Wednesday, April 23, 2014

The small scale approach: How to cleanup messy data, standardize enrich data

You ever faced to the following challanges:

  • You  have a huge amount of master data needs to be cleaned up. Over time the same terms are used differently, buildup of the names using different styles, etc.
  • Have to split data currently part of one column into different one, e.g. a product name contains dimension which should be processed in a different column in the future, etc.
  • common faults in the data needs to be harmonized (multiple whitespaces, upper vs. lower case letters, etc.)
  •  Transform well structured data into a table for relational usage of the data, e.g.
  • The data needs to be enriched based on public available or internal services, e.g. you have the address of your customer available and want to add latitude, longitude to display amount of customer on a map.
  • You have to merge two sets of data from two different systems and want to identify same entries even if the both lists does not share a common key.
  • ....
For those tasks you have several ways to work with.

One interesting tool which is very useful in those kind of tasks is "OpenRefine" (formerly known as Google Refine).
You can easily download the latest version and install it on your local client (or server) and start it just by following the easy installation instruction provided on the homepage.

A good overview of the main functions is shown in : https://github.com/OpenRefine/OpenRefine/wiki/Screencasts

The tool is very easy to use. It works for huge amount of data and for the most tasks it is pretty fast  (especially compared to doing the same with spreadsheet software). It is more likely used for ad-hoc tasks compared to fully automated and repeated tasks in the enterprise - but it provides  handy functions to reapply and reuse already created rules.

For more sophisticated analyses goes out of the scope of one columns and rows of one table tools like RapidMiner are more sufficient. But those kind of tools require a bit more upfront invest to get the first problem solved.

Tuesday, April 22, 2014

Copy tables from PDF for further processing

You ever tried to extract data part of a table from PDF to further process it in MS Excel or any other Spreadsheet application? That can be a painful job.

You might use one of the many PDF to Excel conversion tools - but most of them cannot be used without submitting your PDF to an online service or buying a commercial license. In addition you have to evaluate the quality of the results especially for data part of tables.

Few days ago I stumbled upon the tool Tabula. Still marked as experimental but the results are still pretty useful at least for data oriented tables.

Example

 

The PDF "The Mobile Economy 2013" contains a table on page 56 which you want to process in Excel:









  


You have to follow the following steps to extract as csv:

  • Download the file to your local disk
  • install and start the tool follow the instructions on the homepage
  • upload the PDF and select "Submit"
  • navigate to the table and select the table:













  • choose "Repeat this selection" if you want to select the following tables as well using the same coordinates.
  • choose "Download all data" and you get:










  • choose "Download data" to get a csv file with the extracted tables. This file can be open with MS Excel or any other application which can read the csv format for further processing.
The results are very useful and it works for those kind of data as well:

























Won't work with data provided as graphic as part of the PDF this is an topic for another story.