Wednesday, April 23, 2014

The small scale approach: How to cleanup messy data, standardize enrich data

You ever faced to the following challanges:

  • You  have a huge amount of master data needs to be cleaned up. Over time the same terms are used differently, buildup of the names using different styles, etc.
  • Have to split data currently part of one column into different one, e.g. a product name contains dimension which should be processed in a different column in the future, etc.
  • common faults in the data needs to be harmonized (multiple whitespaces, upper vs. lower case letters, etc.)
  •  Transform well structured data into a table for relational usage of the data, e.g.
  • The data needs to be enriched based on public available or internal services, e.g. you have the address of your customer available and want to add latitude, longitude to display amount of customer on a map.
  • You have to merge two sets of data from two different systems and want to identify same entries even if the both lists does not share a common key.
  • ....
For those tasks you have several ways to work with.

One interesting tool which is very useful in those kind of tasks is "OpenRefine" (formerly known as Google Refine).
You can easily download the latest version and install it on your local client (or server) and start it just by following the easy installation instruction provided on the homepage.

A good overview of the main functions is shown in : https://github.com/OpenRefine/OpenRefine/wiki/Screencasts

The tool is very easy to use. It works for huge amount of data and for the most tasks it is pretty fast  (especially compared to doing the same with spreadsheet software). It is more likely used for ad-hoc tasks compared to fully automated and repeated tasks in the enterprise - but it provides  handy functions to reapply and reuse already created rules.

For more sophisticated analyses goes out of the scope of one columns and rows of one table tools like RapidMiner are more sufficient. But those kind of tools require a bit more upfront invest to get the first problem solved.

1 comment:

Unknown said...

v nice article
brothers visit the site to learn some tutorials
thank you
http://zeeeshproductions.blogspot.com/