Dealing with messy data: Google refine

By Abhijit Menon-Sen <ams@toroid.org>

2011-05-20

Recently at work, I needed to import a substantial quantity of data from some spreadsheets into an SQL database. Due to multiple maintainers and changing needs, the spreadsheets were a mess of special cases, shorthand notation, and minor errors. I gave up on importing them as-is, and asked for the relevant data to be extracted into a format I defined. Writing a converter for the latter was much easier (although errors kept cropping up even in the new spreadsheet).

Today I read about Google refine, a program dedicated to dealing with messy data. The video demonstrations on the project page show off some useful capabilities, such as grouping together and canonicalising values from a column in a few steps, scaling numeric values, and rearranging the data in various ways.

Google refine is a Java program that you install and interact with using a web browser. I've never used such a program, and I suspect it may be painful to install, but I'm going to try it if any more spreadsheets appear on the horizon.