Dealing with messy data: Google refine
By Abhijit Menon-Sen <email@example.com>
Recently at work, I needed to import a substantial quantity of data from
some spreadsheets into an SQL database. Due to multiple maintainers and
changing needs, the spreadsheets were a mess of special cases, shorthand
notation, and minor errors. I gave up on importing them as-is, and asked
for the relevant data to be extracted into a format I defined. Writing a
converter for the latter was much easier (although errors kept cropping
up even in the new spreadsheet).
Today I read about
Google refine, a
program dedicated to dealing with messy data. The video demonstrations
on the project page show off some useful capabilities, such as grouping
together and canonicalising values from a column in a few steps, scaling
numeric values, and rearranging the data in various ways.
Google refine is a Java program that you install and interact with using
a web browser. I've never used such a program, and I suspect it may be
painful to install, but I'm going to try it if any more spreadsheets
appear on the horizon.