In getting more comfortable with R and data mining, I've been working with a handful of free datasets. Each of the sets that I've been working with needs some modification before feeding into R, so tools that simplify this are helpful. I've found two that I really like, Stanford Visualization Group's Data Wrangler and Google Refine (aka OpenRefine, formerly known as Freebase Gridworks). These tools are similar yet have different goals, and each can save you a ton of time. Here's a breakdown of the strengths and differences of these two tools.
This is an application that you install on your system. Once installed it runs in your browser, so while it looks and behaves like a web application it is entirely contained on your own computer. Google Refine is actually a collection of tools for working with data. These tools make it a snap to clean up messy or inconsistent data, such as grouping similar data into one type (merging "Barack Obama", "Obama, Barack" and "Obama"). Reorganizing data is also a straightforward process. Through Refine's facets and filters, your data can be split into multiple columns, merged into a single column, and adjusted to exclude quotes, parentheses or any other type of formatting.
The final piece of Google Refine is really impressive. This tool includes a tool for adding to your data through API calls to external data sources. If you have a list of songs and want to include the artists, this is possible. Or, if you want to include latitude & longitude with a list of addresses, no problem. This feature makes your data more precise, and makes it a snap for you to combine data sources into one larger dataset specific to your needs. As long as you can specify an API call to gather the information that you need, Google Refine can add columns to your data set.
This is actually two tools. The first piece is a web-based app that provides an interactive GUI front end that is used to define the procedures that will ultimately manipulate your data. Each manipulation provides a preview of the output, and there is always a history of operations that can be deselected or deleted. These tools make it a snap to merge, delete, autofill and move information in your set. Filling in missing data or incorporating data from another source becomes a straight-forward process. Further, folding or unfolding your data allows interesting views analogous to pivot-tables. The output of this web app is either a CSV of your manipulated data, or a script that can be used to manipulate similar data.
Both of these tools are fantastic and can save you hours of tedium and frustration. There is definitely some overlap in their functionality, but a high level summation would be: Google Refine is used to clean messy data and extend your data set with external sources, Data Wrangler is used to reorganize your data into easy to understand views. Further, Google Refine can be easily implemented on dozens of files. Data Wrangler can be easily applied to hundreds or thousands of documents. Either tool can handle single data sets of tens of thousands of rows (as long as your computer has enough memory).comments powered by Disqus