r-directory > Blog > Quickly Categorize Messy Data

Quickly Categorize Messy Data


A couple of weeks ago I came across a dataset which included a column of very messy character data. This is pretty common, but the difference here was that the dataset has over a million rows. I started by converting that data to a factor, but there were over a thousand levels... not very helpful. So, I figured out a way to quickly categorize the data to simplify analysis. The key to this is grep.

When I've faced this in other datasets I've created a column of individual grep statement, which is functional but messy. It would be unwieldy in this case because there are so many rows. I'll create a quick dataset to illustrate.

df <- data.frame(
  dish=c("chicken cordon bleu", "Pepperoni Pizza", "egg salad", "bagel", "GOULASH", "Chili Relleno", "cheeseburger", "BLT", "omlette"), 
  meal=c("dinner", "dinner", "lunch", "breakfast", "dinner", "dinner", "lunch", "lunch", "breakfast")

The issue with this is that it isn't clean. Suppose that we wanted to find statistics on the main ingredient, before any analysis can be done it needs to be categorized. For a small dataset it's no problem to write individual grep statements:

df$mainIngredient[grep("chicken", df$dish, ignore.case=TRUE)] <- "Chicken"

A few more lines like that one and it would be done. But there are 2 issues with that, first it's not easy to maintain, and second if your data has very many rows (or more importantly, many factors) the code can quickly become unwieldy. This is the perfect task for a function, so I wrote one.

categorizeDF <- function(df, searchColName, searchList, catList, newColName="Category") {
  catDF <- data.frame(matrix(ncol=ncol(df), nrow=0))
  colnames(catDF) <- paste0(names(df))
  df$sequence <- seq(nrow(df))
  for (i in seq_along(searchList)) {
    rownames(df) <- NULL
    index <- grep(searchList[i], df[,which(colnames(df) == searchColName)], ignore.case=TRUE)
    tempDF <- df[index,]
    tempDF$newCol <- catList[i]
    catDF <- rbind(catDF, tempDF)
    df <- df[-index,]
  if (nrow(df) > 0) {
    df$newCol <- "OTHER"
    catDF <- rbind(catDF, df)
  catDF <- catDF[order(catDF$sequence),]
  catDF$sequence <- NULL
  rownames(catDF) <- NULL
  catDF$newCol <- as.factor(catDF$newCol)
  colnames(catDF)[which(colnames(catDF) == "newCol")] <- newColName

This function takes looks at a data frame column of messy charcter (or factorial) data, and produces a new column of categorized data. The inputs are the data frame, the column name of the messy data, a list of search strings, a list of category names (these have to be correlated), and you have the option of naming the new column.

Referring back to our small data frame above, I would create a list of search strings:

search <- c("chicken", "pepperoni", "egg", "cheeseburger", "blt", "omlette")

categories <- c("Chicken", "Beef", "Egg", "Beef", "Pork", "Egg")

Next, I'll call the function, and assign that to a new data frame.

sorted <- categorizeDF(df, "dish", search, categories, "mainIngredient")

Notice that the column names need to be in quotes. Here's what our new data frame, sorted, looks like.

                 dish      meal mainIngredient
1 chicken cordon bleu    dinner        Chicken
2     Pepperoni Pizza    dinner           Beef
3           egg salad     lunch            Egg
4               bagel breakfast          OTHER
5             GOULASH    dinner          OTHER
6       Chili Relleno    dinner          OTHER
7        cheeseburger     lunch           Beef
8                 BLT     lunch           Pork
9             omlette breakfast            Egg

Now if you wanted to review main ingredients by meal, the task is trivial. If you're getting too many in the OTHER category, add more statements to your list of search strings & categories.

Once a category is assigned it won't be overwritten, so in my example if one of the dishes is "chicken omlette", the category assigned would be Chicken. Pay attention to the order of your search & category lists. If your search & category lists become large & unwieldy, this could easily be modified to accept another data frame as input. That way you could easily keep track of which categories belong with which searches.

I'm careful to refer to this as categorization. You could also solve this problem with a classification algorithm. I chose this method because it's fast and requires very little setup. This function ran against my data set with 1.2 million rows in just a few seconds. At that speed I went through a few iterations of adding search terms and had my entire data set categorized in under 10 minutes.

My code is available on GitHub, categorizeDF.R, and I do support pull requests. All feedback is welcome.

| Tagged cleaning code data function
comments powered by Disqus comments powered by Disqus
The Short List

These are the sites that are visited most frequently.

Recent Blog Posts