R's ggplot package by Hadley Wickham is one of the most versatile tools that I've come across for data visualization projects. I work with a lot of data and where I used to hassle with Excel to get charts & graphs put together, ggplot produces nicer images and allows more control over the final product. I'll admit that it took me a while to understand the methodology behind ggplot, but once you "get it", it really is a remarkable tool. One quick note before we dive in, I'm saying ggplot for the sake of simplicity, but I'm actually using the more recent ggplot2.
The whole trick to working with ggplot is understanding that the final output is put together in layers, each gets its own statement appended to the original ggplot command. The initial call is for a type of chart and includes the data that you'll be using, and you can add as many modifications as you need to until you achieve the final result. Colors, gradients, scale, axes, labels... there's a statement for each. This is what makes ggplot so powerful, and so awesome to work with.
Its always easiest to learn by walking through examples, so I'm going to plot the search traffic for some keywords. I'll create a bar chart that makes it simple to see the most popular terms. The goal of any visualization project should be that the data is easier to understand, I'd like to show my client that some terms have such little traffic that they need to be excluded.
First, if you don't already have ggplot installed, go ahead and do that now. Then load it into your workspace.
I've cleaned and prepared a data set so that we can focus exclusively on the work at hand. You're welcome to download it and follow along, ggplot-data.csv. Load the data into memory to begin.
data <- read.csv("data.csv", header=TRUE, nrows=200)
Now let's start working with ggplot.
gg <- ggplot(data, aes(x=Keyword))
I just declared a variable to hold my ggplot commands, told it which data frame to use and defined which column will be on the X-axis of my plot. The great thing about storing your ggplot command in a variable is that you can incrementally build the image. If we were to execute "gg" in the console right now, we'd get an error stating that the plot has no layers. We still need to tell it which values to plot on the Y-axis.
gg <- gg + geom_bar(aes(weight=Traffic))
Now that we've declared a complete chart, you can view it any time by running the
gg command. This gets us a bar chart with no colors. Its still kind of a mess, but we have a plot.
I notice two things that need attention right off. First, the bars are all black so country specific data is indistinguishable. I'll address this by adding a fill statement to the
geom_bar() command. Second, the keywords are overlapping on the X-axis, making them illegible. I'll use
coord_flip() to make the words appear on the Y-axis.
gg <- gg + geom_bar(aes(weight=Traffic, fill=Country) + coord_flip()
View your work by calling
gg. Great, making progress, now we can make sense of what's being displayed.
The next thing I notice is that the words are listed in alphabetical order (though reversed because we inverted the axes). It probably makes more sense to have this displayed as an inverted waterfall with the most traffic at the top and the least traffic at the bottom. This has to do with the data we're feeding into ggplot, so we'll manipulate it a bit with the
data$kw <- reorder(data$Keyword, data$Traffic)
Now we'll call ggplot again, but this time we'll plot our kw column on the X-axis.
gg <- ggplot(data, aes(x=kw)) gg <- gg + geom_bar(aes(weight=Traffic, fill=Country)) + coord_flip() gg
That's pretty much what I was looking for when we began. If I were to show this to a client they'd get what I'm talking about right off. We can still clean a few details up to make it better. Let's add some meaningful labels.
gg <- gg + labs(list(x="Keyword", y="Monthly Search Volume", title="Search Volume by Keyword"))
I also prefer that my axes have commas where appropriate. This will require the scales library. Scales is usually installed with ggplot2, but if you get an error trying to load it you can use
library("scales") gg <- gg + scale_y_continuous(labels=comma) gg
That's it, we're done. That's the chart that will go into my presentation. The last step is to export the chart to be included in my presentation to the client. I'll export a PNG for this purpose. If you plan to print your chart directly, then I'd recommend a PDF. Here are both commands.
ggsave("myCustomPlot.png", width=5, height=4, dpi=80) ggsave("myCustomPlot.pdf")
For comparison I created the same chart in Excel.
While the two charts convey the same information, the ggplot is cleaner and easier to see. Plus, because you're creating a chart with code, its MUCH easier to duplicate your work with new data, port your chart between systems, or explain to someone else how you did it.
In order to focus on creating the plot, I prepared the data for this exercise in advance. It usually isn't difficult, depending on what format the data arrives in. In my case I combined the data from four sources, then used Reshape to get it into this format. I find that ggplot is easy to work with as long as the data is correctly formatted. Once you pick a chart style, you can view the documentation to see the easiest formatting to work with.
If you need more help, the best introduction that I've come across is a tutorial written by Josef Fruehwald, is available on the University of Pennsylvania website. The original documentation is available on Hadley Wickham's site. And, if you prefer physical books for reference, Wickham has also published a guide, ggplot2: Elegant Graphics for Data Analysis (Use R!).
The beauty of ggplot is that it gives you so much control over the chart that you're working on. Its easy to tweak the image however you need to produce the desired result. Plus, ggplot is capable of producing just about any type of chart you can think of.