I have been doing data analysis for more than a decade. Initially it was by necessity, I was given data at work and had to make sense of it for my managers. Excel was my tool of choice for years, I know it and its capabilities well. Three years ago I was introduced to R, and flat out rejected it as overkill. Then I started tinkering with it. Now I do as little as possible in Excel. This is just one man's opinion, but if you're goal is to analyze data R is much better suited to this task. Here's why.
The way you work in these tools is quite different. With Excel, a lot is accomplished via mouse clicks, you access various tools in different locations within the interface. This makes Excel easy to use (albeit with experience), but it can be time consuming & monotonous to repeat your process with a new project. With R, everything is accomplished via code. You load your data into memory and then explore & manipulate that data by running scripts. This is a less user friendly environment, but there are several payoffs.
I find it conceptually easier to work in R. If you're working with several columns of data, you're looking at all of that data while you focus on individual tasks. With R, the data is present in memory, but out of view unless you call it. If you're doing a transformation or calculation you work with the subset of columns or rows that are relevant, everything else is available but not in front of you. I find that this makes it easier to focus on the task at hand. Once you're done you save your work in a data frame that contains only the columns or rows that you need. You build the exact dataset that addresses the question you're trying to answer. This seems so trivial, but in practice it's a substantial benefit.
With R it is very easy to reproduce your work on other datasets. Because all data manipulation & exploration is done with code it's simple to rinse & repeat with new datasets. With Excel a lot is accomplished through mouse clicks, which is a great user experience but recreating your work with new data can be time consuming & monotonous. With R you just load a new data set and run your scripts again.
The fact that you're working with code makes troubleshooting & sharing your work much easier also. With Excel a lot of the work is based on memory (pivot tables are here, equation editors are on that tab, etc). In R, everything is accomplished through the code that's right in front of you. If you're fixing a bug you know exactly where to look, and if you need to share your work just copy & paste your code. When seeking help online, you can show exactly what you're using and ask very specific questions. In fact, most of the time when you get help online, people will post the exact code that addresses your issue.
Project organization is easier in R. In Excel I'd have a number of tabs and maybe even several spreadsheets, all labeled appropriately, but each file was different. My project notes were kept in separate files. My R project organization has a separate folder for each piece that I'm working with. Cleaning the data, exploratory graphs, models. Its all easy to understand & find, for me or anyone else that comes across my work. You absolutely can work in an organized way within Excel. I find that R's simplicity makes this much easier.
Everything I've mentioned above is nice, but not necessary. I did fine without them for years, you probably have, too. Now I'll go into a few things that really set R apart from Excel. I'd argue that apart from any niceties, R is better suited to the task of data analysis. Heres' why.
You can load any data into R. It doesn't matter where your data is or what form it's in. You can load CSV files, read JSON, execute SQL queries, scrape a website. You can even handle big data in R through Hadoop.
Through the use of packages, R is a complete toolset. R can do so much more than Excel when it comse to data analysis. Data munging, classification & regression, image processing and everything in between. If machine learning is your interest, any algorithm you can imagine is a few clicks away. There are now over 5,000 packages available for R, so regardless of what type of data you're working with R is ready.
R offers outstanding data visualization. To be fair, Excel's charts are quite good and easy to work with. R's are better. In fact, I consider this as one of R's most useful features. With ggplot2 you can quickly create any type of plot you need and customize any aspect of it. Once you're familiar with how to create a chart with ggplot2, any other type of plot is just as easy. ggplot2 also offers more types of charts. Have you ever created a scatterplot matrix with Excel? It's pretty straightforward in R. Same with CDF plots. Excel pales in comparison here.
Version control with git. I have long been in the habit of saving & versioning my work. Git is the easiest tool that I've found for this. I use RStudio as my editor, which supports projects. Create a repository for your project and you'll be able to track versions of your data exploration. You can create versions of your Excel files, but since they're stored as binary files you won't be able to identify changes. With R this is trivially simple.
I've covered a lot of ground here. The bottom line is that Excel is a good tool for data analysis. I'm sure that it is capable of more than I give it credit for. However, if it's your only tool then you'll be limited in the work you can produce. R is much better suited to the task, and it offers a more complete toolset as well. The cost is that it's less user friendly and has a fairly steep learning curve for new users. If you stick with it this will pay substantial dividends in your understanding of data and in capabilities as well.comments powered by Disqus comments powered by Disqus