r-directory > Blog > Folder Structure for Data Analysis

Folder Structure for Data Analysis

By

Over time I've found that it's easier to focus on data analysis if my work is organized. I can spend more time & attention on the actual analysis if I don't have to think about how my work is structured. Hopefully this will help you keep track of your work as well. My notes are specific to R, but this would work regardless of your language or toolset.

Hopefully the screenshot is self explanatory, but I'll say a little about each.

code: This folder contains all of my code. I'll use scripts to clean my data, scripts to generate plots and sometimes to output text. It all goes in here. If I have associated data, images or text, I'll make the filenames match so that I can tell which code generated which dataset or image. Also, I tend to save all of my work, even the code that I choose not to include. If I try a decision tree but end up choosing K nearest neighbor, I'll still keep the decision tree code. Sometimes I'll use another tool such as AWK, PowerShell or Excel to clean up data, those scripts get stored here, too. As I reach completion of a project, I'll save all code that shows my steps in the processed folder. Processed folders will always show the steps that I used to produce the final output. Raw folders contain everything else.

data: This folder contains all of my data. The raw folder contains what I started with. Sometimes I'll use Excel for a task or plot, that gets stored here. Once I get the data into the form I need, I'll store it in a data frame and save it within the clean folder as an .rda or .rdata file type. This allows me to quickly load up my clean data in one quick command.

figures: I usually produce a LOT of plots as I work with data. It's so much easier to see what's happening. I save all of them in the exploratory folder. When I find the image that illustrates the story I'm trying to tell, I'll save it within the final folder and add axil labels, clean up the formatting, etc.

text: I try to always keep a log of my thought process (though I'll confess that I don't always). If I'm producing a report of my analysis, that'll get stored here as well.

I found myself recreating these quite a bit for different projects, so I created a git repository to make it easy to rinse & repeat. Feel free to download and use this yourself. Change the name of the parent folder to your liking, and get started.

This approach has one added benefit. If you use RStudio as your editor, you can go one step further and create a project. This makes it super simple to pick up where you left off when you return. Or review your command history when reviewing your project a year later. I'll write about the beauty of RStudio projects soon.

One quick note about RStudio. If you do create a project, it's going to create .RData & .Rhistory files in the root folder of your project. You'll want to leave those right where there are, this way RStudio will load your data & terminal history when you open the project.

I can tell you that this organization has absolutely helped me. It just makes it easier to keep track of what you're working on. As I get more comfortable with different algorithms & strategies, I know exactly where to look to find out how I did something. Hopefully this helps you, too.

comments powered by Disqus comments powered by Disqus
The Short List

These are the sites that are visited most frequently.

Recent Blog Posts