r-directory > Reference Links > Free Data Sets

Free Datasets

If you work with statistical programming long enough, you're going ta want to find more data to work with, either to practice on or to augment your own research. Here are a handful of sources for data to work with.

All of the datasets listed here are free for download. If you want more, it's easy enough to do a search.

World Bank Data - Literally hundreds of datasets spanning many decades, sortable by topic or country. Data is downloadable in Excel or XML formats, or you can make API calls. This is an outstanding resource.

Gapminder - Hundreds of datasets on world health, economics, population, etc. All of it is viewable online within Google Docs, and downloadable as spreadsheets.

The Data Hub - Hosted by CKAN. Most of these datasets come from the government.

Datamob - List of public datasets.

Numbrary - Lists of datasets.

Kaggle - Kaggle is a site that hosts data mining competitions. Each competition provides a data set that's free for download.

SNAP - Stanford's Large Network Dataset Collection. This list has several datasets related to social networking. Lots of fun in here!

KONECT - The Koblenz Network Collection. Several datasets related to social networking & Wikipedia.

Million Song Dataset - This is a collection of audio features and metadata for a million contemporary popular music tracks.

Energy Information Administration - This site offers a number of datasets on energy production, consumption, sources, etc.

GeoDa Center - This is a collection of geospatial datasets offered by Arizona State Univerisity's Center for Geospatial Analysis & Computation.

Reddit Datasets - This last one isn't a dataset itself, but rather a social news site devoted to datasets. It's updated regularly with news about newly available datasets.

Quandl - This is a web-based front end to a number of public data sets. What's nice about this website is that it allows for the combination of data from a number of sources, and can export the data in a number of formats.

1,001 Datasets - This is a list of lists of datasets. There's not much organization here, but there really are a LOT of datasets. Dive in and have fun.

Yahoo! Webscope - A reference library of interesting and scientifically useful datasets for non-commercial use by academics and other scientists.

Time Series Data Library - Curated by Professor Rob Hyndman of Monash University in Australia, this is a collection of over 500 datasets containing time-series data, organized by category.

Awesome Public Datasets - Curated list of hundreds of public datasets, organized by topic.

Common Crawl - Massive dataset of billions of pages scraped from the web. The data itself is on Amazon Public Datasets, so its easy to load it into an EC2 instance there. The dataset is updated with a new scrape about once per month.

Amazon Public Datasets - Collection of datasets that are ready to be loaded into an EC2 instance.

The Short List

These are the sites that are visited most frequently.

Recent Blog Posts