Are you interested in pursuing a career in data science, but struggling to find interesting projects? Or perhaps you’ve started on a Python project and are now scouring the web for hours on end for the perfect dataset to analyze on a budget?
In this article, we’ve prepared a list of free datasets to download and practice on as you make your way into data science. Not only are they publicly available, but the assorted samples are also all Python-compatible, making them even more accessible and beginner-friendly. Whether you’re currently getting your degree, transitioning from another field such as computer science or economics, or have recently discovered the world of data science altogether, these resources will provide you with valuable experience and make your resume stand out.
And once you’re done with these, feel free to check out the other assortments of Python datasets, machine learning resources, and data visualization projects that we’ve collated for you to further enrich your data science portfolio.
Without further ado, let’s get right into it!
The Boston House Price Dataset
Starting with the Boston House Price Dataset which is a public dataset made up of data about the general house prices in the Boston area and factors such as:
- Residential land
- Number of rooms
- Size in square feet
- Crime rate per town
Easy to understand and free to download, it is a great dataset for students and absolute beginners in data science. Let’s say you want to predict future house prices - by applying linear regression, you’ll be able to teach your machine learning model to successfully forecast how much a given house can sell for.
Not to mention that it’s a very well-liked dataset, thus support is readily available online.
The MNIST Dataset
Allows us to introduce you to arguably the most popular dataset for machine learning. Don’t believe us? Ask anyone in the field – chances are they’ve experimented with MNIST at least once.
MNIST has been circulating since the mid-90s. In short, it is an image database of 70,000 handwritten digits (from 0 to 9). It’s incredibly easy to use as the data has been heavily preprocessed, so you don’t have to worry about doing that yourself. Additionally, the images in MNIST are small (28x28 pixels) and made in grayscale (each pixel has 1 numeric value – how “white” it is).
MNIST is a widely preferred dataset for image recognition classifications and convolutional neural networks (CNN) due to its flexible nature. Aside from the already preprocessed available data, there are clearly established sets for both training (60,000 images) and testing (10,000 images).
Wine Quality
This sample dataset for wine quality is perfect for machine learning projects. It is actually comprised of 2 separate datasets related to red and white variants of the “vihno verde” wine sort found in the Minho region in Northern Portugal. Its inputs are separated into physicochemical properties such as:
- Acidity
- Chlorides
- Density
- pH levels
- Sulfates
As for outputs, the dataset contains sensory wine quality variables based on scores between 0 and 10.
You can perform ordinal regression or classification tasks on the data itself. However, not all available features are necessary to build a good model; this means that you’ll also be able to perform some interesting feature selection methods, such as which wines are classed as excellent or poor.
Stock Market Dataset
Market prediction is always a hot topic with investors who want to make sure their money is going in the right place. The Daily News for Stock Market Prediction was initially set up as a dataset for students, but anyone can play around with it as it’s available for free download. The dataset is comprised of 2 channels:
- Data from news headlines ranging from 2008 to 2016
- Data on stock prices based on the Dow Jones Industrial Average (DJIA)
As well as that, the author has divided those channels into two sets of data for training (80%) and for testing (20%). This makes it a great resource for practicing deep learning methods and building predictive algorithms.
ImageNet
ImageNet is an ongoing data collection project that aims to supply researchers and developers with high-quality images for large-scale data analysis projects and deep learning research. There are at least 1,000 images illustrating different word meanings or “synonym sets”. What is more, it is publicly available for non-commercial use, making it the perfect dataset for students who want to experiment with computer vision.
The dataset developers have also run some of the biggest machine learning challenges, such as ImageNet’s Large-Scale Visual Recognition Challenge (ILSVRC) that produced many of the modern state-of-the-art neural networks. These challenges are a great way to integrate yourself into the community and test your abilities – especially if you’re interested in becoming a machine learning engineer.
Breast Cancer Diagnosis Dataset
Another interesting dataset for machine learning is the Breast Cancer Wisconsin Diagnostic Dataset. It features digitized images of a fine needle aspirate (FNA) of a breast mass that, in turn, describe the features of the present cell nuclei, such as radius, texture, perimeter, area, etc.
You can use this data as a base for starter classification projects as its distribution is very simple and only separated into two categories:
- Bening (B)
- Malignant (M)
There are 569 instances overall, of which 357 are benign and 212 malignant. That’s plenty of interesting data to experiment with!
IMDB Movie Review Dataset
If you’re looking for a dataset repository that is not only publicly available, but also packed with both processed and raw data for binary sentiment classification – we’ve got you covered.
We’ve all heard of IMDB, right? Well, here we have a huge IMDB dataset that contains substantially more data than previous benchmark datasets - it provides you with 25,000 movie reviews per training and testing set. There is also unlabeled data if you’re feeling up for a challenge.
We recommend it as a good starting point for learning natural language processing (NPL).
Note: This dataset is included in TensorFlow.
Food Environment Atlas
Demographic data can prove to be a powerful tool for improving a country’s government and society alike when used as a basis for major economic decisions, as well as make significant developments in the finance industry. Machine learning models, trained on public government data, can help policy makers identify trends and prepare for arising issues.
This particular dataset is comprised of data on how local food resources affect an individual’s nutritional lifestyle in the United States. The Food Environment Atlas contains more than 280 variables with data spanning from a variety of sources, time periods and geographic locations, making it a rather comprehensive resource. In addition, there is plenty of documentation that keeps the dataset up to date, and all its previous versions are also available if you’d like to compare and contrast.
Overall, the Food Environment Atlas is a great choice for building predictive models through which to obtain valuable insights on people’s dietary habits and how to improve them.
Chronic Disease Indicators
We’ve already emphasized the significance of collecting demographic data. Now, following up on our previous point, we have another dataset example in which this kind of data can play a large role.
The Chronic Disease Dataset is comprised of public data, collected by the CDC, in order to track important health statistics in the U.S. Much like the Food Environment Bank, this informs the government of chronic disease tendencies across the country’s territories so policymakers can improve the public health practice.
Note: If you’d like to learn more about how data can save lives, read our article on data science in healthcare, as well as medical imaging and the recommendation systems in the medical industry.
Again, this public domain dataset is ideal for machine learning as you can build predictive models based on sample data accumulated over the last 15 years or so.
Best Free Python Datasets: Next Steps
Even if you’re just now embarking on your very first Python project or already have significant experience with machine learning, finding quality sample data can be tricky. And with the web being as saturated as it is, open source datasets are almost like diamonds in the rough. We’ve included some example resources for all skill levels, ranging from beginners to experts, that will ultimately help you sharpen your abilities, enrich your portfolio, and allow you to tackle the path toward your future data science career.
Are you ready for the next step toward a career in data science?
The 365 Data Science Program offers self-paced courses led by renowned industry experts. Starting from the very basics all the way to advanced specialization, you will learn by doing with a myriad of practical exercises and real-world business cases. If you want to see how the training works, start with a selection of free lessons by signing up below.