What Are the Best Public Datasets for Machine Learning?
In this day and age, the aspiration to automate and improve human related tasks with the help of computers is at the forefront.
Today, this is mostly done through artificial intelligence (AI) and machine learning (ML).
These topics may seem complicated at first, especially if you’re just getting started in the field.
But, in reality, it is not that difficult to get into that part of data science. All you need is practice.
And, in order to practice your machine learning skills, you need to train your models with data.
Lots of data.
Luckily, there is plenty of it available on the Internet for free. Yet still, you may be wondering where to begin and which of the thousands of machine learning datasets to choose.
So, to help you get off to a good start, we have selected the 10 best free datasets for machine learning projects. We made sure the list we compiled covers all main topics of machine learning. Moreover, the projects get progressively more difficult as you go through the list. This way you can gradually improve your skills as you practice.
Let’s get started, shall we?
Top 10 Public Datasets for Machine Learning
1. Boston House Price Dataset
The Boston House Price Dataset consists of the house prices in Boston area based on numerous factors, such as number of rooms, area, crime rates and many others. It is a perfect starting point for beginners to ML looking for easy machine learning projects, as you can practice your linear regression skills in order to predict what the price of a certain house should be. It is also a very popular machine learning dataset, so if you get stuck, you can find a lot of helpful resources about it online.
2. Iris Dataset
The Iris dataset is another dataset suitable for linear regression, and, therefore, for beginner machine learning projects. It contains information about the sizes of different parts of flowers. All these sizes are numerical, which makes it easy to get started and requires no preprocessing. The objective is pattern recognition – classifying flowers based on different sizes.
3. MNIST dataset
The MNIST dataset is the most popular dataset in Machine Learning. Practically everyone in the field has experimented on it at least once.
It consists of 70,000 labeled images of handwritten digits (0-9). 60,000 of those are in the training set and 10,000 in the test set. The images themselves are 28x28 pixels and are in grayscale (meaning each pixel has 1 numeric value – how “white” it is). They have been heavily sanitized and preprocessed, so you don’t have to do much preprocessing yourselves.
The popularity of this dataset stems from its ease of use and flexibility. Given the small size of the images you don’t have to worry much about training times, so you can experiment a lot with it. Coupled with the preprocessing, this makes it very smooth and fast to get started with. In addition, this dataset allows for many different models to work well. So, if you are a beginner, you can use the straightforward linear classifier, however, you can also try and practice a deeper network. Given that the input is images, this is a perfect playground for learning Convolutional Neural Networks (CNN). Overall, we encourage everyone to give this dataset a try.
4. Dog Breed Identification
The previous entry in our list (MNIST) was a transitional dataset from feed forward neural networks to Computer Vision. This one, Dog Breed Identification, is now firmly in the Computer Vision field. It is, as the name suggests, a dataset of images of different dog breeds. Your objective is to build a model that given an image can accurately predict which breed it is. So, you can transfer the CNN skills you obtained from the MNIST dataset and build upon them.
ImageNet is one of the best Machine Learning datasets out there, focused on Computer Vision. It has more than 1,000 categories of objects or people with many images associated with them. It even ran one of the biggest ML challenges – ImageNet’s Large-Scale Visual Recognition Challenge (ILSVRC), that produced many of the modern state-of-the-art Neural Networks.
So, if you want to do Computer Vision, you will need this dataset.
6. Breast Cancer Wisconsin Diagnostic Dataset
The Breast Cancer Wisconsin diagnostic dataset is another interesting machine learning dataset for classification projects is the breast cancer diagnostic dataset. Its design is based on the digitized image of a fine needle aspirate of a breast mass. In this digitized image, the features of the cell nuclei are outlined. For each cell nucleus, ten real-valued features are calculated, i.e., radius, texture, perimeter, area, etc. There are two types of predictions – benign and malignant. In this database, there are 569 instances which include 357 benign and 212 malignant.
7. Amazon Reviews Dataset
We are now entering the territory of Natural Language Processing (NLP). This is recommended for more advanced machine learning enthusiasts.
The Amazon Review Dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). The data spans more than 20 years of reviews.
8. BBC News
Continuing with NLP, this time we have text classification, or more precise news classification. So, to develop your news classifier, you need a standard dataset. The BBC News dataset contains more than 2,200 articles in different categories, and it is your job to try and classify them.
9. YouTube Dataset
Now we have arrived to an even more advanced topic – video classification. The YouTube dataset containing uniformly sampled videos with high-quality labels and annotations.
10. Catching Illegal Fishing
This final dataset for machine learning projects is for the experts.
There are many ships and boats in the oceans, and it is impossible to manually keep track of what everyone is doing. That is why, it has been suggested to develop a system that can identify illegal fishing activities through satellite and Geolocation data. Witch the Catching Illegal Fishing dataset, The Global Fishing Watch is offering real-time data for free, that can be used to build the system.
That was our list of public datasets for machine learning projects. Bear in mind, that we have included interesting datasets for all skill levels and many different parts of machine learning research, however, there might be other, more specific datasets that also work for you.
Machine Learning for Beginners
You already have a good dataset for machine learning but don’t know how to use it? Well, in that case you can explore our machine learning and deep learning courses that are part of the 365 Data Science program. There, you can learn all the skills necessary to tackle the projects outlined in the list above.