The best way to practice data science is to play around and experiment with different types of projects. If you’re planning to start a career in data science, then you need to be familiar with its fundamental building block – data. It is typically stored in a dataset, which you can use and manipulate to gain insights or practice your skills.
Datasets come in 2 file formats:
- ‘.xlsx’ – the widely known Excel file extension
- ‘.csv’ – short for comma separated values
Of course, there are many others, but these are the ones you’re more likely to encounter during your public resource sweep.
To become a successful data professional, you first must be able to source the right dataset for your work. That’s not an easy feat, especially if you’re a beginner or entry-level. In this article, we will help you find the right ones for your projects – all for free. Without further ado, let us take a look at the top 10 online resources for public datasets.
The best thing about Kaggle is that it offers thousands of datasets, big and small, which you can download for free. Most of them are formatted as ‘.cvs’ files.
On the website, you’ll find many interesting datasets that are originally part of competitions for data science enthusiasts. One example is the famous Titanic dataset on which you can practice building a machine learning model to predict which passengers survived the shipwreck. Additionally, you can share your results with the Kaggle community and exchange knowledge.
So, if you’re looking for an all-in-one solution to learn, practice, and compete, then Kaggle is the right place to start with.
Google Dataset Search
Launched in 2018, the Google Dataset Search initiative made it possible to access and download free public datasets. You can choose from a variety of topics and formats including ‘.pdf’, '.csv’, '.jpg’, ‘.txt’, and more.
Using it is as simple as running a regular Google search: just write the name or topic you’re looking for in the bar. As you’re typing, it will keep suggesting datasets that have the specific keyword you’re looking for, thus you might discover something entirely new and exciting.
Besides being a developer’s best friend, GitHub offers thousands of small and large datasets for your data analysis needs. On the left side, you can filter the results by “language” and “keyword”. This allows you to choose topics that interest you so that the content is curated based on your interests.
What is more, on GitHub you can share your work with the world, making it a great opportunity to build your data science portfolio.
World Bank Open Data
The World Bank Open Data is considered one of the richest, most diverse resources of statistical facts and public datasets. You can search by categories such as “country” or “indicator” in order to find demographic information such as:
- Income levels
- Healthcare status
Through data.world, you can access free datasets, as well as work on some directly on the website. All you have to do is create a free account and you’ll be able to work on 3 free projects. Alternatively, there are pricing plans if you need to upgrade to a larger storage space.
By using the search bar, you can look for keywords, resources, organizations, or people. And if you want to be even more specific, you can click on the “Create advanced filter” button to find exactly what you’re looking for.
DataHub is a SAAS data-publishing platform by Datopian where you can browse through the most diverse collection of public datasets organized by topic. The platform also features a blog where you can enjoy articles on various data science subjects.
What’s exciting about DataHub is that it provides you with a documentation section on how to use the platform, as well useful tutorials on how to use its features to build visualizations and manage large datasets online.
Humanitarian Data Exchange
If you’re looking for a platform where you can download, upload, use, and share data all in one place, then Humanitarian Data Exchange is a must-visit. You can search for free datasets and filter the results by location, format, organization, and licenses.
What makes this resource so unique is that, on the home page, you’ll find a tab called “Dataviz”. There, you can explore relevant COVID-19 data and discover insightful stories in the gallery, told by the great power of data visualization.
FiveThirtyEight is, without a doubt, the best data journalism website. It’s a bit different from the previous resources, however, that’s what makes it stand out.
This great platform publishes content in sports, politics, and science, providing you with the code and data used in creating the content. The best part is that it’s all publicly available. Just sign up with your email and you’ll get the newsletter sent directly to your inbox.
Now for the exciting part: the datasets. FiveThirtyEight has a large selection of data to choose from and regularly updates its resources – evidenced by the orange dot next to a dataset that is currently updating.
UCI Machine Learning Depository
This might be the least abundant resource we’ve covered so far, yet the UCI Machine Learning Depository is nevertheless quite helpful if you’re looking to build a machine learning model.
Despite not being as rich as other dataset libraries, UCI is one of the oldest data sources ever published on the internet. There’s actually a dataset online that goes back to 1987!
The user interface is pretty simple and organized. You can browse by the default task, attribute type, data type, and area of specialty. But in case you like a more elegant and modern web design, you’re in luck – the repository is currently testing a beta version with an entirely new look:
Academic Torrents Data
In case you’re an academic or working on a research paper, or a Master’s thesis, then Academic Torrents Data is your ideal study buddy. The platform contains a variety of large datasets from scientific papers – some being the size of 2 terabytes.
Using Academic Torrents is straightforward: simply search for datasets, papers, courses, and collections. You can also upload your own so that other people can experiment with them.
The datasets themselves are free, however, to download one, you’ll need a torrent client already installed on your system.
Bonus Free Dataset Resources
In case you want to dig deeper, we’ve got you covered with this bonus list of other data resources:
- Pew Research Center: Research topics, tools & resources, and datasets
- BuzzFeed News: Open-source data and tools from BuzzFeed's newsroom
- AWS Datasets: Free public datasets from Amazon Web Services
- Nasdaq Data Link: Financial & economic datasets
- gov: Data, resources, and tools by the U.S. Government
- Global Health Observatory: Data & Statistics by the World Health Organization
- UNICEF Datasets: Resources & datasets
Free Dataset Resources: Next Steps
With these great resources in hand, you’ll never run out of data to practice or even work on any data science project. It’s absolutely okay if you’re still confused and not sure if you’re ready to start your career in data yet – we’ve promised to support you in every step of your learning journey.
The 365 Data Science Program offers self-paced courses led by renowned industry experts. Starting from the very basics all the way to advanced specialization, you will learn by doing with a myriad of practical exercises and real-world business cases. If you want to see how the training works, start with a selection of free lessons by signing up below.