Cloud Computing and Data Science
Did you know that US retail giant Walmart generates 2.5 petabytes of data from approximately 1 million customers every hour?
And in case you’re wondering how much a petabyte is, as I did when I first read this, it is equal to 1 million gigabytes. The equivalent of 13.3 years of HD video.
Considering that Walmart locations are open for business for more than 10 hours a day, we get a staggering 130 years of HD video and 25 petabytes of data collected on a daily basis!
Yes, there aren’t many companies like Walmart.
But even smaller enterprises nowadays generate huge amounts of data, so, it becomes increasingly more challenging to take advantage of such information abundance.
And yes, data science is at the heart of all that. But before we can apply data science, we must do justice to another crucial player – the cloud and cloud computing in general. That’s exactly what we will focus on in this article.
Why is cloud computing essential for data science?
To understand the advantages cloud computing provides when it comes to data science, let’s imagine a world with as much data as we have today, but without servers.
In such an unfortunate scenario, firms would need databases that run locally, right?
So, every time when you, as a data scientist, want to engage in new analyses or refresh an existing algorithm, you’d have to transfer information to your machine from the central database, and then proceed to operate locally.
This unfortunate world would have several main drawbacks:
- Manual intervention would be necessary to retrieve data;
- Your machine becomes a single point of failure for the analyses you have worked on locally;
- Processing speed would be equivalent to the computing power of your computer;
- Chances are you will be able to work with a limited amount of data due to the limited computing resources at your disposal;
- Moreover, under this setup, you wouldn’t be able to leverage real-time data to build recommender systems or any type of machine learning algorithms that require ‘live’ data.
Doesn’t sound like the perfect scenario, does it?
Well, that’s why we invented servers.
And then these servers had drawbacks of their own.
- The most obvious one is that a server needs space to be stored. A Cloud is basically somebody else’s server, so it’s their storage problem;
- Server infrastructure is expensive to buy and set up. Cloud infrastructure is already there and is simply awaiting your server consumption;
- In-house data storing requires you to have backups and ideally – have them in different locations. Clouds offer data everywhere, anytime, usually backed up on many different servers across the world;
- Servers need planning. For fast-growing companies, server needs could be unpredictable even for the current quarter. With in-house servers, you usually end up buying more servers than you actually need at a given time. With cloud – you pay as much as you use.
You see my point, right?
Fortunately, we now have clouds.
They overshadow local servers in almost every conceivable aspect. And, in fact, data scientists should be focused on developing great algorithms, testing hypothesis, taking advantage of all available data without having to wait hours to see the results of the tests they are performing and certainly without having to worry how much memory space they have left on their computer.
And yes, sometimes data scientists do end up waiting for long hours for an algorithm to train, but with a cloud, they have the option to pay more and get the job done faster.
That’s yet another advantage of cloud computing over servers.
That being said, the biggest winners are smaller entities, as they get cheap access to the same tools as enormous corporations. And this is why cloud technologies are a huge enabler. They create a level playing field and allow small players to compete with much bigger ones.
If you think about it, this technological progress changed a number of businesses in a way similar to how the Internet changed commerce.
Remember when, all of a sudden, people around the world were able to open e-commerce stores and compete on a global scale with the established firms?
Well, in the same way, cloud technologies and cloud computing democratized data analysis and data science.
The fact that data scientists and data analysts can rely on data stored on the cloud truly makes their life so much easier!
In addition, most cloud providers allow data scientists to access readily installed open-source frameworks right away. This is not only super convenient but can also be a huge time saver.
Alternatively, if you wanted to use Apache Spark in the conventional way you would have to:
- Start by installing java,
- Then continue by installing Scala
- After which you’ll be able to download Apache Spark and install it.
That’s the setup you need to go through if you are working on your own pc. However, if you are using a cloud service, you’ll be able to start working with the Apache Spark framework right away! Yep, it’s been already installed for you. The same is valid for many different open-source frameworks.
This type of easy-to-access, easy-to-use infrastructure is very attractive and potentially applies to all sorts of applications data analysts and data scientists use in their work.
Over the last few years, Amazon Web Services, Microsoft Azure, and Google Cloud have tried to boost their cloud services in terms of capability to run machine learning algorithms. The Big 3 of cloud services focused on this area extensively, as they realized it could be an important source of competitive advantage in the long run. And, in case you’re wondering:
One of the biggest sell points of cloud machine learning is that it allows small and medium enterprises to access a machine learning infrastructure they otherwise wouldn’t be able to afford.
For example, thanks to cloud-based machine learning, a small e-commerce retailer could run a real-time recommender system algorithm to improve the product offering shown to customers based on the products they have already added to their cart. In this type of business, every website click can be interpreted as a particular type of intention and signal, and hence the real-time updated algorithm operating in the cloud will be able to make a suggestion that improves the chances of making a conversion and maximizing revenues.
Without cloud-based machine learning, setting up the necessary infrastructure to perform this type of analysis would be really costly. Therefore, it will be difficult to execute for small and medium enterprises.
It is still unclear who will win the cloud war between giants like AWS, Microsoft Azure, and Google Cloud. But one thing is certain.
This is a service that benefits greatly small and medium-sized businesses, enabling them to level the playing field when competing against large multinationals with superior IT infrastructure.
Ready to take the next step towards data science?
Check out the complete Data Science Program today. Start with the fundamentals with our Statistics, Maths, and Excel courses. Build up a step-by-step experience with SQL, Python, R, and Tableau. And upgrade your skillset with Machine Learning, Deep Learning, Credit Risk Modeling, Time Series Analysis, and Customer Analytics in Python. Still not sure you want to turn your interest in data science into a career? We also offer a free preview version of the Data Science Program. You’ll receive 12 hours of beginner to advanced content for free. It’s a great way to see if the program is right for you.