Data is literally everywhere. According to Mongo DB’s APAC Senior Vice President, Simon Eid, there are over 40 zettabytes of data in the world today, which is equivalent to 40 trillion gigabytes of data! And although it’s abundant, it’s still so valuable that it’s considered the fuel for all industries, from healthcare to transportation.
However, if you don’t know how to manage, protect and clean this huge amount of data, it becomes completely worthless. Just like petrol, without the right technique and tools to extract it, you’ll never be able to use it to the fullest.
So, how can you make the most of your data? The answer is simple: the first step when it comes to dealing with data is knowing its type and properties. That said, if you’re still new to data and you want to understand how to read and interpret it, you can start with the Data Literacy course. And if you already understand the concept of data and you're looking to develop a career in data science, then Intro to Data and Data Science is the right online course for you. But for now, let’s focus on exploring the different types of data with this beginner-friendly guide.
A Complete Guide on Data Types
- Data Types According to Their Value
- Data Types According to the Sensitivity Level
- Data Types in Python
- Importance of Data Types
Data Types According to Their Value
Generally speaking, we can classify data into 2 main types: qualitative (categorical) data and quantitative (numerical) data.
Let’s break each of these down.
Qualitative (Categorical) Data
Qualitative data usually describes an object or a group of items. It’s also known as categorical data because, as the name implies, you can label a group of items or data points to a specific category. Examples include colors, plants, and places.
Qualitative data is then classified into 2 other subtypes – “ordinal” and “nominal”.
Ordinal Data
Ordinal data follows a specific order or ranking, as in test grades, economic status, or military rank.
Nominal Data
Nominal data, however, doesn’t follow a specific order like ordinal data. Consider gender, city, employment status, colors, etc.
Quantitative (Numerical) Data
On the other hand, quantitative data deals with numeric values on which we can apply mathematical operations – height, fruits in a basket, kids in a school.
Although they seem similar, here’s something else to keep in mind – quantitative data can be continuous or discrete.
The difference is that we can split continuous data further into smaller units, and they still make sense. However, this is not possible with discrete data, as dividing them into smaller units will give us unreasonable values.
For example, weight is continuous because we can measure it in kilograms, grams, and milligrams and still we have a valid weight value. But can we apply the same concept to a discrete value, such as kids in a school? That would be more than unreasonable, as you can’t possibly divide a kid in half or into smaller units, right?
Now, it’s alright if you feel confused and can’t clearly differentiate between these different types of data just yet. It gets better over time and the key to progress is practice. In the meantime, you can always check the following diagram whenever you need a handy reference:
Data Types According to Sensitivity
Data sensitivity is a controversial matter with many loose ends yet to be tied. However, the repercussions of neglecting it are so serious that if someone uses your personal data without your permission, you can face a class-action lawsuit. Therefore, being able to categorize data according to its sensitivity is a fundamental aspect of working as a data professional. So, let’s briefly cover the 4 levels of sensitivity:
Low Data Sensitivity
Low sensitivity or public data is the type of data that almost anyone can access and share without harming individuals or institutions. Examples include public websites content, such as blogs and downloadable materials, directory information, and company information.
Medium Data Sensitivity
Data of this level is for internal use only. Slight harm could result when disclosing medium sensitivity data, such as donors’ data, emails, and personnel records.
High Data Sensitivity
This is confidential data and disclosing it for any reason can cause serious harm both to individuals and institutions. High sensitivity data comprises passwords, social security numbers, financial data, etc.
Data Тypes in Python
This classification is of major importance to your work as a data professional. After all, how will you manage or analyze data if you don’t know its type?
That said, there are 2 ways to determine data types in Python. One is just by looking at the value and determining its type, and the other is by code. To be honest, there are many Python data types, but here we’ll focus on the ones you’ll most frequently deal with.
Numbers
There are 3 types of Python numbers:
Int
Int is short for integer, which is a positive or negative whole number, e.g. 1, 2, 3
Float
A float is a real decimal number, and it can be positive or negative, e.g. 15.2, -80.5, 7.509
Complex
Complex usually consists of a real part and an imaginary part, and it’s denoted (x+ yj). Тhe “x” part is real and the “y” part is imaginary, e.g. 9+5j
String
Any text in Python is mainly called a string. For instance, a country, a zip code, gender, etc.
List
An ordered group of values is called a list in Python. The items in a list are separated from each other by a comma, and the values can be modified after creating the list, e.g. x= [5, 10, 22.5, orange, red]
Tuple
Tuples resemble lists in the way values are separated. However, once you create a tuple, you can’t modify the values inside and the values are enclosed within parentheses, e.g. a= (10, 30, book, pen, 9.34)
Set
A set is a group of unordered values separated by a comma and enclosed within braces, e.g. b= {10, 15, 50, 70}
Dictionary
Similar to a word dictionary, a Python dictionary has a pair of items – a key and its corresponding value separated by a colon. For example, Dict = {“Name” : “Mary”, “Country” : “UK”, “Age” : “30”}
DateTime
DateTime is more of a module than a data type. It represents the date (year - month - day) and time (hour - minute - second), e.g. 2020-05-29 15:30:22.156
Boolean
A boolean type is the simplest of all types as it has only 2 values: yes/no, true/false, 1/0.
How to Determine Data Types in Python?
Now, let’s figure out how to determine data types using Python code.
It’s fairly easy. For any given dataset, you can write the following code line:
df.dtypes
And here’s the output you will get:
*Note: a string data type is displayed in the output as “object”
Another option is to use “info” to further understand all the dataset properties:
df.info()
In this case, the output will be:
Importance of Data Types
Why is it so important to understand data types?
Let’s suppose you’re a merchant who imports and exports different types of goods (some are edible, while some are not), and in the course of your job, you decide to try exporting all of them in one shipment. Now, if you don’t know how to differentiate between these different types of goods, and you don’t know how to store them, how would you be able to send them miles away intact and ready for consumption?
The same thing goes with data. To work, clean, analyze, and share data, first you need to determine its type. Once you do that, knowing its exact properties becomes an easy job. Here’s an example of what I’m talking about. If you have a dataset with the column “zip code” and you want to determine its type, the first thing you think of is “integer” because all zip codes have a sequence of numbers. But that’s not true. A numerical value can be added, subtracted, and multiplied by other numerical values. So, if you apply this to zip codes you’ll get a totally different number which is certainly not a valid zip code. And that leads us to understand that zip codes are always strings.
Here's something else. When dealing with your computer, if you insert the wrong data type, your computer won’t be able to interpret or deal with this piece of data properly. Hence, you and your computer should always be on the same page!
To sum up, here are the reasons why understanding data types is crucial.
- Knowing the exact data format and size helps save time and space.
- It reduces the likelihood of errors in the cleaning and analysis stages.
- Ensures that the functions you’ll write later will give you the desired results.
- And it helps instrumentation, which is the process of tracking data and sending it to other systems. To instrument data properly and create an effective tracking plan, you need to determine all types of data beforehand.
But, no matter where you are on your data journey, always remember – practice makes perfect. You’ll come to a point where you’ll just notice how working with different data types has become much easier. However, if you are an aspiring data professional with zero experience and no clue where to start, here’s the place where you can build solid data skills from the ground up.
The 365 Data Science Program offers self-paced courses led by renowned industry experts. Starting from the very basics all the way to advanced specialization, you will learn by doing with a myriad of practical exercises and real-world business cases. If you want to see how the training works, start with a selection of free lessons by signing up below.