Why Do We Need to Pre-Process Time Series Data?
Time series data represents data that attaches a time stamp to each recorded entry. These make up a large portion of the datasets used in business and science since knowing when something occurred is crucial. However, these records can be incomplete, or the events could be occurring sporadically.
But what do we do in such cases? We pre-process the data into a time series data frame.
In this tutorial, you’ll learn how to pre-process time series data. In other words, we’ll guide you through a variety of ways to manipulate and convert a data set into a time-series. These include converting text or numbers into dates, as well as setting the dates as indices, assigning a frequency and accounting for missing values.
Let’s get started!
How to Import Data?
In order to pre-process time-series data, obviously, we need to import some data first. We can either scrape it or add it from a file we have stored locally. In our case, we’ll use the “Index2018” file.
If you need a hint on how to so import the data, you can refer to our dedicated tutorial on importing values.
How to Convert Values into Dates When We Pre-Process Time Series Data?
The first thing we usually do is transform the dates into the appropriate Date Type format.
How to Explore the Data?
We can start by calling the “describe” method on the “date” column.
To carry it out, write “df_comp.date.describe” and run the cell.
Subsequently, what we see is that the values are simply a text representation. As users, we can easily interpret those, but Python can’t. This poses an issue because we want the option to analyse specific consecutive intervals of time.
You will notice that the summarized statistics for dates are very limited and don’t include the first or last value in the series. The “top” date is misleading as it doesn’t hold the highest value or the date with the highest frequency. Instead, it is completely arbitrary and changes every time we re-load the data.
Let me elaborate a bit more on that.
For Python, all the values in the “date” column are 1s, because they hold some value different from 0. Therefore, any single date holds a “top” value. This explains why the method just randomly selects which one to display.
Of course, we need to fix this, if we want to analyse consecutive chunks of data and refer to the values assigned to a specific day. That’s why we must convert the dates into a “date time” type. Lucky for us, the “pandas” package provides a simple to_datetime method we can call.
How to Transform Strings into Dates?
This is the next step you should take when you pre-process time series data. The method to_datetime() requires a single argument - the attribute we wish to transform into a date time type. In our case, that’s the entire ‘date’ column of the data frame, so we write:
A specific feature of the to_datetime() method is that it assumes we are plugging in a string in a “month-day-year” form. Since our data is saved in the “day-month-year” format, we need to specify this when calling the method. And we do it by setting the value of the optional argument “dayfirst” to “True”.
The new date time values will be recorded in a “year-month-day” format. Although we are not used to seeing dates like this, it is the most convenient representation as it allows us to clearly see how the values are ordered.
Since we want to replace the string values in the column with “date time” values, we need to assign the results to the “date” attribute of the data frame. If we omit to store them, the code will only display how the column would look, without overwriting the data. Therefore, to record the new “date type” values, we insert df_comp.date = at the beginning of the line:
So, how can we make sure we have successfully converted the dates?
By using the “head” method to explore the top observations in the data frame.
The values are formatted differently than before, right? To test if the “date” column is no longer stored as text, we can once again use the “describe()” method. Just one thing to remember - only call the method on the “date” attribute, rather than the entire data frame.
After the change, we get the more detailed statistics about the time period. We see that it starts on the 7th of January 1994 and ends of the 29th of January 2018.
In addition, all the date time values also include the hours, minutes and seconds, even though the dataset only holds daily values. That’s because “pandas” is equipped to handle high frequency data with difference between intervals as small as a second.
How to Transform Numbers into Dates?
Alternatively, the initial “date” values could be numeric values instead of strings. In those cases, we can still use the do_datetime method again. The only major difference includes passing some values to the format argument. For example, if we have values like 20200113, then they clearly represent January 13th 2020, but that isn’t obvious for Python. We need to specify the order of the values goes year – month – date. Hence, we would write:
Of course, if we only have yearly, quarterly, or data with different frequencies, we just need to adjust the format argument appropriately and we’re all set.
Now that we have transformed the “date” column into the appropriate type, we can use it as an index for our time-series.
How to Set an Index When We Pre-process Time Series Data?
To refer to a set of values as a time-series, each one should correspond to a time-period. This is crucial for referring to values according to the day they were recorded. Moreover, we often wish to examine specific chunks of data between two concrete dates. For example, analyses of The Great Recession usually involve separating the data before and after September 15th, 2008 - the day Lehman Brothers filed for bankruptcy. Thus, we need to use the associated time periods as indices.
In this instance, we have a dataset of daily values. Therefore, each date would be a separate time-period, so we can use the “date” column of the data frame as the new index column.
How To Use set_index()?
To achieve this easily, the pandas package provides a method called set_index().
Since we are setting the indices for the entire data frame, we write:
In the parenthesis, we write the name of the data frame attribute we intend to use as an index from now on. But make sure to put it in quotation marks, too.
Then we set the value of the optional argument inplace equal to True to tell Python to save this new data frame in the place of the one with integer indexes.
To see if we have successfully created a time-series, let’s check the values of the data frame using the ‘head’ method.
There is no longer a column with integer values on the left. Instead, we have the “date” column in its place. We can tell these are the new index values because they appear in bold once the data frame is displayed. The indices are also pretty much “fixed”, in the sense that we cannot freely overwrite them. Therefore, it is crucial to ensure all the indexes are correct.
How to Use Yahoo Finance in Pre-Processing Time Series Data?
Now, if you opted to scrape the data from Yahoo Finance, instead of importing it from a local file, then you don’t need to set an index column. That is because the result of the yfinance.download method is a data frame, which has Datetype64 values for its indices. Thus, the output of the method is already stored in a convenient format for us.
How To Assign the Right Frequency When We Pre-Process Time Series Data?
Time-series data requires having a constant frequency that persists throughout the set.
How to use as_freq()?
The good news is, the “Pandas” allows us to assign the frequency of the set by calling the as_freq method.
It requires defining a single argument depending on the type of data we are using.
The values this argument can take are specific letters of the alphabet in single quotation marks. For instance, writing the letter “h” in parenthesis will indicate the data is recorded hourly. Most of the possible values are just as intuitive with “w” representing weekly data, “d” - daily, “m” - monthly and so on.
Most often, we refer to the data measured once a year as “annual” rather than “yearly”, so the parameter value we use is “a” instead of “y”.
What Frequency Peculiarities We Should Keep in Mind?
The first one is that our data contains daily closing prices, so we use the “as freq” method to assign the appropriate frequency.
Another peculiarity of our data is that it represents the closing prices of financial indices. These prices can only be recorded during working days, so we’re not interested in any weekends or holidays.
If we do a quick check, we find that the two periods generated by the method were indeed Saturday and Sunday. So, it’s best to avoid adding such irrelevant time-periods into our data set.
Fortunately, the “as freq” parameter also accepts the value “b”, which states the data was only recorded on business days.
Let’s adjust the argument value and once again call the “head” method.
As you can see, setting the frequency to business days avoids adding many unnecessary time periods.
Next up, we will go through the most common approaches to account for missing values.
How to Handle Missing Values When You Pre-Process Time Series Data?
How to Find Missing Values?
For starters, we need to test if assigning the frequency lead to additional time periods for which data is Not Available.
Again, there’s a dedicated method: is_na() which we can call on the entire data frame. We will see either “True” or “False” for every entry in the set, depending on whether there is missing data for that period.
To clarify, “True” indicates there are missing values and “False” suggests there aren’t.
The bigger the dataset, the more difficult it becomes to spot missing entries. In binary “True” is equivalent to 1 and “False” is equivalent to 0. Therefore, we can add up the values in each column to determine the number of missing observations. For convenience, we can call the “sum()” method along with “is_na()”, like so:
For each attribute, this will determine the number of instances without available information.
After running the cell, we see there are exactly 8 missing values for each of the market indexes.
Recall that we had no missing values when we first examined the dataset.
Therefore, setting the frequency to “business days” must have generated 8 dates, for which we have no data available.
How to Fill Missing Values?
To amend this, we need to fill these missing elements with values. The easiest way to do it is via the fill_na() method.
The method goes through the entire data and fills out missing values in several different ways.
We call the first one “front filling” and it assigns the value of the previous period. For instance, if we have no data available for the 15th of July 2004, we assign to it the value recorded on the 14th of July 2004.
The second way we can tackle missing values is through “back filling”. As you can probably guess, it assigns empty entries of the value for the next period. For instance, if we have no data for July 7th, 2004, we pass it the same value as the one we have recorded for July 8th the same year.
A third way we deal with missing data is by assigning the same value to all the NA time periods. For instance, it is common practice to assign the average value of the time-series to all the missing values.
Depending on your data, you may opt for only one or a combination of methods. Usually, filling missing values with the mean is a bad approach when it comes to time-series because there are underlying time-variant patterns in the data. This approach is appropriate only when the data heavily fluctuates around the mean from the first day to the last day.
Well, that about wraps up all the specifics of time-series pre-processing in Python. If you'd like to hone your practical Python skills, check out the rest of our Python tutorials and exercises, or read everything there is to know about Python programming in the 365 Learning Python Programming Guide.
Ready to take the next step towards a career in data science?
Check out the complete Data Science Program today. Start with the fundamentals with our Statistics, Maths, and Excel courses, build up step-by-step experience with SQL, Python, R, and Tableau, and upgrade your skillset with Machine Learning, Deep Learning, Credit Risk Modeling, Time Series Analysis, and Customer Analytics in Python. If you still aren’t sure you want to turn your interest in data science into a solid career, we also offer a free preview version of the Data Science Program. You’ll receive 12 hours of beginner to advanced content for free. It’s a great way to see if the program is right for you.