As a data science enthusiast, you probably know by heart that Pandas is Python’s primary library for data analysis and manipulation. What you may not have heard already is that Pandas 1.0.0 was officially released!
The New Pandas 1.0.0
Although at first sight this latest version is not much different for the user than the previous release starting with a 0: Pandas 0.25.3, there are plenty of enhanced features that boost performance and lay a better foundation in the long run. They represent 1.0.0 as a stable version of pandas with a strengthened API, which has also been cleaned of many prior version deprecations. So,
What are the most notable improvements that come with Pandas 1.0.0?
StringDtype and BooleanDtype
The dedicated string and Boolean data types are still “experimental”, which means that further improvements are expected to happen in the near future. So, as of yet, pandas will not automatically assign “string” or “bool” to your data. This can only happen if you explicitly specify dtype=pd.StringDtype() or dtype=pd.BooleanDtype() while creating a new structure. However, in the future, this may become the default way in which pandas treats data of this type. We’ll just have to wait and see.
Also, you must consider the benefit of having the new “string” data type. For example, until now, pandas would treat a date value and a string value as “object”. Using “string” allows you to distinguish between the two, so now you can select and manipulate string data much more easily. Which leads us to the second point worth mentioning.
DataFrame.select_dtypes() performance improvement
The DataFrame.select_dtypes() method is much quicker now! It relies on vectorization instead of iterating over a loop. So, you can run df.select_dtypes(include=’strings’) to pull all string values, or df.select_dtypes(include=’bool’) to retrieve the Boolean data from a DataFrame, provided that you have set them as such beforehand.
Experimental NA scalar
We now can enjoy the pd.NA scalar that denotes missing values. Using pd.NA is a new concept in the scientific ecosystem of Python, and its goal is to provide an indicator for missing values that can be used consistently and successfully across data types.
That said, this feature is currently “experimental”, too. The reason is that it is yet to be further verified how it will intertwine with the simultaneous work of other packages such as NumPy.
A method that will convert the data types of columns containing such null values has been introduced – DataFrame.convert_dtypes().
The well-known DataFrame.info() has been improved. It is much more readable and this does help you to explore your data in a quicker and more efficient way.
Now we also have .to_markdown() – this new method allows you to display a Series or DataFrame object as a markdown table.
So overall, a lot has been done but mainly on the backend. For everyday users like us, the development of clear data types, consistent with other libraries is surely the most prominent improvement.
In any case, it is worth checking the official release notes for more information before you start using Pandas 1.0.0. There you can find out more about the changes related to using such features as .sort_index() or .sort_values() methods and many more.
Finally, note that you need at least Python 3.6.1 to use this new version.
With that in mind,
pip install pandas --upgrade
and have fun!
If you’re enthusiastic about boosting your Python knowledge, check out our super practical tutorials! And, if you’re new to Pandas and Python, you’ll benefit greatly from our guide: Learning Python Programming – Everything You should Know.