Data scientists deal with data all the time, usually in the format of lists, dictionaries, or tables. The process can be complex, involving preprocessing, queries, and modifications such as data wrangling. As an aspiring data analyst or machine learning engineer, you’re probably thinking that these operations can be quite time-consuming, but thankfully you’ll have a helping hand. Instead of sorting or reversing algorithms by yourself, the Python NumPy package handles everything efficiently for you. The library also boasts high mathematical functions for linear algebra, matrices, and arrays. Because of its computational speed and high functionality, NumPy is often a go-to choice for many professionals and is perfect for anyone looking to break into data science.
In this tutorial, I’ll show you how to install NumPy, go through its basic uses, like how to create an array, and finish off with some more advanced techniques such as performing queries and data manipulation.
Table of Contents
- What Is NumPy in Python?
- How to Install NumPy in Python?
- Why Is NumPy Used in Python?
- How to Create a NumPy Array in Python?
- How to Use NumPy in Python?
- What Are NumPy’s Advanced Functions in Python?
- Ultimate NumPy Tutorial: Next Steps
What Is NumPy in Python?
NumPy (i.e. Numerical Python) is one of the most popular Python libraries, utilized in many other popular packages as well, such as pandas, SciPy, Matplotlib, and many more. With arrays naturally faster than Python lists, it optimizes computational performance in the workflow – from simple mathematical calculations to data manipulation for data science operations.
How to Install NumPy in Python?
Step 1: Install NumPy
You can install NumPy by using the Python tools conda or pip:
conda install numpy pip install numpy
Simply run either code and et voila, you’re ready to go!
Step 2: Import NumPy in Python
To use the NumPy in Python, simply import it using the command:
import numpy as np
In Python, the library usually appears with the shortened np by convention.
Why Is NumPy Used in Python?
As I’ve previously mentioned, NumPy has many functionalities that make it a good fit for data scientists to use in their daily tasks. Perhaps what the Python library is most known for is its use of multidimensional arrays and their high computational speed.
How to Create a NumPy Array in Python?
A NumPy array is a type of data structure that stores, well, data. While similar to Python lists in terms of the coding convention, they optimize operational performance, resulting in faster computation and ease of manipulating numerical data.
One-dimensional NumPy Array
To create a basic array, you simply wrap an np.array()command around a Python list:
>>> import numpy as np >>> a = np.array([1, 2, 3, 4, 5])
In some cases, it is useful to generate one automatically without hardcoding values like the example above. For instance, I can create an array filled with very small random numbers, which is useful to add noise to data. Another example would be full of 0s or 1s, the latter of which I can multiply with another number to create any numbered array I want.
Alternatively, if I want to initialize a NumPy array with different numbers, not just 0s and 1s, I can also fill it with uniformly distributed decimals in a range or, alternatively, in a range of running numbers – the possibilities are endless!
>>> np.empty(shape=2) # filled with small random numbers array([4.24399158e-314, 8.48798317e-314]) >>> np.zeros(shape=3) # filled with zeros array([0., 0., 0.]) >>> np.ones(shape=3) # filled with ones array([1., 1., 1.]) >>> np.linspace(start=0, stop=10, num=5) # uniformly distributed in a range array([ 0. , 2.5, 5. , 7.5, 10. ]) >>> np.arange(start=1, stop=5, step=0.5) # range of running numbers array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
In the code examples above, I have provided the argument name for clarity, but it works just the same without it – np.zeros(shape=3) is the same as np.zeros(3).
N-dimensional NumPy Array
Until now, we dealt with a one-dimensional array, otherwise known as a vector. NumPy, however, can handle two-dimensional matrices, three-dimensional tensors, and more. Indeed, higher dimensional arrays have more layers onto which we can perform more complex mathematical operations. For simplicity, you can call them n-dimensional or as they’re displayed in the Python library – ndarray.
To create one, you can wrap an np.array() command around a Python list so that it becomes a nested list instead:
>>> np.array([[1, 2, 3], [4, 5, 6]]) array([[1, 2, 3], [4, 5, 6]])
You can also create an n-dimensional ndarray by reshaping or stacking multiple one-dimensional arrays horizontally or vertically. Reshaping is useful when you’re working with the wrong shape or dimension. What does that mean? In some cases, you may have different operations returning some of the data. Thus, being able to stack arrays horizontally or vertically allows you to retrieve and reassemble all of the data.
>>> a = np.zeros(6) >>> b = np.ones(6) >>> a.reshape((2, 3)) array([[0., 0., 0.], [0., 0., 0.]]) >>> np.hstack((a, b)) # stacking horizontally array([0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1.]) >>> np.vstack((a, b)) # stacking vertically array([[0., 0., 0., 0., 0., 0.], [1., 1., 1., 1., 1., 1.]])
Moreover, you can generate n-dimensional arrays automatically without hardcoding with np.zeros and np.ones. The difference is that the shape argument is now a tuple depicting the shape of the array. By convention, it should be (height, width) when creating a two-dimensional space:
>>> np.zeros(shape=(2, 3)) array([[0., 0., 0.], [0., 0., 0.]]) >>> np.ones(shape=(3, 2)) array([[1., 1.], [1., 1.], [1., 1.]])
Other useful n-dimension arrays include the identity matrix or one filled with 0s or 1s that takes the shape of the input array. Of course, you can do the latter manually, but why not skip a step by writing np.zeros_like and np.ones_like to perform the same operation:
>>> np.eye(N=3) # identity matrix array([[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]]) >>> a = np.ones((2, 3)) # matrix filled with 1 >>> np.zeros_like(a) # matrix filled with 0, taking shape of input matrix a array([[0., 0., 0.], [0., 0., 0.]]) >>> np.ones_like(a) # matrix filled with 1, taking shape of input matrix a array([[1., 1., 1.], [1., 1., 1.]])
How to Use NumPy in Python?
After learning how to create basic and n-dimensional spaces, we can take it up a notch by modifying the values. And how do we do that? For starters, we can perform mathematical operations in NumPy for data wrangling and summary statistics, as well as sort and flip the array.
With NumPy, you can execute all the math basics such as:
There are 2 ways you can do these: with an integer or with another array. When you have an integer, it is applied for all values in the array – otherwise known as broadcasting. Meanwhile, if you’re working with 2 arrays, then the mathematical operation will be done pairwise.
>>> a = np.arange(1, 5) >>> a array([1, 2, 3, 4]) >>> a + 1 # addition of array with integer array([2, 3, 4, 5]) >>> a + a # addition of array with another array array([2, 4, 6, 8]) >>> a * 2 # multiplication of array with integer array([2, 4, 6, 8]) >>> a * a # multiplication of array with another array array([ 1, 4, 9, 16])
>>> a = np.arange(1, 10) >>> np.min(a) # minumum 1 >>> np.max(a) # maximum 9 >>> np.std(a) # standard deviation 2.581988897471611 >>> np.cov(a) # covariance array(7.5) >>> np.cumsum(a) # cumulative sum array([1, 3, 6, 10, 15, 21, 28, 36, 45], dtype=int32) >>> np.cumprod(a) # cumulative product array([1, 2, 6, 24, 120, 720, 5040, 40320, 362880], dtype=int32)
By getting such a statistical summary of the data, you get a better idea of how the values are distributed within your workspace.
Another feature of the Python library is that it allows even more complex calculations such as:
- Dot product (summation of pair-wise multiplication)
- Cross product
- Trigonometry operations
These are useful if you’d like to perform computational geometry, such as to measure the angle between 2 vectors:
>>> a = np.arange(1, 3) # array([1, 2]) >>> b = np.arange(4, 6) # array([4, 5]) >>> np.dot(a, b) 14 >>> a = np.arange(1, 4) # array([1, 2, 3]) >>> b = np.arange(5, 7) # array([5, 6]) >>> np.cross(a, b) array([-18, 15, -4]) >>> unit_vector1 = [0, 1] >>> unit_vector2 = [1, 0] >>> dot_product = np.dot(unit_vector1, unit_vector2) # dot product >>> angle = np.arccos(dot_product) # inverse cosine >>> angle # angle in radians 1.5707963267948966 >>> angle / np.pi * 180 # angle in degree 90.0
Last but not least, we can take the sum of the whole array, or by column and row respectively. Say you want to add all the values in an array. Or you have a table where each column shows how an item changed over time – summing them all up can help derive the total change in value:
>>> a = np.vstack((np.arange(1, 4), np.arange(1, 4))) >>> a array([[1, 2, 3], [1, 2, 3]]) >>> np.sum(a) 12 >>> np.sum(a, axis=0) # sum by column array([2, 4, 6]) >>> np.sum(a, axis=1) # sum by row array([6, 6])
As a rule, the matrix’s Axis 0 refers to the columns and Axis 1 – to the rows.
Sorting and Flipping an Array
Note that you can do these NumPy operations on a list and still obtain an array regardless. In addition, it is worth mentioning that these operations will not modify the original space – instead, they’ll return a copy:
>>> a = [3, 5, 3, 8, 4] >>> np.sort(a) array([3, 3, 4, 5, 8]) >>> np.flip(a) array([4, 8, 3, 5, 3]) >>> a [3, 5, 3, 8, 4]
What Are NumPy’s Advanced Functions in Python?
As you might remember, I mentioned we’ll also be looking at some more complex techniques you can use NumPy for when working in a Python space. These include:
- Data manipulation
While more advanced, obtaining these skills is definitely worthwhile if you’re an aspiring data scientist looking for an entry-level job as you can further impress potential employers with your resume.
Querying with NumPy
Suppose you want to retrieve some characteristic or another piece of information such as:
- The number of values
- The type of values
- The dimension and shape of an array
You can achieve this by querying them directly in NumPy:
>>> a array([[0., 0., 0.], [0., 0., 0.]]) >>> a.size # number of values 6 >>> a.dtype # type of values dtype('float64') >>> a.ndim # number of dimensions 2 >>> a.shape # shape of array (2, 3)
Another purpose for queries is to return a portion of an array that satisfies a logical condition, i.e., conditions that return a True or False statement. Use this functionality to check whether any value in your space violates the intended range of accepted values:
>>> a = np.arange(1, 10) >>> a array([1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> a > 5 # logical condition array([False, False, False, False, False, True, True, True, True]) >>> a[a > 5] array([6, 7, 8, 9])
Data Manipulation with NumPy
NumPy arrays are mutable, meaning that the values inside can be changed. For instance, you can perform a logical operation with the format:
np.where(logical condition, value if true, value if false)
Remember how I mentioned that you can check whether a value has violated the intended rage through queries? Well, with this you can modify those values, replacing them with other numbers. For example, the following code shows a condition of retaining the original values if they are greater than 5, and 0s elsewhere:
>>> a array([1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> np.where(a > 5, a, 0) array([0, 0, 0, 0, 0, 6, 7, 8, 9])
Previously, we reshaped our space using the command of the same name. Now, if we use it again here, NumPy will return a new array with the shape we defined. But if we want to modify it directly, we can use the resize command instead:
>>> a = np.arange(1, 10) >>> a.reshape((3, 3)) array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) >>> a # reshape does not change the array a array([1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> a.resize((3, 3)) # does not return anything, changes array directly >>> a array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
Important note: This change is irreversible. If you’d like to return your array to its original shape, you will have to use the reshape command.
Ultimate NumPy Tutorial: Next Steps
Mastering NumPy in Python is absolutely critical if you are set on a career in data science. Not only will it simplify your data science workflow, but it will be a valuable asset during your job hunt.
However, programming can be a lengthy, confusing process before you can before fully proficient – especially without a background in computer science. In order to experience a smoother learning journey, it’s a good idea to find good online resources to support you in every step of your data science journey. Luckily, you’ve come to the right place!
The 365 Data Science Program offers self-paced courses led by renowned industry experts. Starting from the very basics all the way to advanced specialization, you will learn by doing with a myriad of practical exercises and real-world business cases. If you want to see how the training works, start with a selection of free lessons by signing up below.