02 Num Py Core Data Structure Explained in Detail

02 NumPy Core Data Structure Explained in Detail #

Hello, I am Fang Yuan.

After the previous two lessons, we already have a preliminary understanding of PyTorch. Are you eager to start working with PyTorch? Well, hold on for a moment. It is necessary for us to first taste an “appetizer”, and that is NumPy.

Why do we need to master NumPy first? I believe that it is almost impossible to engage in or get started with machine learning without touching NumPy. The fundamental computation unit in the mainstream deep learning frameworks PyTorch and TensorFlow, which is the tensor, has similar computation logic to NumPy arrays. Therefore, mastering NumPy is very helpful for learning these two frameworks.

Furthermore, NumPy is widely used in other Python modules for data science and scientific computing, such as Pandas and SciPy. Also, the face recognition technology that we use more and more in our daily lives (a part of the computer vision field) essentially converts the image into a NumPy array, and then performs a series of processing steps.

In order for you to truly grasp NumPy, I will explain it over two lessons. In this lesson, we will first introduce NumPy arrays, the key attributes of arrays, and the concept of axes, which is very important.

What is NumPy #

NumPy is a fundamental package for scientific computing in Python. It provides a multidimensional array object (to be further explained later) and various fast operations on arrays, such as sorting, transformation, selection, etc. Installing NumPy is very simple. You can install it using Conda with the following command:

conda install numpy

Alternatively, you can use pip to install it with the following command:

pip install numpy

NumPy Arrays #

The array object mentioned earlier is the most core component in NumPy, and this array is called ndarray, which is short for “N-dimensional array”. The N represents a number, indicating the dimensionality, for example, you often hear about 1-D arrays, 2-D arrays, or arrays with even higher dimensions.

In NumPy, arrays are implemented by the numpy.ndarray class, which is the core data structure of NumPy. Today’s content revolves around it.

When learning a new concept, we often compare it with familiar things. From a logical perspective, NumPy arrays are similar to arrays in other programming languages, and indexing starts from 0. Python lists, on the other hand, can achieve similar functionalities as NumPy arrays, but there are differences that you can appreciate through a comparison.

In Python, lists can be dynamically changed, but NumPy arrays cannot. They have a fixed size when created. If you change the length of a NumPy array, a new array will be created and the original array will be deleted.
The data types in a NumPy array must be the same, while the elements in a list can be of different types.
NumPy is optimized for a series of operations on NumPy arrays, making them very fast, and compared to Python lists, the same operations require less memory.

Creating Arrays #

Alright, let’s take a look at how to create NumPy arrays.

The simplest way is to pass a list to np.array() or np.asarray(), and the list can be of any dimension. np.array() performs a deep copy while np.asarray() performs a shallow copy. We will discuss the differences between them in the next lesson, but for now, you just need to have an idea.

Let’s start by creating a one-dimensional array. Here’s the code:

import numpy as np
# Import once

arr_1_d = np.asarray([1])
print(arr_1_d)
[1]

Now let’s create a two-dimensional array:

arr_2_d = np.asarray([[1, 2], [3, 4]])
print(arr_2_d)
[[1 2]
 [3 4]]

You can also try creating arrays of higher dimensions on your own.

Array Attributes #

As an array, NumPy has some inherent attributes. Today, we will introduce the commonly used and crucial attributes: ndim, shape, size, and dtype.

ndim #

ndim represents the number of dimensions (or axes) of an array. The array arr_1_d we just created has 1 axis, while arr_2_d has 2 axes.

arr_1_d.ndim
1
arr_2_d.ndim
2

shape #

shape represents the dimensions or shape of an array. It is an integer tuple, and the length of the tuple is equal to ndim.

The shape of arr_1_d is (1,), which represents a vector, while the shape of arr_2_d is (2, 2), which represents a matrix.

arr_1_d.shape
(1,)
arr_2_d.shape
(2, 2)

The shape attribute is widely used in practice. For example, if we have data of the form (B, W, H, C), those familiar with deep learning will know that it represents batch size B data with dimensions (W, H, C).

Now, if we need to reshape or process the data based on the dimensions (W, H, C), we can directly use input_data.shape[1:3] to retrieve the shape of the data, without hard coding the width, height, and number of channels in the program.

In practical work, we often need to transform the shape of an array. We can use the arr.reshape() function to reshape an array without changing its contents. However, you need to note that the number of elements in the original array and the reshaped array should be the same. Please see the code below for an example.

arr_2_d.shape
(2, 2)
arr_2_d
[[1 2]
 [3 4]]
# Reshape arr_2_d to a (4, 1) array
arr_2_d.reshape((4, 1))
array([[1],
       [2],
       [3],
       [4]])

We can also use np.reshape(a, newshape, order) to reshape the array a with the new shape specified in newshape.

Here, you need to note that the reshape function has an order parameter, which specifies the order in which elements are read and written. There are a few parameters available.

‘C’: Default parameter, reads and writes elements using the C-like indexing order (row-major).
‘F’: Reads and writes elements using the Fortran-like indexing order (column-major).
‘A’: If the original array is stored in a ‘C’ order, reshape it using ‘C’ indexing; otherwise, use ‘F’ indexing.

You can understand the process of reshape as follows: first, the original array needs to be flattened according to the specified order (‘C’ or ‘F’), and then written into the new array according to the specified order.

What does this mean? Let’s take a simple example of a 2-dimensional array.

a = np.arange(6).reshape(2, 3)
array([[0, 1, 2],
       [3, 4, 5]])

To reshape the array a using the ‘C’ order to (3, 2), you can do the following. First, flatten the original array. For the ‘C’ order, the last dimension changes first, so the flattened result is as shown below, with the index column indicating the order of flattening.

  ![Image](../images/8901459e0c4f450f877bef04dc609a0d.jpg)

Hence, after reshaping, the array is written in the order 0, 1, 2, 3, 4, 5. The reshaped array is shown in the table below, with the index representing the order of writing.

Next, let’s see how to reshape the array a using the ‘F’ order to (3, 2).

For the row-major order, we should be familiar with it, but the ‘F’ order is the column-major order, which may be a bit difficult to understand for those who have not used it before.

First, flatten the original array according to column-major order. Column-major means that the first dimension of the array changes first. The result of flattening is shown below, with the index representing the order of flattening. Please note the change in coordinates (the first dimension changes first).

Hence, after reshaping, the array is written in the order 0, 3, 1, 4, 2, 5. The reshaped array is shown in the table below, with the index representing the order of writing. To make it more intuitive, I have displayed rows with the same color.

Here’s a little exercise for you: Can you try reshaping multi-dimensional arrays?

However, in most cases, the ‘C’ order (row-major) is commonly used. At least for now, I haven’t used the ‘F’ or ‘A’ order.

size #

size represents the total number of elements in an array. It is equal to the product of the elements in the shape attribute.

Take a look at the following code, where arr_2_d has a size of 4.

arr_2_d.size
4

dtype #

Lastly, dtype is an object that describes the type of elements in an array. You can use the dtype attribute to view the data type of an array.

Most commonly used data types in NumPy are supported, such as int8, int16, int32, float32, float64, etc. dtype is a common attribute that can be seen when creating arrays or converting data types. First let’s take a look at the data type of arr_2_d:

>>> arr_2_d.dtype
dtype('int64')

As you can see, when we created arr_2_d earlier, we did not specify a data type. If no data type is specified, NumPy will automatically determine it and assign a default data type.

Now let’s look at the code below, where we explicitly specify the data type when creating arr_2_d:

>>> arr_2_d = np.asarray([[1, 2], [3, 4]], dtype='float')
>>> arr_2_d.dtype
dtype('float64')

The data type of the array can be changed using the astype() function. However, changing the data type will create a new array instead of modifying the original array’s data type.

Please see the code below for an example:

>>> arr_2_d.dtype
dtype('float64')
>>> arr_2_d.astype('int32')
array([[1, 2],
       [3, 4]], dtype=int32)
>>> arr_2_d.dtype
dtype('float64')
# The data type of the original array has not changed
>>> arr_2_d_int = arr_2_d.astype('int32')
>>> arr_2_d_int.dtype
dtype('int32')

However, I want to remind you that modifying the data type directly will not change the data type of the array. Although the code will not throw an error, the data will be altered. Please see the code below:

>>> arr_2_d.dtype
dtype('float64')
>>> arr_2_d.size
4
>>> arr_2_d.dtype='int32'
>>> arr_2_d
array([[         0, 1072693248,          0, 1073741824],
       [         0, 1074266112,          0, 1074790400]], dtype=int32)

Since one float64 is equivalent to two int32, the original 4 float32 values will become 8 int32 values, which are directly outputted.

Other ways to create arrays #

In addition to using np.asarray or np.array to create an array, NumPy also provides other methods to create arrays with predefined patterns. We only need to provide the necessary parameters as required.

np.ones() and np.zeros() #

np.ones() is used to create an array filled with ones. The required parameter is the shape of the array, and the optional parameter is the data type of the array. Please refer to the code below for an example:

>>> np.ones()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ones() takes at least 1 argument (0 given)
# The error occurs because no shape parameter is given
>>> np.ones(shape=(2,3))
array([[1., 1., 1.],
       [1., 1., 1.]])
>>> np.ones(shape=(2,3), dtype='int32')
array([[1, 1, 1],
       [1, 1, 1]], dtype=int32)

Creating an array filled with zeros is done using np.zeros(). The usage is similar to np.ones(), so we won’t provide an example here.

When would you use these two functions? For example, if you need to initialize some weights, you can use them. For instance, to create a 2x3 array with every element being 0.5, you can do the following:

>>> np.ones((2, 3)) * 0.5
array([[0.5, 0.5, 0.5],
       [0.5, 0.5, 0.5]])

np.arange() #

We can also use np.arange([start, ]stop, [step, ]dtype=None) to create an array with values in the range [start, stop), with a step size of step.

start is an optional parameter and defaults to 0. stop is a required parameter and specifies the end of the range. Please note that the range is a left-closed right-open interval, so the stop value is not included in the array. step is an optional parameter and defaults to 1.

Here are some examples:

# Create an array from 0 to 4
>>> np.arange(5)
array([0, 1, 2, 3, 4])
# Create an array from 2 to 4
>>> np.arange(2, 5)
array([2, 3, 4])
# Create an array from 2 with a step size of 3 to 8
>>> np.arange(2, 9, 3)
array([2, 5, 8])

np.linspace() #

Lastly, we can use np.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None) to create an array that contains a specified number of evenly spaced values between start and stop.

start: Required parameter, the starting value of the sequence.
stop: Required parameter, the end value of the sequence.
num: Optional parameter, the number of elements in the sequence. Defaults to 50.
endpoint: Defaults to True, if set to True, the array includes the stop value.
retstep: Defaults to False, if set to True, the function returns the array and the step value.

# Create an array with 3 elements from 2 to 10
>>> np.linspace(start=2, stop=10, num=3)

np.arange and np.linspace are also commonly used functions. For example, when plotting a graph, you can use them to generate the x-axis coordinates. For example, if you want to plot y=x^2, you can use np.linspace() to generate the x-axis values.

import numpy as np
import matplotlib.pyplot as plt

X = np.arange(-50, 51, 2)
Y = X ** 2

plt.plot(X, Y, color='blue')
plt.legend()
plt.show()

Arrays and Axes #

This is a very important concept and also one of the most difficult concepts to understand in NumPy arrays. It often appears in critical aggregation functions like np.sum() and np.max().

Let’s use a question to illustrate how the same function can produce different results based on different axes. For example, we have a (4,3) matrix that stores rating data for 3 games from 4 students.

>>> interest_score = np.random.randint(10, size=(4, 3))
>>> interest_score
array([[4, 7, 5],
       [4, 2, 5],
       [7, 2, 4],
       [1, 2, 4]])

Our first requirement is to calculate the total score for each game. How do we solve this problem? Let’s analyze it together. The axis of an array represents its dimensions, starting from 0. For this two-dimensional array, it has two axes: axis 0 represents rows, and axis 1 represents columns. As shown in the following figure:

Our problem is to calculate the total score for each game, which means summing along axis 0. So, we just need to specify axis 0 in the sum function to sum along that axis.

>>> np.sum(interest_score, axis=0)
array([16, 13, 18])

The calculation is shown in the direction of the green arrow.

The second problem is to calculate the total score for each student, which means we need to operate on the 2D array along axis 1. So, we just need to set the axis parameter to 1.

>>> np.sum(interest_score, axis=1)
array([16, 11, 13,  7])

The calculation is shown in the direction of the green arrow.

Two-dimensional arrays are relatively easy to understand, but how do we handle multi-dimensional data? Have you noticed that when axis=i, it means that the calculation is performed along the ith axis, or you can understand it as the data along the ith axis will be collapsed or aggregated together.

For an array with shape (a, b, c), after aggregation along axis 0, the shape becomes (b, c); after aggregation along axis 1, the shape becomes (a, c); after aggregation along axis 2, the shape becomes (a, b), and so on.

Next, let’s look at an example of a multi-dimensional array. We want to find the maximum value on different dimensions of the array a.

>>> a = np.arange(18).reshape(3,2,3)
>>> a
array([[[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]],

       [[12, 13, 14],
        [15, 16, 17]]])

We can treat the data along the same axis as the same unit. When aggregating, we only need to aggregate at the same level of units. As shown in the following figure, the green box represents the units along axis 0, the blue box represents the units along axis 1, and the red box represents the units along axis 2.

When axis=0, it means aggregating the data of the three green boxes together, resulting in a (2,3) array. The content of the array is:

[[max(a_000,a_100,a_200), max(a_001,a_101,a_201), max(a_002,a_102,a_202)],
 [max(a_010,a_110,a_210), max(a_011,a_111,a_211), max(a_012,a_112,a_212)]]

The code is as follows:

>>> a.max(axis=0)
array([[12, 13, 14],
       [15, 16, 17]])

When axis=1, it means aggregating the blue boxes within each green box together, resulting in a (3,3) array. The content of the array is:

[[max(a_000,a_010), max(a_001,a_011), max(a_002,a_012)],
 [max(a_100,a_110), max(a_101,a_111), max(a_102,a_112)],
 [max(a_200,a_210), max(a_201,a_211), max(a_202,a_212)]]

The code is as follows:

>>> a.max(axis=1)
array([[ 3,  4,  5],
       [ 9, 10, 11],
       [15, 16, 17]])

When axis=2, it means aggregating the red boxes within each blue box together, resulting in a (3,2) array. The content of the array is:

[[max(a_000,a_001,a_002), max(a_010,a_011,a_012)],
 [max(a_100,a_101,a_102), max(a_110,a_111,a_112)],
 [max(a_200,a_201,a_202), max(a_210,a_211,a_212)]]

The code is as follows:

>>> a.max(axis=2)
array([[ 2,  5],
       [ 8, 11],
       [14, 17]])

The axis parameter is very common, not only in the sum and max functions we just introduced, but also in many other aggregation functions such as min, mean, argmin (to find the index of the minimum value), argmax (to find the index of the maximum value), etc.

Summary #

Congratulations on completing this lesson. If you have some basic programming knowledge in other languages, learning Numpy should be very easy. Here I would like to emphasize once again why Numpy is an essential appetizer.

Many concepts in Numpy are closely related to PyTorch, such as the Tensor in PyTorch. Numpy is commonly used in machine learning, and many modules are based on NumPy, especially in data preprocessing and post-processing.

NumPy is a fundamental package for scientific computing in Python. It provides a multidimensional array object and various fast operations for arrays. To give you a more intuitive experience, we have learned four ways to create arrays.

One method you need to focus on is how to use np.asarray to create an array. This involves the flexible use of array attributes (ndim, shape, dtype, size), especially regarding shape changes and data type conversions.

Finally, I introduced the concept of array axes, which we need to apply flexibly in array aggregation functions. Although this concept is commonly used, it is not easy to understand. I suggest you carefully study the examples in my course, starting from 2D arrays and gradually reasoning to multidimensional arrays. Based on different axes, you will see how the direction of array aggregation changes.

In the next lesson, we will continue to learn about the commonly used and important features in NumPy.

Practice for Each Lesson #

In the previous question about user ratings for games, can you calculate the average rating for each user across three games?

Feel free to leave your questions or comments in the comments section, and I also recommend sharing this lesson with your friends.