Login With Github

NumPy Tutorial: Python Machine Learning Library

NumPy is a Python package, which is very suit for scientific computing. And it's a very common base library for machine learning when we use Python to program.

I'll introduce a getting started tutorial in this article.

1. Introduction

NumPy is a basic package for scientific computing. It is a Python language implementation which includes:

  • The powerful N-dimensional array structure
  • Sophisticated functions
  • Tools that can be integrated into C/C++ and Fortran code
  • Linear algebra, Fourier transform and random number features

In addition to being used for scientific computing, NumPy also can be used as an efficient multi-dimensional container for general data. Because it can work with any type of data, NumPy can be integrated into multiple types of databases seamlessly and efficiently.

2. Obtain NumPy

Since it's a Python package, your machine need to be equipped with the environment of Python you need first. You can search for the obtaining method on the network yourself.

You can also check the Installing packages on the official website of scipy.org for how to get NumPy. I won't go into details in this article .

I recommend the way of using pip to install Python package. The command is as follows:

pip3 install numpy

The environment where the code for this article is validated and tested is as follows:

  • Hardware: MacBook Pro 2015
  • OS: macOS High Sierra
  • Locale: Python 3.6.2
  • Package: numpy 1.13.3

All the source code can be found here: https://github.com/paulQuei/numpy_tutorial

In addition

  • I will verify the results through the print function in Python for the purpose of simplicity.
  • I will use import numpy as np by default for the convenience of spelling.

3. Basic properties and array creation

The foundation of NumPy is a homogeneous multidimensional data, and the elements in the array can be indexed by subscripts. In NumPy, the dimension is called axis (the plural is the axes), and the number of dimensions is called rank.

For instance:

Here is an array with rank 1, and the length of the axis is 3:

[1, 2, 3]

Below is an array with rank 2, and the length of the axis is 3 too:

[[ 1, 2, 3],
 [ 4, 5, 6]]

We can create an array of NumPy through the array function, for example:

a = np.array([1, 2, 3])
b = np.array([(1,2,3), (4,5,6)])

Note that the square brackets are required here. And the following way of writing is wrong:

a = np.array(1,2,3,4) # WRONG!!!

NumPy's array class is ndarray, which has an alias  numpy.array, but it's different from array.array in the Python standard library. The latter is just a one-dimensional array. The features of ndarray are as follows:

  • ndarray.ndim: the dimension number of the array. It's called rank in Python.
  • ndarray.shape: the dimension of the array. It's a series of numbers whose length is determined by the dimension (ndim) of the array. For example, the shape of a one-dimensional array with length n is n. And the shape of an array with n rows and m columns is n,m.
  • ndarray.size: the number of all elements in the array.
  • ndarray.dtype: the type of the element in the array, such as numpy.int32numpy.int16, or numpy.float64.
  • ndarray.itemsize: the size of each element in the array, in bytes.
  • ndarray.data: the buffering for storing the array elements. Usually we only need to access the elements by subscripts, and don't need to access the buffer.

Let's take a look at the code example:

# create_array.py

import numpy as np

a = np.array([1, 2, 3])
b = np.array([(1,2,3), (4,5,6)])

print('a=')
print(a)
print("a's ndim {}".format(a.ndim))
print("a's shape {}".format(a.shape))
print("a's size {}".format(a.size))
print("a's dtype {}".format(a.dtype))
print("a's itemsize {}".format(a.itemsize))

print('')

print('b=')
print(b)
print("b's ndim {}".format(b.ndim))
print("b's shape {}".format(b.shape))
print("b's size {}".format(b.size))
print("b's dtype {}".format(b.dtype))
print("b's itemsize {}".format(b.itemsize))

The output is as follows:

a=
[1 2 3]
a's ndim 1
a's shape (3,)
a's size 3
a's dtype int64
a's itemsize 8

b=
[[1 2 3]
 [4 5 6]]
b's ndim 2
b's shape (2, 3)
b's size 6
b's dtype int64
b's itemsize 8

We can also specify the type of the element when creating the array, for example:

c = np.array( [ [1,2], [3,4] ], dtype=complex )

For more parameter descriptions of the array functions, see: numpy.array

Note: NumPy itself supports multidimensional arrays, and also supports data of various types of elements. But considering that the array structure of 3D and above is not easy to understand, and when we are programming machine learning, matrix operation will be used most, so, next, I will mainly use arrays of one-dimensional and two-dimensional as examples to illustrate.

4. Create a specific array

In actual project engineering, we often need some specific data, and some helper functions are provided in NumPy:

  • zeros: used to create an array whose elements are all 0
  • ones: used to create an array whose elements are all 1
  • empty: used to create uninitialized data. so the content is undefined.
  • arange: used to create an array by specifying the scope and step-length
  • linespace: used to create an array by specifying the range and the number of elements
  • random: used to generate random numbers
# create_specific_array.py

import numpy as np

a = np.zeros((2,3))
print('np.zeros((2,3)= \n{}\n'.format(a))

b = np.ones((2,3))
print('np.ones((2,3))= \n{}\n'.format(b))

c = np.empty((2,3))
print('np.empty((2,3))= \n{}\n'.format(c))

d = np.arange(1, 2, 0.3)
print('np.arange(1, 2, 0.3)= \n{}\n'.format(d))

e = np.linspace(1, 2, 7)
print('np.linspace(1, 2, 7)= \n{}\n'.format(e))

f = np.random.random((2,3))
print('np.random.random((2,3))= \n{}\n'.format(f))

The output is as follows.

np.zeros((2,3)= 
[[ 0.  0.  0.]
 [ 0.  0.  0.]]

np.ones((2,3))= 
[[ 1.  1.  1.]
 [ 1.  1.  1.]]

np.empty((2,3))= 
[[ 1.  1.  1.]
 [ 1.  1.  1.]]

np.arange(1, 2, 0.3)= 
[ 1.   1.3  1.6  1.9]

np.linspace(1, 2, 7)= 
[ 1.          1.16666667  1.33333333  1.5         1.66666667  1.83333333
  2.        ]

np.random.random((2,3))= 
[[ 0.5744616   0.58700653  0.59609648]
 [ 0.0417809   0.23810732  0.38372978]]

5. Shape and operation

In addition to generating an array, after we have held some data, we may need to generate some new data structures based on the existing array. In this case, we can use the following functions:

  • reshape: used to generate a new array based on the existing array and the specified shape
  • vstack: used to stack multiple arrays in vertical direction (the dimensions of the array must be matched)
  • hstack: used to stack multiple arrays in horizontal direction (the dimensions of the array must be matched)
  • hsplit: used to split the array horizontally
  • vsplit: used to split the array vertically

We'll use some examples to illustrate.

To make it easier to test, let's create a few data:

  • zero_line: an array with a row containing three 0
  • one_column: an array with a column containing three 1
  • a: a matrix with 2 rows and 3 columns
  • b: an integer array in the interval of [11,20]
# shape_manipulation.py

zero_line = np.zeros((1,3))
one_column = np.ones((3,1))
print("zero_line = \n{}\n".format(zero_line))
print("one_column = \n{}\n".format(one_column))

a = np.array([(1,2,3), (4,5,6)])
b = np.arange(11, 20)
print("a = \n{}\n".format(a))
print("b = \n{}\n".format(b))

We can get the structure in the output:

zero_line = 
[[ 0.  0.  0.]]

one_column = 
[[ 1.]
 [ 1.]
 [ 1.]]

a = 
[[1 2 3]
 [4 5 6]]

b = 
[11 12 13 14 15 16 17 18 19]

The array b is a one-dimensional array originally, and we resize it into a matrix of 3 rows and 3 columns by the reshape method:

# shape_manipulation.py

b = b.reshape(3, -1)
print("b.reshape(3, -1) = \n{}\n".format(b))

The second parameter here is set to -1, which means that it'll be determined based on actual conditions automatically. Since the array has 9 elements originally, the matrix after being resized is 3X3. The code output is as follows:

b.reshape(3, -1) = 
[[11 12 13]
 [14 15 16]
 [17 18 19]]

Next, we'll stack the three arrays vertically through the vstack function:

# shape_manipulation.py

c = np.vstack((a, b, zero_line))
print("c = np.vstack((a,b, zero_line)) = \n{}\n".format(c))

The output is as follows, please pay attention to the data structure before and after stacking:

c = np.vstack((a,b, zero_line)) = 
[[  1.   2.   3.]
 [  4.   5.   6.]
 [ 11.  12.  13.]
 [ 14.  15.  16.]
 [ 17.  18.  19.]
 [  0.   0.   0.]]

Similarly, we can also use the hstack for horizontal stacking. This time we need to adjust the structure of the array a first:

# shape_manipulation.py

a = a.reshape(3, 2)
print("a.reshape(3, 2) = \n{}\n".format(a))

d = np.hstack((a, b, one_column))
print("d = np.hstack((a,b, one_column)) = \n{}\n".format(d))

The output is as follows, please pay attention to the data structure before and after stacking again:

a.reshape(3, 2) = 
[[1 2]
 [3 4]
 [5 6]]

d = np.hstack((a,b, one_column)) = 
[[  1.   2.  11.  12.  13.   1.]
 [  3.   4.  14.  15.  16.   1.]
 [  5.   6.  17.  18.  19.   1.]]

Note that if the structures of the two arrays are not compatible, the stacking will fail. For example, it won't be able to execute the following code line:

# shape_manipulation.py

# np.vstack((a,b)) # ValueError: dimensions not match

This is because array a has two columns while array b has three columns, so they cannot be stacked.

Next, let's take a look at the split. First, we split the array d into three arrays in horizontal direction. Then we print out the middle one (the subscript is 1):

# shape_manipulation.py

e = np.hsplit(d, 3) # Split a into 3
print("e = np.hsplit(d, 3) = \n{}\n".format(e))
print("e[1] = \n{}\n".format(e[1]))

The output is as follows:

e = np.hsplit(d, 3) = 
[array([[ 1.,  2.],
       [ 3.,  4.],
       [ 5.,  6.]]), array([[ 11.,  12.],
       [ 14.,  15.],
       [ 17.,  18.]]), array([[ 13.,   1.],
       [ 16.,   1.],
       [ 19.,   1.]])]

e[1] = 
[[ 11.  12.]
 [ 14.  15.]
 [ 17.  18.]]

In addition, if the number to split we set can't make the original array be split evenly, the operation will fail:

# np.hsplit(d, 4) # ValueError: array split does not result in an equal division

In addition to specifying number to split the array evenly, we can also specify the number of columns to split. The following is to split the array d from the first column and the third column:

# shape_manipulation.py

f = np.hsplit(d, (1, 3)) # # Split a after the 1st and the 3rd column
print("f = np.hsplit(d, (1, 3)) = \n{}\n".format(f))

The output of the code is as follows. The array d is split into three arrays containing 1, 2, and 3 columns respectively:

f = np.hsplit(d, (1, 3)) = 
[array([[ 1.],
       [ 3.],
       [ 5.]]), array([[  2.,  11.],
       [  4.,  14.],
       [  6.,  17.]]), array([[ 12.,  13.,   1.],
       [ 15.,  16.,   1.],
       [ 18.,  19.,   1.]])]

Finally, we split the array d in the vertical direction. Similarly, if the specified number cannot make the array be split evenly, it will fail:

# shape_manipulation.py

g = np.vsplit(d, 3)
print("np.hsplit(d, 2) = \n{}\n".format(g))

# np.vsplit(d, 2) # ValueError: array split does not result in an equal division

np.vsplit(d, 3) will generate three one-dimensional arrays:

np.vsplit(d, 3) = 
[array([[  1.,   2.,  11.,  12.,  13.,   1.]]), array([[  3.,   4.,  14.,  15.,  16.,   1.]]), array([[  5.,   6.,  17.,  18.,  19.,   1.]])]

6. Index

Next we look at how to access the data in the NumPy array.

Again, for testing convenience, let's create a one-dimensional array first. Its content is integers in the interval of [100,200).

Basically, we can specify the subscripts by array[index] to access the elements of the array.

# array_index.py

import numpy as np

base_data = np.arange(100, 200)
print("base_data\n={}\n".format(base_data))

print("base_data[10] = {}\n".format(base_data[10]))

The output for the above code is as follows:

base_data
=[100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171
 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189
 190 191 192 193 194 195 196 197 198 199]

base_data[10] = 110

In NumPy, we can create an array containing several subscripts to get the elements in the target array. For example:

# array_index.py
every_five = np.arange(0, 100, 5)
print("base_data[every_five] = \n{}\n".format(
    base_data[every_five]))

every_five is the array containing the subscripts we want to get, and it's easy for us to understand the code. We can get all the elements we specified the subscript directly through the square brackets:

base_data[every_five] = 
[100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185
 190 195]

The subscript array can be one-dimensional, or multi-dimensional. Let's suppose that we want to get a 2X2 matrix whose content comes from the four subscripts of 1, 2, 10, and 20 in the target array, so the code can be written:

# array_index.py
a = np.array([(1,2), (10,20)])
print("a = \n{}\n".format(a))
print("base_data[a] = \n{}\n".format(base_data[a]))

The output is as follows:

a = 
[[ 1  2]
 [10 20]]
 
base_data[a] = 
[[101 102]
 [110 120]]

The above we see is the case where the target array is one-dimensional. Let's convert the following array into a 10X10 two-dimensional array.

# array_index.py
base_data2 = base_data.reshape(10, -1)
print("base_data2 = np.reshape(base_data, (10, -1)) = \n{}\n".format(base_data2))

The reshape function has been introduced before, and the results is as follows:

base_data2 = np.reshape(base_data, (10, -1)) = 
[[100 101 102 103 104 105 106 107 108 109]
 [110 111 112 113 114 115 116 117 118 119]
 [120 121 122 123 124 125 126 127 128 129]
 [130 131 132 133 134 135 136 137 138 139]
 [140 141 142 143 144 145 146 147 148 149]
 [150 151 152 153 154 155 156 157 158 159]
 [160 161 162 163 164 165 166 167 168 169]
 [170 171 172 173 174 175 176 177 178 179]
 [180 181 182 183 184 185 186 187 188 189]
 [190 191 192 193 194 195 196 197 198 199]]

For a two-dimensional array,

  • if we only specify one subscript, the result of the access is still an array.
  • if we specify two subscripts,  the result of the access is the elements inside.
  • we can also specify the last element by "-1".
# array_index.py
print("base_data2[2] = \n{}\n".format(base_data2[2]))
print("base_data2[2, 3] = \n{}\n".format(base_data2[2, 3]))
print("base_data2[-1, -1] = \n{}\n".format(base_data2[-1, -1]))

The output of the code is as follows.

For higher dimensional arrays, the principle is the same, and you can reason yourself.

base_data2[2] = 
[120 121 122 123 124 125 126 127 128 129]

base_data2[2, 3] = 
123

base_data2[-1, -1] = 
199

In addition, we can also specify the scope by ":", such as: 2:5 . Only to write ":" indicates the full scope.

Please see the code below:

# array_index.py
print("base_data2[2, :]] = \n{}\n".format(base_data2[2, :]))
print("base_data2[:, 3]] = \n{}\n".format(base_data2[:, 3]))
print("base_data2[2:5, 2:4]] = \n{}\n".format(base_data2[2:5, 2:4]))

It will:

  1. get all the elements of the row whose subscript is 2
  2. get all the elements of the column whose subscript is 3
  3. get all the elements of the rows whose subscripts are in [2,5) and the columns whose subscripts are in [2,4). Please observe the following output carefully:
base_data2[2, :]] = 
[120 121 122 123 124 125 126 127 128 129]

base_data2[:, 3]] = 
[103 113 123 133 143 153 163 173 183 193]

base_data2[2:5, 2:4]] = 
[[122 123]
 [132 133]
 [142 143]]

7. Mathematics

There are also a lot of mathematical functions in NumPy. Here are some examples. For more functions, please see: NumPy manual contents

# operation.py

import numpy as np

base_data = (np.random.random((5, 5)) - 0.5) * 100
print("base_data = \n{}\n".format(base_data))

print("np.amin(base_data) = {}".format(np.amin(base_data)))
print("np.amax(base_data) = {}".format(np.amax(base_data)))
print("np.average(base_data) = {}".format(np.average(base_data)))
print("np.sum(base_data) = {}".format(np.sum(base_data)))
print("np.sin(base_data) = \n{}".format(np.sin(base_data)))

The output of the code is as follows:

base_data = 
[[ -9.63895991   6.9292461   -2.35654712 -48.45969283  13.56031937]
 [-39.75875796 -43.21031705 -49.27708561   6.80357128  33.71975059]
 [ 36.32228175  30.92546582 -41.63728955  28.68799187   6.44818484]
 [  7.71568596  43.24884701 -14.90716555  -9.24092252   3.69738718]
 [-31.90994273  34.06067289  18.47830413 -16.02495202 -44.84625246]]

np.amin(base_data) = -49.277085606595726
np.amax(base_data) = 43.24884701268845
np.average(base_data) = -3.22680706079886
np.sum(base_data) = -80.6701765199715
np.sin(base_data) = 
[[ 0.21254814  0.60204578 -0.70685739  0.9725159   0.8381861 ]
 [-0.88287359  0.69755541  0.83514527  0.49721505  0.74315189]
 [-0.98124746 -0.47103234  0.7149727  -0.40196147  0.16425187]
 [ 0.99045239 -0.66943662 -0.71791164 -0.18282139 -0.5276184 ]
 [-0.4741657   0.47665553 -0.36278223  0.31170676 -0.76041722]]

8. Matrix

Now, let's take a look at how to use NumPy in a matrix way.

First, let's create a 5X5 random integer matrix. There are two ways to get the transpose of a matrix: .T or transpose function. In addition, the matrix can be multiplied through the dot function. The sample code is as follows:

# matrix.py

import numpy as np

base_data = np.floor((np.random.random((5, 5)) - 0.5) * 100)
print("base_data = \n{}\n".format(base_data))

print("base_data.T = \n{}\n".format(base_data.T))
print("base_data.transpose() = \n{}\n".format(base_data.transpose()))

matrix_one = np.ones((5, 5))
print("matrix_one = \n{}\n".format(matrix_one))

minus_one = np.dot(matrix_one, -1)
print("minus_one = \n{}\n".format(minus_one))

print("np.dot(base_data, minus_one) = \n{}\n".format(
    np.dot(base_data, minus_one)))

The output is as follows:

base_data = 
[[-49.  -5.  11. -13. -41.]
 [ -6. -33. -33. -47.  -4.]
 [-38.  26.  28. -18.  18.]
 [ -3. -19. -15. -39.  45.]
 [-43.   6.  18. -15. -21.]]

base_data.T = 
[[-49.  -6. -38.  -3. -43.]
 [ -5. -33.  26. -19.   6.]
 [ 11. -33.  28. -15.  18.]
 [-13. -47. -18. -39. -15.]
 [-41.  -4.  18.  45. -21.]]

base_data.transpose() = 
[[-49.  -6. -38.  -3. -43.]
 [ -5. -33.  26. -19.   6.]
 [ 11. -33.  28. -15.  18.]
 [-13. -47. -18. -39. -15.]
 [-41.  -4.  18.  45. -21.]]

matrix_one = 
[[ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]]

minus_one = 
[[-1. -1. -1. -1. -1.]
 [-1. -1. -1. -1. -1.]
 [-1. -1. -1. -1. -1.]
 [-1. -1. -1. -1. -1.]
 [-1. -1. -1. -1. -1.]]

np.dot(base_data, minus_one) = 
[[  97.   97.   97.   97.   97.]
 [ 123.  123.  123.  123.  123.]
 [ -16.  -16.  -16.  -16.  -16.]
 [  31.   31.   31.   31.   31.]
 [  55.   55.   55.   55.   55.]]

9. random number

At the end of the article, let's take a look at the use of random numbers.

Random numbers are a feature we use very often during the programming process, such as generating demo data, or disordering existing data sequence randomly to segment the modeling data and the verification data.

The numpy.random package contains a number of algorithms for random numbers. Here we list the four most common usage:

# rand.py

import numpy as np

print("random: {}\n".format(np.random.random(20)));

print("rand: {}\n".format(np.random.rand(3, 4)));

print("randint: {}\n".format(np.random.randint(0, 100, 20)));

print("permutation: {}\n".format(np.random.permutation(np.arange(20))));

The four usages are:

  1. to generate 20 random numbers, each of which is between [0.0, 1.0)
  2. to generate a random number based on the specified shape
  3. to generate a specified number (such as 20) of random integers within the specified range (such as [0, 100))
  4. to disorder the sequence of the existing data ([0, 1, 2, ..., 19]) randomly

The output is as follows:

random: [0.62956026 0.56816277 0.30903156 0.50427765 0.92117724 0.43044905
 0.54591323 0.47286235 0.93241333 0.32636472 0.14692983 0.02163887
 0.85014782 0.20164791 0.76556972 0.15137427 0.14626625 0.60972522
 0.2995841  0.27569573]

rand: [[0.38629927 0.43779617 0.96276889 0.80018417]
 [0.67656892 0.97189483 0.13323458 0.90663724]
 [0.99440473 0.85197677 0.9420241  0.79598706]]

randint: [74 65 51 34 22 69 81 36 73 35 98 26 41 84  0 93 41  6 51 55]

permutation: [15  3  8 18 14 19 16  1  0  4 10 17  5  2  6 12  9 11 13  7]

10. Reference and recommended materials

0 Comment

temp