Appendix: The NumPy Library

Appendix E: The NumPy Library

This section is under construction.


An array as an indexed sequence of objects, all of which are of the same type. In Section 1.4, we implemented arrays using the Python list data type: a list object is an indexed sequence of objects, not necessarily of the same type. Using Python lists to implement arrays incurs substantial overhead, both in terms of memory (because Python must associate type information with each element) and time (because Python must perform a type check when accessing an element). Moreover, it is the programmer's responsibility to enforce the "all elements of the same type" constraint.

Now, we describe an alternative way to represent arrays in Python using the ndarray ("n-dimensional array") data type in the standard NumPy library: a ndarray object is an indexed sequence of objects, all of which are of the the same type — and NumPy enforces the "all elements of the same type" constraint. We use the informal term NumPy array to mean "an object of type ndarray."

Typically the elements of a NumPy array are numbers, such as floats or integers. As a result, there is minimal overhead in terms of memory (because NumPy need only associate type information with the array and not each element). Also, this representation can dramatically speed up certain types of "vectorized" computations because the array elements are stored contiguously in memory.

But there is more to NumPy than numeric arrays: the NumPy library also supports arrays whose elements are booleans and strings, and arrays whose elements are of data types that you define. There also is more to NumPy than arrays: the NumPy library provides many functions that work on "scalar" numeric objects. But the most valuable aspect of NumPy is its ability to manipulate numeric arrays; this appendix describes only that aspect.



When to Use NumPy

When should you use NumPy? One short answer is "when you need the functionality that NumPy provides." Indeed, as described later in this appendix, NumPy provides a rich set of numeric array-handling functions and methods. Generally you should use the pre-defined (and thoroughly tested) functions and methods provided by NumPy instead of defining your own equivalent functions or methods.

Another short answer is "when you need your program to run faster." The NumPy library was not written in Python; instead it was written using the C programming language. Programs written in C run more quickly than those written in Python. So a Python program that implements arrays as NumPy arrays (maybe) will run faster than an equivalent Python program that implements arrays as ordinary Python lists.

However, there is more to the story. Since NumPy was written in C, there is a boundary between NumPy code and ordinary Python code. Crossing that boundary is expensive. That is, calling a NumPy function or method from ordinary Python code consumes more time than does calling an ordinary Python function or method. Similarly, returning a value from a NumPy function or method to ordinary Python code consumes more time than does returning a value from an ordinary Python function or method. With that in mind, suppose:

In that case Program B probably will benefit from the use of NumPy, but Program A probably will not.



NumPy Data Types

Since the NumPy library was written in the C programming language, its fundamental data types are, for the most part, those of C. The following table (taken from the online NumPy documentation) lists the NumPy fundamental data types:

Data Type Description
bool_ boolean (True or False) stored using 8 bits
int_ Default integer type (same as C long; normally either int64 or int32)
intc Identical to C int (normally int32 or int64)
intp Integer used for indexing (same as C ssize_t; normally either int32 or int64)
int8 Byte (-128 to 127)
int16 Integer (-32768 to 32767)
int32 Integer (-2147483648 to 2147483647)
int64 Integer (-9223372036854775808 to 9223372036854775807)
uint8 Unsigned integer (0 to 255)
uint16 Unsigned integer (0 to 65535)
uint32 Unsigned integer (0 to 4294967295)
uint64 Unsigned integer (0 to 18446744073709551615)
float_ Shorthand for float64
float_ Shorthand for float64
float16 Half precision float: sign bit, 5 bits exponent, 10 bits mantissa
float32 Single precision float: sign bit, 8 bits exponent, 23 bits mantissa
float64 Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
complex_ Shorthand for complex128
complex64 Complex number, represented by two 32-bit floats (real and imaginary components)
complex128 Complex number, represented by two 64-bit floats (real and imaginary components)

When you create a NumPy array, you specify the type of the array's elements. Normally you specify the element data type as a Python data type: int, float, bool, or complex. The Python int data type maps to the NumPy int_ data type. So if you create a NumPy array with elements of data type int, then internally within NumPy its elements are of type int_. Similarly, the Python float data type maps to the NumPy float_ data type, the Python bool data type maps to the NumPy bool_ data type, and the Python complex data type maps to the NumPy complex_ data type.

Usually you need not be concerned about the distinction between Python data types and NumPy data types. When an object of a Python data type is sent "across the boundary" to NumPy, the object automatically is converted to the appropriate Numpy data type. Conversely, when NumPy sends an object of a NumPy data type back to Python, and when that object is used in a context that demands an object of a Python data type, the object automatically is converted to the appropriate Python data type. Throughout this appendix we ignore the distinction between Python and NumPy data types.

However the distinction between Python and NumPy data types can be important in programs that manipulate large-magnitude integers, that is, integers whose absolute values are large. Whereas Python int objects have unlimited range, the range of NumPy int_ objects is limited. A NumPy int_ object has range -2147483648 to 2147483647 (that is -231 to 231-1) on systems that store integers using 32 binary digits, and -9223372036854775808 to 9223372036854775807 (that is, -263 to 263-1) on systems that store integers using 64 binary digits.

So in NumPy it is possible for an expression to evaluate to an integer that is outside of the range that NumPy can store. When such an overflow occurs, NumPy evaluates the expression to an integer that is mathematically incorrect. Beware when manipulating large-magnitude integers in NumPy.



NumPy Array Fundamentals

To use the NumPy library, include the statement import numpy near the beginning of your program. Then to create a NumPy array, call the numpy.array() function specifying a Python list as the first argument and a Python data type as the second argument. For example, this statement:

a = numpy.array([18, 19, 20, 21], int)

creates a one-dimensional NumPy array containing integers 18, 19, 20, and 21, and this statement:

b = numpy.array([[18.5, 19.3], [20.1, 21.0],  [23.7, 24.9]], float)

creates a two dimensional NumPy array of floats having three rows and two columns. If you omit the second argument to numpy.array(), then the function infers the desired element type by examining the types of the values provided in the first argument.

To convert a NumPy array to a Python list, call the tolist() method. For example the expression a.tolist() evaluates to [18, 19, 20, 21], and b.tolist() evaluates to [[18.5, 19.3], [20.1, 21.0], [23.7, 24.9]].

You can reference an element of a one-dimensional NumPy array, just as you can reference an elements of a Python list, by specifying an index enclosed within square brackets. For example a[1] evaluates to 19. To reference an element of a two-dimensional NumPy array, specify the indices within square brackets, separated by commas. For example b[1, 0] evaluates to 20.1. Note that the syntax for referencing an element of a NumPy two-dimensional array differs from the syntax for referencing an element of a list of lists. (Recall that you would use the expression b[1][0] if b referenced a Python list of lists.)

Iteration over NumPy arrays works as expected:

for element in a:
    stdio.writeln(element)
    
for row in b:
    for element in row:
        stdio.writeln(element)

Slicing a NumPy array also is straightforward. For example a[1:2] evaluates to the NumPy array [19, 20]. However, slicing a numpy array does not generate a copy of the array. For example, this statement:

e = a[1:2]

causes e to reference a subarray of the array referenced by a that is not distinct from a. Assigning some value to e[0] would change both e[0] and a[1]. Accordingly, the expression a[:] does not create a copy of the NumPy array referenced by a. Instead, to make a copy of a NumPy array you must call the copy() method:

e = a.copy()

Comparison of two NumPy arrays performs comparison of corresponding elements. Unlike comparison of two Python lists, comparison of two NumPy arrays yields another NumPy array containing booleans. For example, these statements:

f = numpy.array([1,2,3])
g = numpy.array([3,2,1])
h = e < f

generate and assign to h the NumPy array [True, False, False].



NumPy Array Operations

The NumPy library defines many operators, methods, and functions to manipulate arrays. The following table shows examples of some from the field of linear algebra. The examples use these NumPy arrays:

a = numpy.array([1, 2, 3])
b = numpy.array([4, 5, 6])
cc = numpy.array([[7, 8, 9], [10, 11, 12]])
dd = numpy.array([[13, 14, 15], [16, 17, 18]])
ee = numpy.array([[19, 20], [21, 22], [23, 24]])
ff = numpy.array([[25, 26, 27], [28, 29, 30], [31, 32, 33]])

In many cases the same operation can be performed using an operator, a method call, or a function call. For example, both of these expressions compute the memberwise sum of two arrays:

a + b            # Using an operator
numpy.add(a, b)  # Using a function call

and both of these expressions compute the dot product of two arrays:

a.dot(b)         # Using a method call
numpy.dot(a, b)  # Using a function call

In this appendix we use an operator if it is available, we use a method call only if an operator is unavailable, and we use a function call only if neither an operator nor a method call is available.

Operation Examples Results
Zeros array numpy.zeros(3, int)
numpy.zeros(4, float)
[0, 0, 0]
[0.0, 0.0, 0.0, 0.0]
Ones array numpy.ones(3, int)
numpy.ones(4, float)
[1, 1, 1]
[1.0, 1.0, 1.0, 1.0]
Scalar addition 5 + a
a + 5
5 + cc
cc + 5
[6, 7, 8]
[6, 7, 8]
[[12, 13, 14], [15, 16, 17]]
[[12, 13, 14], [15, 16, 17]]
Scalar subtraction 5 - a
a - 5
5 - cc
cc - 5
[4, 3, 2]
[-4, -3, -2]
[[-2, -3, -4], [-5, -6, -7]]
[[2, 3, 4], [5, 6, 7]]
Scalar multiplication 5 * a
a * 5
5 * cc
cc * 5
[5, 10, 15]
[5, 10, 15]
[[35, 40, 45], [50, 55, 60]]
[[35, 40, 45], [55, 55, 60]]
Scalar division 5 / a
a / 5
5 / cc
cc / 5
[5., 2.5, 1.66666667]
[0.2, 0.4, 0.6]
[[ 0.71428571, 0.625, 0.55555556], [0.5, 0.45454545, 0.41666667]]
[[1.4, 1.6, 1.8], [2., 2.2, 2.4]]
Vector/matrix addition a + b
cc + dd
5, 7, 9]
[[20, 22, 24], [26, 28, 30]]
Vector/matrix subtraction a - b
cc - dd
[-3, -3, -3]
[[-6, -6, -6], [-6, -6, -6]]
Vector/matrix memberwise multiplication a * b
cc * dd
[4, 10, 18]
[[91, 112, 135], [160, 187, 216]]
Vector/matrix memberwise division a / b
cc / dd
[0.25, 0.4, 0.5]
[[0.53846154, 0.57142857, 0.6], [0.625, 0.64705882, 0.66666667]]
Vector/matrix multiplication
(dot product)
a.dot(b)
cc.dot(ee)
32
[[508, 532], [697, 730]]
Matrix transpose cc.transpose() [[7, 10], [8, 11], [9, 12]]
Identity matrix numpy.identity(3, int)
numpy.identity(2, float)
[[1, 0, 0], [ 0, 1, 0], [ 0, 0, 1]]
[[1., 0.], [0., 1.]]
Vector/matrix magnitude numpy.linalg.norm(a)
numpy.linalg.norm(cc)
3.7416573867739413
23.643180835073778
Vector/matrix rank numpy.rank(a)
numpy.rank(cc)
1
2
Matrix inverse numpy.linalg.inv(ff) [[-1.80143985e+16, 3.60287970e+16, -1.80143985e+16], [3.60287970e+16, -7.20575940e+16, 3.60287970e+16], [-1.80143985e+16, 3.60287970e+16, -1.80143985e+16]]
Matrix determinant numpy.linalg.det(ff) 1.6653345369377407e-16
Matrix power numpy.linalg.matrix_power(ff, 3) [[190980, 197784, 204588], [212958, 220545, 228132], [234936, 243306, 251676]]
Matrix Eigenvalues numpy.linalg.eig(ff)[0] [8.72064069e+01, -2.06406853e-01, 3.96103694e-15]
Matrix Eigenvectors numpy.linalg.eig(ff)[1] [[-0.51593724, -0.72303847, 0.40824829], [-0.57531135, -0.01621046, -0.81649658], [-0.63468545, 0.69061755, 0.40824829]]


An Example

Recall the Vector class defined in vector.py, the Sketch class defined sketch.py, and the client comparedocuments.py program, all from Section 3.3. The Sketch class's constructor creates a Vector object, and then populates the Vector object such that it profiles given text. The Sketch class's similarTo() method returns a measure of the similarity between two Sketch objects by computing the dot product of the Vector objects defined within the two Sketch objects. The comparedocuments.py client calls the Sketch class's constructor and similarTo() method repeatedly to compute measures of similarity between documents.

As an example of the value of the NumPy library, consider sketchn.py as an alternative to sketch.py. The Sketch class defined in sketchn.py is the same as the one defined in sketch.py, except that it defines its constructor and similarTo() methods such that they use NumPy arrays instead of Vector objects. Also consider comparedocumentsn.py as an alternative to comparedocuments.py. The comparedocumentsn.py program is the same as comparedocuments.py, except that it uses the Sketch class defined in sketchn.py instead of the one defined in sketch.py.

It is easy to believe that the comparedocumentsn.py program runs faster than the comparedocuments.py program — because the comparedocumentsn.py program executes NumPy operations few times, and each operation involves much computation. Indeed it does run much faster. This table shows the approximate amount of CPU time consumed when running the two programs on a typical computer:

Command Time Consumed for dim =
10000 100000 1000000
python comparedocuments.py 5 dim 1.5 sec 2.7 sec 15.3 sec
python comparedocumentsn.py 5 dim 1.4 sec 1.5 sec 2.0 sec

Incidentally, consider this variation of the Sketch constructor:

def __init__(self, text, k, d):
    freq = numpy.zeros(d, float)
    for i in range(len(text) - k):
        kgram = text[i:i+k]
        h = hash(kgram)
        freq[h % d] += 1
    self._sketch = freq / numpy.linalg.norm(freq) # Unit vector

Note that it uses a NumPy array referenced by freq, and that it indexes into freq frequently within its for statement. This version of the constructor executes NumPy operations (array indexing) many times, and each operation involves little computation. So it is easy to believe that this version of the constructor does not benefit from using NumPy. In fact, it runs substantially slower than the constructor shown in sketchn.py. Give it a try!



Learning More

As noted previously, this appendix describes only NumPy's numeric arrays. Moreover, it describes only a handfull of the many operations that can be applied to numeric arrays.

There is much more to NumPy. To learn more we recommend that you consult the online NumPy documentation. The documentation contains both a tutorial and a reference manual. The table of contents of the NumPy reference manual gives you a sense of the scope of the library.

Moreover the Python community also provides the SciPy library. SciPy is a "Sci"entific extension to the "Py"thon language. SciPy is a client of NumPy; that is, some features of SciPy are clients of some features of NumPy. The root of the online SciPy documentation provides a table of contents that gives you a sense of the scope of the SciPy library.

In general, whenever you are faced with the task of composing a Python program that is heavily mathematical or scientific in nature, we recommend that you consider using NumPy or SciPy. The Python community has invested a large amount of time and effort in the development of NumPy and SciPy. As a result, they are full-featured and robust. It is wise to use them when appropriate.