Appendix E: The NumPy Library
This section is under construction.
An array as an indexed sequence of objects, all of which are of the same type. In Section 1.4, we implemented arrays using the Python list
data type: a list
object is an indexed sequence of objects, not necessarily of the same type. Using Python lists to implement arrays incurs substantial overhead, both in terms of memory (because Python must associate type information with each element) and time (because Python must perform a type check when accessing an element). Moreover, it is the programmer's responsibility to enforce the "all elements of the same type" constraint.
Now, we describe an alternative way to represent arrays in Python using the ndarray
("n-dimensional array") data type in the standard NumPy library: a ndarray
object is an indexed sequence of objects, all of which are of the the same type — and NumPy enforces the "all elements of the same type" constraint. We use the informal term NumPy array to mean "an object of type ndarray
."
Typically the elements of a NumPy array are numbers, such as floats or integers. As a result, there is minimal overhead in terms of memory (because NumPy need only associate type information with the array and not each element). Also, this representation can dramatically speed up certain types of "vectorized" computations because the array elements are stored contiguously in memory.
But there is more to NumPy than numeric arrays: the NumPy library also supports arrays whose elements are booleans and strings, and arrays whose elements are of data types that you define. There also is more to NumPy than arrays: the NumPy library provides many functions that work on "scalar" numeric objects. But the most valuable aspect of NumPy is its ability to manipulate numeric arrays; this appendix describes only that aspect.When to Use NumPy
When should you use NumPy? One short answer is "when you need the functionality that NumPy provides." Indeed, as described later in this appendix, NumPy provides a rich set of numeric array-handling functions and methods. Generally you should use the pre-defined (and thoroughly tested) functions and methods provided by NumPy instead of defining your own equivalent functions or methods.
Another short answer is "when you need your program to run faster." The NumPy library was not written in Python; instead it was written using the C programming language. Programs written in C run more quickly than those written in Python. So a Python program that implements arrays as NumPy arrays (maybe) will run faster than an equivalent Python program that implements arrays as ordinary Python lists.
However, there is more to the story. Since NumPy was written in C, there is a boundary between NumPy code and ordinary Python code. Crossing that boundary is expensive. That is, calling a NumPy function or method from ordinary Python code consumes more time than does calling an ordinary Python function or method. Similarly, returning a value from a NumPy function or method to ordinary Python code consumes more time than does returning a value from an ordinary Python function or method. With that in mind, suppose:
- Program A calls NumPy functions/methods many times, and each function/method call involves little computation.
- Program B calls NumPy functions/methods few times, and each function/method call involves much computation.
In that case Program B probably will benefit from the use of NumPy, but Program A probably will not.
NumPy Data Types
Since the NumPy library was written in the C programming language, its fundamental data types are, for the most part, those of C. The following table (taken from the online NumPy documentation) lists the NumPy fundamental data types:
Data Type Description bool_
boolean (True or False) stored using 8 bits int_
Default integer type (same as C long; normally either int64 or int32) intc
Identical to C int (normally int32 or int64) intp
Integer used for indexing (same as C ssize_t; normally either int32 or int64) int8
Byte (-128 to 127) int16
Integer (-32768 to 32767) int32
Integer (-2147483648 to 2147483647) int64
Integer (-9223372036854775808 to 9223372036854775807) uint8
Unsigned integer (0 to 255) uint16
Unsigned integer (0 to 65535) uint32
Unsigned integer (0 to 4294967295) uint64
Unsigned integer (0 to 18446744073709551615) float_
Shorthand for float64 float_
Shorthand for float64 float16
Half precision float: sign bit, 5 bits exponent, 10 bits mantissa float32
Single precision float: sign bit, 8 bits exponent, 23 bits mantissa float64
Double precision float: sign bit, 11 bits exponent, 52 bits mantissa complex_
Shorthand for complex128 complex64
Complex number, represented by two 32-bit floats (real and imaginary components) complex128
Complex number, represented by two 64-bit floats (real and imaginary components)
When you create a NumPy array, you specify the type of the array's elements. Normally you specify the element data type as a Python data type: int
, float
, bool
, or complex
. The Python int
data type maps to the NumPy int_
data type. So if you create a NumPy array with elements of data type int
, then internally within NumPy its elements are of type int_
. Similarly, the Python float
data type maps to the NumPy float_
data type, the Python bool
data type maps to the NumPy bool_
data type, and the Python complex
data type maps to the NumPy complex_
data type.
Usually you need not be concerned about the distinction between Python data types and NumPy data types. When an object of a Python data type is sent "across the boundary" to NumPy, the object automatically is converted to the appropriate Numpy data type. Conversely, when NumPy sends an object of a NumPy data type back to Python, and when that object is used in a context that demands an object of a Python data type, the object automatically is converted to the appropriate Python data type. Throughout this appendix we ignore the distinction between Python and NumPy data types.
However the distinction between Python and NumPy data types can be important in programs that manipulate large-magnitude integers, that is, integers whose absolute values are large. Whereas Python int
objects have unlimited range, the range of NumPy int_
objects is limited. A NumPy int_
object has range -2147483648 to 2147483647 (that is -231 to 231-1) on systems that store integers using 32 binary digits, and -9223372036854775808 to 9223372036854775807 (that is, -263 to 263-1) on systems that store integers using 64 binary digits.
So in NumPy it is possible for an expression to evaluate to an integer that is outside of the range that NumPy can store. When such an overflow occurs, NumPy evaluates the expression to an integer that is mathematically incorrect. Beware when manipulating large-magnitude integers in NumPy.
NumPy Array Fundamentals
To use the NumPy library, include the statement import numpy
near the beginning of your program. Then to create a NumPy array, call the numpy.array()
function specifying a Python list as the first argument and a Python data type as the second argument. For example, this statement:
a = numpy.array([18, 19, 20, 21], int)
creates a one-dimensional NumPy array containing integers 18, 19, 20, and 21, and this statement:
b = numpy.array([[18.5, 19.3], [20.1, 21.0], [23.7, 24.9]], float)
creates a two dimensional NumPy array of floats having three rows and two columns. If you omit the second argument to numpy.array()
, then the function infers the desired element type by examining the types of the values provided in the first argument.
To convert a NumPy array to a Python list, call the tolist()
method. For example the expression a.tolist()
evaluates to [18, 19, 20, 21]
, and b.tolist()
evaluates to [[18.5, 19.3], [20.1, 21.0], [23.7, 24.9]]
.
You can reference an element of a one-dimensional NumPy array, just as you can reference an elements of a Python list, by specifying an index enclosed within square brackets. For example a[1]
evaluates to 19. To reference an element of a two-dimensional NumPy array, specify the indices within square brackets, separated by commas. For example b[1, 0]
evaluates to 20.1. Note that the syntax for referencing an element of a NumPy two-dimensional array differs from the syntax for referencing an element of a list of lists. (Recall that you would use the expression b[1][0]
if b
referenced a Python list of lists.)
Iteration over NumPy arrays works as expected:
for element in a: stdio.writeln(element) for row in b: for element in row: stdio.writeln(element)
Slicing a NumPy array also is straightforward. For example a[1:2]
evaluates to the NumPy array [19, 20]
. However, slicing a numpy array does not generate a copy of the array. For example, this statement:
e = a[1:2]
causes e
to reference a subarray of the array referenced by a
that is not distinct from a
. Assigning some value to e[0]
would change both e[0]
and a[1]
. Accordingly, the expression a[:]
does not create a copy of the NumPy array referenced by a
. Instead, to make a copy of a NumPy array you must call the copy()
method:
e = a.copy()
Comparison of two NumPy arrays performs comparison of corresponding elements. Unlike comparison of two Python lists, comparison of two NumPy arrays yields another NumPy array containing booleans. For example, these statements:
f = numpy.array([1,2,3]) g = numpy.array([3,2,1]) h = e < f
generate and assign to h
the NumPy array [True, False, False]
.
NumPy Array Operations
The NumPy library defines many operators, methods, and functions to manipulate arrays. The following table shows examples of some from the field of linear algebra. The examples use these NumPy arrays:
a = numpy.array([1, 2, 3]) b = numpy.array([4, 5, 6]) cc = numpy.array([[7, 8, 9], [10, 11, 12]]) dd = numpy.array([[13, 14, 15], [16, 17, 18]]) ee = numpy.array([[19, 20], [21, 22], [23, 24]]) ff = numpy.array([[25, 26, 27], [28, 29, 30], [31, 32, 33]])
In many cases the same operation can be performed using an operator, a method call, or a function call. For example, both of these expressions compute the memberwise sum of two arrays:
a + b # Using an operator numpy.add(a, b) # Using a function call
and both of these expressions compute the dot product of two arrays:
a.dot(b) # Using a method call numpy.dot(a, b) # Using a function call
In this appendix we use an operator if it is available, we use a method call only if an operator is unavailable, and we use a function call only if neither an operator nor a method call is available.
Operation Examples Results Zeros array numpy.zeros(3, int)
numpy.zeros(4, float)
[0, 0, 0]
[0.0, 0.0, 0.0, 0.0]
Ones array numpy.ones(3, int)
numpy.ones(4, float)
[1, 1, 1]
[1.0, 1.0, 1.0, 1.0]
Scalar addition 5 + a
a + 5
5 + cc
cc + 5
[6, 7, 8]
[6, 7, 8]
[[12, 13, 14], [15, 16, 17]]
[[12, 13, 14], [15, 16, 17]]
Scalar subtraction 5 - a
a - 5
5 - cc
cc - 5
[4, 3, 2]
[-4, -3, -2]
[[-2, -3, -4], [-5, -6, -7]]
[[2, 3, 4], [5, 6, 7]]
Scalar multiplication 5 * a
a * 5
5 * cc
cc * 5
[5, 10, 15]
[5, 10, 15]
[[35, 40, 45], [50, 55, 60]]
[[35, 40, 45], [55, 55, 60]]
Scalar division 5 / a
a / 5
5 / cc
cc / 5
[5., 2.5, 1.66666667]
[0.2, 0.4, 0.6]
[[ 0.71428571, 0.625, 0.55555556], [0.5, 0.45454545, 0.41666667]]
[[1.4, 1.6, 1.8], [2., 2.2, 2.4]]
Vector/matrix addition a + b
cc + dd
5, 7, 9]
[[20, 22, 24], [26, 28, 30]]
Vector/matrix subtraction a - b
cc - dd
[-3, -3, -3]
[[-6, -6, -6], [-6, -6, -6]]
Vector/matrix memberwise multiplication a * b
cc * dd
[4, 10, 18]
[[91, 112, 135], [160, 187, 216]]
Vector/matrix memberwise division a / b
cc / dd
[0.25, 0.4, 0.5]
[[0.53846154, 0.57142857, 0.6], [0.625, 0.64705882, 0.66666667]]
Vector/matrix multiplication
(dot product)a.dot(b)
cc.dot(ee)
32
[[508, 532], [697, 730]]
Matrix transpose cc.transpose()
[[7, 10], [8, 11], [9, 12]]
Identity matrix numpy.identity(3, int)
numpy.identity(2, float)
[[1, 0, 0], [ 0, 1, 0], [ 0, 0, 1]]
[[1., 0.], [0., 1.]]
Vector/matrix magnitude numpy.linalg.norm(a)
numpy.linalg.norm(cc)
3.7416573867739413
23.643180835073778
Vector/matrix rank numpy.rank(a)
numpy.rank(cc)
1
2
Matrix inverse numpy.linalg.inv(ff)
[[-1.80143985e+16, 3.60287970e+16, -1.80143985e+16], [3.60287970e+16, -7.20575940e+16, 3.60287970e+16], [-1.80143985e+16, 3.60287970e+16, -1.80143985e+16]]
Matrix determinant numpy.linalg.det(ff)
1.6653345369377407e-16
Matrix power numpy.linalg.matrix_power(ff, 3)
[[190980, 197784, 204588], [212958, 220545, 228132], [234936, 243306, 251676]]
Matrix Eigenvalues numpy.linalg.eig(ff)[0]
[8.72064069e+01, -2.06406853e-01, 3.96103694e-15]
Matrix Eigenvectors numpy.linalg.eig(ff)[1]
[[-0.51593724, -0.72303847, 0.40824829], [-0.57531135, -0.01621046, -0.81649658], [-0.63468545, 0.69061755, 0.40824829]]
An Example
Recall the Vector
class defined in vector.py, the Sketch
class defined sketch.py, and the client comparedocuments.py program, all from Section 3.3. The Sketch
class's constructor creates a Vector
object, and then populates the Vector
object such that it profiles given text. The Sketch
class's similarTo()
method returns a measure of the similarity between two Sketch
objects by computing the dot product of the Vector
objects defined within the two Sketch
objects. The comparedocuments.py
client calls the Sketch
class's constructor and similarTo()
method repeatedly to compute measures of similarity between documents.
As an example of the value of the NumPy library, consider sketchn.py as an alternative to sketch.py. The Sketch
class defined in sketchn.py is the same as the one defined in sketch.py, except that it defines its constructor and similarTo()
methods such that they use NumPy arrays instead of Vector
objects. Also consider comparedocumentsn.py as an alternative to comparedocuments.py. The comparedocumentsn.py program is the same as comparedocuments.py, except that it uses the Sketch
class defined in sketchn.py instead of the one defined in sketch.py.
It is easy to believe that the comparedocumentsn.py program runs faster than the comparedocuments.py program — because the comparedocumentsn.py
program executes NumPy operations few times, and each operation involves much computation. Indeed it does run much faster. This table shows the approximate amount of CPU time consumed when running the two programs on a typical computer:
Command Time Consumed for dim = 10000 100000 1000000 python comparedocuments.py 5 dim
1.5 sec 2.7 sec 15.3 sec python comparedocumentsn.py 5 dim
1.4 sec 1.5 sec 2.0 sec
Incidentally, consider this variation of the Sketch
constructor:
def __init__(self, text, k, d): freq = numpy.zeros(d, float) for i in range(len(text) - k): kgram = text[i:i+k] h = hash(kgram) freq[h % d] += 1 self._sketch = freq / numpy.linalg.norm(freq) # Unit vector
Note that it uses a NumPy array referenced by freq
, and that it indexes into freq
frequently within its for
statement. This version of the constructor executes NumPy operations (array indexing) many times, and each operation involves little computation. So it is easy to believe that this version of the constructor does not benefit from using NumPy. In fact, it runs substantially slower than the constructor shown in sketchn.py. Give it a try!
Learning More
As noted previously, this appendix describes only NumPy's numeric arrays. Moreover, it describes only a handfull of the many operations that can be applied to numeric arrays.
There is much more to NumPy. To learn more we recommend that you consult the online NumPy documentation. The documentation contains both a tutorial and a reference manual. The table of contents of the NumPy reference manual gives you a sense of the scope of the library.
Moreover the Python community also provides the SciPy library. SciPy is a "Sci"entific extension to the "Py"thon language. SciPy is a client of NumPy; that is, some features of SciPy are clients of some features of NumPy. The root of the online SciPy documentation provides a table of contents that gives you a sense of the scope of the SciPy library.
In general, whenever you are faced with the task of composing a Python program that is heavily mathematical or scientific in nature, we recommend that you consider using NumPy or SciPy. The Python community has invested a large amount of time and effort in the development of NumPy and SciPy. As a result, they are full-featured and robust. It is wise to use them when appropriate.