Numpy and Matplotlib

This lecture will introduce NumPy and Matplotlib. These are two of the most fundamental parts of the scientific python “ecosystem”. Most everything else is built on top of them.

Numpy: The fundamental package for scientific computing with Python

Matlotlib: Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

Importing and Examining a New Package

This will be our first experience with importing a package which is not part of the Python standard library.

import numpy as np

What did we just do? We imported a package. This brings new variables (mostly functions) into our interpreter. We access them as follows.

# find out what is in our namespace
dir()
['In',
 'Out',
 '_',
 '__',
 '___',
 '__builtin__',
 '__builtins__',
 '__doc__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_dh',
 '_i',
 '_i1',
 '_i2',
 '_ih',
 '_ii',
 '_iii',
 '_oh',
 'exit',
 'get_ipython',
 'np',
 'quit']
# find out what's in numpy
dir(np)
['ALLOW_THREADS',
 'AxisError',
 'BUFSIZE',
 'Bytes0',
 'CLIP',
 'ComplexWarning',
 'DataSource',
 'Datetime64',
 'ERR_CALL',
 'ERR_DEFAULT',
 'ERR_IGNORE',
 'ERR_LOG',
 'ERR_PRINT',
 'ERR_RAISE',
 'ERR_WARN',
 'FLOATING_POINT_SUPPORT',
 'FPE_DIVIDEBYZERO',
 'FPE_INVALID',
 'FPE_OVERFLOW',
 'FPE_UNDERFLOW',
 'False_',
 'Inf',
 'Infinity',
 'MAXDIMS',
 'MAY_SHARE_BOUNDS',
 'MAY_SHARE_EXACT',
 'MachAr',
 'ModuleDeprecationWarning',
 'NAN',
 'NINF',
 'NZERO',
 'NaN',
 'PINF',
 'PZERO',
 'RAISE',
 'RankWarning',
 'SHIFT_DIVIDEBYZERO',
 'SHIFT_INVALID',
 'SHIFT_OVERFLOW',
 'SHIFT_UNDERFLOW',
 'ScalarType',
 'Str0',
 'Tester',
 'TooHardError',
 'True_',
 'UFUNC_BUFSIZE_DEFAULT',
 'UFUNC_PYVALS_NAME',
 'Uint64',
 'VisibleDeprecationWarning',
 'WRAP',
 '_NoValue',
 '_UFUNC_API',
 '__NUMPY_SETUP__',
 '__all__',
 '__builtins__',
 '__cached__',
 '__config__',
 '__deprecated_attrs__',
 '__dir__',
 '__doc__',
 '__expired_functions__',
 '__file__',
 '__getattr__',
 '__git_version__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_add_newdoc_ufunc',
 '_distributor_init',
 '_financial_names',
 '_globals',
 '_mat',
 '_pytesttester',
 '_version',
 'abs',
 'absolute',
 'add',
 'add_docstring',
 'add_newdoc',
 'add_newdoc_ufunc',
 'alen',
 'all',
 'allclose',
 'alltrue',
 'amax',
 'amin',
 'angle',
 'any',
 'append',
 'apply_along_axis',
 'apply_over_axes',
 'arange',
 'arccos',
 'arccosh',
 'arcsin',
 'arcsinh',
 'arctan',
 'arctan2',
 'arctanh',
 'argmax',
 'argmin',
 'argpartition',
 'argsort',
 'argwhere',
 'around',
 'array',
 'array2string',
 'array_equal',
 'array_equiv',
 'array_repr',
 'array_split',
 'array_str',
 'asanyarray',
 'asarray',
 'asarray_chkfinite',
 'ascontiguousarray',
 'asfarray',
 'asfortranarray',
 'asmatrix',
 'asscalar',
 'atleast_1d',
 'atleast_2d',
 'atleast_3d',
 'average',
 'bartlett',
 'base_repr',
 'binary_repr',
 'bincount',
 'bitwise_and',
 'bitwise_not',
 'bitwise_or',
 'bitwise_xor',
 'blackman',
 'block',
 'bmat',
 'bool8',
 'bool_',
 'broadcast',
 'broadcast_arrays',
 'broadcast_shapes',
 'broadcast_to',
 'busday_count',
 'busday_offset',
 'busdaycalendar',
 'byte',
 'byte_bounds',
 'bytes0',
 'bytes_',
 'c_',
 'can_cast',
 'cast',
 'cbrt',
 'cdouble',
 'ceil',
 'cfloat',
 'char',
 'character',
 'chararray',
 'choose',
 'clip',
 'clongdouble',
 'clongfloat',
 'column_stack',
 'common_type',
 'compare_chararrays',
 'compat',
 'complex128',
 'complex256',
 'complex64',
 'complex_',
 'complexfloating',
 'compress',
 'concatenate',
 'conj',
 'conjugate',
 'convolve',
 'copy',
 'copysign',
 'copyto',
 'core',
 'corrcoef',
 'correlate',
 'cos',
 'cosh',
 'count_nonzero',
 'cov',
 'cross',
 'csingle',
 'ctypeslib',
 'cumprod',
 'cumproduct',
 'cumsum',
 'datetime64',
 'datetime_as_string',
 'datetime_data',
 'deg2rad',
 'degrees',
 'delete',
 'deprecate',
 'deprecate_with_doc',
 'diag',
 'diag_indices',
 'diag_indices_from',
 'diagflat',
 'diagonal',
 'diff',
 'digitize',
 'disp',
 'divide',
 'divmod',
 'dot',
 'double',
 'dsplit',
 'dstack',
 'dtype',
 'e',
 'ediff1d',
 'einsum',
 'einsum_path',
 'emath',
 'empty',
 'empty_like',
 'equal',
 'errstate',
 'euler_gamma',
 'exp',
 'exp2',
 'expand_dims',
 'expm1',
 'extract',
 'eye',
 'fabs',
 'fastCopyAndTranspose',
 'fft',
 'fill_diagonal',
 'find_common_type',
 'finfo',
 'fix',
 'flatiter',
 'flatnonzero',
 'flexible',
 'flip',
 'fliplr',
 'flipud',
 'float128',
 'float16',
 'float32',
 'float64',
 'float_',
 'float_power',
 'floating',
 'floor',
 'floor_divide',
 'fmax',
 'fmin',
 'fmod',
 'format_float_positional',
 'format_float_scientific',
 'format_parser',
 'frexp',
 'frombuffer',
 'fromfile',
 'fromfunction',
 'fromiter',
 'frompyfunc',
 'fromregex',
 'fromstring',
 'full',
 'full_like',
 'gcd',
 'generic',
 'genfromtxt',
 'geomspace',
 'get_array_wrap',
 'get_include',
 'get_printoptions',
 'getbufsize',
 'geterr',
 'geterrcall',
 'geterrobj',
 'gradient',
 'greater',
 'greater_equal',
 'half',
 'hamming',
 'hanning',
 'heaviside',
 'histogram',
 'histogram2d',
 'histogram_bin_edges',
 'histogramdd',
 'hsplit',
 'hstack',
 'hypot',
 'i0',
 'identity',
 'iinfo',
 'imag',
 'in1d',
 'index_exp',
 'indices',
 'inexact',
 'inf',
 'info',
 'infty',
 'inner',
 'insert',
 'int0',
 'int16',
 'int32',
 'int64',
 'int8',
 'int_',
 'intc',
 'integer',
 'interp',
 'intersect1d',
 'intp',
 'invert',
 'is_busday',
 'isclose',
 'iscomplex',
 'iscomplexobj',
 'isfinite',
 'isfortran',
 'isin',
 'isinf',
 'isnan',
 'isnat',
 'isneginf',
 'isposinf',
 'isreal',
 'isrealobj',
 'isscalar',
 'issctype',
 'issubclass_',
 'issubdtype',
 'issubsctype',
 'iterable',
 'ix_',
 'kaiser',
 'kernel_version',
 'kron',
 'lcm',
 'ldexp',
 'left_shift',
 'less',
 'less_equal',
 'lexsort',
 'lib',
 'linalg',
 'linspace',
 'little_endian',
 'load',
 'loads',
 'loadtxt',
 'log',
 'log10',
 'log1p',
 'log2',
 'logaddexp',
 'logaddexp2',
 'logical_and',
 'logical_not',
 'logical_or',
 'logical_xor',
 'logspace',
 'longcomplex',
 'longdouble',
 'longfloat',
 'longlong',
 'lookfor',
 'ma',
 'mafromtxt',
 'mask_indices',
 'mat',
 'math',
 'matmul',
 'matrix',
 'matrixlib',
 'max',
 'maximum',
 'maximum_sctype',
 'may_share_memory',
 'mean',
 'median',
 'memmap',
 'meshgrid',
 'mgrid',
 'min',
 'min_scalar_type',
 'minimum',
 'mintypecode',
 'mod',
 'modf',
 'moveaxis',
 'msort',
 'multiply',
 'nan',
 'nan_to_num',
 'nanargmax',
 'nanargmin',
 'nancumprod',
 'nancumsum',
 'nanmax',
 'nanmean',
 'nanmedian',
 'nanmin',
 'nanpercentile',
 'nanprod',
 'nanquantile',
 'nanstd',
 'nansum',
 'nanvar',
 'nbytes',
 'ndarray',
 'ndenumerate',
 'ndfromtxt',
 'ndim',
 'ndindex',
 'nditer',
 'negative',
 'nested_iters',
 'newaxis',
 'nextafter',
 'nonzero',
 'not_equal',
 'numarray',
 'number',
 'obj2sctype',
 'object0',
 'object_',
 'ogrid',
 'oldnumeric',
 'ones',
 'ones_like',
 'os',
 'outer',
 'packbits',
 'pad',
 'partition',
 'percentile',
 'pi',
 'piecewise',
 'place',
 'poly',
 'poly1d',
 'polyadd',
 'polyder',
 'polydiv',
 'polyfit',
 'polyint',
 'polymul',
 'polynomial',
 'polysub',
 'polyval',
 'positive',
 'power',
 'printoptions',
 'prod',
 'product',
 'promote_types',
 'ptp',
 'put',
 'put_along_axis',
 'putmask',
 'quantile',
 'r_',
 'rad2deg',
 'radians',
 'random',
 'ravel',
 'ravel_multi_index',
 'real',
 'real_if_close',
 'rec',
 'recarray',
 'recfromcsv',
 'recfromtxt',
 'reciprocal',
 'record',
 'remainder',
 'repeat',
 'require',
 'reshape',
 'resize',
 'result_type',
 'right_shift',
 'rint',
 'roll',
 'rollaxis',
 'roots',
 'rot90',
 'round',
 'round_',
 'row_stack',
 's_',
 'safe_eval',
 'save',
 'savetxt',
 'savez',
 'savez_compressed',
 'sctype2char',
 'sctypeDict',
 'sctypes',
 'searchsorted',
 'select',
 'set_numeric_ops',
 'set_printoptions',
 'set_string_function',
 'setbufsize',
 'setdiff1d',
 'seterr',
 'seterrcall',
 'seterrobj',
 'setxor1d',
 'shape',
 'shares_memory',
 'short',
 'show_config',
 'sign',
 'signbit',
 'signedinteger',
 'sin',
 'sinc',
 'single',
 'singlecomplex',
 'sinh',
 'size',
 'sometrue',
 'sort',
 'sort_complex',
 'source',
 'spacing',
 'split',
 'sqrt',
 'square',
 'squeeze',
 'stack',
 'std',
 'str0',
 'str_',
 'string_',
 'subtract',
 'sum',
 'swapaxes',
 'sys',
 'take',
 'take_along_axis',
 'tan',
 'tanh',
 'tensordot',
 'test',
 'testing',
 'tile',
 'timedelta64',
 'trace',
 'tracemalloc_domain',
 'transpose',
 'trapz',
 'tri',
 'tril',
 'tril_indices',
 'tril_indices_from',
 'trim_zeros',
 'triu',
 'triu_indices',
 'triu_indices_from',
 'true_divide',
 'trunc',
 'typecodes',
 'typename',
 'ubyte',
 'ufunc',
 'uint',
 'uint0',
 'uint16',
 'uint32',
 'uint64',
 'uint8',
 'uintc',
 'uintp',
 'ulonglong',
 'unicode_',
 'union1d',
 'unique',
 'unpackbits',
 'unravel_index',
 'unsignedinteger',
 'unwrap',
 'use_hugepage',
 'ushort',
 'vander',
 'var',
 'vdot',
 'vectorize',
 'version',
 'void',
 'void0',
 'vsplit',
 'vstack',
 'warnings',
 'where',
 'who',
 'zeros',
 'zeros_like']
# find out what version we have
np.__version__
'1.21.1'

There is no way we could explicitly teach each of these functions. The numpy documentation is crucial!

https://numpy.org/doc/stable/reference/

NDArrays

The core class is the numpy ndarray (n-dimensional array).

The main difference between a numpy array an a more general data container such as list are the following:

  • Numpy arrays can have N dimensions (while lists, tuples, etc. only have 1)

  • Numpy arrays hold values of the same datatype (e.g. int, float), while lists can contain anything.

  • Numpy optimizes numerical operations on arrays. Numpy is fast!

from IPython.display import Image
Image(url='http://docs.scipy.org/doc/numpy/_images/threefundamental.png')
# create an array from a list
a = np.array([9,0,2,1,0])
# find out the datatype
a.dtype
dtype('int64')
# find out the shape
a.shape
(5,)
# what is the shape
type(a.shape)
tuple
# another array with a different datatype and shape
b = np.array([[5,3,1,9],[9,2,3,0]], dtype=np.float64)

# check dtype and shape
b.dtype, b.shape
(dtype('float64'), (2, 4))

Note

The fastest varying dimension is the last dimension! The outer level of the hierarchy is the first dimension. (This is called “c-style” indexing)

Array Creation

There are lots of ways to create arrays.

# create some uniform arrays
c = np.zeros((9,9))
d = np.ones((3,6,3), dtype=np.complex128)
e = np.full((3,3), np.pi)
e = np.ones_like(c)
f = np.zeros_like(d)

arange works very similar to range, but it populates the array “eagerly” (i.e. immediately), rather than generating the values upon iteration.

np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

arange is left inclusive, right exclusive, just like range, but also works with floating-point numbers.

np.arange(2,4,0.25)
array([2.  , 2.25, 2.5 , 2.75, 3.  , 3.25, 3.5 , 3.75])

A frequent need is to generate an array of N numbers, evenly spaced between two values. That is what linspace is for.

np.linspace(2,4,20)
array([2.        , 2.10526316, 2.21052632, 2.31578947, 2.42105263,
       2.52631579, 2.63157895, 2.73684211, 2.84210526, 2.94736842,
       3.05263158, 3.15789474, 3.26315789, 3.36842105, 3.47368421,
       3.57894737, 3.68421053, 3.78947368, 3.89473684, 4.        ])
# log spaced
np.logspace(1,2,10)
array([ 10.        ,  12.91549665,  16.68100537,  21.5443469 ,
        27.82559402,  35.93813664,  46.41588834,  59.94842503,
        77.42636827, 100.        ])

Numpy also has some utilities for helping us generate multi-dimensional arrays. meshgrid creates 2D arrays out of a combination of 1D arrays.

x = np.linspace(-2*np.pi, 2*np.pi, 100)
y = np.linspace(-np.pi, np.pi, 50)
xx, yy = np.meshgrid(x, y)
xx.shape, yy.shape
((50, 100), (50, 100))

Indexing

Basic indexing is similar to lists

# get some individual elements of xx
xx[0,0], xx[-1,-1], xx[3,-5]
(-6.283185307179586, 6.283185307179586, 5.775453161144872)
# get some whole rows and columns
xx[0].shape, xx[:,-1].shape
((100,), (50,))
# get some ranges
xx[3:10,30:40].shape
(7, 10)

There are many advanced ways to index arrays. You can read about them in the manual. Here is one example.

# use a boolean array as an index
idx = xx<0
yy[idx].shape
(2500,)
# the array got flattened
xx.ravel().shape
(5000,)

Visualizing Arrays with Matplotlib

It can be hard to work with big arrays without actually seeing anything with our eyes! We will now bring in Matplotib to start visualizating these arrays. For now we will just skim the surface of Matplotlib. Much more depth will be provided in the next chapter.

from matplotlib import pyplot as plt

For plotting a 1D array as a line, we use the plot command.

plt.plot(x)
[<matplotlib.lines.Line2D at 0x7efecaf6f3a0>]
../../_images/numpy_and_matplotlib_35_1.png

There are many ways to visualize 2D data. He we use pcolormesh.

plt.pcolormesh(xx)
<matplotlib.collections.QuadMesh at 0x7efecae663a0>
../../_images/numpy_and_matplotlib_37_1.png
plt.pcolormesh(yy)
<matplotlib.collections.QuadMesh at 0x7efecadce9a0>
../../_images/numpy_and_matplotlib_38_1.png

Array Operations

There are a huge number of operations available on arrays. All the familiar arithemtic operators are applied on an element-by-element basis.

Basic Math

f = np.sin(xx) * np.cos(0.5*yy)
plt.pcolormesh(f)
<matplotlib.collections.QuadMesh at 0x7efecada8df0>
../../_images/numpy_and_matplotlib_41_1.png

Manipulating array dimensions

Swapping the dimension order is accomplished by calling transpose.

f_transposed = f.transpose()
plt.pcolormesh(f_transposed)
<matplotlib.collections.QuadMesh at 0x7efecad14130>
../../_images/numpy_and_matplotlib_44_1.png

We can also manually change the shape of an array…as long as the new shape has the same number of elements.

g = np.reshape(f, (8,9))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_442/3844134479.py in <module>
----> 1 g = np.reshape(f, (8,9))

<__array_function__ internals> in reshape(*args, **kwargs)

/srv/conda/envs/notebook/lib/python3.8/site-packages/numpy/core/fromnumeric.py in reshape(a, newshape, order)
    296            [5, 6]])
    297     """
--> 298     return _wrapfunc(a, 'reshape', newshape, order=order)
    299 
    300 

/srv/conda/envs/notebook/lib/python3.8/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     55 
     56     try:
---> 57         return bound(*args, **kwds)
     58     except TypeError:
     59         # A TypeError occurs if the object does have such a method in its

ValueError: cannot reshape array of size 5000 into shape (8,9)

However, be careful with reshapeing data! You can accidentally lose the structure of the data.

g = np.reshape(f, (200,25))
plt.pcolormesh(g)
<matplotlib.collections.QuadMesh at 0x7efecac11c70>
../../_images/numpy_and_matplotlib_48_1.png

We can also “tile” an array to repeat it many times.

f_tiled = np.tile(f,(3, 2))
plt.pcolormesh(f_tiled)
<matplotlib.collections.QuadMesh at 0x7efecab78250>
../../_images/numpy_and_matplotlib_50_1.png

Another common need is to add an extra dimension to an array. This can be accomplished via indexing with None.

x.shape
(100,)
x[None, :].shape
(1, 100)
x[None, :, None, None].shape
(1, 100, 1, 1)

Broadcasting

Not all the arrays we want to work with will have the same size. One approach would be to manually “expand” our arrays to all be the same size, e.g. using tile. Broadcasting is a more efficient way to multiply arrays of different sizes Numpy has specific rules for how broadcasting works. These can be confusing but are worth learning if you plan to work with Numpy data a lot.

The core concept of broadcasting is telling Numpy which dimensions are supposed to line up with each other.

Image(url='http://scipy-lectures.github.io/_images/numpy_broadcasting.png',
     width=720)

Dimensions are automatically aligned starting with the last dimension. If the last two dimensions have the same length, then the two arrays can be broadcast.

print(f.shape, x.shape)
g = f * x
print(g.shape)
(50, 100) (100,)
(50, 100)
plt.pcolormesh(g)
<matplotlib.collections.QuadMesh at 0x7efecabed310>
../../_images/numpy_and_matplotlib_59_1.png

However, if the last two dimensions are not the same, Numpy cannot just automatically figure it out.

# multiply f by y
print(f.shape, y.shape)
h = f * y
(50, 100) (50,)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_442/1156475932.py in <module>
      1 # multiply f by y
      2 print(f.shape, y.shape)
----> 3 h = f * y

ValueError: operands could not be broadcast together with shapes (50,100) (50,) 

We can help numpy by adding an extra dimension to y at the end. Then the length-50 dimensions will line up.

print(f.shape, y[:, None].shape)
h = f * y[:, None]
print(h.shape)
(50, 100) (50, 1)
(50, 100)
plt.pcolormesh(h)
<matplotlib.collections.QuadMesh at 0x7efecaabaee0>
../../_images/numpy_and_matplotlib_64_1.png

Reduction Operations

In scientific data analysis, we usually start with a lot of data and want to reduce it down in order to make plots of summary tables. Operations that reduce the size of numpy arrays are called “reductions”. There are many different reduction operations. Here we will look at some of the most common ones.

# sum
g.sum()
-3083.038387807155
# mean
g.mean()
-0.616607677561431
# standard deviation
g.std()
1.6402280119141424

A key property of numpy reductions is the ability to operate on just one axis.

# apply on just one axis
g_ymean = g.mean(axis=0)
g_xmean = g.mean(axis=1)
plt.plot(x, g_ymean)
[<matplotlib.lines.Line2D at 0x7efecaa9b790>]
../../_images/numpy_and_matplotlib_71_1.png
plt.plot(g_xmean, y)
[<matplotlib.lines.Line2D at 0x7efecaa00610>]
../../_images/numpy_and_matplotlib_72_1.png

Data Files

It can be useful to save numpy data into files.

np.save('g.npy', g)

Warning

Numpy .npy files are a convenient way to store temporary data, but they are not considered a robust archival format. Later we will learn about NetCDF, the recommended way to store earth and environmental data.

g_loaded = np.load('g.npy')

np.testing.assert_equal(g, g_loaded)