1. NumPy

1. NumPy#

1. Introduction#

Numpy stands for Numerical python and it is typically imported as np

import numpy as np

The major data structure numpy provides is the array, or more accurately ndarray, and arrays can represent data in multiple Dimensions. Section 9 of this tutorial goes over several other types of data structures but we will focus on the array.

1.1: Features of an Array#

Homgeonous Data Type (Structured Arrays can have multiple data types but store data differently)
Once created size can not be changed
Must be rectangular

1.2: Array Dimensionality#

Dimension (Rank)	Common Name	Description	Example Shape (`.shape`)
0D	Scalar	A single numerical value, not an array but an atomic element.	`()` (empty tuple)
1D	Vector	A one-dimensional array (list-like but with array properties).	`(n,)` (e.g., `(5,)`)
2D	Matrix	A rectangular grid of numbers with rows and columns.	`(m, n)` (e.g., `(3, 4)`)
3D	Tensor	A stack of matrices, used in physics and deep learning.	`(d, m, n)` (e.g., `(2, 3, 4)`)
4D	HyperTensor	Higher-order tensor, used in advanced machine learning.	`(t, d, m, n)` (e.g., `(5, 2, 3, 4)`)
nD	nD Array (General)	An array with any number of dimensions (`n ≥ 0`).	`(dim_1, dim_2, ..., dim_n)`

Array Axes#

An axes is a dimension along which data is arranged in a NumPy array and you can think of it as an indexing direction that allows you to navigate and manipulate data. Note, axis 0 is the highest ranked axes.

Dimension (Rank)	Axes	Description
1D Array	1 axis (`axis 0`)	Like a Python list: a single sequence of elements.
2D Array	2 axes (`axis 0`: rows, `axis 1`: columns)	Like a table (rows and columns).
3D Array	3 axes (`axis 0`: depth, `axis 1`: rows, `axis 2`: columns)	Like multiple stacked tables.

4D array Example#

Multiple experiments of the spectra of a reaction mixture where reactants concentrations are varied.

Axis	Description	Role in Experiment
Axis 0	Experiment Number	Separate runs where reactant concentrations vary
Axis 1	Temporal Data Cube	Absorbance values at different times
Axis 2	Wavelength	Spectral dimension (e.g., 400-700 nm)
Axis 3	Absorbance at Wavelength	Intensity values at each wavelength

Since NumPy indexing follows axes, so you can retrieve meaningful slices from your data.

1.3 Array Properties#

Arrays use brackets like lists and in fact a 1D array looks like a list, but the array holds one block of memory while the list has pointers to multiple memory blocks.

Fixed Data Type (All elements of same data type)
Arrays are Mutable (You can change values within an array)
Fixed Size (Fixed upon creation)
Efficient Memory Usage (stores data in contiguous memory)
Vectorized Operations (supports element-wise operations that a list would require loops for)
Multi-Dimensional Support (Does not need nested structures like a list would)

2. Arrays as a Python Objects#

Table of NumPy Array Attributes#

Attribute	Description
`shape`	Tuple of array dimensions
`ndim`	Number of array dimensions
`size`	Total number of elements in the array
`dtype`	Data type of the array elements
`itemsize`	Size in bytes of each array element
`data`	Python buffer object pointing to the array’s data
`strides`	Tuple of bytes to step in each dimension when traversing the array
`flags`	Information about the memory layout of the array
`real`	Real part of the array (for complex number arrays)
`imag`	Imaginary part of the array (for complex number arrays)

Table of NumPy Array Methods#

Method	Description
`reshape()`	Returns a new array with a changed shape
`transpose()`	Returns a new array with axes transposed
`flatten()`	Returns a copy of the array collapsed into one dimension
`astype()`	Cast the array to a specified data type
`copy()`	Return a copy of the array
`fill()`	Fill the array with a scalar value
`sort()`	Sort an array in-place
`max()`	Return the maximum value along a given axis
`min()`	Return the minimum value along a given axis
`mean()`	Compute the arithmetic mean along the specified axis

Table of NumPy Functions for Arrays#

Function	Description
`np.array()`	Create an array
`np.zeros()`	Create an array filled with zeros
`np.ones()`	Create an array filled with ones
`np.arange()`	Create an array with evenly spaced values within a given interval
`np.linspace()`	Create an array with evenly spaced numbers over a specified interval
`np.concatenate()`	Join arrays along an existing axis
`np.vstack()`	Stack arrays vertically (row-wise)
`np.hstack()`	Stack arrays horizontally (column-wise)
`np.split()`	Split an array into multiple sub-arrays
`np.where()`	Return elements chosen from two arrays depending on condition

3. Array Data Types#

dtype()#

In python everything has a type, which you can find with the type() function. In Numpy data also has a type function called dtype(), which like the type(), defines how much memory the object uses. The following table covers the common data types stored in a Numpy Array and it is often important to define these when creating an array.

Code	Data Type	Example Equivalent
`'b'`	Boolean	`bool`
`'i'`	Integer (default)	`int32`, `int64` (depends on system)
`'L'`	Long Integer	`int64` (platform-dependent)
`'u'`	Unsigned Integer	`uint32`, `uint64`
`'f'`	Floating Point	`float32`, `float64`
`'c'`	Complex Float	`complex64`, `complex128`
`'m'`	Timedelta	N/A
`'M'`	Datetime	N/A
`'O'`	Object (Python objects)	N/A
`'S'` or `'a'`	String (bytes)	Fixed-length strings
`'U'`	Unicode String	Fixed-length Unicode

.astype() -changing dtypes#

import numpy as np

# Example array of long integers
arr = np.array([1733976600, 1733798400], dtype='l')

# Convert to float
arr_float = arr.astype('f8')  # Convert to float64

import numpy as np

# Example array
arr = np.array([1733976600, 1733798400], dtype='l')

# Check dtype
print(arr)
print(arr.dtype)  
print(f'memory location = {id(arr)}\n')
# Convert to float
arr_float = arr.astype('f8')  # Convert to float64
print(arr_float)
print(arr_float.dtype)
print(f'memory location = {id(arr_float)}')

[1733976600 1733798400]
int64
memory location = 130959724447856

[1.7339766e+09 1.7337984e+09]
float64
memory location = 130958383798800

4. Basic Array Tasks#

4.1: Creating Arrays#

Array syntax looks like a Python list for a 1D Array in that the elements are in square brackets and separated by commas.

Array constructor (np.array())#

Creating Arrays from other python objects. In this exercise we are going to use nested python lists to create an array with the array constructor np.array().

note, an ndarray is of a single type of data, but you can use the np.array() constructor to make a structured array, which is an array with more than one data type (see section 9 below).

1D array#

import numpy as np
my_list = [1,2,3,4,5,6,7,8]
print(f"{my_list} is {type(my_list)}.")
my_1Darray=np.array(my_list)
print(f"{my_1Darray} is {type(my_1Darray)}.")

[1, 2, 3, 4, 5, 6, 7, 8] is <class 'list'>.
[1 2 3 4 5 6 7 8] is <class 'numpy.ndarray'>.

Just as a loop an be nested inside of another loop, a list can be nested inside of another list

2D array#

my_2nested_list = [[1,2],[3,4],[5,6],[7,8]]
print(f"{my_2nested_list} is {type(my_2nested_list)}.")
my_2Darray=np.array(my_2nested_list)
print(f"{my_2Darray} is {type(my_2Darray)}.")

[[1, 2], [3, 4], [5, 6], [7, 8]] is <class 'list'>.
[[1 2]
 [3 4]
 [5 6]
 [7 8]] is <class 'numpy.ndarray'>.

3D array#

my_3nested_list = [[[1,2],[3,4]],[[5,6],[7,8]]]
print(f"{my_3nested_list} is {type(my_3nested_list)}.")
my_3Darray=np.array(my_3nested_list)
print(f"{my_3Darray} is {type(my_3Darray)}.")

[[[1, 2], [3, 4]], [[5, 6], [7, 8]]] is <class 'list'>.
[[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]] is <class 'numpy.ndarray'>.

np.linespace()#

generates evenly spaced numbers over a specified range.

np.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)

Parameter	Description
`start`	The starting value of the sequence (inclusive).
`stop`	The ending value of the sequence. By default, it is inclusive, but you can exclude it using `endpoint=False`.
`num`	The number of points to generate (default = 50).
`endpoint`	If `True` (default), the `stop` value is included in the sequence. If `False`, the `stop` value is excluded.
`retstep`	If `True`, returns the spacing (step size) between values in addition to the array.
`dtype`	The desired NumPy data type for the output array.

# np.linspace(): 5 evenly spaced numbers
print("linspace:", np.linspace(1, 10, num=5))
# Output: [ 1.  3.25  5.5  7.75 10. ]

linspace: [ 1.    3.25  5.5   7.75 10.  ]

np.arange()#

gives evenly spaced value over a range with control of step size and data value.

np.arange(start, stop, step=1, dtype=None)

Parameter	Description
`start`	The starting value (inclusive). Defaults to `0` if not provided.
`stop`	The ending value (exclusive by default).
`step`	The increment (default = `1`). Controls how the sequence progresses.
`dtype`	The data type of the output array. If not specified, NumPy infers it automatically.

import numpy as np

# np.arange(): Step size of 2
print("arange:", np.arange(1, 10, 2))  
# Output: [1 3 5 7 9]

arange: [1 3 5 7 9]

Comparison: np.arange() vs. np.linspace()#

Feature	`np.arange()`	`np.linspace()`
Step Control	Uses a fixed step between values.	Uses a fixed number of points between `start` and `stop`.
Stop Inclusion	Exclusive (default behavior).	Inclusive (by default), unless `endpoint=False`.
Flexibility	Good for integer-based sequences with regular spacing.	Better for floating-point sequences with exact spacing.
Use Case	When you know the step size and want values to progress by that amount.	When you know the number of points and want them distributed evenly.
Example	`np.arange(1, 10, 2)` → `[1, 3, 5, 7, 9]`	`np.linspace(1, 10, num=5)` → `[ 1. 3.25 5.5 7.75 10. ]`

import numpy as np

# np.arange(): Step size of 2
print("arange:", np.arange(1, 10, 2))  
# Output: [1 3 5 7 9]

# np.linspace(): 5 evenly spaced numbers
print("linspace:", np.linspace(1, 10, num=5))
# Output: [ 1.  3.25  5.5  7.75 10. ]

arange: [1 3 5 7 9]
linspace: [ 1.    3.25  5.5   7.75 10.  ]

np.zeros()#

This function allows you to quickly create an array filled with zeros,

np.zeros((2))

array([0., 0.])

np.zeros((3,2))

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

np.zeros((4,3,2))

array([[[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]]])

np.ones()#

np.ones((2,3,4))

array([[[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]])

4.2: Array Indexing (Introduction)#

Indexing goes along the axes from zero to “n” for an “n” dimensional array. So a 3D array would be sheet, column, row. Let’s create a 3D NumPy array representing absorbance values at different wavelengths over time for multiple experiments.

Axis 0 → Experiment number
Axis 1 → Sample Times
Axis 2 → Wavelengths

import numpy as np

# Create a 3D array: 2 experiments, 4 sample times, 3 wavelengths
data = np.array([
    [[0.12, 0.15, 0.18], [0.14, 0.17, 0.20], [0.16, 0.19, 0.22], [0.18, 0.21, 0.24]],  # Experiment 1
    [[0.22, 0.25, 0.28], [0.24, 0.27, 0.30], [0.26, 0.29, 0.32], [0.28, 0.31, 0.34]]   # Experiment 2
])

print("Shape of the array:", data.shape)  # (2, 4, 3)
print(data)

Shape of the array: (2, 4, 3)
[[[0.12 0.15 0.18]
  [0.14 0.17 0.2 ]
  [0.16 0.19 0.22]
  [0.18 0.21 0.24]]

 [[0.22 0.25 0.28]
  [0.24 0.27 0.3 ]
  [0.26 0.29 0.32]
  [0.28 0.31 0.34]]]

This array has:

2 experiments (Axis 0)
4 time steps (Axis 1)
3 wavelengths (Axis 2)

Extracting a Single Value#

To get a single absorbance value for Experiment 2, sample time 3, Wavelength 2:

value = data[1, 2, 1]  # Experiment 2 (index 1), Time step 3 (index 2), Wavelength 2 (index 1)
print("Absorbance value:", value)

Absorbance value: 0.29

Extract a 1D array#

Obtain all absorbances for the first experiment at the second time step

time_2_exp_1 = data[0, 1, :]
print(time_2_exp_1)

[0.14 0.17 0.2 ]

Extract a 2D array#

Obtain absorbance values at third wavelength for all experiments and all sample times

wavelength_3 = data[:, :, 2]
print(wavelength_3)

[[0.18 0.2  0.22 0.24]
 [0.28 0.3  0.32 0.34]]

Slice out the first experiment#

subset = data[0:1, :, :]
print(subset.shape)  
print(subset)

(1, 4, 3)
[[[0.12 0.15 0.18]
  [0.14 0.17 0.2 ]
  [0.16 0.19 0.22]
  [0.18 0.21 0.24]]]

Slice out all data for second and third sample times#

subset_time = data[:, 1:3, :]
print(subset_time.shape)
print(subset_time)

(2, 2, 3)
[[[0.14 0.17 0.2 ]
  [0.16 0.19 0.22]]

 [[0.24 0.27 0.3 ]
  [0.26 0.29 0.32]]]

Summary Table#

Operation	Code	Output Shape
Single Value	`data[1, 2, 1]`	`()` (Scalar)
1D Time Slice	`data[0, 1, :]`	`(3,)`
Wavelength 3 Absorbance	`data[:, :, 2]`	`(2, 4)`
First Experiment	`data[0:1, :, :]`	`(1, 4, 3)`
Time Steps 2-3	`data[:, 1:3, :]`	`(2, 2, 3)`
Every Other Wavelength	`data[:, :, ::2]`	`(2, 4, 2)`

4.3: Array Attributes#

.shape#

Gives the shape of the array

my_array=np.ones((4,3,2))
my_array.shape

(4, 3, 2)

.ndim#

Gives the number of dimensions

my_array.ndim

.size#

Gives total number of elements in an array

my_array.size

4.4: Arrays from Functions (np.fromfunction())#

You can make an array from a function, let’s create an array for the molar mass of 1-10 elements of carbon, oxygen and hydrogen.

import numpy as np
def molar_mass_grid(element_idx, atom_count):
    # Atomic masses: C=12.01, H=1.008, O=16.00
    atomic_masses = np.array([12.01, 1.008, 16.00])  
    return atomic_masses[element_idx.astype(int)] * (atom_count + 1)

mass_grid = np.fromfunction(molar_mass_grid, (3,10), dtype=float)
print(mass_grid)

[[ 12.01   24.02   36.03   48.04   60.05   72.06   84.07   96.08  108.09
  120.1  ]
 [  1.008   2.016   3.024   4.032   5.04    6.048   7.056   8.064   9.072
   10.08 ]
 [ 16.     32.     48.     64.     80.     96.    112.    128.    144.
  160.   ]]

import numpy as np

# Step 1: Define the mass grid using np.fromfunction()
def molecular_mass_grid(element_idx, atom_count):
    atomic_masses = np.array([12.01, 1.008, 16.00])  # C, H, O
    return atomic_masses[element_idx.astype(int)] * (atom_count + 1)

mass_grid = np.fromfunction(molecular_mass_grid, (3, 10), dtype=float)

# Step 2: Define a function to compute molar mass
def compute_molar_mass(formula):
    # Define mapping for element index lookup
    element_indices = {'C': 0, 'H': 1, 'O': 2}
    
    # Parse formula into element counts
    parsed_formula = [('C', 3), ('H', 8), ('O', 1)]  # Example: C3H8O

    # Step 3: Compute molar mass
    #molar_mass = sum(mass_grid[element_indices[element], atoms - 1] for element, atoms in parsed_formula)
    molar_mass = 0  # Initialize molar mass
    for element, atoms in parsed_formula:
        element_idx = element_indices[element]  # Get index for the element
        molar_mass += mass_grid[element_idx, atoms - 1]  # Accumulate mass using array indexing
        
    return molar_mass

# Step 4: Compute molar mass of C3H8O (Propanol)
molar_mass_C3H8O = compute_molar_mass("C3H8O")
print(f"Molar Mass of C3H8O: {molar_mass_C3H8O:.2f} g/mol")

Molar Mass of C3H8O: 60.09 g/mol

4.5: Sorting Arrays#

axis=0: Sort along columns (vertically)
axis=1: Sort along rows (horizontally)
axis=-1: Sort along the last axis (default)

import numpy as np

arr = np.array([[3, 1, 4], [1, 5, 9], [2, 6, 5]])
print(f'original array: \n{arr}\n')
# Sort each column

print(f'sort axis 0 \n{np.sort(arr, axis=0)}\n')

# Sort each row
print(f'sort axis 1 \n {np.sort(arr, axis=1)}')

original array: 
[[3 1 4]
 [1 5 9]
 [2 6 5]]

sort axis 0 
[[1 1 4]
 [2 5 5]
 [3 6 9]]

sort axis 1 
 [[1 3 4]
 [1 5 9]
 [2 5 6]]

4.6 Reshaping Arrays#

.reshape()#

The number of elements of the original and reshaped arrays must be the same
(3,2,4) array must be a multiple of the product of the indices
In the following we have measurements at 5 wavelengths taken 3 different times

# Simulated absorbance data (1D)
absorbance_data = np.arange(15)  # Example: 15 absorbance values
print(absorbance_data)
# Reshape into 3 time points × 5 wavelengths
absorbance_matrix = absorbance_data.reshape(3, 5)

print(absorbance_matrix)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]

.flatten()#

-convert multi-dimensional array to a 1D array

print(absorbance_matrix)
flat_matrix=absorbance_matrix.flatten()
print(flat_matrix)

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]

Transpose Arrays#

Transposing a matrix swaps columns and rows. For dimensions higher than 2 you have two choices, .T and .transpose(). The former does a full transpose and completely rearranges the order, while the latter allows you to specify axes transposed .T vs. .transpose(): Key Differences

Feature	`.T` (Shortcut)	`.transpose()` (Flexible)
Works on 2D?	Yes (swaps rows & columns)	Yes (same as `.T` for 2D)
Works on 3D+?	Yes (reverses all axes)	Yes (custom axis order)
Custom Axis Order?	No	Yes (`.transpose(axes)`)
Effect on (2,3,4) Shape?	➝ `(4,3,2)` (full reverse)	➝ Custom order (e.g., `(3,4,2)`)

.T (Full Transpose)#

Full transpose

my_array=np.arange(1,25).reshape(6,4)
print(my_array)
print(f'\ntransposed array \n{my_array.T}')

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]
 [17 18 19 20]
 [21 22 23 24]]

transposed array 
[[ 1  5  9 13 17 21]
 [ 2  6 10 14 18 22]
 [ 3  7 11 15 19 23]
 [ 4  8 12 16 20 24]]

.transpose()#

my_array=np.arange(1,25).reshape(2,3,4)
print(my_array)
print(f'transposing rows (3) and columns(1) of 3D array \n{my_array.transpose(0,2,1)}')

[[[ 1  2  3  4]
  [ 5  6  7  8]
  [ 9 10 11 12]]

 [[13 14 15 16]
  [17 18 19 20]
  [21 22 23 24]]]
transposing rows (3) and columns(1) of 3D array 
[[[ 1  5  9]
  [ 2  6 10]
  [ 3  7 11]
  [ 4  8 12]]

 [[13 17 21]
  [14 18 22]
  [15 19 23]
  [16 20 24]]]

Scenario: Kinetic Absorbance Data for a Chemical Reaction#

A spectrometer records absorbance at multiple wavelengths over several time points across different experiments (e.g., varying reagent concentrations).
This dataset is stored in a 3D NumPy array where:

Axis 0 → 3 different experiments (e.g., different NaOH concentrations).
Axis 1 → 4 different time points (e.g., 0s, 10s, 20s…).
Axis 2 → 5 different wavelengths (e.g., 400 nm, 450 nm, 500 nm…).

Goal: The chemist needs to rearrange the dataset to focus on a wavelength-first structure, where:

Axis 0 → Wavelengths
Axis 1 → Experiments
Axis 2 → Time points

This makes it easier to extract and compare spectral changes at specific wavelengths.

import numpy as np

np.random.seed(42)  # Set the seed for reproducibility

# Simulated absorbance data: (Experiments, Time Points, Wavelengths)
absorbance_data = np.random.rand(3, 4, 5)  # Shape (3, 4, 5)

print("Original Shape:", absorbance_data.shape)  # (3,4,5)
print(absorbance_data)  # Will always generate the same random values

Original Shape: (3, 4, 5)
[[[0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]
  [0.15599452 0.05808361 0.86617615 0.60111501 0.70807258]
  [0.02058449 0.96990985 0.83244264 0.21233911 0.18182497]
  [0.18340451 0.30424224 0.52475643 0.43194502 0.29122914]]

 [[0.61185289 0.13949386 0.29214465 0.36636184 0.45606998]
  [0.78517596 0.19967378 0.51423444 0.59241457 0.04645041]
  [0.60754485 0.17052412 0.06505159 0.94888554 0.96563203]
  [0.80839735 0.30461377 0.09767211 0.68423303 0.44015249]]

 [[0.12203823 0.49517691 0.03438852 0.9093204  0.25877998]
  [0.66252228 0.31171108 0.52006802 0.54671028 0.18485446]
  [0.96958463 0.77513282 0.93949894 0.89482735 0.59789998]
  [0.92187424 0.0884925  0.19598286 0.04522729 0.32533033]]]

# Transpose: (Experiments, Time Points, Wavelengths) → (Wavelengths, Experiments, Time Points)
transposed_data = absorbance_data.transpose(2, 0, 1)

print("New Shape:", transposed_data.shape)  # Expected output: (5, 3, 4)
print(transposed_data)

New Shape: (5, 3, 4)
[[[0.37454012 0.15599452 0.02058449 0.18340451]
  [0.61185289 0.78517596 0.60754485 0.80839735]
  [0.12203823 0.66252228 0.96958463 0.92187424]]

 [[0.95071431 0.05808361 0.96990985 0.30424224]
  [0.13949386 0.19967378 0.17052412 0.30461377]
  [0.49517691 0.31171108 0.77513282 0.0884925 ]]

 [[0.73199394 0.86617615 0.83244264 0.52475643]
  [0.29214465 0.51423444 0.06505159 0.09767211]
  [0.03438852 0.52006802 0.93949894 0.19598286]]

 [[0.59865848 0.60111501 0.21233911 0.43194502]
  [0.36636184 0.59241457 0.94888554 0.68423303]
  [0.9093204  0.54671028 0.89482735 0.04522729]]

 [[0.15601864 0.70807258 0.18182497 0.29122914]
  [0.45606998 0.04645041 0.96563203 0.44015249]
  [0.25877998 0.18485446 0.59789998 0.32533033]]]

4.7: Merging Arrays#

Functions like np.concatenate(), np.vstack(), np.hstack(), and np.dstack() always create a new array. They do not modify the original arrays but instead return a new one that needs to be saved to a variable.

np.vstack()#

Adds arrays as new rows (along axis 0).
Requires matching number of columns.

Example: Combining Molecular Property Data Imagine we have two datasets:

Dataset 1: Molar masses of three molecules.
Dataset 2: Molar masses of three other molecules

import numpy as np

# Molecular masses (g/mol)
mass_data = np.array([[18.015],  # Water
                      [44.01],   # CO2
                      [32.00]])  # O2

# Boiling points (°C)
boiling_data = np.array([[100],  # Water
                         [-78.5], # CO2
                         [-183]]) # O2

# Stack them vertically
merged_data = np.vstack((mass_data, boiling_data))

print(merged_data)

import numpy as np

# Molecular masses (g/mol)
mass1_data = np.array([[18.015],  # Water
                      [44.01],   # CO2
                      [32.00]])  # O2

# Molecular massess (g/mol)
mass2_data = np.array([[180.16],  # Aspirin
                         [151.16], # Acetaminophen
                         [206.28]]) # Ibuprofen

# Stack them vertically
merged_data = np.vstack((mass1_data, mass2_data))

print(merged_data)

[[ 18.015]
 [ 44.01 ]
 [ 32.   ]
 [180.16 ]
 [151.16 ]
 [206.28 ]]

np.hstack()#

Adds arrays as new columns (along axis 1).
Requires matching number of rows.

Example: Combining Molecular Property Data Imagine we have two datasets:

Dataset 1: Molecular masses of three molecules.
Dataset 2: Their boiling points.

import numpy as np

# Molecular masses (g/mol)
mass_data = np.array([[18.015],  # Water
                      [44.01],   # CO2
                      [32.00]])  # O2

# Boiling points (°C)
boiling_data = np.array([[100],  # Water
                         [-78.5], # CO2
                         [-183]]) # O2
# Stack them horizontally
merged_data = np.hstack((mass_data, boiling_data))

import numpy as np

# Molecular masses (g/mol)
mass_data = np.array([[18.015],  # Water
                      [44.01],   # CO2
                      [32.00]])  # O2

# Boiling points (°C)
boiling_data = np.array([[100],  # Water
                         [-78.5], # CO2
                         [-183]]) # O2

# Stack them horizontally
merged_data = np.hstack((mass_data, boiling_data))

print(merged_data)

[[  18.015  100.   ]
 [  44.01   -78.5  ]
 [  32.    -183.   ]]

np.column_stack()#

Works like hstack(), but is specifically designed for 1D arrays.

Example: Pairing Element Symbols with Atomic Weights

# Element symbols
elements = np.array(['H', 'O', 'C'])

# Atomic masses
atomic_masses = np.array([1.008, 16.00, 12.01])

# Combine into a structured array
merged_data = np.column_stack((elements, atomic_masses))

print(merged_data)

[['H' '1.008']
 ['O' '16.0']
 ['C' '12.01']]

np.dstack( )#

only works on 1 and 2D arrays
Results in 3d arrays

Stacking 1D arrays#

import numpy as np

a = np.array([1, 2, 3])  # Shape (3,)
b = np.array([4, 5, 6])  # Shape (3,)

stacked = np.dstack((a, b))

print("Stacked shape:", stacked.shape)  # (1, 3, 2)
print(stacked)

Stacked shape: (1, 3, 2)
[[[1 4]
  [2 5]
  [3 6]]]

Stacking 2D arrays#

a = np.array([[1, 2], [3, 4]])  # Shape (2,2)
b = np.array([[5, 6], [7, 8]])  # Shape (2,2)

stacked = np.dstack((a, b))

print("Stacked shape:", stacked.shape)  # (2, 2, 2)
print(stacked)

Stacked shape: (2, 2, 2)
[[[1 5]
  [2 6]]

 [[3 7]
  [4 8]]]

np.stack()#

For higher order arrays you need np.stack()

import numpy as np

np.random.seed(42)  # Set the seed for reproducibility

# Define common wavelength range (same for all experiments)
wavelengths = np.linspace(400, 700, 5)  # 5 wavelengths from 400 to 700 nm

# Define common time scale (to enable stacking)
common_time = np.linspace(0, 100, 6)  # 6 time points from 0s to 100s

# Simulate absorbance data (random values for illustration)
exp1 = np.random.rand(len(common_time), len(wavelengths), 1)  # Shape (6,5,1)
exp2 = np.random.rand(len(common_time), len(wavelengths), 1)  # Shape (6,5,1)
exp3 = np.random.rand(len(common_time), len(wavelengths), 1)  # Shape (6,5,1)

print("Shape of a single experiment dataset:", exp1.shape)  # (6, 5, 1)
# Stack experiments along a new axis (creating a 4D array)
spectroscopy_data = np.stack((exp1, exp2, exp3), axis=0)

print("Shape of stacked spectroscopy data:", spectroscopy_data.shape)  # (3,6,5,1)

Shape of a single experiment dataset: (6, 5, 1)
Shape of stacked spectroscopy data: (3, 6, 5, 1)

import numpy as np
import matplotlib.pyplot as plt

# Set seed for reproducibility
np.random.seed(42)

# Define common wavelength range (same for all experiments)
wavelengths = np.linspace(400, 700, 5)  # 5 wavelengths from 400 to 700 nm

# Define common time scale (to enable stacking)
common_time = np.linspace(0, 100, 6)  # 6 time points from 0s to 100s

# Simulate absorbance data (random values for illustration)
exp1 = np.random.rand(len(common_time), len(wavelengths), 1)  # Shape (6,5,1)
exp2 = np.random.rand(len(common_time), len(wavelengths), 1)  # Shape (6,5,1)
exp3 = np.random.rand(len(common_time), len(wavelengths), 1)  # Shape (6,5,1)

# Stack experiments along a new axis (creating a 4D array)
spectroscopy_data = np.stack((exp1, exp2, exp3), axis=0)

# Function to plot absorbance over time for a specific wavelength across all experiments
def plot_absorbance_at_wavelength(target_wavelength):
    """
    Plots absorbance vs time for a specific wavelength across all experiments.
    
    Parameters:
    target_wavelength (float): The wavelength to extract absorbance values for.
    """
    num_experiments = spectroscopy_data.shape[0]  # Number of experiments

    # Find the closest wavelength index in the wavelength array
    wavelength_idx = np.argmin(np.abs(wavelengths - target_wavelength))

    # Define experiment labels for the legend
    experiment_labels = [f'Experiment {i+1}' for i in range(num_experiments)]
    
    # Set up a color cycle for experiments
    colors = ['b', 'g', 'r']

    # Create figure
    plt.figure(figsize=(7, 5))

    for exp_idx in range(num_experiments):
        absorbance_values = spectroscopy_data[exp_idx, :, wavelength_idx, 0]  # Extract absorbance data

        plt.plot(common_time, absorbance_values, marker='o', linestyle='-', color=colors[exp_idx], label=experiment_labels[exp_idx])

    plt.title(f'Absorbance at {target_wavelength} nm')
    plt.xlabel('Time (s)')
    plt.ylabel('Absorbance')
    plt.legend()
    plt.grid(True)

    plt.show()

# Plot absorbance at 620 nm across all experiments
plot_absorbance_at_wavelength(620)

../../_images/28808e231106b64d44e28ec7c0250f9d6f93426e148861ce6527c0621d85a837.png

np.concatenate()#

In the above we had three experiments with the dimensions (6,5,1). We then added a fourth dimension usig np.stack() so that all three experiments are in the same array with the dimensions (3,6,5,1). We can use np.concatenate() to add a fourth experiment to the fourth dimension, giving a 4D array of shape (4,6,5,1).

# Set seed for reproducibility
np.random.seed(42)


# Simulate a new experiment (exp4) with the same shape as previous experiments
exp4 = np.random.rand(len(common_time), len(wavelengths), 1)  # Shape (6,5,1)

# Use np.concatenate() to merge exp4 along axis=0 (experiment axis)
updated_spectroscopy_data = np.concatenate((spectroscopy_data, np.expand_dims(exp4, axis=0)), axis=0)

# Print the new shape to confirm
print("Updated 4D array shape:", updated_spectroscopy_data.shape)  # Expected: (4,6,5,1)

Updated 4D array shape: (4, 6, 5, 1)

import numpy as np
import matplotlib.pyplot as plt

# Set seed for reproducibility
np.random.seed(42)

# Define common wavelength range (same for all experiments)
wavelengths = np.linspace(400, 700, 5)  # 5 wavelengths from 400 to 700 nm

# Define common time scale (to enable stacking)
common_time = np.linspace(0, 100, 6)  # 6 time points from 0s to 100s

# Simulate absorbance data (random values for illustration)
exp1 = np.random.rand(len(common_time), len(wavelengths), 1)  # Shape (6,5,1)
exp2 = np.random.rand(len(common_time), len(wavelengths), 1)  # Shape (6,5,1)
exp3 = np.random.rand(len(common_time), len(wavelengths), 1)  # Shape (6,5,1)

# Stack experiments along a new axis (creating a 4D array)
spectroscopy_data = np.stack((exp1, exp2, exp3), axis=0)

# Simulate a new experiment (exp4) with the same shape as previous experiments
exp4 = np.random.rand(len(common_time), len(wavelengths), 1)  # Shape (6,5,1)

# Use np.concatenate() to merge exp4 along axis=0 (experiment axis)
updated_spectroscopy_data = np.concatenate((spectroscopy_data, np.expand_dims(exp4, axis=0)), axis=0)

# Function to plot absorbance at different times for all experiments
def plot_absorbance_at_times(times_to_plot):
    """
    Plots absorbance vs wavelength at selected time points for all experiments.
    
    Parameters:
    times_to_plot (list): List of time indices to plot.
    """
    num_experiments = updated_spectroscopy_data.shape[0]  # Number of experiments
    
    # Define experiment labels for the legend
    experiment_labels = [f'Experiment {i+1}' for i in range(num_experiments)]
    
    # Use the correct Matplotlib colormap approach
    cmap = plt.colormaps.get_cmap('tab10')  # Updated approach
    colors = [cmap(i % 10) for i in range(num_experiments)]  # Ensures correct color mapping

    # Create subplots for each selected time point
    fig, axes = plt.subplots(1, len(times_to_plot), figsize=(5 * len(times_to_plot), 5), sharey=True)

    if len(times_to_plot) == 1:
        axes = [axes]  # Ensure axes is iterable for a single plot

    for ax, time_idx in zip(axes, times_to_plot):
        for exp_idx in range(num_experiments):
            absorbance_values = updated_spectroscopy_data[exp_idx, time_idx, :, 0]  # Extract absorbance data

            ax.plot(wavelengths, absorbance_values, marker='o', linestyle='-', color=colors[exp_idx], label=experiment_labels[exp_idx])

        ax.set_title(f'Time = {common_time[time_idx]:.1f}s')
        ax.set_xlabel('Wavelength (nm)')
        ax.set_ylabel('Absorbance')
        ax.legend()
        ax.grid(True)

    plt.tight_layout()
    plt.show()

# Example: Plot absorbance at selected time points (Modify as needed)
plot_absorbance_at_times([0, 2, 5])  # Plot at first, middle, and last time points

../../_images/7e38459db4d73ca6f9571dce774723872f43bbdb98388aae56675c68acd12803.png

4.8: Vectorization and Broadcasting Arrays#

Vectorization#

Because of the way arrays are stored in memory you can do operations on every element within an array without having to go through a loop. If the arrays have identical shapes this is called vectorization, and this is much faster than using loops.

In the following code we have two 1D arrays with 3 elements each, one containing the numbers of carbon, hydrogen and oxygen in a molecule, and the second the molar mass of carbon, hydrogen and oxygen. Two calculate the molar mass we simply multiply the two arrays by each other.

elements = np.array(['C', 'H', 'O'])
counts = np.array([6, 12, 6])

atomic_weights_np = np.array([12.01, 1.008, 16.00])  # Order matches elements

mass_matrix = atomic_weights_np * counts
print(mass_matrix)

total_mass = np.sum(atomic_weights_np * counts)
print(total_mass)  # 180.156 g/mol

What happens is

Index 0: 6 * 12.01 = 72.06
Index 1: 12 * 1.008
Index 2: 6 * 16.00 And this is much faster than looping.

elements = np.array(['C', 'H', 'O'])
counts = np.array([6, 12, 6])

atomic_weights_np = np.array([12.01, 1.008, 16.00])  # Order matches elements
mass_matrix = atomic_weights_np * counts
print(mass_matrix)

total_mass = np.sum(atomic_weights_np * counts)
print(total_mass)  # 180.156 g/mol

[72.06  12.096 96.   ]
180.156

Broadcasting#

If the shapes do not match Numpy Broadcasts from the array with the smaller dimensions to the larger by expanding the smaller one to fit the size of the larger.

Scalar math on arrays#

\[\begin{split} \left[ \begin{array}{cc} 2 & 4 \\ 6 & 8 \end{array} \right] \times \left[ \begin{array}{c} 2 \end{array} \right]= ? \end{split}\]

becomes $$ \left[ \begin{array}{cc} 2 & 4 \\ 6 & 8 \end{array} \right] \times \left[ \begin{array}{cc} 2 & 2 \\ 2 & 2 \end{array} \right]= \left[ \begin{array}{cc} 4 & 8 \\ 12 & 16 \end{array} \right] $$

a = np.array([[2,4], [6,8]])
b = np.array([2]) 
a * b

array([[ 4,  8],
       [12, 16]])

\[\begin{split} \left[ \begin{array}{cc} 2 & 4 \\ 6 & 8 \end{array} \right] \times \left[ \begin{array}{cc} 4 & 8 \end{array} \right]= ? \end{split}\]

becomes $$ \left[ \begin{array}{cc} 2 & 4 \\ 6 & 8 \end{array} \right] \times \left[ \begin{array}{cc} 4 & 8 \\ 4 & 8 \end{array} \right]= \left[ \begin{array}{cc} 8 & 32 \\ 24 & 48 \end{array} \right] $$

a = np.array([[2,4], [6,8]])
b = np.array([4,8])
a * b

array([[ 8, 32],
       [24, 64]])

4.9 Array io (saving arrays)#

You can save an array as a binary file (*.npy) or a binary compressed file (*.npz)

Save/Load a NumPy array#

import numpy as np

arr = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
np.save("data.npy", arr)  # Saves as binary file

loaded_arr = np.load("data.npy")
print(loaded_arr)

import numpy as np
import os

sandbox_dir = os.path.expanduser("~/sandbox/")
ndarray_npy_sbpath = os.path.join(sandbox_dir, "data.npy")

# Ensure the ~/sandbox/ directory exists
os.makedirs(sandbox_dir, exist_ok=True)

arr = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
np.save(ndarray_npy_sbpath, arr)  # Saves as binary file
print(arr)

[[1. 2. 3.]
 [4. 5. 6.]]

loaded_arr = np.load(ndarray_npy_sbpath)
print(loaded_arr)

[[1. 2. 3.]
 [4. 5. 6.]]

Save/Load multiple arrays#

np.savez("data.npz", arr1=arr, arr2=np.arange(10))

data = np.load("data.npz")
print(data["arr1"])  # Retrieve first array
print(data["arr2"])  # Retrieve second array

import numpy as np
import os

sandbox_dir = os.path.expanduser("~/sandbox/")
ndarray_npz_sbpath = os.path.join(sandbox_dir, "data.npz")

# Ensure the ~/sandbox/ directory exists
os.makedirs(sandbox_dir, exist_ok=True)

np.savez(ndarray_npz_sbpath, arr1=np.arange(1,25).reshape(6,4), arr2=np.arange(10))

data_files = np.load(ndarray_npz_sbpath)
print(data_files["arr1"])  # Retrieve first array
print(data_files["arr2"])  # Retrieve second array

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]
 [17 18 19 20]
 [21 22 23 24]]
[0 1 2 3 4 5 6 7 8 9]

5. Random Number Generation#

The random submodule allows for a variety of random number generations. It is a very powerful tool for creating datasets to test programs with, and these can be tailored to fit different distributions.

Compare Random module with np.random submodule#

import random

# Define isotopes and their natural abundance
isotopes = ["Cl-35", "Cl-37"]
weights = [0.75, 0.25]  # Probabilities based on natural abundance

# Generate 1000 random selections
random_sample = random.choices(isotopes, weights=weights, k=1000)

# Count occurrences
cl_35_count = random_sample.count("Cl-35")
cl_37_count = random_sample.count("Cl-37")

# Print results
print(f"Using random module:")
print(f"Cl-35: {cl_35_count}, Cl-37: {cl_37_count}")

Using random module:
Cl-35: 764, Cl-37: 236

# Numpy random Submodule
import numpy as np

# Define isotopes and their probabilities
isotopes = np.array(["Cl-35", "Cl-37"])
probabilities = [0.75, 0.25]

# Generate 1000 random selections using NumPy
np_sample = np.random.choice(isotopes, size=1000, p=probabilities)

# Count occurrences using NumPy
unique, counts = np.unique(np_sample, return_counts=True)
isotope_counts = dict(zip(unique, counts))

# Print results
print("\nUsing numpy.random:")
print(f"Cl-35: {isotope_counts['Cl-35']}, Cl-37: {isotope_counts['Cl-37']}")

Using numpy.random:
Cl-35: 749, Cl-37: 251

Comparing the time we see that the numpy.random.choice is an order of magnitude faster than random.choices

import time

# Using random module
start = time.time()
random_sample = random.choices(isotopes, weights=weights, k=1_000_000)
end = time.time()
print(f"random.choices() time: {end - start:.5f} seconds")

# Using numpy.random
start = time.time()
np_sample = np.random.choice(isotopes, size=1_000_000, p=probabilities)
end = time.time()
print(f"numpy.random.choice() time: {end - start:.5f} seconds")

random.choices() time: 0.29654 seconds
numpy.random.choice() time: 0.03485 seconds

Functions and Methods in `numpy.random`#

Name	Type	Description
`random_sample()`	Function	Generates random floats uniformly distributed in [0.0, 1.0).
`rand()`	Function	Generates random values in a given shape from a uniform distribution over [0, 1).
`randint()`	Function	Returns random integers from a specified range.
`choice()`	Function	Randomly selects elements from a given 1D array or list.
`uniform()`	Function	Generates random floats uniformly distributed between specified low and high values.
`normal()`	Function	Draws samples from a normal (Gaussian) distribution with specified mean and standard deviation.
`standard_normal()`	Function	Returns samples from a standard normal distribution (mean=0, std=1).
`seed()`	Method	Sets the seed for reproducibility of random number generation.
`RandomState()`	Class/Method	Creates a container for managing random number generation with its own state.
`default_rng()`	Method	Returns a new instance of the default random number generator introduced in NumPy 1.17.
`integers()`	Method	Generates random integers over a specified range (replacement for `randint` in newer versions).
`shuffle()`	Method	Shuffles the elements of an array in place.
`permutation()`	Method	Returns a permuted sequence or array without modifying the input.
`beta()`	Function	Draws samples from a Beta distribution.
`binomial()`	Function	Draws samples from a Binomial distribution.
`poisson()`	Function	Draws samples from a Poisson distribution.
`exponential()`	Function	Draws samples from an exponential distribution.
`gamma()`	Function	Draws samples from a Gamma distribution.
`geometric()`	Function	Draws samples from a geometric distribution.

np.random.seed()#

A random.seed allows you to generate the same random values when you rerun the code.

Allows reproducible data sets
Global - affects all subsequent random generation calls within a program unless another seed is set.
Works with random functions like np.random.rand() and np.random.int()

import numpy as np

# Set the seed
np.random.seed(42)

# Generate 10 random numbers to the hundredth place
random_numbers = np.round(np.random.rand(10), 2)
print(random_numbers)

[0.37 0.95 0.73 0.6  0.16 0.16 0.06 0.87 0.6  0.71]

np.random.default_rng()#

This allows you to create an independent (non global random instance). Note, the second time you call it the numbers are different because you have advanced through the seed iteration, and you need to reset the seed to start over.

import numpy as np

# First approach: The RNG advances
rng1 = np.random.default_rng(42) # seed is 42
rng2 = np.random.default_rng(99) # seed is 99

print("First call - RNG 1:", rng1.random(3))  # Advances
print("First call - RNG 2:", rng2.random(3))  # Advances

print("Second call - RNG 1:", rng1.random(3))  # Different numbers!
print("Second call - RNG 2:", rng2.random(3))  # Different numbers!

# Reset the generators
rng1 = np.random.default_rng(42)  # Reinitialize
rng2 = np.random.default_rng(99)  # Reinitialize

print("Reset and first call again - RNG 1:", rng1.random(3))  # Same as first call!
print("Reset and first call again - RNG 2:", rng2.random(3))  # Same as first call!

First call - RNG 1: [0.77395605 0.43887844 0.85859792]
First call - RNG 2: [0.50603067 0.56509163 0.51191596]
Second call - RNG 1: [0.69736803 0.09417735 0.97562235]
Second call - RNG 2: [0.97218637 0.61490314 0.5682835 ]
Reset and first call again - RNG 1: [0.77395605 0.43887844 0.85859792]
Reset and first call again - RNG 2: [0.50603067 0.56509163 0.51191596]

Uniform Distribution#

import numpy as np
import matplotlib.pyplot as plt

# Create an independent RNG
rng = np.random.default_rng(42)

# Generate 1000 random numbers from uniform distribution
uniform_data = rng.uniform(size=1000)


# Create a histogram plot
plt.figure(figsize=(10, 6))

# Plot uniform distribution histogram
plt.hist(uniform_data, bins=30, alpha=0.5, color='blue', label="uniform()")



# Labels and title
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Histogram of uniform data)")
plt.legend()
plt.grid(True)

# Show the histogram
plt.show()

../../_images/c8e2791815bf2dd15cd3b259ab7b8f78a8c13543821ec14b488122958869e285.png

import numpy as np
import matplotlib.pyplot as plt

# Create an independent RNG
rng = np.random.default_rng(42)

# Generate 1000 random numbers from uniform distribution
standard_normal_data = rng.standard_normal(size=1000)


# Create a histogram plot
plt.figure(figsize=(10, 6))

# Plot uniform distribution histogram
plt.hist(standard_normal_data, bins=30, alpha=0.5, color='blue', label="uniform()")



# Labels and title
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Histogram of standard normal data)")
plt.legend()
plt.grid(True)

# Show the histogram
plt.show()

../../_images/2c937a38965190514387993f080a0d571fde323a2d509cbafbabc7d42d1bb6e8.png

Inserting noise into an equation.

import numpy as np
import matplotlib.pyplot as plt

# Set seed for reproducibility
np.random.seed(42)

# Define wavelengths for R, I, and P
wavelengths = np.array([450, 550, 650])  # nm

# Define common time scale (10 points over 100s)
common_time = np.linspace(0, 100, 10)  

# Define the number of experiments
num_experiments = 4  

# Function to simulate reaction kinetics: R → I → P
def reaction_profiles(time):
    """
    Simulates reaction kinetics:
    - R (Reactant) decays exponentially.
    - I (Intermediate) forms, then decays as P forms.
    - P (Product) forms as I decays.
    """
    R = np.exp(-0.05 * time)  # Exponential decay for R
    I = (time * np.exp(-0.05 * time)) / 10  # Intermediate formation & decay
    P = (1 - np.exp(-0.05 * time))  # Product formation

    return np.array([R, I, P]).T  # Transpose to match (time, species)

# Initialize a NumPy array to store absorbance data (4D array)
# Shape: (Experiments, Time, Wavelengths, 1)
absorbance_data = np.zeros((num_experiments, len(common_time), len(wavelengths), 1))

# Generate data for each experiment
for exp_idx in range(num_experiments):
    reaction_data = reaction_profiles(common_time)  # Get kinetic profiles

    # Introduce slight experimental variations using a random factor
    noise = np.random.normal(0, 0.02, reaction_data.shape)  # Small noise added
    absorbance_data[exp_idx, :, :, 0] = reaction_data + noise  # Store data

# Function to plot absorbance vs time at selected wavelengths
def plot_absorbance_at_wavelengths():
    """
    Plots absorbance vs time for selected wavelengths (450nm, 550nm, 650nm),
    representing Reactant, Intermediate, and Product formation.
    """
    experiment_labels = [f'Experiment {i+1}' for i in range(num_experiments)]
    colors = plt.cm.tab10(np.linspace(0, 1, num_experiments))  # Assign colors

    # Create subplots for each wavelength
    fig, axes = plt.subplots(1, len(wavelengths), figsize=(15, 5), sharey=True)

    for ax, wavelength_idx in zip(axes, range(len(wavelengths))):
        for exp_idx in range(num_experiments):
            absorbance_values = absorbance_data[exp_idx, :, wavelength_idx, 0]  # Extract data

            ax.plot(common_time, absorbance_values, marker='o', linestyle='-',
                    color=colors[exp_idx], label=experiment_labels[exp_idx])

        ax.set_title(f'Absorbance at {wavelengths[wavelength_idx]} nm')
        ax.set_xlabel('Time (s)')
        ax.set_ylabel('Absorbance')
        ax.legend()
        ax.grid(True)

    plt.tight_layout()
    plt.show()

# Plot absorbance at 450nm, 550nm, and 650nm
plot_absorbance_at_wavelengths()

../../_images/cd5c0752e3e1d0e5055ad1d8934b578bdb1c847e518fbd272d1a9b14cedb1761.png

# Show data of second experiment
absorbance_data[1, :, :, 0]

array([[ 9.87965868e-01,  3.70455637e-02, -2.69944495e-04],
       [ 5.52599202e-01,  6.53954699e-01,  4.01829706e-01],
       [ 3.33370260e-01,  6.92346570e-01,  6.44243291e-01],
       [ 1.92812828e-01,  6.44354674e-01,  8.14551763e-01],
       [ 1.06055058e-01,  4.75613585e-01,  8.62061537e-01],
       [ 4.77796399e-02,  3.36212358e-01,  9.58965921e-01],
       [ 4.25463591e-02,  2.02565819e-01,  9.70807686e-01],
       [ 1.27664301e-02,  1.45657704e-01,  9.91765450e-01],
       [ 3.23636189e-02,  1.23013411e-01,  9.71472021e-01],
       [ 5.53699482e-04,  7.40047386e-02,  1.01277296e+00]])

# Show first four temporal measurements of third experiment
absorbance_data[2, 0:4, :, 0]

array([[ 0.99041652, -0.00371318, -0.0221267 ],
       [ 0.54982929,  0.65375432,  0.45337138],
       [ 0.32775279,  0.75161063,  0.67803973],
       [ 0.17597321,  0.63681325,  0.84188513]])

# Show the decay of the first experiment
absorbance_data[0, :, 0, 0]

array([1.00993428e+00, 6.04214018e-01, 3.60777244e-01, 1.99726804e-01,
       1.13207269e-01, 5.09307734e-02, 1.75135118e-02, 1.59525497e-02,
       8.55973967e-04, 1.42519074e-02])

#Find the values of the reactant and product when the intermediate is max.
print(absorbance_data.shape)
# Find the time index where Intermediate (I) is max for each experiment
max_I_indices = np.argmax(absorbance_data[:, :, 1, 0], axis=1)  # I at index 1 (550 nm)

# Extract corresponding R (450 nm) and P (650 nm) values
R_at_max_I = absorbance_data[np.arange(num_experiments), max_I_indices, 0, 0]  # R at 450nm
P_at_max_I = absorbance_data[np.arange(num_experiments), max_I_indices, 2, 0]  # P at 650nm

# Print results
for exp_idx in range(num_experiments):
    time_at_max_I = common_time[max_I_indices[exp_idx]]
    print(f"Experiment {exp_idx+1}:")
    print(f"  Time at max I: {time_at_max_I:.2f} s")
    print(f"  R (450nm) at max I: {R_at_max_I[exp_idx]:.3f}")
    print(f"  P (650nm) at max I: {P_at_max_I[exp_idx]:.3f}\n")

(4, 10, 3, 1)
Experiment 1:
  Time at max I: 22.22 s
  R (450nm) at max I: 0.361
  P (650nm) at max I: 0.661

Experiment 2:
  Time at max I: 22.22 s
  R (450nm) at max I: 0.333
  P (650nm) at max I: 0.644

Experiment 3:
  Time at max I: 22.22 s
  R (450nm) at max I: 0.328
  P (650nm) at max I: 0.678

Experiment 4:
  Time at max I: 22.22 s
  R (450nm) at max I: 0.335
  P (650nm) at max I: 0.671

6. Masks#

A mask is a way of filtering data based on a condition.

Instead of looping through a dataset and using if statements, we can use a Boolean mask to select specific values efficiently.

In NumPy, masks are frequently used for filtering arrays.

.nonzero()#

nonzero() returns the indices of elements that are non-zero (or True for boolean arrays). For a 1D array, it returns a tuple containing a single array of indices. For multi-dimensional arrays, it returns a tuple of arrays, one for each dimension.

import numpy as np

# Define the array
darray = np.array([3, 1, 8, -8, 2, 8])

# Create a boolean mask where values equal to 8 are True
mask = (darray == 8)
print("Mask:", mask)  # [False False  True False False  True]

# Use nonzero() to find indices where mask is True
print("nonzero() result:", mask.nonzero())  # returns a tuple of arrays
print("Extracted indices:", mask.nonzero()[0])  # extract the first array from the tuple

# Separate analysis of the types
nonzero_result = mask.nonzero()
print("nonzero_result:", nonzero_result)  # (array([2, 5]),)
print("type of nonzero_result:", type(nonzero_result))  # <class 'tuple'>
print("type of nonzero_result[0]:", type(nonzero_result[0]))  # <class 'numpy.ndarray'>

Mask: [False False  True False False  True]
nonzero() result: (array([2, 5]),)
Extracted indices: [2 5]
nonzero_result: (array([2, 5]),)
type of nonzero_result: <class 'tuple'>
type of nonzero_result[0]: <class 'numpy.ndarray'>

Example: Finding Temperatures Above a Threshold

import numpy as np

# Temperature readings in °C
temperatures = np.array([20, 25, 30, 15, 10, 35, 40])

# Mask for temperatures above 25°C
high_temps = temperatures > 25  
print(high_temps)  # [False False  True False False  True  True]

# Use the mask to filter values
filtered_temps = temperatures[high_temps]
print(filtered_temps)  # [30 35 40]

Why is this useful?

Instead of writing a loop with an if statement, we filter data in a single operation.
Faster execution for large datasets.
Useful for analyzing experimental datasets (e.g., selecting only high-energy molecules).

Why NumPy Handles True Boolean Masks#

Boolean Indexing
- NumPy allows direct indexing using a Boolean array (mask), which is not possible with standard lists.
- Example:
```
import numpy as np
arr = np.array([10, 15, 20, 25])
mask = arr > 15  # [False, False, True, True]
print(arr[mask])  # Output: [20 25]
```
  - This directly filters elements based on the mask.
  - Lists don’t support this; you’d need a list comprehension instead.
Performance Optimization
- NumPy operates at C-level speed under the hood, making Boolean masks much faster than loops or comprehensions in pure Python.
Logical Operations on Masks
- You can combine conditions efficiently using NumPy’s element-wise operations:
```
mask = (arr > 10) & (arr < 25)  # Using bitwise AND for multiple conditions
print(arr[mask])
```
  - Standard lists require multiple loops or conditions inside a comprehension.

import numpy as np

# Temperature readings in °C
temperatures = np.array([20, 25, 30, 15, 10, 35, 40])

# Mask for temperatures above 25°C
high_temps = temperatures > 25  
print(high_temps)  # [False False  True False False  True  True]

# Use the mask to filter values
filtered_temps = temperatures[high_temps]
print(filtered_temps)  # [30 35 40]

[False False  True False False  True  True]
[30 35 40]

random mask (the following will print between 0 and 10 values that were generated as true

atomic_numbers = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
random_mask = np.random.randint(0, 2, size=10, dtype=bool)  # Random 0/1 mask
selected_data = atomic_numbers[random_mask]  
print(selected_data)  # Random subset of elements

[2 4 5 6 7]

.where#

import numpy as np
data = np.array([3,1,8,-8,2,8])
mask = np.where(data == 8, np.arange(np.size(data)), 0)
print(mask)

[0 0 2 0 0 5]

8. Missing Data#

IEEE NaN (Not a Number)#

IEEE standard for floating point arithmetic.
Represents the absence of a valid numerical value in floating point systems

Using Boolean Masks for NaN#

import numpy as np

arr = np.array([1.0, 2.0, np.nan, 4.0, 5.0])
print(arr)

[ 1.  2. nan  4.  5.]

nan_mask = np.isnan(arr)
print(nan_mask)

[False False  True False False]

nan_mask = ~np.isnan(arr)
print(nan_mask)

[ True  True False  True  True]

The ~ is the Bitwise NOT operator and inverts the boolean mask

~ inverts the Boolean mask:
- True → False
- False → True

arr_no_nan = arr[~np.isnan(arr)]
print(arr_no_nan)

[1. 2. 4. 5.]

9. NumPy Statistics#

Function	Description	Example
`np.mean()`	Calculates the arithmetic mean	`np.mean([1, 2, 3, 4, 5])`
`np.median()`	Calculates the median (middle value)	`np.median([1, 2, 3, 4, 5])`
`np.std()`	Calculates the standard deviation	`np.std([1, 2, 3, 4, 5])`
`np.var()`	Calculates the variance	`np.var([1, 2, 3, 4, 5])`
`np.min()`	Finds the minimum value	`np.min([1, 2, 3, 4, 5])`
`np.max()`	Finds the maximum value	`np.max([1, 2, 3, 4, 5])`
`np.percentile()`	Calculates the qth percentile	`np.percentile([1, 2, 3, 4, 5], 50)`
`np.sum()`	Calculates the sum of all elements	`np.sum([1, 2, 3, 4, 5])`
`np.prod()`	Calculates the product of all elements	`np.prod([1, 2, 3, 4, 5])`
`np.cumsum()`	Calculates the cumulative sum	`np.cumsum([1, 2, 3, 4, 5])`
`np.cumprod()`	Calculates the cumulative product	`np.cumprod([1, 2, 3, 4, 5])`
`np.argmin()`	Returns the index of the minimum value	`np.argmin([3, 1, 2, 4, 5])`
`np.argmax()`	Returns the index of the maximum value	`np.argmax([3, 1, 2, 4, 5])`
`np.ptp()`	Range of values (max - min)	`np.ptp([1, 2, 3, 4, 5])`

10. Numpy NaN Statistics#

Function	Description	Example
`np.nanmax(arr)`	Maximum value, ignoring NaNs	`np.nanmax([1, 2, np.nan, 4])`
`np.nanmin(arr)`	Minimum value, ignoring NaNs	`np.nanmin([1, np.nan, 3, 4])`
`np.nanmean(arr)`	Arithmetic mean, ignoring NaNs	`np.nanmean([1, 2, np.nan, 4])`
`np.nanmedian(arr)`	Median value, ignoring NaNs	`np.nanmedian([1, np.nan, 3, 4])`
`np.nansum(arr)`	Sum of elements, ignoring NaNs	`np.nansum([1, 2, np.nan, 4])`
`np.nanstd(arr)`	Standard deviation, ignoring NaNs	`np.nanstd([1, 2, np.nan, 4])`
`np.nanvar(arr)`	Variance, ignoring NaNs	`np.nanvar([1, 2, np.nan, 4])`
`np.nanprod(arr)`	Product of elements, ignoring NaNs	`np.nanprod([1, 2, np.nan, 4])`
`np.nancumsum(arr)`	Cumulative sum, ignoring NaNs	`np.nancumsum([1, 2, np.nan, 4])`
`np.nancumprod(arr)`	Cumulative product, ignoring NaNs	`np.nancumprod([1, 2, np.nan, 4])`
`np.nanpercentile(arr, q)`	Compute percentiles, ignoring NaNs	`np.nanpercentile([1, np.nan, 3, 4], 50)`
`np.nanquantile(arr, q)`	Compute quantiles, ignoring NaNs	`np.nanquantile([1, np.nan, 3, 4], 0.5)`

12. Other Numpy DataTypes#

besides ndarray there are several other datatypes provided by the numpy package

9.1: Structured arrays (np.array)#

Allows defining heterogeneous data types .
Example:

# Define a structured dtype with multiple fields
dtype = [('name', 'U10'), ('mass', 'f4'), ('charge', 'i4')]

# Create a structured array with different field types
structured_arr = np.array([
    ('Electron', 9.11e-31, -1),
    ('Proton', 1.67e-27, +1)
], dtype=dtype)

print(structured_arr)
print(type(structured_arr))  # <class 'numpy.ndarray'>

# Define a structured dtype with multiple fields
dtype = [('name', 'U10'), ('mass', 'f4'), ('charge', 'i4')]

# Create a structured array with different field types
structured_arr = np.array([
    ('Electron', 9.11e-31, -1),
    ('Proton', 1.67e-27, +1)
], dtype=dtype)

print(structured_arr)
print(type(structured_arr))  # <class 'numpy.ndarray'>

[('Electron', 9.11e-31, -1) ('Proton', 1.67e-27,  1)]
<class 'numpy.ndarray'>

How Are Structured Arrays Different from Regular `ndarray`s?#

Feature	Regular `ndarray`	Structured Array (`np.ndarray` with `dtype`)
Data Type	Homogeneous (`dtype` is the same for all elements)	Heterogeneous (`dtype` varies across fields)
Row Access	Indexed by number (`arr[i]`)	Indexed by number (`arr[i]`)
Column Access	Not supported natively	Accessed by field names (`arr['column']`)
Resemblance to DataFrames?	No	Yes (similar to structured tables)

9.2: Masked arrays (np.ma.array)#

A subclass of ndarray that allows elements to be masked (ignored) in computations.
Useful for handling missing or invalid data without affecting computations.

Example:

import numpy as np
a = np.ma.array([1, 2, 3, 4], mask=[False, True, False, False])
print(a.mean())  # Ignores masked elements

9.3: Record Arrays (np.recarray)#

A variation of structured arrays with attribute-style access. Example: rec = np.rec.array([(‘Hydrogen’, 1.008), (‘Oxygen’, 16.00)], dtype=[(‘element’, ‘U10’), (‘mass’, ‘f4’)]) print(rec.element) # Access column as an attribute

Acknowledgements#

This content was developed with assistance from Perplexity AI and Chat GPT. Multiple queries were made during the Fall 2024 and the Spring 2025.