2. Pandas

2. Pandas#

1. Introduction#

Pandas stands for Panel data, which is an economics term for multi-dimensional datasets. Pandas is built on top of Numpy and has two primary data structures

Series (1D data structure)
Dataframe (2D data structure)

If you want higher dimensional data structures you should use the xrray package.

Pandas series and dataframes are typically easier to work with than Numpy arrays because they can use labels as index numbers. That is, you can give a row or a column a heading label and then using the label to obtain the data value, whereas in numpy you need to use the index numbers. Pandas can also contain multiple data types while numpy arrays can only have one data type.

Pandas DataFrames vs. Numpy Arrays#

Feature	Pandas DataFrame	NumPy Array
Type	2D table with labeled rows & columns	N-dimensional homogeneous array
Supports multiple data types?	Yes (e.g., strings & numbers in one column)	No (all elements must be the same type)
Supports heterogeneous data?	Yes (each column can have different types)	No (all elements must be the same `dtype`)
Row & column labels	Yes (uses `Index` and column names)	No (only positional indexing)
Access by labels?	Yes (`df.loc["row_label", "column_label"]`)	No (only by index position)
Access by index position?	Yes (`df.iloc[1, 2]`)	Yes (`arr[1, 2]`)
Supports missing values?	Yes (`NaN`)	No (must use masked arrays or `np.nan`)
Built-in functions for missing data?	Yes (`df.fillna()`, `df.dropna()`)	No (requires workarounds)
Supports row/column operations?	Yes (`df.mean(axis=0)`, `df.sum(axis=1)`)	Yes (`arr.mean(axis=0)`, `arr.sum(axis=1)`)
Supports grouping & aggregation?	Yes (`df.groupby()`, `df.agg()`)	No
Supports reshaping?	Yes (`df.pivot()`, `df.melt()`)	Yes (`arr.reshape()`)
Supports merging/joining?	Yes (`df.merge()`, `df.join()`)	No
Mutable	Yes (you can modify/add/delete columns, rows or elements)	Partially (you can modify elements but not change shape)
Read/Write CSV?	Yes (`df.to_csv()`, `pd.read_csv()`)	No (must use Pandas or `np.savetxt()`)
Read/Write JSON?	Yes (`df.to_json()`, `pd.read_json()`)	No
Read/Write Excel?	Yes (`df.to_excel()`, `pd.read_excel()`)	No
Save/Load Binary Format?	Yes (`df.to_pickle()`)	Yes (`np.save()`, `np.load()`)

If you have not installed pandas you need to activate your conda virtual environment and install it using your terminal

conda activate NameOfEnvironment
conda install -c conda-forge pandas

Now check the version that is installed

import pandas as pd
print(pd.__version__)

2.2.3

2. Pandas Object Overview#

2.1: Functions#

Series Functions (series only)#

Function	Syntax	Parameters	Description
`str`	`series.str.method()`	String-specific method	Enables vectorized string operations.
`dt`	`series.dt.method()`	Datetime-specific method	Enables date/time manipulations.
`map()`	`series.map(func)`	`func`: Function to apply	Maps function to each value in the Series.
`unique()`	`series.unique()`	No parameters	Returns unique values in the Series.
`nunique()`	`series.nunique()`	No parameters	Returns the number of unique values.
`value_counts()`	`series.value_counts()`	`normalize`, `sort`, `ascending`, `bins`	Returns frequency counts of unique values.
`idxmax()`	`series.idxmax()`	No parameters	Returns index of the maximum value.
`idxmin()`	`series.idxmin()`	No parameters	Returns index of the minimum value.
`cumsum()`	`series.cumsum()`	No parameters	Computes the cumulative sum.
`cumprod()`	`series.cumprod()`	No parameters	Computes the cumulative product.
`cummax()`	`series.cummax()`	No parameters	Computes the cumulative max.
`cummin()`	`series.cummin()`	No parameters	Computes the cumulative min.
`shift()`	`series.shift(periods=1)`	`periods` (int, default=1)	Shifts values up or down.
`diff()`	`series.diff(periods=1)`	`periods` (int, default=1)	Computes the difference between elements.
`rank()`	`series.rank(method='average')`	`method`: Ranking method	Computes ranks of elements.

Dataframe functions (df only)#

Function	Syntax	Parameters	Description
`groupby()`	`df.groupby(by)`	`by`: Column(s) to group by	Groups data and applies aggregation.
`pivot()`	`df.pivot(index, columns, values)`	`index`, `columns`, `values`	Reshapes data by pivoting columns.
`pivot_table()`	`df.pivot_table(values, index, columns)`	`values`, `index`, `columns`, `aggfunc`	Similar to `pivot()`, but allows aggregation.
`melt()`	`df.melt(id_vars, value_vars)`	`id_vars`, `value_vars`	Converts wide format to long format.
`merge()`	`df.merge(df2, on, how)`	`on`: Key column, `how`: Join type	Merges two DataFrames on a key.
`join()`	`df.join(df2, on)`	`on`: Key column	Joins two DataFrames based on index.
`concat()`	`pd.concat([df1, df2], axis=0)`	`axis`: 0 for rows, 1 for columns	Concatenates multiple DataFrames.
`stack()`	`df.stack()`	No parameters	Converts columns into rows (long format).
`unstack()`	`df.unstack()`	No parameters	Converts rows into columns (wide format).
`explode()`	`df.explode(column)`	`column`: Column to explode	Expands list-like column values into rows.
`T`	`df.T`	No parameters	Transposes rows and columns.
`corr()`	`df.corr()`	No parameters	Computes correlation between columns.
`cov()`	`df.cov()`	No parameters	Computes covariance between columns.
`duplicated()`	`df.duplicated(subset, keep='first')`	`subset`: Columns to check	Identifies duplicate rows.
`drop_duplicates()`	`df.drop_duplicates(subset, keep='first')`	`subset`: Columns to check	Removes duplicate rows.
`sample()`	`df.sample(n=5, frac=None)`	`n`: Number of samples, `frac`: Fraction	Returns a random sample.
`nlargest()`	`df.nlargest(n, columns)`	`n`: Number of rows, `columns`: Sort column	Returns top `n` largest values in a column.
`nsmallest()`	`df.nsmallest(n, columns)`	`n`: Number of rows, `columns`: Sort column	Returns top `n` smallest values in a column.
`fillna()`	`df.fillna(value)`	`value`: Value to fill missing data	Fills missing values.
`dropna()`	`df.dropna(axis=0)`	`axis`: 0 (rows), 1 (columns)	Drops missing values.
`replace()`	`df.replace(to_replace, value)`	`to_replace`, `value`	Replaces specific values.

Series and dataframe functions#

Function	Syntax	Parameters	Description
`head()`	`df.head(n)`	`n` (int, default=5): Number of rows to return	Returns the first `n` rows.
`tail()`	`df.tail(n)`	`n` (int, default=5): Number of rows to return	Returns the last `n` rows.
`describe()`	`df.describe()`	`percentiles`, `include`, `exclude`	Generates summary statistics for numerical data.
`count()`	`df.count()`	`axis=0` (count rows) or `axis=1` (count columns)	Counts non-null values.
`sum()`	`df.sum(axis=0)`	`axis=0` (column-wise) or `axis=1` (row-wise)	Computes sum of values.
`mean()`	`df.mean(axis=0)`	`axis=0` (default, column-wise)	Computes mean of values.
`min()`	`df.min(axis=0)`	`axis=0` (default, column-wise)	Returns the minimum value.
`max()`	`df.max(axis=0)`	`axis=0` (default, column-wise)	Returns the maximum value.
`std()`	`df.std(axis=0)`	`axis=0` (default, column-wise)	Computes standard deviation.
`var()`	`df.var(axis=0)`	`axis=0` (default, column-wise)	Computes variance.
`median()`	`df.median(axis=0)`	`axis=0` (default, column-wise)	Computes the median value.
`mode()`	`df.mode()`	No parameters	Returns the most frequent value(s).
`abs()`	`df.abs()`	No parameters	Returns absolute values.
`clip()`	`df.clip(lower, upper)`	`lower`, `upper`: Clip values to limits	Clips values to a range.
`apply()`	`df.apply(func, axis=0)`	`func`: Function to apply, `axis`	Applies a function element-wise.
`map()`	`df.map(func)`	`func`: Function to apply	Applies function to each element (Series only).
`astype()`	`df.astype(dtype)`	`dtype`: Target data type	Converts data type of elements.

2.2 Attributes#

Series Attributes#

Attribute	Syntax	Description
`name`	`series.name`	Returns or sets the name of the Series.
`dtype`	`series.dtype`	Returns the data type of the Series.
`nbytes`	`series.nbytes`	Returns the total memory usage of the Series (in bytes).
`T`	`series.T`	Returns the Series itself (useful for compatibility with DataFrames).
`hasnans`	`series.hasnans`	Returns `True` if the Series contains `NaN` values.
`is_unique`	`series.is_unique`	Returns `True` if all values in the Series are unique.

Dataframe Attributes#

Attribute	Syntax	Description
`columns`	`df.columns`	Returns column labels of the DataFrame.
`axes`	`df.axes`	Returns a list of row and column index labels.
`T`	`df.T`	Returns the transposed DataFrame (swaps rows and columns).
`info()`	`df.info()`	Displays a summary of the DataFrame (index, dtypes, memory usage).
`memory_usage()`	`df.memory_usage()`	Returns the memory usage of each column.
`select_dtypes()`	`df.select_dtypes(include=[...])`	Selects columns of a specific dtype.
`nbytes`	`df.nbytes`	Returns the total memory usage of the DataFrame (in bytes).

Series and Dataframe Attributes#

These work on both

Attribute	Syntax	Description
`index`	`df.index`	Returns the row index labels.
`columns`	`df.columns`	Returns the column labels (DataFrame only).
`shape`	`df.shape`	Returns the dimensions (rows, columns).
`size`	`df.size`	Returns the total number of elements (rows × columns).
`ndim`	`df.ndim`	Returns the number of dimensions (1D for Series, 2D for DataFrame).
`values`	`df.values`	Returns data as a NumPy array.
`dtypes`	`df.dtypes`	Returns the data type of each column.
`empty`	`df.empty`	Returns `True` if the object is empty.

3. Pandas Series#

3.2 Creating Series#

pd.Series() constructor#

General syntax for the pd.Series() Constructor:

pd.Series(data=None, index=None, dtype=None, name=None, copy=False)

Parameter	Description
`data`	The main data (can be list, dict, array, scalar, etc.)
`index`	Labels for the Series elements (default: auto-generated integers)
`dtype`	The data type (e.g., `float`, `int`, `str`)
`name`	Optional name for the Series
`copy`	If `True`, makes a copy of the data

From Lists#

Index can be manually assigned or autogenerated (0,1,2,…)
By default produces a view if possible

import pandas as pd

# Creating a Series: Boiling points (in °C) of halogens
boiling_points = pd.Series(
    [-188.1, -34.0, 59.5, 184.4, 336.8],  # Values
    index=["F", "Cl", "Br", "I", "At"],  # Element symbols as labels
    name="Boiling Point (°C)"
    )


# Display the Series
print(boiling_points)

# Example Operations
print("\nHighest Boiling Point:", boiling_points.max())  # Get the highest boiling point
print("\nSorted Boiling Points:\n", boiling_points.sort_values())  # Sort in ascending order
print("\nRanked Boiling Points:\n", boiling_points.rank())  # Rank the elements

F    -188.1
Cl    -34.0
Br     59.5
I     184.4
At    336.8
Name: Boiling Point (°C), dtype: float64

Highest Boiling Point: 336.8

Sorted Boiling Points:
 F    -188.1
Cl    -34.0
Br     59.5
I     184.4
At    336.8
Name: Boiling Point (°C), dtype: float64

Ranked Boiling Points:
 F     1.0
Cl    2.0
Br    3.0
I     4.0
At    5.0
Name: Boiling Point (°C), dtype: float64

From Numpy Arrays#

import numpy as np
import pandas as pd
arr = np.array([1,2,3,4])
s=pd.Series(arr, index=(['H','He', 'Li', 'Be']))
print(s)

H     1
He    2
Li    3
Be    4
dtype: int64

From Dictionary#

# Creating a Series from a dictionary (keys become the index)
data = {"H": 1.008, "He": 4.0026, "Li": 6.94, "Be": 9.0122}
s = pd.Series(data, name="Atomic Mass (g/mol)")
print(s)

H     1.0080
He    4.0026
Li    6.9400
Be    9.0122
Name: Atomic Mass (g/mol), dtype: float64

From Scalar Value#

-useful for initializing placeholders

s = pd.Series(1, index=["A", "B", "C", "D"])
print(s)

A    1
B    1
C    1
D    1
dtype: int64

s = pd.Series(1, index=[“A”, “B”, “C”, “D”]) print(s)

Indirect Methods#

From DataFrame Column#

import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({"Element": ["H", "He", "Li"], "Atomic Mass": [1.008, 4.0026, 6.94]})

# Extracting a single column as a Series
atomic_mass_series = df["Atomic Mass"]
print(type(atomic_mass_series))  # <class 'pandas.core.series.Series'>
print(atomic_mass_series)

<class 'pandas.core.series.Series'>
0    1.0080
1    4.0026
2    6.9400
Name: Atomic Mass, dtype: float64

Using apply() on dataframe column#

df["Atomic Mass Squared"] = df["Atomic Mass"].apply(lambda x: x ** 2)
print(type(df["Atomic Mass Squared"]))  # <class 'pandas.core.series.Series'>
print(df["Atomic Mass Squared"])

<class 'pandas.core.series.Series'>
0     1.016064
1    16.020807
2    48.163600
Name: Atomic Mass Squared, dtype: float64

Using map() to a dataframe column#

import pandas as pd

# Step 1: Create a DataFrame of electronegativity values for halogens
halogens = pd.DataFrame({
    "Element": ["F", "Cl", "Br", "I", "At"],
    "Electronegativity": [3.98, 3.16, 2.96, 2.66, 2.2]
})

# Step 2: Define a function to classify elements based on electronegativity
def classify_electronegativity(value):
    if value >= 3.5:
        return "Highly Electronegative"
    elif value >= 2.5:
        return "Moderately Electronegative"
    else:
        return "Low Electronegativity"

# Step 3: Use `map()` to create a new Series with classifications
electronegativity_class = halogens["Electronegativity"].map(classify_electronegativity)

# Step 4: Add the new classification Series to the DataFrame
halogens["Electronegativity Category"] = electronegativity_class

# Step 5: Display the Series and DataFrame
print(electronegativity_class)
print(type(electronegativity_class))
print("\n")
print(halogens)
print(type(halogens))
# Step 6 (Optional): Save the DataFrame to a CSV file
halogens.to_csv("halogens_electronegativity.csv", index=False)

      Highly Electronegative
  Moderately Electronegative
  Moderately Electronegative
  Moderately Electronegative
       Low Electronegativity
Name: Electronegativity, dtype: object
<class 'pandas.core.series.Series'>


  Element  Electronegativity  Electronegativity Category
     F               3.98      Highly Electronegative
    Cl               3.16  Moderately Electronegative
    Br               2.96  Moderately Electronegative
     I               2.66  Moderately Electronegative
    At               2.20       Low Electronegativity
<class 'pandas.core.frame.DataFrame'>

Using iloc() on a dataframe#

import pandas as pd

# Create a sample DataFrame with chemical properties
df = pd.DataFrame({
    "Element": ["Na", "Mg", "Al", "Si", "P"],
    "Atomic Number": [11, 12, 13, 14, 15],
    "Electronegativity": [0.93, 1.31, 1.61, 1.90, 2.19]
})

# Extract the "Electronegativity" column using iloc
electronegativity_series = df.iloc[:, 2]  # Selecting column index 2

print(type(electronegativity_series))  # Output: <class 'pandas.core.series.Series'>
print(electronegativity_series)

<class 'pandas.core.series.Series'>
  0.93
  1.31
  1.61
  1.90
  2.19
Name: Electronegativity, dtype: float64

Using loc() on a dataframe#

# Extract "Electronegativity" column using loc
electronegativity_series = df.loc[:, "Electronegativity"]

print(type(electronegativity_series))  # Output: <class 'pandas.core.series.Series'>
print(electronegativity_series)

<class 'pandas.core.series.Series'>
  0.93
  1.31
  1.61
  1.90
  2.19
Name: Electronegativity, dtype: float64

4. Pandas DataFrames#

4.1 DataFrame Creation#

Function	Description
`pd.DataFrame()`	Creates a DataFrame from a dictionary, list, NumPy array, or other data source.
`pd.read_csv()`	Reads a CSV file into a DataFrame.
`pd.read_excel()`	Reads an Excel file into a DataFrame.
`pd.read_json()`	Reads a JSON file into a DataFrame.
`pd.read_sql()`	Reads from a SQL database into a DataFrame.
`pd.read_html()`	Reads an HTML table into a DataFrame.

pd.DataFrame() constructor#

4.2 DataFrame Methods#

Exploration#

Method	Description
`df.head(n)`	Returns the first `n` rows.
`df.tail(n)`	Returns the last `n` rows.
`df.info()`	Displays concise summary of the DataFrame.
`df.describe()`	Generates summary statistics for numerical columns.
`df.shape`	Returns tuple of (rows, columns).
`df.dtypes`	Returns the data types of each column.
`df.columns`	Returns the column labels.
`df.index`	Returns the index (row labels).
`df.sample(n)`	Returns a random sample of `n` rows.

DataFrame Selection and Filtering#

Method	Description
`df['column']`	Selects a single column as a Series.
`df[['col1', 'col2']]`	Selects multiple columns.
`df.iloc[row, col]`	Selects data using integer-location indexing.
`df.loc[row_label, col_label]`	Selects data using label-based indexing.
`df[df['column'] > value]`	Filters rows based on a condition.
`df.query('column > value')`	Queries the DataFrame using an expression.
`df.set_index('column')`	Sets a specific column as the index.
`df.reset_index()`	Resets the index to default integer values.

DataFrame Cleaning#

Method	Description
`df.dropna()`	Removes rows with missing values.
`df.fillna(value)`	Fills missing values with a specified value.
`df.drop(columns=['col1', 'col2'])`	Drops specified columns.
`df.drop(index=[0,1])`	Drops specified rows.
`df.rename(columns={'old':'new'})`	Renames column labels.
`df.duplicated()`	Checks for duplicate rows.
`df.drop_duplicates()`	Removes duplicate rows.
`df.replace(old, new)`	Replaces values in the DataFrame.

DataFrame Transformation#

Method	Description
`df.apply(func, axis=0 or 1)`	Applies a function across rows or columns.
`df.applymap(func)`	Applies a function element-wise.
`df.assign(new_col=value)`	Creates a new column.
`df.astype(dtype)`	Changes the data type of a column.
`df.sort_values(by='column')`	Sorts values based on a column.
`df.sort_index()`	Sorts DataFrame by index.
`df.pivot(index, columns, values)`	Creates a pivot table.

DataFrame Statistics and Aggregation#

Method	Description
`df.groupby('column')`	Groups data by a column.
`df.agg(['mean', 'sum'])`	Applies aggregation functions.
`df.mean()`	Computes mean of numerical columns.
`df.median()`	Computes median.
`df.sum()`	Computes sum.
`df.min()`	Computes minimum.
`df.max()`	Computes maximum.
`df.count()`	Counts non-null values.
`df.nunique()`	Counts unique values in each column.

DataFrame Merging and Concatenation#

Method	Description
`pd.concat([df1, df2])`	Concatenates DataFrames along an axis.
`df1.merge(df2, on='column')`	Merges DataFrames on a common column.
`df1.join(df2, on='column')`	Joins DataFrames based on index.

Dataframe IO#

Method	Description
`df.to_csv('file.csv')`	Writes DataFrame to a CSV file.
`df.to_excel('file.xlsx')`	Writes DataFrame to an Excel file.
`df.to_json('file.json')`	Writes DataFrame to a JSON file.
`df.to_sql('table', con=engine)`	Writes DataFrame to a SQL table.
`pd.read_clipboard()`	Reads data from clipboard into a Pandas DataFrame.
`df.to_clipboard()`	Copies a DataFrame to the clipboard for easy pasting elsewhere.
`pd.read_table("file.txt")`	Reads a tab-separated text file into a Pandas DataFrame.
`df.to_csv("file.txt", sep="\t", index=False)`	Saves a DataFrame as a tab-separated text file.
`pd.read_csv()`	Reads CSV file
pd.read_json()	Reads JSON file

4.3: Handling Missing Data#

import pandas as pd
import numpy as np

# Creating a DataFrame with missing values
data = {
    "Element": ["H", "He", "Li", "Be", "B", "C", "N", "O", "F", "Ne",
                "Na", "Mg", "Al", "Si", "P", "S", "Cl", "Ar"],
    "Atomic_Weight": [1.008, 4.0026, 6.94, 9.0122, 10.81, 12.011, 14.007, 
                      15.999, 18.998, 20.180, 22.990, None, 26.981, 28.085, 
                      30.974, None, 35.45, 39.948]
}

df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

Original DataFrame:
   Element  Atomic_Weight
      H         1.0080
     He         4.0026
     Li         6.9400
     Be         9.0122
      B        10.8100
      C        12.0110
      N        14.0070
      O        15.9990
      F        18.9980
     Ne        20.1800
    Na        22.9900
    Mg            NaN
    Al        26.9810
    Si        28.0850
     P        30.9740
     S            NaN
    Cl        35.4500
    Ar        39.9480

dropna() - remove missing values#

Note the index values are removed and not changed

df_cleaned = df.dropna()
print("\nDataFrame after dropping missing values:")
print(df_cleaned)

DataFrame after dropping missing values:
   Element  Atomic_Weight
      H         1.0080
     He         4.0026
     Li         6.9400
     Be         9.0122
      B        10.8100
      C        12.0110
      N        14.0070
      O        15.9990
      F        18.9980
     Ne        20.1800
    Na        22.9900
    Al        26.9810
    Si        28.0850
     P        30.9740
    Cl        35.4500
    Ar        39.9480

fillna(np.nan)Replacing Missing Values with NaN#

Note, Pandas automatically converts None to NaN, but sometimes other characters get introduced into the data. Also, sometimes missing data can cause problems when doing functions over entire dataframes even if it is a string (remember NaN is a float).

data = {
    "Element": ["H", "He", "Li", None, "B"],
    "State": ["Gas", "Gas", "Solid", None, "Solid"]
}

df1 = pd.DataFrame(data)
print(df1)
# Ensure missing values are explicitly set to NaN
df1 = df1.fillna(np.nan)

print(df1)

  Element  State
     H    Gas
    He    Gas
    Li  Solid
  None   None
     B  Solid
  Element  State
     H    Gas
    He    Gas
    Li  Solid
   NaN    NaN
     B  Solid

Replace Custom Missing Values#

This gets tricky as Pandas is discontinuing type implicit down casting and we have two data types in the Atomic Weight column. We had to force it to a string, replace the values with np.nan, and then convert to a float

import numpy as np
import pandas as pd

# Create DataFrame
data = {
    "Element": ["H", "He", "Li", "Be", "B"],
    "Atomic_Weight": [1.008, 4.0026, "N/A", "?", 10.81]
}

df2 = pd.DataFrame(data)
print("Original DataFrame:")
print(df2)

# ✅ Fully Future-Proof Method
df2["Atomic_Weight"] = df2["Atomic_Weight"].astype("string")  # Force string type to avoid downcasting
df2["Atomic_Weight"] = df2["Atomic_Weight"].replace(["N/A", "?"], np.nan)  # Replace missing values
df2["Atomic_Weight"] = df2["Atomic_Weight"].astype(float)  # Convert to float after replacement

print("\nUpdated DataFrame:")
print(df2)

Original DataFrame:
  Element Atomic_Weight
     H         1.008
    He        4.0026
    Li           N/A
    Be             ?
     B         10.81

Updated DataFrame:
  Element  Atomic_Weight
     H         1.0080
    He         4.0026
    Li            NaN
    Be            NaN
     B        10.8100

Replace Missing Values with the Mean#

df_mean_filled = df.copy()
df_mean_filled["Atomic_Weight"] = df_mean_filled["Atomic_Weight"].fillna(df["Atomic_Weight"].mean())

print("\nDataFrame with missing values replaced by the mean:")
print(df_mean_filled)

DataFrame with missing values replaced by the mean:
   Element  Atomic_Weight
      H       1.008000
     He       4.002600
     Li       6.940000
     Be       9.012200
      B      10.810000
      C      12.011000
      N      14.007000
      O      15.999000
      F      18.998000
     Ne      20.180000
    Na      22.990000
    Mg      18.587238
    Al      26.981000
    Si      28.085000
     P      30.974000
     S      18.587238
    Cl      35.450000
    Ar      39.948000

Replace Missing Value with Custom Values#

df_filled_custom = df.copy()

# ✅ Assign the modified column explicitly (instead of using inplace=True)
df_filled_custom["Atomic_Weight"] = df_filled_custom["Atomic_Weight"].fillna(0)

print("\nDataFrame with missing values replaced by 0:")
print(df_filled_custom)

DataFrame with missing values replaced by 0:
   Element  Atomic_Weight
      H         1.0080
     He         4.0026
     Li         6.9400
     Be         9.0122
      B        10.8100
      C        12.0110
      N        14.0070
      O        15.9990
      F        18.9980
     Ne        20.1800
    Na        22.9900
    Mg         0.0000
    Al        26.9810
    Si        28.0850
     P        30.9740
     S         0.0000
    Cl        35.4500
    Ar        39.9480

Use Boolean mask to remove NaN#

Note, we are using .notna() and not .notnan(). ‘na’ stands for “not available”, and this extends to non numeric data types like strings. the not in notna() inverts the Boolean mask the way ~ does in numpy. So it returns True if the value is a number

Now Pandas uses NaN as a standard marker for missing data, not just an invalid number. This is slightly different than Numpy, which treats it as a float.

df_valid = df[df["Atomic_Weight"].notna()]
print("\nDataFrame with only valid atomic weights:")
print(df_valid)

DataFrame with only valid atomic weights:
   Element  Atomic_Weight
      H         1.0080
     He         4.0026
     Li         6.9400
     Be         9.0122
      B        10.8100
      C        12.0110
      N        14.0070
      O        15.9990
      F        18.9980
     Ne        20.1800
    Na        22.9900
    Al        26.9810
    Si        28.0850
     P        30.9740
    Cl        35.4500
    Ar        39.9480

4. Statistics#

# Creating a DataFrame with physical properties of alkali metals
alkali_metals = pd.DataFrame({
    "Atomic Number": [3, 11, 19, 37, 55],  # Lithium to Cesium
    "Atomic Radius (pm)": [152, 186, 227, 248, 265],
    "Density (g/cm³)": [0.534, 0.97, 0.86, 1.53, 1.87],
    "Melting Point (°C)": [180.5, 97.8, 63.5, 39.3, 28.5]
}, index=["Li", "Na", "K", "Rb", "Cs"])  # Using symbols as row index

# Display the DataFrame
print(alkali_metals)

# Example Operations
print("\nMean Atomic Radius:", alkali_metals["Atomic Radius (pm)"].mean())  # Average atomic radius
print("\nSorted by Melting Point:\n", alkali_metals.sort_values("Melting Point (°C)"))  # Sort by melting point
print("\nDensity Correlation:\n", alkali_metals.corr())  # Correlation between properties

    Atomic Number  Atomic Radius (pm)  Density (g/cm³)  Melting Point (°C)
Li              3                 152            0.534               180.5
Na             11                 186            0.970                97.8
K              19                 227            0.860                63.5
Rb             37                 248            1.530                39.3
Cs             55                 265            1.870                28.5

Mean Atomic Radius: 215.6

Sorted by Melting Point:
     Atomic Number  Atomic Radius (pm)  Density (g/cm³)  Melting Point (°C)
Cs             55                 265            1.870                28.5
Rb             37                 248            1.530                39.3
K              19                 227            0.860                63.5
Na             11                 186            0.970                97.8
Li              3                 152            0.534               180.5

Density Correlation:
                     Atomic Number  Atomic Radius (pm)  Density (g/cm³)  \
Atomic Number            1.000000            0.932637         0.973714   
Atomic Radius (pm)       0.932637            1.000000         0.895427   
Density (g/cm³)          0.973714            0.895427         1.000000   
Melting Point (°C)      -0.855879           -0.969793        -0.856947   

                    Melting Point (°C)  
Atomic Number                -0.855879  
Atomic Radius (pm)           -0.969793  
Density (g/cm³)              -0.856947  
Melting Point (°C)            1.000000  

# Convert melting points from °C to Kelvin (for both Series & DataFrame)
to_kelvin = lambda temp: temp + 273.15

# Apply function to the boiling points Series
boiling_points_K = boiling_points.apply(to_kelvin)
print("\nBoiling Points in Kelvin:\n", boiling_points_K)

# Apply function to the DataFrame column
#print("\n", alkali_metals)
alkali_metals["Melting Point (K)"] = alkali_metals["Melting Point (°C)"].apply(to_kelvin)
print("\nUpdated Alkali Metals DataFrame:\n", alkali_metals)

Boiling Points in Kelvin:
 F      85.05
Cl    239.15
Br    332.65
I     457.55
At    609.95
Name: Boiling Point (°C), dtype: float64

Updated Alkali Metals DataFrame:
     Atomic Number  Atomic Radius (pm)  Density (g/cm³)  Melting Point (°C)  \
Li              3                 152            0.534               180.5   
Na             11                 186            0.970                97.8   
K              19                 227            0.860                63.5   
Rb             37                 248            1.530                39.3   
Cs             55                 265            1.870                28.5   

    Melting Point (K)  
Li             453.65  
Na             370.95  
K              336.65  
Rb             312.45  
Cs             301.65  

5. Activity#

We are going to upload the periodic table from worksheet 4 and explore it in a Pandas Dataframe

Step	Command	Purpose
Shape	`df.shape`	Get number of rows & columns
Column Labels	`df.columns`	See column names
Row & Column Count	`df.shape[0]`, `df.shape[1]`	Get counts separately
Data Types	`df.dtypes`	Check variable types
Data Overview	`df.info()`	Summarize structure & missing values
First Few Rows	`df.head(n)`	Preview top rows
Last Few Rows	`df.tail(n)`	Preview bottom rows
Statistics	`df.describe()`	Get numeric summaries
Missing Data	`df.isnull().sum()`	Identify missing values

# Use the directory structure from Workbook 4
import pandas as pd
import os
base_data_dir = os.path.expanduser("~/data")  # Parent directory
pubchem_data_dir = os.path.join(base_data_dir, "pubchem_data")  # Subdirectory for PubChem
os.makedirs(pubchem_data_dir, exist_ok=True)  # Ensure directories exist
periodictable_csv_datapath = os.path.join(pubchem_data_dir, "PubChemElements_all.csv")
df_periodictable = pd.read_csv(periodictable_csv_datapath)
df_periodictable.head()

	AtomicNumber	Symbol	Name	AtomicMass	CPKHexColor	ElectronConfiguration	Electronegativity	AtomicRadius	IonizationEnergy	ElectronAffinity	OxidationStates	StandardState	MeltingPoint	BoilingPoint	Density	GroupBlock	YearDiscovered
0	1	H	Hydrogen	1.008000	FFFFFF	1s1	2.20	120.0	13.598	0.754	+1, -1	Gas	13.81	20.28	0.000090	Nonmetal	1766
1	2	He	Helium	4.002600	D9FFFF	1s2	NaN	140.0	24.587	NaN	0	Gas	0.95	4.22	0.000179	Noble gas	1868
2	3	Li	Lithium	7.000000	CC80FF	[He]2s1	0.98	182.0	5.392	0.618	+1	Solid	453.65	1615.00	0.534000	Alkali metal	1817
3	4	Be	Beryllium	9.012183	C2FF00	[He]2s2	1.57	153.0	9.323	NaN	+2	Solid	1560.00	2744.00	1.850000	Alkaline earth metal	1798
4	5	B	Boron	10.810000	FFB5B5	[He]2s2 2p1	2.04	192.0	8.298	0.277	+3	Solid	2348.00	4273.00	2.370000	Metalloid	1808

Set the symbols column as the index#

df_periodictable=df_periodictable.set_index("Symbol")
df_periodictable.head()

	AtomicNumber	Name	AtomicMass	CPKHexColor	ElectronConfiguration	Electronegativity	AtomicRadius	IonizationEnergy	ElectronAffinity	OxidationStates	StandardState	MeltingPoint	BoilingPoint	Density	GroupBlock	YearDiscovered
Symbol
H	1	Hydrogen	1.008000	FFFFFF	1s1	2.20	120.0	13.598	0.754	+1, -1	Gas	13.81	20.28	0.000090	Nonmetal	1766
He	2	Helium	4.002600	D9FFFF	1s2	NaN	140.0	24.587	NaN	0	Gas	0.95	4.22	0.000179	Noble gas	1868
Li	3	Lithium	7.000000	CC80FF	[He]2s1	0.98	182.0	5.392	0.618	+1	Solid	453.65	1615.00	0.534000	Alkali metal	1817
Be	4	Beryllium	9.012183	C2FF00	[He]2s2	1.57	153.0	9.323	NaN	+2	Solid	1560.00	2744.00	1.850000	Alkaline earth metal	1798
B	5	Boron	10.810000	FFB5B5	[He]2s2 2p1	2.04	192.0	8.298	0.277	+3	Solid	2348.00	4273.00	2.370000	Metalloid	1808

df_periodictable.tail()

	AtomicNumber	Name	AtomicMass	CPKHexColor	ElectronConfiguration	Electronegativity	AtomicRadius	IonizationEnergy	ElectronAffinity	OxidationStates	StandardState	MeltingPoint	BoilingPoint	Density	GroupBlock	YearDiscovered
Symbol
Fl	114	Flerovium	290.192	NaN	[Rn]7s2 7p2 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	6, 4,2, 1, 0	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	1998
Mc	115	Moscovium	290.196	NaN	[Rn]7s2 7p3 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	3, 1	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	2003
Lv	116	Livermorium	293.205	NaN	[Rn]7s2 7p4 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+4, +2, -2	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	2000
Ts	117	Tennessine	294.211	NaN	[Rn]7s2 7p5 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+5, +3, +1, -1	Expected to be a Solid	NaN	NaN	NaN	Halogen	2010
Og	118	Oganesson	295.216	NaN	[Rn]7s2 7p6 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+6, +4, +2, +1, 0, -1	Expected to be a Gas	NaN	NaN	NaN	Noble gas	2006

print(df_periodictable.shape)
print(f'number rows is: {df_periodictable.shape[0]}, \nnumber of columns is {df_periodictable.shape[1]}')
print(df_periodictable.columns)
print(df_periodictable.dtypes)      

(118, 16)
number rows is: 118, 
number of columns is 16
Index(['AtomicNumber', 'Name', 'AtomicMass', 'CPKHexColor',
       'ElectronConfiguration', 'Electronegativity', 'AtomicRadius',
       'IonizationEnergy', 'ElectronAffinity', 'OxidationStates',
       'StandardState', 'MeltingPoint', 'BoilingPoint', 'Density',
       'GroupBlock', 'YearDiscovered'],
      dtype='object')
AtomicNumber               int64
Name                      object
AtomicMass               float64
CPKHexColor               object
ElectronConfiguration     object
Electronegativity        float64
AtomicRadius             float64
IonizationEnergy         float64
ElectronAffinity         float64
OxidationStates           object
StandardState             object
MeltingPoint             float64
BoilingPoint             float64
Density                  float64
GroupBlock                object
YearDiscovered            object
dtype: object

df_periodictable.info

<bound method DataFrame.info of         AtomicNumber         Name  AtomicMass CPKHexColor  \
Symbol                                                      
H                  1     Hydrogen    1.008000      FFFFFF   
He                 2       Helium    4.002600      D9FFFF   
Li                 3      Lithium    7.000000      CC80FF   
Be                 4    Beryllium    9.012183      C2FF00   
B                  5        Boron   10.810000      FFB5B5   
...              ...          ...         ...         ...   
Fl               114    Flerovium  290.192000         NaN   
Mc               115    Moscovium  290.196000         NaN   
Lv               116  Livermorium  293.205000         NaN   
Ts               117   Tennessine  294.211000         NaN   
Og               118    Oganesson  295.216000         NaN   

                    ElectronConfiguration  Electronegativity  AtomicRadius  \
Symbol                                                                       
H                                     1s1               2.20         120.0   
He                                    1s2                NaN         140.0   
Li                                [He]2s1               0.98         182.0   
Be                                [He]2s2               1.57         153.0   
B                             [He]2s2 2p1               2.04         192.0   
...                                   ...                ...           ...   
Fl      [Rn]7s2 7p2 5f14 6d10 (predicted)                NaN           NaN   
Mc      [Rn]7s2 7p3 5f14 6d10 (predicted)                NaN           NaN   
Lv      [Rn]7s2 7p4 5f14 6d10 (predicted)                NaN           NaN   
Ts      [Rn]7s2 7p5 5f14 6d10 (predicted)                NaN           NaN   
Og      [Rn]7s2 7p6 5f14 6d10 (predicted)                NaN           NaN   

        IonizationEnergy  ElectronAffinity        OxidationStates  \
Symbol                                                              
H                 13.598             0.754                 +1, -1   
He                24.587               NaN                      0   
Li                 5.392             0.618                     +1   
Be                 9.323               NaN                     +2   
B                  8.298             0.277                     +3   
...                  ...               ...                    ...   
Fl                   NaN               NaN           6, 4,2, 1, 0   
Mc                   NaN               NaN                   3, 1   
Lv                   NaN               NaN             +4, +2, -2   
Ts                   NaN               NaN         +5, +3, +1, -1   
Og                   NaN               NaN  +6, +4, +2, +1, 0, -1   

                 StandardState  MeltingPoint  BoilingPoint   Density  \
Symbol                                                                 
H                          Gas         13.81         20.28  0.000090   
He                         Gas          0.95          4.22  0.000179   
Li                       Solid        453.65       1615.00  0.534000   
Be                       Solid       1560.00       2744.00  1.850000   
B                        Solid       2348.00       4273.00  2.370000   
...                        ...           ...           ...       ...   
Fl      Expected to be a Solid           NaN           NaN       NaN   
Mc      Expected to be a Solid           NaN           NaN       NaN   
Lv      Expected to be a Solid           NaN           NaN       NaN   
Ts      Expected to be a Solid           NaN           NaN       NaN   
Og        Expected to be a Gas           NaN           NaN       NaN   

                   GroupBlock YearDiscovered  
Symbol                                        
H                    Nonmetal           1766  
He                  Noble gas           1868  
Li               Alkali metal           1817  
Be       Alkaline earth metal           1798  
B                   Metalloid           1808  
...                       ...            ...  
Fl      Post-transition metal           1998  
Mc      Post-transition metal           2003  
Lv      Post-transition metal           2000  
Ts                    Halogen           2010  
Og                  Noble gas           2006  

[118 rows x 16 columns]>

df_periodictable.describe()

	AtomicNumber	AtomicMass	Electronegativity	AtomicRadius	IonizationEnergy	ElectronAffinity	MeltingPoint	BoilingPoint	Density
count	118.000000	118.000000	95.000000	99.000000	102.000000	57.000000	103.000000	93.000000	96.000000
mean	59.500000	146.540281	1.732316	209.464646	7.997255	1.072140	1273.740553	2536.212473	7.608001
std	34.207699	89.768356	0.635187	38.569130	3.339066	0.879163	888.853859	1588.410919	5.878692
min	1.000000	1.008000	0.700000	120.000000	3.894000	0.079000	0.950000	4.220000	0.000090
25%	30.250000	66.480750	1.290000	187.000000	6.020500	0.470000	516.040000	1180.000000	2.572500
50%	59.500000	142.573830	1.620000	209.000000	6.960000	0.754000	1191.000000	2792.000000	7.072000
75%	88.750000	226.777165	2.170000	232.000000	8.998500	1.350000	1806.500000	3618.000000	10.275250
max	118.000000	295.216000	3.980000	348.000000	24.587000	3.617000	3823.000000	5869.000000	22.570000

#Check for missing data for each column
df_periodictable.isnull().sum()

AtomicNumber              0
Name                      0
AtomicMass                0
CPKHexColor              10
ElectronConfiguration     0
Electronegativity        23
AtomicRadius             19
IonizationEnergy         16
ElectronAffinity         61
OxidationStates           1
StandardState             0
MeltingPoint             15
BoilingPoint             25
Density                  22
GroupBlock                0
YearDiscovered            0
dtype: int64

Extracting Information from DataFrame#

df_periodictable.loc['Cr','ElectronConfiguration']

'[Ar]3d5 4s1'

# Define the list of halogens
halogens = ['F', 'Cl', 'Br', 'I', 'At']

# Select ionization energy for halogens
halogen_ionization = df_periodictable.loc[df_periodictable.index.isin(halogens), "IonizationEnergy"]

# Display results
#display(halogen_ionization.to_frame())  # Ensures it shows as a DataFrame
print(halogen_ionization)
print(type(halogen_ionization))

Symbol
F     17.423
Cl    12.968
Br    11.814
I     10.451
At     9.500
Name: IonizationEnergy, dtype: float64
<class 'pandas.core.series.Series'>

Explaining the above code

Create a list of halogens

halogens = ['F', 'Cl', 'Br', 'I', 'At']

Create a boolean mask that is True for halogens

df_periodictable.index.isin(halogens)

Select rows where boolean mask is True and return only “IonizationEnergy”

.loc[df_periodictable.index.isin(halogens), "IonizationEnergy"]

Assign to variable halogen_ionization

halogen_ionization = df_periodictable.loc[df_periodictable.index.isin(halogens), "IonizationEnergy"]

display(halogen_ionization.to_frame())  # Ensures it shows as a DataFrame

Display is a jupyter function and will only work in notebooks, you need to spring to a string if you want to print it out

print(halogen_ionization)

Symbol
F     17.423
Cl    12.968
Br    11.814
I     10.451
At     9.500
Name: IonizationEnergy, dtype: float64

print(halogen_ionization.to_string())

Symbol
F     17.423
Cl    12.968
Br    11.814
I     10.451
At     9.500

print(halogen_ionization.to_frame())
print(type(halogen_ionization.to_frame()))

        IonizationEnergy
Symbol                  
F                 17.423
Cl                12.968
Br                11.814
I                 10.451
At                 9.500
<class 'pandas.core.frame.DataFrame'>

print(halogen_ionization.to_frame().to_string())
print(type(halogen_ionization.to_frame().to_string()))

        IonizationEnergy
Symbol                  
F                 17.423
Cl                12.968
Br                11.814
I                 10.451
At                 9.500
<class 'str'>

halogen_melt_boil_pts = df_periodictable.loc[df_periodictable.index.isin(halogens), ["MeltingPoint", "BoilingPoint"]]
print(halogen_melt_boil_pts)
print(type(halogen_melt_boil_pts))

        MeltingPoint  BoilingPoint
Symbol                            
F              53.53         85.03
Cl            171.65        239.11
Br            265.95        331.95
I             386.85        457.55
At            575.00           NaN
<class 'pandas.core.frame.DataFrame'>

Acknowledgements#

This content was developed with assistance from Perplexity AI and Chat GPT. Multiple queries were made during the Fall 2024 and the Spring 2025.