2. Pandas#

1. Introduction#

Pandas stands for Panel data, which is an economics term for multi-dimensional datasets. Pandas is built on top of Numpy and has two primary data structures

  1. Series (1D data structure)

  2. Dataframe (2D data structure)

If you want higher dimensional data structures you should use the xrray package.

Pandas series and dataframes are typically easier to work with than Numpy arrays because they can use labels as index numbers. That is, you can give a row or a column a heading label and then using the label to obtain the data value, whereas in numpy you need to use the index numbers. Pandas can also contain multiple data types while numpy arrays can only have one data type.

Pandas DataFrames vs. Numpy Arrays#


Feature

Pandas DataFrame

NumPy Array

Type

2D table with labeled rows & columns

N-dimensional homogeneous array

Supports multiple data types?

Yes (e.g., strings & numbers in one column)

No (all elements must be the same type)

Supports heterogeneous data?

Yes (each column can have different types)

No (all elements must be the same dtype)

Row & column labels

Yes (uses Index and column names)

No (only positional indexing)

Access by labels?

Yes (df.loc["row_label", "column_label"])

No (only by index position)

Access by index position?

Yes (df.iloc[1, 2])

Yes (arr[1, 2])

Supports missing values?

Yes (NaN)

No (must use masked arrays or np.nan)

Built-in functions for missing data?

Yes (df.fillna(), df.dropna())

No (requires workarounds)

Supports row/column operations?

Yes (df.mean(axis=0), df.sum(axis=1))

Yes (arr.mean(axis=0), arr.sum(axis=1))

Supports grouping & aggregation?

Yes (df.groupby(), df.agg())

No

Supports reshaping?

Yes (df.pivot(), df.melt())

Yes (arr.reshape())

Supports merging/joining?

Yes (df.merge(), df.join())

No

Mutable

Yes (you can modify/add/delete columns, rows or elements)

Partially (you can modify elements but not change shape)

Read/Write CSV?

Yes (df.to_csv(), pd.read_csv())

No (must use Pandas or np.savetxt())

Read/Write JSON?

Yes (df.to_json(), pd.read_json())

No

Read/Write Excel?

Yes (df.to_excel(), pd.read_excel())

No

Save/Load Binary Format?

Yes (df.to_pickle())

Yes (np.save(), np.load())

If you have not installed pandas you need to activate your conda virtual environment and install it using your terminal

conda activate NameOfEnvironment
conda install -c conda-forge pandas

Now check the version that is installed

import pandas as pd
print(pd.__version__)
2.2.3

2. Pandas Object Overview#

2.1: Functions#

Series Functions (series only)#

Function

Syntax

Parameters

Description

str

series.str.method()

String-specific method

Enables vectorized string operations.

dt

series.dt.method()

Datetime-specific method

Enables date/time manipulations.

map()

series.map(func)

func: Function to apply

Maps function to each value in the Series.

unique()

series.unique()

No parameters

Returns unique values in the Series.

nunique()

series.nunique()

No parameters

Returns the number of unique values.

value_counts()

series.value_counts()

normalize, sort, ascending, bins

Returns frequency counts of unique values.

idxmax()

series.idxmax()

No parameters

Returns index of the maximum value.

idxmin()

series.idxmin()

No parameters

Returns index of the minimum value.

cumsum()

series.cumsum()

No parameters

Computes the cumulative sum.

cumprod()

series.cumprod()

No parameters

Computes the cumulative product.

cummax()

series.cummax()

No parameters

Computes the cumulative max.

cummin()

series.cummin()

No parameters

Computes the cumulative min.

shift()

series.shift(periods=1)

periods (int, default=1)

Shifts values up or down.

diff()

series.diff(periods=1)

periods (int, default=1)

Computes the difference between elements.

rank()

series.rank(method='average')

method: Ranking method

Computes ranks of elements.

Dataframe functions (df only)#

Function

Syntax

Parameters

Description

groupby()

df.groupby(by)

by: Column(s) to group by

Groups data and applies aggregation.

pivot()

df.pivot(index, columns, values)

index, columns, values

Reshapes data by pivoting columns.

pivot_table()

df.pivot_table(values, index, columns)

values, index, columns, aggfunc

Similar to pivot(), but allows aggregation.

melt()

df.melt(id_vars, value_vars)

id_vars, value_vars

Converts wide format to long format.

merge()

df.merge(df2, on, how)

on: Key column, how: Join type

Merges two DataFrames on a key.

join()

df.join(df2, on)

on: Key column

Joins two DataFrames based on index.

concat()

pd.concat([df1, df2], axis=0)

axis: 0 for rows, 1 for columns

Concatenates multiple DataFrames.

stack()

df.stack()

No parameters

Converts columns into rows (long format).

unstack()

df.unstack()

No parameters

Converts rows into columns (wide format).

explode()

df.explode(column)

column: Column to explode

Expands list-like column values into rows.

T

df.T

No parameters

Transposes rows and columns.

corr()

df.corr()

No parameters

Computes correlation between columns.

cov()

df.cov()

No parameters

Computes covariance between columns.

duplicated()

df.duplicated(subset, keep='first')

subset: Columns to check

Identifies duplicate rows.

drop_duplicates()

df.drop_duplicates(subset, keep='first')

subset: Columns to check

Removes duplicate rows.

sample()

df.sample(n=5, frac=None)

n: Number of samples, frac: Fraction

Returns a random sample.

nlargest()

df.nlargest(n, columns)

n: Number of rows, columns: Sort column

Returns top n largest values in a column.

nsmallest()

df.nsmallest(n, columns)

n: Number of rows, columns: Sort column

Returns top n smallest values in a column.

fillna()

df.fillna(value)

value: Value to fill missing data

Fills missing values.

dropna()

df.dropna(axis=0)

axis: 0 (rows), 1 (columns)

Drops missing values.

replace()

df.replace(to_replace, value)

to_replace, value

Replaces specific values.

Series and dataframe functions#

Function

Syntax

Parameters

Description

head()

df.head(n)

n (int, default=5): Number of rows to return

Returns the first n rows.

tail()

df.tail(n)

n (int, default=5): Number of rows to return

Returns the last n rows.

describe()

df.describe()

percentiles, include, exclude

Generates summary statistics for numerical data.

count()

df.count()

axis=0 (count rows) or axis=1 (count columns)

Counts non-null values.

sum()

df.sum(axis=0)

axis=0 (column-wise) or axis=1 (row-wise)

Computes sum of values.

mean()

df.mean(axis=0)

axis=0 (default, column-wise)

Computes mean of values.

min()

df.min(axis=0)

axis=0 (default, column-wise)

Returns the minimum value.

max()

df.max(axis=0)

axis=0 (default, column-wise)

Returns the maximum value.

std()

df.std(axis=0)

axis=0 (default, column-wise)

Computes standard deviation.

var()

df.var(axis=0)

axis=0 (default, column-wise)

Computes variance.

median()

df.median(axis=0)

axis=0 (default, column-wise)

Computes the median value.

mode()

df.mode()

No parameters

Returns the most frequent value(s).

abs()

df.abs()

No parameters

Returns absolute values.

clip()

df.clip(lower, upper)

lower, upper: Clip values to limits

Clips values to a range.

apply()

df.apply(func, axis=0)

func: Function to apply, axis

Applies a function element-wise.

map()

df.map(func)

func: Function to apply

Applies function to each element (Series only).

astype()

df.astype(dtype)

dtype: Target data type

Converts data type of elements.

2.2 Attributes#

Series Attributes#

Attribute

Syntax

Description

name

series.name

Returns or sets the name of the Series.

dtype

series.dtype

Returns the data type of the Series.

nbytes

series.nbytes

Returns the total memory usage of the Series (in bytes).

T

series.T

Returns the Series itself (useful for compatibility with DataFrames).

hasnans

series.hasnans

Returns True if the Series contains NaN values.

is_unique

series.is_unique

Returns True if all values in the Series are unique.

Dataframe Attributes#

Attribute

Syntax

Description

columns

df.columns

Returns column labels of the DataFrame.

axes

df.axes

Returns a list of row and column index labels.

T

df.T

Returns the transposed DataFrame (swaps rows and columns).

info()

df.info()

Displays a summary of the DataFrame (index, dtypes, memory usage).

memory_usage()

df.memory_usage()

Returns the memory usage of each column.

select_dtypes()

df.select_dtypes(include=[...])

Selects columns of a specific dtype.

nbytes

df.nbytes

Returns the total memory usage of the DataFrame (in bytes).

Series and Dataframe Attributes#

These work on both

Attribute

Syntax

Description

index

df.index

Returns the row index labels.

columns

df.columns

Returns the column labels (DataFrame only).

shape

df.shape

Returns the dimensions (rows, columns).

size

df.size

Returns the total number of elements (rows × columns).

ndim

df.ndim

Returns the number of dimensions (1D for Series, 2D for DataFrame).

values

df.values

Returns data as a NumPy array.

dtypes

df.dtypes

Returns the data type of each column.

empty

df.empty

Returns True if the object is empty.

3. Pandas Series#

3.2 Creating Series#

pd.Series() constructor#

General syntax for the pd.Series() Constructor:

pd.Series(data=None, index=None, dtype=None, name=None, copy=False)

Parameter

Description

data

The main data (can be list, dict, array, scalar, etc.)

index

Labels for the Series elements (default: auto-generated integers)

dtype

The data type (e.g., float, int, str)

name

Optional name for the Series

copy

If True, makes a copy of the data

From Lists#

  • Index can be manually assigned or autogenerated (0,1,2,…)

  • By default produces a view if possible

import pandas as pd

# Creating a Series: Boiling points (in °C) of halogens
boiling_points = pd.Series(
    [-188.1, -34.0, 59.5, 184.4, 336.8],  # Values
    index=["F", "Cl", "Br", "I", "At"],  # Element symbols as labels
    name="Boiling Point (°C)"
    )


# Display the Series
print(boiling_points)

# Example Operations
print("\nHighest Boiling Point:", boiling_points.max())  # Get the highest boiling point
print("\nSorted Boiling Points:\n", boiling_points.sort_values())  # Sort in ascending order
print("\nRanked Boiling Points:\n", boiling_points.rank())  # Rank the elements
F    -188.1
Cl    -34.0
Br     59.5
I     184.4
At    336.8
Name: Boiling Point (°C), dtype: float64

Highest Boiling Point: 336.8

Sorted Boiling Points:
 F    -188.1
Cl    -34.0
Br     59.5
I     184.4
At    336.8
Name: Boiling Point (°C), dtype: float64

Ranked Boiling Points:
 F     1.0
Cl    2.0
Br    3.0
I     4.0
At    5.0
Name: Boiling Point (°C), dtype: float64

From Numpy Arrays#

import numpy as np
import pandas as pd
arr = np.array([1,2,3,4])
s=pd.Series(arr, index=(['H','He', 'Li', 'Be']))
print(s)
H     1
He    2
Li    3
Be    4
dtype: int64

From Dictionary#

# Creating a Series from a dictionary (keys become the index)
data = {"H": 1.008, "He": 4.0026, "Li": 6.94, "Be": 9.0122}
s = pd.Series(data, name="Atomic Mass (g/mol)")
print(s)
H     1.0080
He    4.0026
Li    6.9400
Be    9.0122
Name: Atomic Mass (g/mol), dtype: float64

From Scalar Value#

-useful for initializing placeholders

s = pd.Series(1, index=["A", "B", "C", "D"])
print(s)
A    1
B    1
C    1
D    1
dtype: int64

s = pd.Series(1, index=[“A”, “B”, “C”, “D”]) print(s)

Indirect Methods#

From DataFrame Column#

import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({"Element": ["H", "He", "Li"], "Atomic Mass": [1.008, 4.0026, 6.94]})

# Extracting a single column as a Series
atomic_mass_series = df["Atomic Mass"]
print(type(atomic_mass_series))  # <class 'pandas.core.series.Series'>
print(atomic_mass_series)
<class 'pandas.core.series.Series'>
0    1.0080
1    4.0026
2    6.9400
Name: Atomic Mass, dtype: float64

Using apply() on dataframe column#

df["Atomic Mass Squared"] = df["Atomic Mass"].apply(lambda x: x ** 2)
print(type(df["Atomic Mass Squared"]))  # <class 'pandas.core.series.Series'>
print(df["Atomic Mass Squared"])
<class 'pandas.core.series.Series'>
0     1.016064
1    16.020807
2    48.163600
Name: Atomic Mass Squared, dtype: float64

Using map() to a dataframe column#

import pandas as pd

# Step 1: Create a DataFrame of electronegativity values for halogens
halogens = pd.DataFrame({
    "Element": ["F", "Cl", "Br", "I", "At"],
    "Electronegativity": [3.98, 3.16, 2.96, 2.66, 2.2]
})

# Step 2: Define a function to classify elements based on electronegativity
def classify_electronegativity(value):
    if value >= 3.5:
        return "Highly Electronegative"
    elif value >= 2.5:
        return "Moderately Electronegative"
    else:
        return "Low Electronegativity"

# Step 3: Use `map()` to create a new Series with classifications
electronegativity_class = halogens["Electronegativity"].map(classify_electronegativity)

# Step 4: Add the new classification Series to the DataFrame
halogens["Electronegativity Category"] = electronegativity_class

# Step 5: Display the Series and DataFrame
print(electronegativity_class)
print(type(electronegativity_class))
print("\n")
print(halogens)
print(type(halogens))
# Step 6 (Optional): Save the DataFrame to a CSV file
halogens.to_csv("halogens_electronegativity.csv", index=False)
0        Highly Electronegative
1    Moderately Electronegative
2    Moderately Electronegative
3    Moderately Electronegative
4         Low Electronegativity
Name: Electronegativity, dtype: object
<class 'pandas.core.series.Series'>


  Element  Electronegativity  Electronegativity Category
0       F               3.98      Highly Electronegative
1      Cl               3.16  Moderately Electronegative
2      Br               2.96  Moderately Electronegative
3       I               2.66  Moderately Electronegative
4      At               2.20       Low Electronegativity
<class 'pandas.core.frame.DataFrame'>

Using iloc() on a dataframe#

import pandas as pd

# Create a sample DataFrame with chemical properties
df = pd.DataFrame({
    "Element": ["Na", "Mg", "Al", "Si", "P"],
    "Atomic Number": [11, 12, 13, 14, 15],
    "Electronegativity": [0.93, 1.31, 1.61, 1.90, 2.19]
})

# Extract the "Electronegativity" column using iloc
electronegativity_series = df.iloc[:, 2]  # Selecting column index 2

print(type(electronegativity_series))  # Output: <class 'pandas.core.series.Series'>
print(electronegativity_series)
<class 'pandas.core.series.Series'>
0    0.93
1    1.31
2    1.61
3    1.90
4    2.19
Name: Electronegativity, dtype: float64

Using loc() on a dataframe#

# Extract "Electronegativity" column using loc
electronegativity_series = df.loc[:, "Electronegativity"]

print(type(electronegativity_series))  # Output: <class 'pandas.core.series.Series'>
print(electronegativity_series)
<class 'pandas.core.series.Series'>
0    0.93
1    1.31
2    1.61
3    1.90
4    2.19
Name: Electronegativity, dtype: float64

4. Pandas DataFrames#

4.1 DataFrame Creation#

Function

Description

pd.DataFrame()

Creates a DataFrame from a dictionary, list, NumPy array, or other data source.

pd.read_csv()

Reads a CSV file into a DataFrame.

pd.read_excel()

Reads an Excel file into a DataFrame.

pd.read_json()

Reads a JSON file into a DataFrame.

pd.read_sql()

Reads from a SQL database into a DataFrame.

pd.read_html()

Reads an HTML table into a DataFrame.

pd.DataFrame() constructor#

4.2 DataFrame Methods#

Exploration#

Method

Description

df.head(n)

Returns the first n rows.

df.tail(n)

Returns the last n rows.

df.info()

Displays concise summary of the DataFrame.

df.describe()

Generates summary statistics for numerical columns.

df.shape

Returns tuple of (rows, columns).

df.dtypes

Returns the data types of each column.

df.columns

Returns the column labels.

df.index

Returns the index (row labels).

df.sample(n)

Returns a random sample of n rows.

DataFrame Selection and Filtering#

Method

Description

df['column']

Selects a single column as a Series.

df[['col1', 'col2']]

Selects multiple columns.

df.iloc[row, col]

Selects data using integer-location indexing.

df.loc[row_label, col_label]

Selects data using label-based indexing.

df[df['column'] > value]

Filters rows based on a condition.

df.query('column > value')

Queries the DataFrame using an expression.

df.set_index('column')

Sets a specific column as the index.

df.reset_index()

Resets the index to default integer values.

DataFrame Cleaning#

Method

Description

df.dropna()

Removes rows with missing values.

df.fillna(value)

Fills missing values with a specified value.

df.drop(columns=['col1', 'col2'])

Drops specified columns.

df.drop(index=[0,1])

Drops specified rows.

df.rename(columns={'old':'new'})

Renames column labels.

df.duplicated()

Checks for duplicate rows.

df.drop_duplicates()

Removes duplicate rows.

df.replace(old, new)

Replaces values in the DataFrame.

DataFrame Transformation#

Method

Description

df.apply(func, axis=0 or 1)

Applies a function across rows or columns.

df.applymap(func)

Applies a function element-wise.

df.assign(new_col=value)

Creates a new column.

df.astype(dtype)

Changes the data type of a column.

df.sort_values(by='column')

Sorts values based on a column.

df.sort_index()

Sorts DataFrame by index.

df.pivot(index, columns, values)

Creates a pivot table.

DataFrame Statistics and Aggregation#

Method

Description

df.groupby('column')

Groups data by a column.

df.agg(['mean', 'sum'])

Applies aggregation functions.

df.mean()

Computes mean of numerical columns.

df.median()

Computes median.

df.sum()

Computes sum.

df.min()

Computes minimum.

df.max()

Computes maximum.

df.count()

Counts non-null values.

df.nunique()

Counts unique values in each column.

DataFrame Merging and Concatenation#

Method

Description

pd.concat([df1, df2])

Concatenates DataFrames along an axis.

df1.merge(df2, on='column')

Merges DataFrames on a common column.

df1.join(df2, on='column')

Joins DataFrames based on index.

Dataframe IO#

Method

Description

df.to_csv('file.csv')

Writes DataFrame to a CSV file.

df.to_excel('file.xlsx')

Writes DataFrame to an Excel file.

df.to_json('file.json')

Writes DataFrame to a JSON file.

df.to_sql('table', con=engine)

Writes DataFrame to a SQL table.

pd.read_clipboard()

Reads data from clipboard into a Pandas DataFrame.

df.to_clipboard()

Copies a DataFrame to the clipboard for easy pasting elsewhere.

pd.read_table("file.txt")

Reads a tab-separated text file into a Pandas DataFrame.

df.to_csv("file.txt", sep="\t", index=False)

Saves a DataFrame as a tab-separated text file.

pd.read_csv()

Reads CSV file

pd.read_json()

Reads JSON file

4.3: Handling Missing Data#

import pandas as pd
import numpy as np

# Creating a DataFrame with missing values
data = {
    "Element": ["H", "He", "Li", "Be", "B", "C", "N", "O", "F", "Ne",
                "Na", "Mg", "Al", "Si", "P", "S", "Cl", "Ar"],
    "Atomic_Weight": [1.008, 4.0026, 6.94, 9.0122, 10.81, 12.011, 14.007, 
                      15.999, 18.998, 20.180, 22.990, None, 26.981, 28.085, 
                      30.974, None, 35.45, 39.948]
}

df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)
Original DataFrame:
   Element  Atomic_Weight
0        H         1.0080
1       He         4.0026
2       Li         6.9400
3       Be         9.0122
4        B        10.8100
5        C        12.0110
6        N        14.0070
7        O        15.9990
8        F        18.9980
9       Ne        20.1800
10      Na        22.9900
11      Mg            NaN
12      Al        26.9810
13      Si        28.0850
14       P        30.9740
15       S            NaN
16      Cl        35.4500
17      Ar        39.9480

dropna() - remove missing values#

Note the index values are removed and not changed

df_cleaned = df.dropna()
print("\nDataFrame after dropping missing values:")
print(df_cleaned)
DataFrame after dropping missing values:
   Element  Atomic_Weight
0        H         1.0080
1       He         4.0026
2       Li         6.9400
3       Be         9.0122
4        B        10.8100
5        C        12.0110
6        N        14.0070
7        O        15.9990
8        F        18.9980
9       Ne        20.1800
10      Na        22.9900
12      Al        26.9810
13      Si        28.0850
14       P        30.9740
16      Cl        35.4500
17      Ar        39.9480

fillna(np.nan)Replacing Missing Values with NaN#

Note, Pandas automatically converts None to NaN, but sometimes other characters get introduced into the data. Also, sometimes missing data can cause problems when doing functions over entire dataframes even if it is a string (remember NaN is a float).

data = {
    "Element": ["H", "He", "Li", None, "B"],
    "State": ["Gas", "Gas", "Solid", None, "Solid"]
}

df1 = pd.DataFrame(data)
print(df1)
# Ensure missing values are explicitly set to NaN
df1 = df1.fillna(np.nan)

print(df1)
  Element  State
0       H    Gas
1      He    Gas
2      Li  Solid
3    None   None
4       B  Solid
  Element  State
0       H    Gas
1      He    Gas
2      Li  Solid
3     NaN    NaN
4       B  Solid

Replace Custom Missing Values#

This gets tricky as Pandas is discontinuing type implicit down casting and we have two data types in the Atomic Weight column. We had to force it to a string, replace the values with np.nan, and then convert to a float

import numpy as np
import pandas as pd

# Create DataFrame
data = {
    "Element": ["H", "He", "Li", "Be", "B"],
    "Atomic_Weight": [1.008, 4.0026, "N/A", "?", 10.81]
}

df2 = pd.DataFrame(data)
print("Original DataFrame:")
print(df2)

# ✅ Fully Future-Proof Method
df2["Atomic_Weight"] = df2["Atomic_Weight"].astype("string")  # Force string type to avoid downcasting
df2["Atomic_Weight"] = df2["Atomic_Weight"].replace(["N/A", "?"], np.nan)  # Replace missing values
df2["Atomic_Weight"] = df2["Atomic_Weight"].astype(float)  # Convert to float after replacement

print("\nUpdated DataFrame:")
print(df2)
Original DataFrame:
  Element Atomic_Weight
0       H         1.008
1      He        4.0026
2      Li           N/A
3      Be             ?
4       B         10.81

Updated DataFrame:
  Element  Atomic_Weight
0       H         1.0080
1      He         4.0026
2      Li            NaN
3      Be            NaN
4       B        10.8100

Replace Missing Values with the Mean#

df_mean_filled = df.copy()
df_mean_filled["Atomic_Weight"] = df_mean_filled["Atomic_Weight"].fillna(df["Atomic_Weight"].mean())

print("\nDataFrame with missing values replaced by the mean:")
print(df_mean_filled)
DataFrame with missing values replaced by the mean:
   Element  Atomic_Weight
0        H       1.008000
1       He       4.002600
2       Li       6.940000
3       Be       9.012200
4        B      10.810000
5        C      12.011000
6        N      14.007000
7        O      15.999000
8        F      18.998000
9       Ne      20.180000
10      Na      22.990000
11      Mg      18.587238
12      Al      26.981000
13      Si      28.085000
14       P      30.974000
15       S      18.587238
16      Cl      35.450000
17      Ar      39.948000

Replace Missing Value with Custom Values#

df_filled_custom = df.copy()

# ✅ Assign the modified column explicitly (instead of using inplace=True)
df_filled_custom["Atomic_Weight"] = df_filled_custom["Atomic_Weight"].fillna(0)

print("\nDataFrame with missing values replaced by 0:")
print(df_filled_custom)
DataFrame with missing values replaced by 0:
   Element  Atomic_Weight
0        H         1.0080
1       He         4.0026
2       Li         6.9400
3       Be         9.0122
4        B        10.8100
5        C        12.0110
6        N        14.0070
7        O        15.9990
8        F        18.9980
9       Ne        20.1800
10      Na        22.9900
11      Mg         0.0000
12      Al        26.9810
13      Si        28.0850
14       P        30.9740
15       S         0.0000
16      Cl        35.4500
17      Ar        39.9480

Use Boolean mask to remove NaN#

Note, we are using .notna() and not .notnan(). ‘na’ stands for “not available”, and this extends to non numeric data types like strings. the not in notna() inverts the Boolean mask the way ~ does in numpy. So it returns True if the value is a number

Now Pandas uses NaN as a standard marker for missing data, not just an invalid number. This is slightly different than Numpy, which treats it as a float.

df_valid = df[df["Atomic_Weight"].notna()]
print("\nDataFrame with only valid atomic weights:")
print(df_valid)
DataFrame with only valid atomic weights:
   Element  Atomic_Weight
0        H         1.0080
1       He         4.0026
2       Li         6.9400
3       Be         9.0122
4        B        10.8100
5        C        12.0110
6        N        14.0070
7        O        15.9990
8        F        18.9980
9       Ne        20.1800
10      Na        22.9900
12      Al        26.9810
13      Si        28.0850
14       P        30.9740
16      Cl        35.4500
17      Ar        39.9480

4. Statistics#

# Creating a DataFrame with physical properties of alkali metals
alkali_metals = pd.DataFrame({
    "Atomic Number": [3, 11, 19, 37, 55],  # Lithium to Cesium
    "Atomic Radius (pm)": [152, 186, 227, 248, 265],
    "Density (g/cm³)": [0.534, 0.97, 0.86, 1.53, 1.87],
    "Melting Point (°C)": [180.5, 97.8, 63.5, 39.3, 28.5]
}, index=["Li", "Na", "K", "Rb", "Cs"])  # Using symbols as row index

# Display the DataFrame
print(alkali_metals)

# Example Operations
print("\nMean Atomic Radius:", alkali_metals["Atomic Radius (pm)"].mean())  # Average atomic radius
print("\nSorted by Melting Point:\n", alkali_metals.sort_values("Melting Point (°C)"))  # Sort by melting point
print("\nDensity Correlation:\n", alkali_metals.corr())  # Correlation between properties
    Atomic Number  Atomic Radius (pm)  Density (g/cm³)  Melting Point (°C)
Li              3                 152            0.534               180.5
Na             11                 186            0.970                97.8
K              19                 227            0.860                63.5
Rb             37                 248            1.530                39.3
Cs             55                 265            1.870                28.5

Mean Atomic Radius: 215.6

Sorted by Melting Point:
     Atomic Number  Atomic Radius (pm)  Density (g/cm³)  Melting Point (°C)
Cs             55                 265            1.870                28.5
Rb             37                 248            1.530                39.3
K              19                 227            0.860                63.5
Na             11                 186            0.970                97.8
Li              3                 152            0.534               180.5

Density Correlation:
                     Atomic Number  Atomic Radius (pm)  Density (g/cm³)  \
Atomic Number            1.000000            0.932637         0.973714   
Atomic Radius (pm)       0.932637            1.000000         0.895427   
Density (g/cm³)          0.973714            0.895427         1.000000   
Melting Point (°C)      -0.855879           -0.969793        -0.856947   

                    Melting Point (°C)  
Atomic Number                -0.855879  
Atomic Radius (pm)           -0.969793  
Density (g/cm³)              -0.856947  
Melting Point (°C)            1.000000  
# Convert melting points from °C to Kelvin (for both Series & DataFrame)
to_kelvin = lambda temp: temp + 273.15

# Apply function to the boiling points Series
boiling_points_K = boiling_points.apply(to_kelvin)
print("\nBoiling Points in Kelvin:\n", boiling_points_K)

# Apply function to the DataFrame column
#print("\n", alkali_metals)
alkali_metals["Melting Point (K)"] = alkali_metals["Melting Point (°C)"].apply(to_kelvin)
print("\nUpdated Alkali Metals DataFrame:\n", alkali_metals)
Boiling Points in Kelvin:
 F      85.05
Cl    239.15
Br    332.65
I     457.55
At    609.95
Name: Boiling Point (°C), dtype: float64

Updated Alkali Metals DataFrame:
     Atomic Number  Atomic Radius (pm)  Density (g/cm³)  Melting Point (°C)  \
Li              3                 152            0.534               180.5   
Na             11                 186            0.970                97.8   
K              19                 227            0.860                63.5   
Rb             37                 248            1.530                39.3   
Cs             55                 265            1.870                28.5   

    Melting Point (K)  
Li             453.65  
Na             370.95  
K              336.65  
Rb             312.45  
Cs             301.65  

5. Activity#

We are going to upload the periodic table from worksheet 4 and explore it in a Pandas Dataframe

Step

Command

Purpose

Shape

df.shape

Get number of rows & columns

Column Labels

df.columns

See column names

Row & Column Count

df.shape[0], df.shape[1]

Get counts separately

Data Types

df.dtypes

Check variable types

Data Overview

df.info()

Summarize structure & missing values

First Few Rows

df.head(n)

Preview top rows

Last Few Rows

df.tail(n)

Preview bottom rows

Statistics

df.describe()

Get numeric summaries

Missing Data

df.isnull().sum()

Identify missing values

# Use the directory structure from Workbook 4
import pandas as pd
import os
base_data_dir = os.path.expanduser("~/data")  # Parent directory
pubchem_data_dir = os.path.join(base_data_dir, "pubchem_data")  # Subdirectory for PubChem
os.makedirs(pubchem_data_dir, exist_ok=True)  # Ensure directories exist
periodictable_csv_datapath = os.path.join(pubchem_data_dir, "PubChemElements_all.csv")
df_periodictable = pd.read_csv(periodictable_csv_datapath)
df_periodictable.head()
AtomicNumber Symbol Name AtomicMass CPKHexColor ElectronConfiguration Electronegativity AtomicRadius IonizationEnergy ElectronAffinity OxidationStates StandardState MeltingPoint BoilingPoint Density GroupBlock YearDiscovered
0 1 H Hydrogen 1.008000 FFFFFF 1s1 2.20 120.0 13.598 0.754 +1, -1 Gas 13.81 20.28 0.000090 Nonmetal 1766
1 2 He Helium 4.002600 D9FFFF 1s2 NaN 140.0 24.587 NaN 0 Gas 0.95 4.22 0.000179 Noble gas 1868
2 3 Li Lithium 7.000000 CC80FF [He]2s1 0.98 182.0 5.392 0.618 +1 Solid 453.65 1615.00 0.534000 Alkali metal 1817
3 4 Be Beryllium 9.012183 C2FF00 [He]2s2 1.57 153.0 9.323 NaN +2 Solid 1560.00 2744.00 1.850000 Alkaline earth metal 1798
4 5 B Boron 10.810000 FFB5B5 [He]2s2 2p1 2.04 192.0 8.298 0.277 +3 Solid 2348.00 4273.00 2.370000 Metalloid 1808

Set the symbols column as the index#

df_periodictable=df_periodictable.set_index("Symbol")
df_periodictable.head()
AtomicNumber Name AtomicMass CPKHexColor ElectronConfiguration Electronegativity AtomicRadius IonizationEnergy ElectronAffinity OxidationStates StandardState MeltingPoint BoilingPoint Density GroupBlock YearDiscovered
Symbol
H 1 Hydrogen 1.008000 FFFFFF 1s1 2.20 120.0 13.598 0.754 +1, -1 Gas 13.81 20.28 0.000090 Nonmetal 1766
He 2 Helium 4.002600 D9FFFF 1s2 NaN 140.0 24.587 NaN 0 Gas 0.95 4.22 0.000179 Noble gas 1868
Li 3 Lithium 7.000000 CC80FF [He]2s1 0.98 182.0 5.392 0.618 +1 Solid 453.65 1615.00 0.534000 Alkali metal 1817
Be 4 Beryllium 9.012183 C2FF00 [He]2s2 1.57 153.0 9.323 NaN +2 Solid 1560.00 2744.00 1.850000 Alkaline earth metal 1798
B 5 Boron 10.810000 FFB5B5 [He]2s2 2p1 2.04 192.0 8.298 0.277 +3 Solid 2348.00 4273.00 2.370000 Metalloid 1808
df_periodictable.tail()
AtomicNumber Name AtomicMass CPKHexColor ElectronConfiguration Electronegativity AtomicRadius IonizationEnergy ElectronAffinity OxidationStates StandardState MeltingPoint BoilingPoint Density GroupBlock YearDiscovered
Symbol
Fl 114 Flerovium 290.192 NaN [Rn]7s2 7p2 5f14 6d10 (predicted) NaN NaN NaN NaN 6, 4,2, 1, 0 Expected to be a Solid NaN NaN NaN Post-transition metal 1998
Mc 115 Moscovium 290.196 NaN [Rn]7s2 7p3 5f14 6d10 (predicted) NaN NaN NaN NaN 3, 1 Expected to be a Solid NaN NaN NaN Post-transition metal 2003
Lv 116 Livermorium 293.205 NaN [Rn]7s2 7p4 5f14 6d10 (predicted) NaN NaN NaN NaN +4, +2, -2 Expected to be a Solid NaN NaN NaN Post-transition metal 2000
Ts 117 Tennessine 294.211 NaN [Rn]7s2 7p5 5f14 6d10 (predicted) NaN NaN NaN NaN +5, +3, +1, -1 Expected to be a Solid NaN NaN NaN Halogen 2010
Og 118 Oganesson 295.216 NaN [Rn]7s2 7p6 5f14 6d10 (predicted) NaN NaN NaN NaN +6, +4, +2, +1, 0, -1 Expected to be a Gas NaN NaN NaN Noble gas 2006
print(df_periodictable.shape)
print(f'number rows is: {df_periodictable.shape[0]}, \nnumber of columns is {df_periodictable.shape[1]}')
print(df_periodictable.columns)
print(df_periodictable.dtypes)      
(118, 16)
number rows is: 118, 
number of columns is 16
Index(['AtomicNumber', 'Name', 'AtomicMass', 'CPKHexColor',
       'ElectronConfiguration', 'Electronegativity', 'AtomicRadius',
       'IonizationEnergy', 'ElectronAffinity', 'OxidationStates',
       'StandardState', 'MeltingPoint', 'BoilingPoint', 'Density',
       'GroupBlock', 'YearDiscovered'],
      dtype='object')
AtomicNumber               int64
Name                      object
AtomicMass               float64
CPKHexColor               object
ElectronConfiguration     object
Electronegativity        float64
AtomicRadius             float64
IonizationEnergy         float64
ElectronAffinity         float64
OxidationStates           object
StandardState             object
MeltingPoint             float64
BoilingPoint             float64
Density                  float64
GroupBlock                object
YearDiscovered            object
dtype: object
df_periodictable.info
<bound method DataFrame.info of         AtomicNumber         Name  AtomicMass CPKHexColor  \
Symbol                                                      
H                  1     Hydrogen    1.008000      FFFFFF   
He                 2       Helium    4.002600      D9FFFF   
Li                 3      Lithium    7.000000      CC80FF   
Be                 4    Beryllium    9.012183      C2FF00   
B                  5        Boron   10.810000      FFB5B5   
...              ...          ...         ...         ...   
Fl               114    Flerovium  290.192000         NaN   
Mc               115    Moscovium  290.196000         NaN   
Lv               116  Livermorium  293.205000         NaN   
Ts               117   Tennessine  294.211000         NaN   
Og               118    Oganesson  295.216000         NaN   

                    ElectronConfiguration  Electronegativity  AtomicRadius  \
Symbol                                                                       
H                                     1s1               2.20         120.0   
He                                    1s2                NaN         140.0   
Li                                [He]2s1               0.98         182.0   
Be                                [He]2s2               1.57         153.0   
B                             [He]2s2 2p1               2.04         192.0   
...                                   ...                ...           ...   
Fl      [Rn]7s2 7p2 5f14 6d10 (predicted)                NaN           NaN   
Mc      [Rn]7s2 7p3 5f14 6d10 (predicted)                NaN           NaN   
Lv      [Rn]7s2 7p4 5f14 6d10 (predicted)                NaN           NaN   
Ts      [Rn]7s2 7p5 5f14 6d10 (predicted)                NaN           NaN   
Og      [Rn]7s2 7p6 5f14 6d10 (predicted)                NaN           NaN   

        IonizationEnergy  ElectronAffinity        OxidationStates  \
Symbol                                                              
H                 13.598             0.754                 +1, -1   
He                24.587               NaN                      0   
Li                 5.392             0.618                     +1   
Be                 9.323               NaN                     +2   
B                  8.298             0.277                     +3   
...                  ...               ...                    ...   
Fl                   NaN               NaN           6, 4,2, 1, 0   
Mc                   NaN               NaN                   3, 1   
Lv                   NaN               NaN             +4, +2, -2   
Ts                   NaN               NaN         +5, +3, +1, -1   
Og                   NaN               NaN  +6, +4, +2, +1, 0, -1   

                 StandardState  MeltingPoint  BoilingPoint   Density  \
Symbol                                                                 
H                          Gas         13.81         20.28  0.000090   
He                         Gas          0.95          4.22  0.000179   
Li                       Solid        453.65       1615.00  0.534000   
Be                       Solid       1560.00       2744.00  1.850000   
B                        Solid       2348.00       4273.00  2.370000   
...                        ...           ...           ...       ...   
Fl      Expected to be a Solid           NaN           NaN       NaN   
Mc      Expected to be a Solid           NaN           NaN       NaN   
Lv      Expected to be a Solid           NaN           NaN       NaN   
Ts      Expected to be a Solid           NaN           NaN       NaN   
Og        Expected to be a Gas           NaN           NaN       NaN   

                   GroupBlock YearDiscovered  
Symbol                                        
H                    Nonmetal           1766  
He                  Noble gas           1868  
Li               Alkali metal           1817  
Be       Alkaline earth metal           1798  
B                   Metalloid           1808  
...                       ...            ...  
Fl      Post-transition metal           1998  
Mc      Post-transition metal           2003  
Lv      Post-transition metal           2000  
Ts                    Halogen           2010  
Og                  Noble gas           2006  

[118 rows x 16 columns]>
df_periodictable.describe()
AtomicNumber AtomicMass Electronegativity AtomicRadius IonizationEnergy ElectronAffinity MeltingPoint BoilingPoint Density
count 118.000000 118.000000 95.000000 99.000000 102.000000 57.000000 103.000000 93.000000 96.000000
mean 59.500000 146.540281 1.732316 209.464646 7.997255 1.072140 1273.740553 2536.212473 7.608001
std 34.207699 89.768356 0.635187 38.569130 3.339066 0.879163 888.853859 1588.410919 5.878692
min 1.000000 1.008000 0.700000 120.000000 3.894000 0.079000 0.950000 4.220000 0.000090
25% 30.250000 66.480750 1.290000 187.000000 6.020500 0.470000 516.040000 1180.000000 2.572500
50% 59.500000 142.573830 1.620000 209.000000 6.960000 0.754000 1191.000000 2792.000000 7.072000
75% 88.750000 226.777165 2.170000 232.000000 8.998500 1.350000 1806.500000 3618.000000 10.275250
max 118.000000 295.216000 3.980000 348.000000 24.587000 3.617000 3823.000000 5869.000000 22.570000
#Check for missing data for each column
df_periodictable.isnull().sum()
AtomicNumber              0
Name                      0
AtomicMass                0
CPKHexColor              10
ElectronConfiguration     0
Electronegativity        23
AtomicRadius             19
IonizationEnergy         16
ElectronAffinity         61
OxidationStates           1
StandardState             0
MeltingPoint             15
BoilingPoint             25
Density                  22
GroupBlock                0
YearDiscovered            0
dtype: int64

Extracting Information from DataFrame#

df_periodictable.loc['Cr','ElectronConfiguration']
'[Ar]3d5 4s1'
# Define the list of halogens
halogens = ['F', 'Cl', 'Br', 'I', 'At']

# Select ionization energy for halogens
halogen_ionization = df_periodictable.loc[df_periodictable.index.isin(halogens), "IonizationEnergy"]

# Display results
#display(halogen_ionization.to_frame())  # Ensures it shows as a DataFrame
print(halogen_ionization)
print(type(halogen_ionization))
Symbol
F     17.423
Cl    12.968
Br    11.814
I     10.451
At     9.500
Name: IonizationEnergy, dtype: float64
<class 'pandas.core.series.Series'>

Explaining the above code

  1. Create a list of halogens

halogens = ['F', 'Cl', 'Br', 'I', 'At']
  1. Create a boolean mask that is True for halogens

df_periodictable.index.isin(halogens)
  1. Select rows where boolean mask is True and return only “IonizationEnergy”

.loc[df_periodictable.index.isin(halogens), "IonizationEnergy"]
  1. Assign to variable halogen_ionization

halogen_ionization = df_periodictable.loc[df_periodictable.index.isin(halogens), "IonizationEnergy"]
display(halogen_ionization.to_frame())  # Ensures it shows as a DataFrame

Display is a jupyter function and will only work in notebooks, you need to spring to a string if you want to print it out

print(halogen_ionization)
Symbol
F     17.423
Cl    12.968
Br    11.814
I     10.451
At     9.500
Name: IonizationEnergy, dtype: float64
print(halogen_ionization.to_string())
Symbol
F     17.423
Cl    12.968
Br    11.814
I     10.451
At     9.500
print(halogen_ionization.to_frame())
print(type(halogen_ionization.to_frame()))
        IonizationEnergy
Symbol                  
F                 17.423
Cl                12.968
Br                11.814
I                 10.451
At                 9.500
<class 'pandas.core.frame.DataFrame'>
print(halogen_ionization.to_frame().to_string())
print(type(halogen_ionization.to_frame().to_string()))
        IonizationEnergy
Symbol                  
F                 17.423
Cl                12.968
Br                11.814
I                 10.451
At                 9.500
<class 'str'>
halogen_melt_boil_pts = df_periodictable.loc[df_periodictable.index.isin(halogens), ["MeltingPoint", "BoilingPoint"]]
print(halogen_melt_boil_pts)
print(type(halogen_melt_boil_pts))
        MeltingPoint  BoilingPoint
Symbol                            
F              53.53         85.03
Cl            171.65        239.11
Br            265.95        331.95
I             386.85        457.55
At            575.00           NaN
<class 'pandas.core.frame.DataFrame'>

Acknowledgements#

This content was developed with assistance from Perplexity AI and Chat GPT. Multiple queries were made during the Fall 2024 and the Spring 2025.