2. Pandas#
1. Introduction#
Pandas stands for Panel data, which is an economics term for multi-dimensional datasets. Pandas is built on top of Numpy and has two primary data structures
Series (1D data structure)
Dataframe (2D data structure)
If you want higher dimensional data structures you should use the xrray package.
Pandas series and dataframes are typically easier to work with than Numpy arrays because they can use labels as index numbers. That is, you can give a row or a column a heading label and then using the label to obtain the data value, whereas in numpy you need to use the index numbers. Pandas can also contain multiple data types while numpy arrays can only have one data type.
Pandas DataFrames vs. Numpy Arrays#
Feature |
Pandas DataFrame |
NumPy Array |
---|---|---|
Type |
2D table with labeled rows & columns |
N-dimensional homogeneous array |
Supports multiple data types? |
Yes (e.g., strings & numbers in one column) |
No (all elements must be the same type) |
Supports heterogeneous data? |
Yes (each column can have different types) |
No (all elements must be the same |
Row & column labels |
Yes (uses |
No (only positional indexing) |
Access by labels? |
Yes ( |
No (only by index position) |
Access by index position? |
Yes ( |
Yes ( |
Supports missing values? |
Yes ( |
No (must use masked arrays or |
Built-in functions for missing data? |
Yes ( |
No (requires workarounds) |
Supports row/column operations? |
Yes ( |
Yes ( |
Supports grouping & aggregation? |
Yes ( |
No |
Supports reshaping? |
Yes ( |
Yes ( |
Supports merging/joining? |
Yes ( |
No |
Mutable |
Yes (you can modify/add/delete columns, rows or elements) |
Partially (you can modify elements but not change shape) |
Read/Write CSV? |
Yes ( |
No (must use Pandas or |
Read/Write JSON? |
Yes ( |
No |
Read/Write Excel? |
Yes ( |
No |
Save/Load Binary Format? |
Yes ( |
Yes ( |
If you have not installed pandas you need to activate your conda virtual environment and install it using your terminal
conda activate NameOfEnvironment
conda install -c conda-forge pandas
Now check the version that is installed
import pandas as pd
print(pd.__version__)
2.2.3
2. Pandas Object Overview#
2.1: Functions#
Series Functions (series only)#
Function |
Syntax |
Parameters |
Description |
---|---|---|---|
|
|
String-specific method |
Enables vectorized string operations. |
|
|
Datetime-specific method |
Enables date/time manipulations. |
|
|
|
Maps function to each value in the Series. |
|
|
No parameters |
Returns unique values in the Series. |
|
|
No parameters |
Returns the number of unique values. |
|
|
|
Returns frequency counts of unique values. |
|
|
No parameters |
Returns index of the maximum value. |
|
|
No parameters |
Returns index of the minimum value. |
|
|
No parameters |
Computes the cumulative sum. |
|
|
No parameters |
Computes the cumulative product. |
|
|
No parameters |
Computes the cumulative max. |
|
|
No parameters |
Computes the cumulative min. |
|
|
|
Shifts values up or down. |
|
|
|
Computes the difference between elements. |
|
|
|
Computes ranks of elements. |
Dataframe functions (df only)#
Function |
Syntax |
Parameters |
Description |
---|---|---|---|
|
|
|
Groups data and applies aggregation. |
|
|
|
Reshapes data by pivoting columns. |
|
|
|
Similar to |
|
|
|
Converts wide format to long format. |
|
|
|
Merges two DataFrames on a key. |
|
|
|
Joins two DataFrames based on index. |
|
|
|
Concatenates multiple DataFrames. |
|
|
No parameters |
Converts columns into rows (long format). |
|
|
No parameters |
Converts rows into columns (wide format). |
|
|
|
Expands list-like column values into rows. |
|
|
No parameters |
Transposes rows and columns. |
|
|
No parameters |
Computes correlation between columns. |
|
|
No parameters |
Computes covariance between columns. |
|
|
|
Identifies duplicate rows. |
|
|
|
Removes duplicate rows. |
|
|
|
Returns a random sample. |
|
|
|
Returns top |
|
|
|
Returns top |
|
|
|
Fills missing values. |
|
|
|
Drops missing values. |
|
|
|
Replaces specific values. |
Series and dataframe functions#
Function |
Syntax |
Parameters |
Description |
---|---|---|---|
|
|
|
Returns the first |
|
|
|
Returns the last |
|
|
|
Generates summary statistics for numerical data. |
|
|
|
Counts non-null values. |
|
|
|
Computes sum of values. |
|
|
|
Computes mean of values. |
|
|
|
Returns the minimum value. |
|
|
|
Returns the maximum value. |
|
|
|
Computes standard deviation. |
|
|
|
Computes variance. |
|
|
|
Computes the median value. |
|
|
No parameters |
Returns the most frequent value(s). |
|
|
No parameters |
Returns absolute values. |
|
|
|
Clips values to a range. |
|
|
|
Applies a function element-wise. |
|
|
|
Applies function to each element (Series only). |
|
|
|
Converts data type of elements. |
2.2 Attributes#
Series Attributes#
Attribute |
Syntax |
Description |
---|---|---|
|
|
Returns or sets the name of the Series. |
|
|
Returns the data type of the Series. |
|
|
Returns the total memory usage of the Series (in bytes). |
|
|
Returns the Series itself (useful for compatibility with DataFrames). |
|
|
Returns |
|
|
Returns |
Dataframe Attributes#
Attribute |
Syntax |
Description |
---|---|---|
|
|
Returns column labels of the DataFrame. |
|
|
Returns a list of row and column index labels. |
|
|
Returns the transposed DataFrame (swaps rows and columns). |
|
|
Displays a summary of the DataFrame (index, dtypes, memory usage). |
|
|
Returns the memory usage of each column. |
|
|
Selects columns of a specific dtype. |
|
|
Returns the total memory usage of the DataFrame (in bytes). |
Series and Dataframe Attributes#
These work on both
Attribute |
Syntax |
Description |
---|---|---|
|
|
Returns the row index labels. |
|
|
Returns the column labels (DataFrame only). |
|
|
Returns the dimensions (rows, columns). |
|
|
Returns the total number of elements (rows × columns). |
|
|
Returns the number of dimensions (1D for Series, 2D for DataFrame). |
|
|
Returns data as a NumPy array. |
|
|
Returns the data type of each column. |
|
|
Returns |
3. Pandas Series#
3.2 Creating Series#
pd.Series() constructor#
General syntax for the pd.Series() Constructor:
pd.Series(data=None, index=None, dtype=None, name=None, copy=False)
Parameter |
Description |
---|---|
|
The main data (can be list, dict, array, scalar, etc.) |
|
Labels for the Series elements (default: auto-generated integers) |
|
The data type (e.g., |
|
Optional name for the Series |
|
If |
From Lists#
Index can be manually assigned or autogenerated (0,1,2,…)
By default produces a view if possible
import pandas as pd
# Creating a Series: Boiling points (in °C) of halogens
boiling_points = pd.Series(
[-188.1, -34.0, 59.5, 184.4, 336.8], # Values
index=["F", "Cl", "Br", "I", "At"], # Element symbols as labels
name="Boiling Point (°C)"
)
# Display the Series
print(boiling_points)
# Example Operations
print("\nHighest Boiling Point:", boiling_points.max()) # Get the highest boiling point
print("\nSorted Boiling Points:\n", boiling_points.sort_values()) # Sort in ascending order
print("\nRanked Boiling Points:\n", boiling_points.rank()) # Rank the elements
F -188.1
Cl -34.0
Br 59.5
I 184.4
At 336.8
Name: Boiling Point (°C), dtype: float64
Highest Boiling Point: 336.8
Sorted Boiling Points:
F -188.1
Cl -34.0
Br 59.5
I 184.4
At 336.8
Name: Boiling Point (°C), dtype: float64
Ranked Boiling Points:
F 1.0
Cl 2.0
Br 3.0
I 4.0
At 5.0
Name: Boiling Point (°C), dtype: float64
From Numpy Arrays#
import numpy as np
import pandas as pd
arr = np.array([1,2,3,4])
s=pd.Series(arr, index=(['H','He', 'Li', 'Be']))
print(s)
H 1
He 2
Li 3
Be 4
dtype: int64
From Dictionary#
# Creating a Series from a dictionary (keys become the index)
data = {"H": 1.008, "He": 4.0026, "Li": 6.94, "Be": 9.0122}
s = pd.Series(data, name="Atomic Mass (g/mol)")
print(s)
H 1.0080
He 4.0026
Li 6.9400
Be 9.0122
Name: Atomic Mass (g/mol), dtype: float64
From Scalar Value#
-useful for initializing placeholders
s = pd.Series(1, index=["A", "B", "C", "D"])
print(s)
A 1
B 1
C 1
D 1
dtype: int64
s = pd.Series(1, index=[“A”, “B”, “C”, “D”]) print(s)
Indirect Methods#
From DataFrame Column#
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({"Element": ["H", "He", "Li"], "Atomic Mass": [1.008, 4.0026, 6.94]})
# Extracting a single column as a Series
atomic_mass_series = df["Atomic Mass"]
print(type(atomic_mass_series)) # <class 'pandas.core.series.Series'>
print(atomic_mass_series)
<class 'pandas.core.series.Series'>
0 1.0080
1 4.0026
2 6.9400
Name: Atomic Mass, dtype: float64
Using apply() on dataframe column#
df["Atomic Mass Squared"] = df["Atomic Mass"].apply(lambda x: x ** 2)
print(type(df["Atomic Mass Squared"])) # <class 'pandas.core.series.Series'>
print(df["Atomic Mass Squared"])
<class 'pandas.core.series.Series'>
0 1.016064
1 16.020807
2 48.163600
Name: Atomic Mass Squared, dtype: float64
Using map() to a dataframe column#
import pandas as pd
# Step 1: Create a DataFrame of electronegativity values for halogens
halogens = pd.DataFrame({
"Element": ["F", "Cl", "Br", "I", "At"],
"Electronegativity": [3.98, 3.16, 2.96, 2.66, 2.2]
})
# Step 2: Define a function to classify elements based on electronegativity
def classify_electronegativity(value):
if value >= 3.5:
return "Highly Electronegative"
elif value >= 2.5:
return "Moderately Electronegative"
else:
return "Low Electronegativity"
# Step 3: Use `map()` to create a new Series with classifications
electronegativity_class = halogens["Electronegativity"].map(classify_electronegativity)
# Step 4: Add the new classification Series to the DataFrame
halogens["Electronegativity Category"] = electronegativity_class
# Step 5: Display the Series and DataFrame
print(electronegativity_class)
print(type(electronegativity_class))
print("\n")
print(halogens)
print(type(halogens))
# Step 6 (Optional): Save the DataFrame to a CSV file
halogens.to_csv("halogens_electronegativity.csv", index=False)
0 Highly Electronegative
1 Moderately Electronegative
2 Moderately Electronegative
3 Moderately Electronegative
4 Low Electronegativity
Name: Electronegativity, dtype: object
<class 'pandas.core.series.Series'>
Element Electronegativity Electronegativity Category
0 F 3.98 Highly Electronegative
1 Cl 3.16 Moderately Electronegative
2 Br 2.96 Moderately Electronegative
3 I 2.66 Moderately Electronegative
4 At 2.20 Low Electronegativity
<class 'pandas.core.frame.DataFrame'>
Using iloc() on a dataframe#
import pandas as pd
# Create a sample DataFrame with chemical properties
df = pd.DataFrame({
"Element": ["Na", "Mg", "Al", "Si", "P"],
"Atomic Number": [11, 12, 13, 14, 15],
"Electronegativity": [0.93, 1.31, 1.61, 1.90, 2.19]
})
# Extract the "Electronegativity" column using iloc
electronegativity_series = df.iloc[:, 2] # Selecting column index 2
print(type(electronegativity_series)) # Output: <class 'pandas.core.series.Series'>
print(electronegativity_series)
<class 'pandas.core.series.Series'>
0 0.93
1 1.31
2 1.61
3 1.90
4 2.19
Name: Electronegativity, dtype: float64
Using loc() on a dataframe#
# Extract "Electronegativity" column using loc
electronegativity_series = df.loc[:, "Electronegativity"]
print(type(electronegativity_series)) # Output: <class 'pandas.core.series.Series'>
print(electronegativity_series)
<class 'pandas.core.series.Series'>
0 0.93
1 1.31
2 1.61
3 1.90
4 2.19
Name: Electronegativity, dtype: float64
4. Pandas DataFrames#
4.1 DataFrame Creation#
Function |
Description |
---|---|
|
Creates a DataFrame from a dictionary, list, NumPy array, or other data source. |
|
Reads a CSV file into a DataFrame. |
|
Reads an Excel file into a DataFrame. |
|
Reads a JSON file into a DataFrame. |
|
Reads from a SQL database into a DataFrame. |
|
Reads an HTML table into a DataFrame. |
pd.DataFrame() constructor#
4.2 DataFrame Methods#
Exploration#
Method |
Description |
---|---|
|
Returns the first |
|
Returns the last |
|
Displays concise summary of the DataFrame. |
|
Generates summary statistics for numerical columns. |
|
Returns tuple of (rows, columns). |
|
Returns the data types of each column. |
|
Returns the column labels. |
|
Returns the index (row labels). |
|
Returns a random sample of |
DataFrame Selection and Filtering#
Method |
Description |
---|---|
|
Selects a single column as a Series. |
|
Selects multiple columns. |
|
Selects data using integer-location indexing. |
|
Selects data using label-based indexing. |
|
Filters rows based on a condition. |
|
Queries the DataFrame using an expression. |
|
Sets a specific column as the index. |
|
Resets the index to default integer values. |
DataFrame Cleaning#
Method |
Description |
---|---|
|
Removes rows with missing values. |
|
Fills missing values with a specified value. |
|
Drops specified columns. |
|
Drops specified rows. |
|
Renames column labels. |
|
Checks for duplicate rows. |
|
Removes duplicate rows. |
|
Replaces values in the DataFrame. |
DataFrame Transformation#
Method |
Description |
---|---|
|
Applies a function across rows or columns. |
|
Applies a function element-wise. |
|
Creates a new column. |
|
Changes the data type of a column. |
|
Sorts values based on a column. |
|
Sorts DataFrame by index. |
|
Creates a pivot table. |
DataFrame Statistics and Aggregation#
Method |
Description |
---|---|
|
Groups data by a column. |
|
Applies aggregation functions. |
|
Computes mean of numerical columns. |
|
Computes median. |
|
Computes sum. |
|
Computes minimum. |
|
Computes maximum. |
|
Counts non-null values. |
|
Counts unique values in each column. |
DataFrame Merging and Concatenation#
Method |
Description |
---|---|
|
Concatenates DataFrames along an axis. |
|
Merges DataFrames on a common column. |
|
Joins DataFrames based on index. |
Dataframe IO#
Method |
Description |
---|---|
|
Writes DataFrame to a CSV file. |
|
Writes DataFrame to an Excel file. |
|
Writes DataFrame to a JSON file. |
|
Writes DataFrame to a SQL table. |
|
Reads data from clipboard into a Pandas DataFrame. |
|
Copies a DataFrame to the clipboard for easy pasting elsewhere. |
|
Reads a tab-separated text file into a Pandas DataFrame. |
|
Saves a DataFrame as a tab-separated text file. |
|
Reads CSV file |
pd.read_json() |
Reads JSON file |
4.3: Handling Missing Data#
import pandas as pd
import numpy as np
# Creating a DataFrame with missing values
data = {
"Element": ["H", "He", "Li", "Be", "B", "C", "N", "O", "F", "Ne",
"Na", "Mg", "Al", "Si", "P", "S", "Cl", "Ar"],
"Atomic_Weight": [1.008, 4.0026, 6.94, 9.0122, 10.81, 12.011, 14.007,
15.999, 18.998, 20.180, 22.990, None, 26.981, 28.085,
30.974, None, 35.45, 39.948]
}
df = pd.DataFrame(data)
# Display the original DataFrame
print("Original DataFrame:")
print(df)
Original DataFrame:
Element Atomic_Weight
0 H 1.0080
1 He 4.0026
2 Li 6.9400
3 Be 9.0122
4 B 10.8100
5 C 12.0110
6 N 14.0070
7 O 15.9990
8 F 18.9980
9 Ne 20.1800
10 Na 22.9900
11 Mg NaN
12 Al 26.9810
13 Si 28.0850
14 P 30.9740
15 S NaN
16 Cl 35.4500
17 Ar 39.9480
dropna() - remove missing values#
Note the index values are removed and not changed
df_cleaned = df.dropna()
print("\nDataFrame after dropping missing values:")
print(df_cleaned)
DataFrame after dropping missing values:
Element Atomic_Weight
0 H 1.0080
1 He 4.0026
2 Li 6.9400
3 Be 9.0122
4 B 10.8100
5 C 12.0110
6 N 14.0070
7 O 15.9990
8 F 18.9980
9 Ne 20.1800
10 Na 22.9900
12 Al 26.9810
13 Si 28.0850
14 P 30.9740
16 Cl 35.4500
17 Ar 39.9480
fillna(np.nan)Replacing Missing Values with NaN#
Note, Pandas automatically converts None to NaN, but sometimes other characters get introduced into the data. Also, sometimes missing data can cause problems when doing functions over entire dataframes even if it is a string (remember NaN is a float).
data = {
"Element": ["H", "He", "Li", None, "B"],
"State": ["Gas", "Gas", "Solid", None, "Solid"]
}
df1 = pd.DataFrame(data)
print(df1)
# Ensure missing values are explicitly set to NaN
df1 = df1.fillna(np.nan)
print(df1)
Element State
0 H Gas
1 He Gas
2 Li Solid
3 None None
4 B Solid
Element State
0 H Gas
1 He Gas
2 Li Solid
3 NaN NaN
4 B Solid
Replace Custom Missing Values#
This gets tricky as Pandas is discontinuing type implicit down casting and we have two data types in the Atomic Weight column. We had to force it to a string, replace the values with np.nan, and then convert to a float
import numpy as np
import pandas as pd
# Create DataFrame
data = {
"Element": ["H", "He", "Li", "Be", "B"],
"Atomic_Weight": [1.008, 4.0026, "N/A", "?", 10.81]
}
df2 = pd.DataFrame(data)
print("Original DataFrame:")
print(df2)
# ✅ Fully Future-Proof Method
df2["Atomic_Weight"] = df2["Atomic_Weight"].astype("string") # Force string type to avoid downcasting
df2["Atomic_Weight"] = df2["Atomic_Weight"].replace(["N/A", "?"], np.nan) # Replace missing values
df2["Atomic_Weight"] = df2["Atomic_Weight"].astype(float) # Convert to float after replacement
print("\nUpdated DataFrame:")
print(df2)
Original DataFrame:
Element Atomic_Weight
0 H 1.008
1 He 4.0026
2 Li N/A
3 Be ?
4 B 10.81
Updated DataFrame:
Element Atomic_Weight
0 H 1.0080
1 He 4.0026
2 Li NaN
3 Be NaN
4 B 10.8100
Replace Missing Values with the Mean#
df_mean_filled = df.copy()
df_mean_filled["Atomic_Weight"] = df_mean_filled["Atomic_Weight"].fillna(df["Atomic_Weight"].mean())
print("\nDataFrame with missing values replaced by the mean:")
print(df_mean_filled)
DataFrame with missing values replaced by the mean:
Element Atomic_Weight
0 H 1.008000
1 He 4.002600
2 Li 6.940000
3 Be 9.012200
4 B 10.810000
5 C 12.011000
6 N 14.007000
7 O 15.999000
8 F 18.998000
9 Ne 20.180000
10 Na 22.990000
11 Mg 18.587238
12 Al 26.981000
13 Si 28.085000
14 P 30.974000
15 S 18.587238
16 Cl 35.450000
17 Ar 39.948000
Replace Missing Value with Custom Values#
df_filled_custom = df.copy()
# ✅ Assign the modified column explicitly (instead of using inplace=True)
df_filled_custom["Atomic_Weight"] = df_filled_custom["Atomic_Weight"].fillna(0)
print("\nDataFrame with missing values replaced by 0:")
print(df_filled_custom)
DataFrame with missing values replaced by 0:
Element Atomic_Weight
0 H 1.0080
1 He 4.0026
2 Li 6.9400
3 Be 9.0122
4 B 10.8100
5 C 12.0110
6 N 14.0070
7 O 15.9990
8 F 18.9980
9 Ne 20.1800
10 Na 22.9900
11 Mg 0.0000
12 Al 26.9810
13 Si 28.0850
14 P 30.9740
15 S 0.0000
16 Cl 35.4500
17 Ar 39.9480
Use Boolean mask to remove NaN#
Note, we are using .notna() and not .notnan(). ‘na’ stands for “not available”, and this extends to non numeric data types like strings. the not in notna() inverts the Boolean mask the way ~ does in numpy. So it returns True if the value is a number
Now Pandas uses NaN as a standard marker for missing data, not just an invalid number. This is slightly different than Numpy, which treats it as a float.
df_valid = df[df["Atomic_Weight"].notna()]
print("\nDataFrame with only valid atomic weights:")
print(df_valid)
DataFrame with only valid atomic weights:
Element Atomic_Weight
0 H 1.0080
1 He 4.0026
2 Li 6.9400
3 Be 9.0122
4 B 10.8100
5 C 12.0110
6 N 14.0070
7 O 15.9990
8 F 18.9980
9 Ne 20.1800
10 Na 22.9900
12 Al 26.9810
13 Si 28.0850
14 P 30.9740
16 Cl 35.4500
17 Ar 39.9480
4. Statistics#
# Creating a DataFrame with physical properties of alkali metals
alkali_metals = pd.DataFrame({
"Atomic Number": [3, 11, 19, 37, 55], # Lithium to Cesium
"Atomic Radius (pm)": [152, 186, 227, 248, 265],
"Density (g/cm³)": [0.534, 0.97, 0.86, 1.53, 1.87],
"Melting Point (°C)": [180.5, 97.8, 63.5, 39.3, 28.5]
}, index=["Li", "Na", "K", "Rb", "Cs"]) # Using symbols as row index
# Display the DataFrame
print(alkali_metals)
# Example Operations
print("\nMean Atomic Radius:", alkali_metals["Atomic Radius (pm)"].mean()) # Average atomic radius
print("\nSorted by Melting Point:\n", alkali_metals.sort_values("Melting Point (°C)")) # Sort by melting point
print("\nDensity Correlation:\n", alkali_metals.corr()) # Correlation between properties
Atomic Number Atomic Radius (pm) Density (g/cm³) Melting Point (°C)
Li 3 152 0.534 180.5
Na 11 186 0.970 97.8
K 19 227 0.860 63.5
Rb 37 248 1.530 39.3
Cs 55 265 1.870 28.5
Mean Atomic Radius: 215.6
Sorted by Melting Point:
Atomic Number Atomic Radius (pm) Density (g/cm³) Melting Point (°C)
Cs 55 265 1.870 28.5
Rb 37 248 1.530 39.3
K 19 227 0.860 63.5
Na 11 186 0.970 97.8
Li 3 152 0.534 180.5
Density Correlation:
Atomic Number Atomic Radius (pm) Density (g/cm³) \
Atomic Number 1.000000 0.932637 0.973714
Atomic Radius (pm) 0.932637 1.000000 0.895427
Density (g/cm³) 0.973714 0.895427 1.000000
Melting Point (°C) -0.855879 -0.969793 -0.856947
Melting Point (°C)
Atomic Number -0.855879
Atomic Radius (pm) -0.969793
Density (g/cm³) -0.856947
Melting Point (°C) 1.000000
# Convert melting points from °C to Kelvin (for both Series & DataFrame)
to_kelvin = lambda temp: temp + 273.15
# Apply function to the boiling points Series
boiling_points_K = boiling_points.apply(to_kelvin)
print("\nBoiling Points in Kelvin:\n", boiling_points_K)
# Apply function to the DataFrame column
#print("\n", alkali_metals)
alkali_metals["Melting Point (K)"] = alkali_metals["Melting Point (°C)"].apply(to_kelvin)
print("\nUpdated Alkali Metals DataFrame:\n", alkali_metals)
Boiling Points in Kelvin:
F 85.05
Cl 239.15
Br 332.65
I 457.55
At 609.95
Name: Boiling Point (°C), dtype: float64
Updated Alkali Metals DataFrame:
Atomic Number Atomic Radius (pm) Density (g/cm³) Melting Point (°C) \
Li 3 152 0.534 180.5
Na 11 186 0.970 97.8
K 19 227 0.860 63.5
Rb 37 248 1.530 39.3
Cs 55 265 1.870 28.5
Melting Point (K)
Li 453.65
Na 370.95
K 336.65
Rb 312.45
Cs 301.65
5. Activity#
We are going to upload the periodic table from worksheet 4 and explore it in a Pandas Dataframe
Step |
Command |
Purpose |
---|---|---|
Shape |
|
Get number of rows & columns |
Column Labels |
|
See column names |
Row & Column Count |
|
Get counts separately |
Data Types |
|
Check variable types |
Data Overview |
|
Summarize structure & missing values |
First Few Rows |
|
Preview top rows |
Last Few Rows |
|
Preview bottom rows |
Statistics |
|
Get numeric summaries |
Missing Data |
|
Identify missing values |
# Use the directory structure from Workbook 4
import pandas as pd
import os
base_data_dir = os.path.expanduser("~/data") # Parent directory
pubchem_data_dir = os.path.join(base_data_dir, "pubchem_data") # Subdirectory for PubChem
os.makedirs(pubchem_data_dir, exist_ok=True) # Ensure directories exist
periodictable_csv_datapath = os.path.join(pubchem_data_dir, "PubChemElements_all.csv")
df_periodictable = pd.read_csv(periodictable_csv_datapath)
df_periodictable.head()
AtomicNumber | Symbol | Name | AtomicMass | CPKHexColor | ElectronConfiguration | Electronegativity | AtomicRadius | IonizationEnergy | ElectronAffinity | OxidationStates | StandardState | MeltingPoint | BoilingPoint | Density | GroupBlock | YearDiscovered | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | H | Hydrogen | 1.008000 | FFFFFF | 1s1 | 2.20 | 120.0 | 13.598 | 0.754 | +1, -1 | Gas | 13.81 | 20.28 | 0.000090 | Nonmetal | 1766 |
1 | 2 | He | Helium | 4.002600 | D9FFFF | 1s2 | NaN | 140.0 | 24.587 | NaN | 0 | Gas | 0.95 | 4.22 | 0.000179 | Noble gas | 1868 |
2 | 3 | Li | Lithium | 7.000000 | CC80FF | [He]2s1 | 0.98 | 182.0 | 5.392 | 0.618 | +1 | Solid | 453.65 | 1615.00 | 0.534000 | Alkali metal | 1817 |
3 | 4 | Be | Beryllium | 9.012183 | C2FF00 | [He]2s2 | 1.57 | 153.0 | 9.323 | NaN | +2 | Solid | 1560.00 | 2744.00 | 1.850000 | Alkaline earth metal | 1798 |
4 | 5 | B | Boron | 10.810000 | FFB5B5 | [He]2s2 2p1 | 2.04 | 192.0 | 8.298 | 0.277 | +3 | Solid | 2348.00 | 4273.00 | 2.370000 | Metalloid | 1808 |
Set the symbols column as the index#
df_periodictable=df_periodictable.set_index("Symbol")
df_periodictable.head()
AtomicNumber | Name | AtomicMass | CPKHexColor | ElectronConfiguration | Electronegativity | AtomicRadius | IonizationEnergy | ElectronAffinity | OxidationStates | StandardState | MeltingPoint | BoilingPoint | Density | GroupBlock | YearDiscovered | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Symbol | ||||||||||||||||
H | 1 | Hydrogen | 1.008000 | FFFFFF | 1s1 | 2.20 | 120.0 | 13.598 | 0.754 | +1, -1 | Gas | 13.81 | 20.28 | 0.000090 | Nonmetal | 1766 |
He | 2 | Helium | 4.002600 | D9FFFF | 1s2 | NaN | 140.0 | 24.587 | NaN | 0 | Gas | 0.95 | 4.22 | 0.000179 | Noble gas | 1868 |
Li | 3 | Lithium | 7.000000 | CC80FF | [He]2s1 | 0.98 | 182.0 | 5.392 | 0.618 | +1 | Solid | 453.65 | 1615.00 | 0.534000 | Alkali metal | 1817 |
Be | 4 | Beryllium | 9.012183 | C2FF00 | [He]2s2 | 1.57 | 153.0 | 9.323 | NaN | +2 | Solid | 1560.00 | 2744.00 | 1.850000 | Alkaline earth metal | 1798 |
B | 5 | Boron | 10.810000 | FFB5B5 | [He]2s2 2p1 | 2.04 | 192.0 | 8.298 | 0.277 | +3 | Solid | 2348.00 | 4273.00 | 2.370000 | Metalloid | 1808 |
df_periodictable.tail()
AtomicNumber | Name | AtomicMass | CPKHexColor | ElectronConfiguration | Electronegativity | AtomicRadius | IonizationEnergy | ElectronAffinity | OxidationStates | StandardState | MeltingPoint | BoilingPoint | Density | GroupBlock | YearDiscovered | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Symbol | ||||||||||||||||
Fl | 114 | Flerovium | 290.192 | NaN | [Rn]7s2 7p2 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | 6, 4,2, 1, 0 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 1998 |
Mc | 115 | Moscovium | 290.196 | NaN | [Rn]7s2 7p3 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | 3, 1 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 2003 |
Lv | 116 | Livermorium | 293.205 | NaN | [Rn]7s2 7p4 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +4, +2, -2 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 2000 |
Ts | 117 | Tennessine | 294.211 | NaN | [Rn]7s2 7p5 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +5, +3, +1, -1 | Expected to be a Solid | NaN | NaN | NaN | Halogen | 2010 |
Og | 118 | Oganesson | 295.216 | NaN | [Rn]7s2 7p6 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +6, +4, +2, +1, 0, -1 | Expected to be a Gas | NaN | NaN | NaN | Noble gas | 2006 |
print(df_periodictable.shape)
print(f'number rows is: {df_periodictable.shape[0]}, \nnumber of columns is {df_periodictable.shape[1]}')
print(df_periodictable.columns)
print(df_periodictable.dtypes)
(118, 16)
number rows is: 118,
number of columns is 16
Index(['AtomicNumber', 'Name', 'AtomicMass', 'CPKHexColor',
'ElectronConfiguration', 'Electronegativity', 'AtomicRadius',
'IonizationEnergy', 'ElectronAffinity', 'OxidationStates',
'StandardState', 'MeltingPoint', 'BoilingPoint', 'Density',
'GroupBlock', 'YearDiscovered'],
dtype='object')
AtomicNumber int64
Name object
AtomicMass float64
CPKHexColor object
ElectronConfiguration object
Electronegativity float64
AtomicRadius float64
IonizationEnergy float64
ElectronAffinity float64
OxidationStates object
StandardState object
MeltingPoint float64
BoilingPoint float64
Density float64
GroupBlock object
YearDiscovered object
dtype: object
df_periodictable.info
<bound method DataFrame.info of AtomicNumber Name AtomicMass CPKHexColor \
Symbol
H 1 Hydrogen 1.008000 FFFFFF
He 2 Helium 4.002600 D9FFFF
Li 3 Lithium 7.000000 CC80FF
Be 4 Beryllium 9.012183 C2FF00
B 5 Boron 10.810000 FFB5B5
... ... ... ... ...
Fl 114 Flerovium 290.192000 NaN
Mc 115 Moscovium 290.196000 NaN
Lv 116 Livermorium 293.205000 NaN
Ts 117 Tennessine 294.211000 NaN
Og 118 Oganesson 295.216000 NaN
ElectronConfiguration Electronegativity AtomicRadius \
Symbol
H 1s1 2.20 120.0
He 1s2 NaN 140.0
Li [He]2s1 0.98 182.0
Be [He]2s2 1.57 153.0
B [He]2s2 2p1 2.04 192.0
... ... ... ...
Fl [Rn]7s2 7p2 5f14 6d10 (predicted) NaN NaN
Mc [Rn]7s2 7p3 5f14 6d10 (predicted) NaN NaN
Lv [Rn]7s2 7p4 5f14 6d10 (predicted) NaN NaN
Ts [Rn]7s2 7p5 5f14 6d10 (predicted) NaN NaN
Og [Rn]7s2 7p6 5f14 6d10 (predicted) NaN NaN
IonizationEnergy ElectronAffinity OxidationStates \
Symbol
H 13.598 0.754 +1, -1
He 24.587 NaN 0
Li 5.392 0.618 +1
Be 9.323 NaN +2
B 8.298 0.277 +3
... ... ... ...
Fl NaN NaN 6, 4,2, 1, 0
Mc NaN NaN 3, 1
Lv NaN NaN +4, +2, -2
Ts NaN NaN +5, +3, +1, -1
Og NaN NaN +6, +4, +2, +1, 0, -1
StandardState MeltingPoint BoilingPoint Density \
Symbol
H Gas 13.81 20.28 0.000090
He Gas 0.95 4.22 0.000179
Li Solid 453.65 1615.00 0.534000
Be Solid 1560.00 2744.00 1.850000
B Solid 2348.00 4273.00 2.370000
... ... ... ... ...
Fl Expected to be a Solid NaN NaN NaN
Mc Expected to be a Solid NaN NaN NaN
Lv Expected to be a Solid NaN NaN NaN
Ts Expected to be a Solid NaN NaN NaN
Og Expected to be a Gas NaN NaN NaN
GroupBlock YearDiscovered
Symbol
H Nonmetal 1766
He Noble gas 1868
Li Alkali metal 1817
Be Alkaline earth metal 1798
B Metalloid 1808
... ... ...
Fl Post-transition metal 1998
Mc Post-transition metal 2003
Lv Post-transition metal 2000
Ts Halogen 2010
Og Noble gas 2006
[118 rows x 16 columns]>
df_periodictable.describe()
AtomicNumber | AtomicMass | Electronegativity | AtomicRadius | IonizationEnergy | ElectronAffinity | MeltingPoint | BoilingPoint | Density | |
---|---|---|---|---|---|---|---|---|---|
count | 118.000000 | 118.000000 | 95.000000 | 99.000000 | 102.000000 | 57.000000 | 103.000000 | 93.000000 | 96.000000 |
mean | 59.500000 | 146.540281 | 1.732316 | 209.464646 | 7.997255 | 1.072140 | 1273.740553 | 2536.212473 | 7.608001 |
std | 34.207699 | 89.768356 | 0.635187 | 38.569130 | 3.339066 | 0.879163 | 888.853859 | 1588.410919 | 5.878692 |
min | 1.000000 | 1.008000 | 0.700000 | 120.000000 | 3.894000 | 0.079000 | 0.950000 | 4.220000 | 0.000090 |
25% | 30.250000 | 66.480750 | 1.290000 | 187.000000 | 6.020500 | 0.470000 | 516.040000 | 1180.000000 | 2.572500 |
50% | 59.500000 | 142.573830 | 1.620000 | 209.000000 | 6.960000 | 0.754000 | 1191.000000 | 2792.000000 | 7.072000 |
75% | 88.750000 | 226.777165 | 2.170000 | 232.000000 | 8.998500 | 1.350000 | 1806.500000 | 3618.000000 | 10.275250 |
max | 118.000000 | 295.216000 | 3.980000 | 348.000000 | 24.587000 | 3.617000 | 3823.000000 | 5869.000000 | 22.570000 |
#Check for missing data for each column
df_periodictable.isnull().sum()
AtomicNumber 0
Name 0
AtomicMass 0
CPKHexColor 10
ElectronConfiguration 0
Electronegativity 23
AtomicRadius 19
IonizationEnergy 16
ElectronAffinity 61
OxidationStates 1
StandardState 0
MeltingPoint 15
BoilingPoint 25
Density 22
GroupBlock 0
YearDiscovered 0
dtype: int64
Extracting Information from DataFrame#
df_periodictable.loc['Cr','ElectronConfiguration']
'[Ar]3d5 4s1'
# Define the list of halogens
halogens = ['F', 'Cl', 'Br', 'I', 'At']
# Select ionization energy for halogens
halogen_ionization = df_periodictable.loc[df_periodictable.index.isin(halogens), "IonizationEnergy"]
# Display results
#display(halogen_ionization.to_frame()) # Ensures it shows as a DataFrame
print(halogen_ionization)
print(type(halogen_ionization))
Symbol
F 17.423
Cl 12.968
Br 11.814
I 10.451
At 9.500
Name: IonizationEnergy, dtype: float64
<class 'pandas.core.series.Series'>
Explaining the above code
Create a list of halogens
halogens = ['F', 'Cl', 'Br', 'I', 'At']
Create a boolean mask that is True for halogens
df_periodictable.index.isin(halogens)
Select rows where boolean mask is True and return only “IonizationEnergy”
.loc[df_periodictable.index.isin(halogens), "IonizationEnergy"]
Assign to variable halogen_ionization
halogen_ionization = df_periodictable.loc[df_periodictable.index.isin(halogens), "IonizationEnergy"]
display(halogen_ionization.to_frame()) # Ensures it shows as a DataFrame
Display is a jupyter function and will only work in notebooks, you need to spring to a string if you want to print it out
print(halogen_ionization)
Symbol
F 17.423
Cl 12.968
Br 11.814
I 10.451
At 9.500
Name: IonizationEnergy, dtype: float64
print(halogen_ionization.to_string())
Symbol
F 17.423
Cl 12.968
Br 11.814
I 10.451
At 9.500
print(halogen_ionization.to_frame())
print(type(halogen_ionization.to_frame()))
IonizationEnergy
Symbol
F 17.423
Cl 12.968
Br 11.814
I 10.451
At 9.500
<class 'pandas.core.frame.DataFrame'>
print(halogen_ionization.to_frame().to_string())
print(type(halogen_ionization.to_frame().to_string()))
IonizationEnergy
Symbol
F 17.423
Cl 12.968
Br 11.814
I 10.451
At 9.500
<class 'str'>
halogen_melt_boil_pts = df_periodictable.loc[df_periodictable.index.isin(halogens), ["MeltingPoint", "BoilingPoint"]]
print(halogen_melt_boil_pts)
print(type(halogen_melt_boil_pts))
MeltingPoint BoilingPoint
Symbol
F 53.53 85.03
Cl 171.65 239.11
Br 265.95 331.95
I 386.85 457.55
At 575.00 NaN
<class 'pandas.core.frame.DataFrame'>
Acknowledgements#
This content was developed with assistance from Perplexity AI and Chat GPT. Multiple queries were made during the Fall 2024 and the Spring 2025.