7. Seaborn Part 3

7. Seaborn Part 3#

3. Seaborn Plots (cont)#

3.4 Regression Plots#

Regression Plot (sns.regplot())#

Plots scatter points of your data.
Overlays a linear regression line (by default).
Uses NumPy and SciPy internally to calculate the least squares best fit line. $$ y = mx + b $$

Where:

m is the slope.
b is the intercept.

It finds the line that minimizes the squared vertical distances (errors) from the data points to the line: $$ \min \sum (y_i - (mx_i + b))^2 $$

Seaborn Option	Method Used	Function Called
`order=1` (default)	Linear regression	`scipy.stats.linregress()`
`order > 1`	Polynomial regression	`numpy.polyfit()`
`lowess=True`	Locally weighted smoothing	`statsmodels.nonparametric.lowess()`

lowess = LOcally WEighted Scatterplot Smoothing

Parameter	Type	Description
`x`	str or array	Variable for x-axis
`y`	str or array	Variable for y-axis
`data`	DataFrame	Data source containing `x` and `y`
`fit_reg`	bool	Whether to draw the regression line (`True` by default)
`ci`	int or None	Size of confidence interval in percent (default=95). Use `None` to hide.
`n_boot`	int	Number of bootstrap samples to estimate `ci`
`line_kws`	dict	Keyword args for customizing the regression line (e.g., color, linewidth)
`scatter_kws`	dict	Keyword args for customizing the scatterplot (e.g., s=size, alpha)
`order`	int	Degree of polynomial regression (1 = linear)
`logx`	bool	Set `True` for log-scale x-axis
`x_jitter`	float	Add jitter to x-values to reduce overlap
`y_jitter`	float	Add jitter to y-values
`truncate`	bool	Whether to truncate the regression line to data range
`dropna`	bool	Whether to ignore missing values
`lowess`	bool	Fit a nonparametric LOWESS regression curve
`color`	str	Color for both points and line
`ax`	matplotlib.axes	Existing axis to draw the plot on

import pandas as pd

alkanes = pd.DataFrame({
    "Alkane": ["Methane", "Ethane", "Propane", "Butane", "Pentane",
               "Hexane", "Heptane", "Octane", "Nonane", "Decane"],
    "Carbons": list(range(1, 11)),
    "MolarMass": [16.04, 30.07, 44.10, 58.12, 72.15,
                  86.18, 100.21, 114.23, 128.26, 142.29],
    "BoilingPoint": [-161.5, -88.6, -42.1, -0.5, 36.1,
                     68.7, 98.4, 125.6, 150.8, 174.1]
})

import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="whitegrid")

# Regression plot: Boiling Point vs. Molar Mass
plt.figure(figsize=(8, 6))
sns.regplot(data=alkanes, x="MolarMass", y="BoilingPoint")

plt.title("Boiling Point vs. Molar Mass for Linear Alkanes")
plt.xlabel("Molar Mass (g/mol)")
plt.ylabel("Boiling Point (°C)")
plt.show()

../../_images/e50282af7a850924db15e38dfeb82806fac7569808f16c11ca0750d87cdd00af.png

from scipy.stats import linregress

# Extract x and y data from your DataFrame
x = alkanes["MolarMass"]
y = alkanes["BoilingPoint"]

# Perform linear regression
slope, intercept, r_value, p_value, std_err = linregress(x, y)

# Print slope and intercept
print(f"Slope: {slope:.3f}")
print(f"Intercept: {intercept:.3f}")
print(f"R²: {r_value**2:.3f}")
print(f"p value: {p_value**2:.3f}, if p value <0.05 slope is statistically significant")
print(f"std_err: {std_err**2:.3f}, (smaller the better)")

Slope: 2.534
Intercept: -164.469
R²: 0.972
p value: 0.000, if p value <0.05 slope is statistically significant
std_err: 0.023, (smaller the better)

import seaborn as sns
import matplotlib.pyplot as plt

# Create the plot
plt.figure(figsize=(8, 6))
ax = sns.regplot(x=x, y=y)

# Annotate with equation
equation = f"y = {slope:.2f}x + {intercept:.2f}\nR² = {r_value**2:.3f}"
ax.text(0.05, 0.95, equation, transform=ax.transAxes,
        fontsize=12, verticalalignment='top', bbox=dict(boxstyle="round", facecolor="white", alpha=0.7))

# Labels
ax.set_title("Boiling Point vs. Molar Mass for Linear Alkanes")
ax.set_xlabel("Molar Mass (g/mol)")
ax.set_ylabel("Boiling Point (°C)")

plt.show()

../../_images/db80c405aa3a637d8e1d33c87513883ab1bd5eda87079f9031b06b2ae772d453.png

Logistics Regression (sns.lmplot())#

Carbon Atoms	Alkane	Molar Mass (g/mol)	Boiling Point (°C)	Alcohol	Molar Mass (g/mol)	Boiling Point (°C)
1	Methane	16.04	-161.5	Methanol	32.04	64.7
2	Ethane	30.07	-88.6	Ethanol	46.07	78.4
3	Propane	44.10	-42.1	1-Propanol	60.10	97.0
4	Butane	58.12	-0.5	1-Butanol	74.12	117.7
5	Pentane	72.15	36.1	1-Pentanol	88.15	137.9
6	Hexane	86.18	68.7	1-Hexanol	102.18	157.5
7	Heptane	100.21	98.4	1-Heptanol	116.21	176.9
8	Octane	114.23	125.7	1-Octanol	130.23	195.0
9	Nonane	128.26	150.8	1-Nonanol	144.26	212.6
10	Decane	142.29	174.0	1-Decanol	158.29	229.7

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Data preparation
data = {
    'CarbonAtoms': list(range(1, 11)) * 2,
    'MolarMass': [16.04, 30.07, 44.10, 58.12, 72.15, 86.18, 100.21, 114.23, 128.26, 142.29,
                  32.04, 46.07, 60.10, 74.12, 88.15, 102.18, 116.21, 130.23, 144.26, 158.29],
    'BoilingPoint': [-161.5, -88.6, -42.1, -0.5, 36.1, 68.7, 98.4, 125.7, 150.8, 174.0,
                     64.7, 78.4, 97.0, 117.7, 137.9, 157.5, 176.9, 195.0, 212.6, 229.7],
    'CompoundType': ['Alkane'] * 10 + ['Alcohol'] * 10
}

df = pd.DataFrame(data)

# Plotting
sns.set_theme(style="whitegrid")
g = sns.lmplot(
    data=df,
    x='MolarMass',
    y='BoilingPoint',
    hue='CompoundType',
    height=6,
    aspect=1.5,
    markers=['o', 's'],
    palette='muted',
    ci=None
)

# Titles and labels
g.set_axis_labels('Molar Mass (g/mol)', 'Boiling Point (°C)')
g.fig.suptitle('Boiling Point vs. Molar Mass for Alkanes and Primary Alcohols', fontsize=16)
plt.subplots_adjust(top=0.9)

# Show plot
plt.show()

../../_images/8afe77d740c35bf1d8463b466331f81123abdc9306a335f71f083ae3258c5550.png

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import linregress

# Prepare the data
data = {
    'CarbonAtoms': list(range(1, 11)) * 2,
    'MolarMass': [16.04, 30.07, 44.10, 58.12, 72.15, 86.18, 100.21, 114.23, 128.26, 142.29,
                  32.04, 46.07, 60.10, 74.12, 88.15, 102.18, 116.21, 130.23, 144.26, 158.29],
    'BoilingPoint': [-161.5, -88.6, -42.1, -0.5, 36.1, 68.7, 98.4, 125.7, 150.8, 174.0,
                     64.7, 78.4, 97.0, 117.7, 137.9, 157.5, 176.9, 195.0, 212.6, 229.7],
    'CompoundType': ['Alkane'] * 10 + ['Alcohol'] * 10
}

df = pd.DataFrame(data)

# Perform regression analysis for each compound type
results = {}
for compound in ['Alkane', 'Alcohol']:
    subset = df[df['CompoundType'] == compound]
    slope, intercept, r_value, p_value, std_err = linregress(subset['MolarMass'], subset['BoilingPoint'])
    results[compound] = {
        'slope': slope,
        'intercept': intercept,
        'r_squared': r_value**2
    }

# Create the plot
sns.set_theme(style="whitegrid")
g = sns.lmplot(
    data=df,
    x='MolarMass',
    y='BoilingPoint',
    hue='CompoundType',
    height=6,
    aspect=1.5,
    markers=['o', 's'],
    palette='muted',
    ci=None
)

# Annotate regression equations
ax = g.ax  # Get the Matplotlib Axes object
text_y = {
    'Alkane': -10,
    'Alcohol': -70
}
for compound, res in results.items():
    label = f"{compound}:\n$y = {res['slope']:.2f}x + {res['intercept']:.2f}$\n$R^2 = {res['r_squared']:.3f}$"
    x_pos = 90 if compound == "Alkane" else 105
    ax.text(x_pos, text_y[compound], label, fontsize=11, bbox=dict(boxstyle="round", facecolor="white", alpha=0.8))

# Final plot tweaks
g.set_axis_labels('Molar Mass (g/mol)', 'Boiling Point (°C)')
g.fig.suptitle('Boiling Point vs. Molar Mass for Alkanes and Primary Alcohols', fontsize=16)
plt.subplots_adjust(top=0.9)
plt.show()

../../_images/583f5c3a2a52ce49856341deaed1d2a6d33175c285ecf955ba3f732913eb2517.png

3.5 Heatmaps/grids#

Heatmaps#

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

# Set path to your aquation.csv file
aquation_csv_datapath = os.path.expanduser("~/data/spectra/aquation.csv")

# Step 1: Load the CSV
df = pd.read_csv(aquation_csv_datapath)

# Step 2: Promote first row to column headers (time values in minutes)
df.columns = df.iloc[0]
df = df.drop(index=0)

# Step 3: Rename first column to "Wavelength" and set it as index
df = df.rename(columns={df.columns[0]: "Wavelength"})
df.set_index("Wavelength", inplace=True)

# Step 4: Convert all values to numeric (in case of any formatting issues)
df = df.apply(pd.to_numeric, errors='coerce')

# Optional: Sort index just in case
df.index = df.index.astype(float)
df = df.sort_index()

# Step 5: Plot the heatmap
plt.figure(figsize=(14, 6))
sns.heatmap(
    df,
    cmap="magma",
    cbar_kws={'label': 'Absorbance'},
    xticklabels=10,  # Show every 10th time point on x-axis
    yticklabels=10   # Show every 10th wavelength on y-axis
)

plt.title("Time-Resolved UV-VIS Absorbance Heatmap")
plt.xlabel("Time (minutes)")
plt.ylabel("Wavelength (nm)")
plt.tight_layout()
plt.show()

../../_images/b4bbde8dcdd6e532a2ebe738cd10ac1f571b90f74c36e5f70cd295397db1da56.png

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import os

# Load PubChem periodic table data
periodictable_csv_datapath = os.path.expanduser("~/data/pubchem_data/PubChemElements_all.csv")
df = pd.read_csv(periodictable_csv_datapath)

# Build periodic table layout (main group + transition elements)
pt_layout = [
    # Period 1
    (1, 1, "H"), (2, 18, "He"),
    # Period 2
    (3, 1, "Li"), (4, 2, "Be"), (5, 13, "B"), (6, 14, "C"), (7, 15, "N"), (8, 16, "O"), (9, 17, "F"), (10, 18, "Ne"),
    # Period 3
    (11, 1, "Na"), (12, 2, "Mg"), (13, 13, "Al"), (14, 14, "Si"), (15, 15, "P"), (16, 16, "S"), (17, 17, "Cl"), (18, 18, "Ar"),
    # Period 4
    (19, 1, "K"), (20, 2, "Ca"), (21, 3, "Sc"), (22, 4, "Ti"), (23, 5, "V"), (24, 6, "Cr"), (25, 7, "Mn"), (26, 8, "Fe"),
    (27, 9, "Co"), (28, 10, "Ni"), (29, 11, "Cu"), (30, 12, "Zn"), (31, 13, "Ga"), (32, 14, "Ge"), (33, 15, "As"),
    (34, 16, "Se"), (35, 17, "Br"), (36, 18, "Kr"),
    # Period 5
    (37, 1, "Rb"), (38, 2, "Sr"), (39, 3, "Y"), (40, 4, "Zr"), (41, 5, "Nb"), (42, 6, "Mo"), (43, 7, "Tc"), (44, 8, "Ru"),
    (45, 9, "Rh"), (46, 10, "Pd"), (47, 11, "Ag"), (48, 12, "Cd"), (49, 13, "In"), (50, 14, "Sn"), (51, 15, "Sb"),
    (52, 16, "Te"), (53, 17, "I"), (54, 18, "Xe"),
    # Period 6 (no lanthanides)
    (55, 1, "Cs"), (56, 2, "Ba"), (72, 4, "Hf"), (73, 5, "Ta"), (74, 6, "W"), (75, 7, "Re"), (76, 8, "Os"),
    (77, 9, "Ir"), (78, 10, "Pt"), (79, 11, "Au"), (80, 12, "Hg"), (81, 13, "Tl"), (82, 14, "Pb"), (83, 15, "Bi"),
    (84, 16, "Po"), (85, 17, "At"), (86, 18, "Rn"),
    # Period 7 (no actinides)
    (87, 1, "Fr"), (88, 2, "Ra"), (104, 4, "Rf"), (105, 5, "Db"), (106, 6, "Sg"), (107, 7, "Bh"), (108, 8, "Hs"),
    (109, 9, "Mt"), (110, 10, "Ds"), (111, 11, "Rg"), (112, 12, "Cn"), (113, 13, "Nh"), (114, 14, "Fl"),
    (115, 15, "Mc"), (116, 16, "Lv"), (117, 17, "Ts"), (118, 18, "Og")
]

# Create layout DataFrame
template_df = pd.DataFrame(pt_layout, columns=["AtomicNumber", "Group", "Symbol"])
template_df["Period"] = 1  # Fill with placeholder

# Assign periods by atomic number range
for i, rng in enumerate([(1, 2), (3, 10), (11, 18), (19, 36), (37, 54), (55, 86), (87, 118)], start=1):
    template_df.loc[(template_df["AtomicNumber"] >= rng[0]) & (template_df["AtomicNumber"] <= rng[1]), "Period"] = i

# Merge with PubChem electronegativity data
merged = pd.merge(template_df, df[["AtomicNumber", "Electronegativity"]], on="AtomicNumber", how="left")

# Create 7x18 grid
heatmap_data = pd.DataFrame(np.nan, index=range(1, 8), columns=range(1, 19))
label_data = pd.DataFrame("", index=range(1, 8), columns=range(1, 19))

# Populate grid with electronegativity and element symbol
for _, row in merged.iterrows():
    period = int(row["Period"])
    group = int(row["Group"])
    heatmap_data.at[period, group] = row["Electronegativity"]
    label_data.at[period, group] = row["Symbol"]

# Plot the heatmap
plt.figure(figsize=(14, 6))
sns.heatmap(
    heatmap_data,
    annot=label_data,
    fmt='',
    cmap="viridis",
    linewidths=0.5,
    linecolor='gray',
    cbar_kws={'label': 'Electronegativity'},
    square=True,
    mask=heatmap_data.isnull()
)

plt.title("Electronegativity Heatmap of the Periodic Table (excluding Lanthanides/Actinides)")
plt.xlabel("Group")
plt.ylabel("Period")
plt.yticks(rotation=0)
plt.show()

../../_images/2a81c4e2c841735dcf444ad175f5ca19f183d20f44fb98f3933c3cb313ea1bf5.png

Cluster Map#

Cluster maps utilize a dendogram, which is a dichotomous tree

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Step 1: Define simplified chemistry dataset with Xenon replacing Helium
data = {
    "Element": [
        "Li", "Na", "K", "Rb",             # Alkali metals
        "Be", "Mg", "Ca", "Sr",            # Alkaline earth metals
        "F", "Cl", "Br", "I",              # Halogens
        "Ne", "Ar", "Kr", "Xe"             # Noble gases
    ],
    "Group": [
        "Alkali", "Alkali", "Alkali", "Alkali",
        "AlkalineEarth", "AlkalineEarth", "AlkalineEarth", "AlkalineEarth",
        "Halogen", "Halogen", "Halogen", "Halogen",
        "NobleGas", "NobleGas", "NobleGas", "NobleGas"
    ],
    "Valence_s": [
        1, 1, 1, 1,
        2, 2, 2, 2,
        2, 2, 2, 2,
        2, 2, 2, 2
    ],
    "Valence_p": [
        0, 0, 0, 0,
        0, 0, 0, 0,
        5, 5, 5, 5,
        6, 6, 6, 6
    ]
}

# Step 2: Create DataFrame
df = pd.DataFrame(data)
df.set_index("Element", inplace=True)

# Step 3: Extract features and standardize
features = df[["Valence_s", "Valence_p"]]
scaled = StandardScaler().fit_transform(features)
scaled_df = pd.DataFrame(scaled, index=features.index, columns=features.columns)

# Step 4: Create color map for row labels
group_colors = {
    "Alkali": "#d62728",         # red
    "AlkalineEarth": "#1f77b4",  # blue
    "Halogen": "#2ca02c",        # green
    "NobleGas": "#9467bd"        # purple
}
row_colors = df["Group"].map(group_colors)

# Step 5: Create clustermap with row colors
sns.clustermap(
    scaled_df,
    cmap="vlag",
    annot=True,
    figsize=(9, 6),
    linewidths=0.5,
    row_cluster=True,
    col_cluster=False,
    row_colors=row_colors
)

plt.suptitle("Clustermap of Elements by Valence Shell Configuration", y=1.05)
plt.show()

../../_images/d19e99e127779e00159f526f862c9e1270658680e3772bc23c6046227b455d19.png

import numpy as np

data = [
    1, 1, 1, 1,
    2, 2, 2, 2,
    2, 2, 2, 2,
    2, 2, 2, 2
]

std_dev = np.std(data)
print("Standard Deviation:", std_dev)
mean = np.mean(data)
print("Mean:", mean)
print(f"z-score for valence_s alkali metal: {(1-mean)/std_dev:.2f}")
print(f"z-score for valence_s all others: {(2-mean)/std_dev:.2f}")

Standard Deviation: 0.4330127018922193
Mean: 1.75
z-score for valence_s alkali metal: -1.73
z-score for valence_s all others: 0.58

Understanding the dendogram#

These are standardized values (z-scores), meaning: $$ z = \frac{x - \text{mean}}{\text{std dev}} $$ z = \frac{x - \mu}{\sigma} $$ You divide by the standard deviation to normalize the value so that large data sets do not weigh more (there are only 2 electrons in the s orbitals but up to 6 in the p, so the difference between a value and the mean can be larger for the p)

We are only showing the z values for each of the two features (valence_s and valence_p electrons) and in the next code cell we will add a third feature column for the total configuration

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Step 1: Define dataset with valence features
data = {
    "Element": [
        "Li", "Na", "K", "Rb",             # Alkali metals
        "Be", "Mg", "Ca", "Sr",            # Alkaline earth metals
        "F", "Cl", "Br", "I",              # Halogens
        "Ne", "Ar", "Kr", "Xe"             # Noble gases
    ],
    "Group": [
        "Alkali", "Alkali", "Alkali", "Alkali",
        "AlkalineEarth", "AlkalineEarth", "AlkalineEarth", "AlkalineEarth",
        "Halogen", "Halogen", "Halogen", "Halogen",
        "NobleGas", "NobleGas", "NobleGas", "NobleGas"
    ],
    "Valence_s": [
        1, 1, 1, 1,
        2, 2, 2, 2,
        2, 2, 2, 2,
        2, 2, 2, 2
    ],
    "Valence_p": [
        0, 0, 0, 0,
        0, 0, 0, 0,
        5, 5, 5, 5,
        6, 6, 6, 6
    ]
}

# Step 2: Create DataFrame
df = pd.DataFrame(data)
df.set_index("Element", inplace=True)

# Step 3: Add combined valence total
df["Valence_Total"] = df["Valence_s"] + df["Valence_p"]

# Step 4: Standardize all three features
features = df[["Valence_s", "Valence_p", "Valence_Total"]]
scaled = StandardScaler().fit_transform(features)
scaled_df = pd.DataFrame(scaled, index=features.index, columns=features.columns)

# Step 5: Map groups to colors for visual labeling
group_colors = {
    "Alkali": "#d62728",         # red
    "AlkalineEarth": "#1f77b4",  # blue
    "Halogen": "#2ca02c",        # green
    "NobleGas": "#9467bd"        # purple
}
row_colors = df["Group"].map(group_colors)

# Step 6: Create clustermap
sns.clustermap(
    scaled_df,
    cmap="vlag",
    annot=True,
    figsize=(10, 6),
    linewidths=0.5,
    row_cluster=True,
    col_cluster=False,
    row_colors=row_colors
)

plt.suptitle("Clustermap of Valence Electron Configuration (s, p, total)", y=1.05)
plt.show()

../../_images/b35d4a87321a9a2775082f35a2e271d23f2f236655b5d1684bf03b146a4167c6.png

clutermap() and Euclidean distance#

Clustermap() calculates the Euclidean distance across data values for each feature. First it normalizes each column as some may have more values than others (there are on two s electrons, six p electrons and eight total electrons).

Treats each row (element) as a vector: $$ \vec{x} = [z_{s}, z_{p}, z_{\text{total}}] $$
To compare how similar two elements are (say, Li and Be), we compute the Euclidean distance between their feature vectors:

\[ d(\vec{x}, \vec{y}) = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + (x_3 - y_3)^2} \]

This is just the 3D Pythagorean theorem, measuring how far apart two points are in the space defined by:

$ x_1, y_1 $ → Valence_s
$ x_2, y_2 $ → Valence_p
$ x_3, y_3 $ → Valence_Total

import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler

# Step 1: Define simplified valence electron dataset
data = {
    "Element": [
        "Li", "Na", "K", "Rb",             # Alkali metals
        "Be", "Mg", "Ca", "Sr",            # Alkaline earth metals
        "F", "Cl", "Br", "I",              # Halogens
        "Ne", "Ar", "Kr", "Xe"             # Noble gases
    ],
    "Group": [
        "Alkali", "Alkali", "Alkali", "Alkali",
        "AlkalineEarth", "AlkalineEarth", "AlkalineEarth", "AlkalineEarth",
        "Halogen", "Halogen", "Halogen", "Halogen",
        "NobleGas", "NobleGas", "NobleGas", "NobleGas"
    ],
    "Valence_s": [
        1, 1, 1, 1,
        2, 2, 2, 2,
        2, 2, 2, 2,
        2, 2, 2, 2
    ],
    "Valence_p": [
        0, 0, 0, 0,
        0, 0, 0, 0,
        5, 5, 5, 5,
        6, 6, 6, 6
    ]
}

# Step 2: Create DataFrame and compute valence total
df = pd.DataFrame(data)
df.set_index("Element", inplace=True)
df["Valence_Total"] = df["Valence_s"] + df["Valence_p"]

# Step 3: Standardize the features
features = df[["Valence_s", "Valence_p", "Valence_Total"]]
scaler = StandardScaler()
scaled = scaler.fit_transform(features)
scaled_df = pd.DataFrame(scaled, index=df.index, columns=features.columns)

# Step 4: Create 3D scatter plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# Extract standardized coordinates
x = scaled_df["Valence_s"]
y = scaled_df["Valence_p"]
z = scaled_df["Valence_Total"]

# Plot the points
ax.scatter(x, y, z, s=100, c="tomato", edgecolor="k")

# Annotate each element
# Smarter label offsetting to reduce overlap
offsets = [
    (0.07, 0.04, 0.04),
    (0.07, -0.04, 0.04),
    (-0.07, 0.04, 0.04),
    (-0.07, -0.04, 0.04),
    (0.04, 0.07, -0.04),
    (-0.04, 0.07, -0.04),
    (0.04, -0.07, -0.04),
    (-0.04, -0.07, -0.04),
] * 2  # Extend if needed

for (element, xs, ys, zs), (dx, dy, dz) in zip(scaled_df.itertuples(), offsets):
    ax.text(xs + dx, ys + dy, zs + dz, element, fontsize=9, ha='left', va='bottom')


# Axis labels
ax.set_xlabel("Standardized Valence s")
ax.set_ylabel("Standardized Valence p")
ax.set_zlabel("Standardized Valence Total")
ax.set_title("3D Feature Space of Elements (Standardized)")

plt.tight_layout()
plt.show()

../../_images/b50102c5641ad50b28eafd9b54ac868ab75d00110137fc3edb99785229e39635.png

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import pdist, squareform

# Step 1: Define simplified chemistry dataset (replacing He with Xe)
data = {
    "Element": [
        "Li", "Na", "K", "Rb",             # Alkali metals
        "Be", "Mg", "Ca", "Sr",            # Alkaline earth metals
        "F", "Cl", "Br", "I",              # Halogens
        "Ne", "Ar", "Kr", "Xe"             # Noble gases
    ],
    "Group": [
        "Alkali", "Alkali", "Alkali", "Alkali",
        "AlkalineEarth", "AlkalineEarth", "AlkalineEarth", "AlkalineEarth",
        "Halogen", "Halogen", "Halogen", "Halogen",
        "NobleGas", "NobleGas", "NobleGas", "NobleGas"
    ],
    "Valence_s": [
        1, 1, 1, 1,
        2, 2, 2, 2,
        2, 2, 2, 2,
        2, 2, 2, 2
    ],
    "Valence_p": [
        0, 0, 0, 0,
        0, 0, 0, 0,
        5, 5, 5, 5,
        6, 6, 6, 6
    ]
}

# Step 2: Create DataFrame
df = pd.DataFrame(data)
df.set_index("Element", inplace=True)

# Step 3: Add combined valence column
df["Valence_Total"] = df["Valence_s"] + df["Valence_p"]

# Step 4: Standardize features
features = df[["Valence_s", "Valence_p", "Valence_Total"]]
scaler = StandardScaler()
scaled = scaler.fit_transform(features)
scaled_df = pd.DataFrame(scaled, index=df.index, columns=features.columns)

# Step 5: Compute pairwise Euclidean distances
dist_matrix = squareform(pdist(scaled_df, metric="euclidean"))
dist_df = pd.DataFrame(dist_matrix, index=scaled_df.index, columns=scaled_df.index)

# Step 6: Plot the distance matrix as a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(dist_df, cmap="mako", annot=True, fmt=".2f", square=True, linewidths=0.5, linecolor='gray')

plt.title("Euclidean Distance Matrix Between Elements (Standardized Features)")
plt.xlabel("Element")
plt.ylabel("Element")
plt.tight_layout()
plt.show()

../../_images/2bbc2e125063e20819778da75e53cf1b3f9ce4f51dde2c87183514b5ee751c4c.png

Pair Plots (sns.pairplot())#

The Seaborn function pairplot() is an exploratory data analysis (EDA) tool for visualizing pairwise relationships in a dataset. It creates a grid of scatter plots (and histograms on the diagonals) for each pair of numerical variables.

Plots every numerical variable against every other numerical variable.
The diagonal shows univariate distributions (histograms or KDEs).
Can color points by a categorical variable using the hue parameter.
Can facet rows and columns using row and col parameters.
Automatically skips non-numeric columns unless used for coloring or faceting

import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt

base_data_dir = os.path.expanduser("~/data")  # Parent directory
pubchem_data_dir = os.path.join(base_data_dir, "pubchem_data")  # Subdirectory for PubChem
os.makedirs(pubchem_data_dir, exist_ok=True)  # Ensure directories exist
periodictable_csv_datapath = os.path.join(pubchem_data_dir, "PubChemElements_all.csv")
df = pd.read_csv(periodictable_csv_datapath, index_col=1)
df.head()

sns.pairplot(data=df, hue='GroupBlock', diag_kind='hist')
plt.show()

../../_images/3f9c6d7ebf8b1593b400d4e346c918cf2507f6e7d49664e40e29a385f009e7b1.png

import seaborn as sns
import matplotlib.pyplot as plt

# Make sure you select only the relevant columns + 'GroupBlock'
selected_columns = [
    'AtomicNumber', 'AtomicMass', 'AtomicRadius', 'MeltingPoint', 
    'BoilingPoint', 'Density', 'GroupBlock'
]

# Subset the DataFrame
df_subset = df[selected_columns].dropna()  # drop rows with missing data

# Create the pairplot
sns.pairplot(df_subset, hue='GroupBlock', diag_kind='hist', corner=True)

# Show the plot
plt.show()

../../_images/bf212f748816fa449318aaa627524e741501845092950f51bbfd1888a9f37f2b.png

import seaborn as sns
import matplotlib.pyplot as plt

# List the numeric columns and 'GroupBlock'
selected_columns = [
    'AtomicNumber', 'AtomicMass', 'AtomicRadius', 'MeltingPoint', 
    'BoilingPoint', 'Density', 'GroupBlock'
]

# Subset the DataFrame and filter only halogens and noble gases
df_filtered = df[selected_columns].dropna()
df_filtered = df_filtered[df_filtered['GroupBlock'].isin(['Halogen', 'Noble gas'])]

# Create the pairplot
sns.pairplot(
    df_filtered,
    hue='GroupBlock',
    diag_kind='hist',
    corner=True,
    plot_kws={'alpha': 0.7, 's': 40}
)

# Show the plot
plt.show()

../../_images/cb8d7601e733b5237b240d9e9fa4a3f1d49be7583cfe8d2a5d3e2188f90d7c3c.png

import seaborn as sns
import matplotlib.pyplot as plt

# Select and filter your data
selected_columns = [
    'AtomicMass', 'AtomicRadius', 'MeltingPoint', 
    'BoilingPoint', 'Density', 'GroupBlock'
]

df_filtered = df[selected_columns].dropna()
df_filtered = df_filtered[df_filtered['GroupBlock'].isin(['Halogen', 'Noble gas'])]

# Create pairplot with regression lines
sns.pairplot(
    df_filtered,
    hue='GroupBlock',
    kind='reg',              # <-- this adds regression fits
    diag_kind='hist',
    corner=True,
    plot_kws={'scatter_kws': {'alpha': 0.7, 's': 40}}  # control point appearance
)

plt.show()

../../_images/1a0456d0fd5e486cc2a0b3cca0c697defd20cd53d3f2e4391af7a4be75a14cc5.png

Joint Plot (sns.jointplot())#

import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt

base_data_dir = os.path.expanduser("~/data")  # Parent directory
pubchem_data_dir = os.path.join(base_data_dir, "pubchem_data")  # Subdirectory for PubChem
os.makedirs(pubchem_data_dir, exist_ok=True)  # Ensure directories exist
periodictable_csv_datapath = os.path.join(pubchem_data_dir, "PubChemElements_all.csv")
df = pd.read_csv(periodictable_csv_datapath, index_col=1)
print(df.head())

# Select and filter your data
selected_columns = [
    'AtomicMass', 'AtomicRadius', 'MeltingPoint', 'IonizationEnergy', 
    'BoilingPoint', 'Density', 'GroupBlock', 'Electronegativity'
]

df_filtered = df[selected_columns].dropna()
df_filtered = df_filtered[df_filtered['GroupBlock'].isin(['Halogen', 'Noble gas'])]

sns.jointplot(data=df, x="MeltingPoint", y="BoilingPoint", kind="reg")
sns.jointplot(data=df, x="AtomicRadius", y="IonizationEnergy", kind="scatter")
sns.jointplot(data=df, x="Electronegativity", y="IonizationEnergy", kind="reg")
sns.jointplot(data=df_filtered, x="Electronegativity", y="IonizationEnergy", kind="reg")
plt.show()

        AtomicNumber       Name  AtomicMass CPKHexColor ElectronConfiguration  \
Symbol                                                                          
H                  1   Hydrogen    1.008000      FFFFFF                   1s1   
He                 2     Helium    4.002600      D9FFFF                   1s2   
Li                 3    Lithium    7.000000      CC80FF               [He]2s1   
Be                 4  Beryllium    9.012183      C2FF00               [He]2s2   
B                  5      Boron   10.810000      FFB5B5           [He]2s2 2p1   

        Electronegativity  AtomicRadius  IonizationEnergy  ElectronAffinity  \
Symbol                                                                        
H                    2.20         120.0            13.598             0.754   
He                    NaN         140.0            24.587               NaN   
Li                   0.98         182.0             5.392             0.618   
Be                   1.57         153.0             9.323               NaN   
B                    2.04         192.0             8.298             0.277   

       OxidationStates StandardState  MeltingPoint  BoilingPoint   Density  \
Symbol                                                                       
H               +1, -1           Gas         13.81         20.28  0.000090   
He                   0           Gas          0.95          4.22  0.000179   
Li                  +1         Solid        453.65       1615.00  0.534000   
Be                  +2         Solid       1560.00       2744.00  1.850000   
B                   +3         Solid       2348.00       4273.00  2.370000   

                  GroupBlock YearDiscovered  
Symbol                                       
H                   Nonmetal           1766  
He                 Noble gas           1868  
Li              Alkali metal           1817  
Be      Alkaline earth metal           1798  
B                  Metalloid           1808  

../../_images/a8eda21de7cfdbce988988d8f607cf25b8d29eceeec946f479746a9fc65b5753.png

../../_images/f8adcab216fa4eead1c7db81226241923f084ce3b76647a2671adf7374b8f82a.png

../../_images/a1635cb1e88e85d6e4c07726f2f02b9b1e7fe0a9b47fcf336ca126e53e2618da.png

../../_images/9cfab4d6866369df81b095a3dc611402c02c99e4e0adffcff95056aafb9924d6.png

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Simulate concentration and absorbance with increasing variance
np.random.seed(42)
concentration = np.linspace(0.001, 1.0, 200)
absorbance = 1.5 * concentration + np.random.normal(0, 0.15 * concentration, size=200)
#
#absorbance = 1.5 * concentration + np.random.normal(0, 0.15 * np.exp(concentration))

df = pd.DataFrame({
    "Concentration": concentration,
    "Absorbance": absorbance
})


sns.jointplot(data=df, x="Concentration", y="Absorbance", kind="reg")
plt.show()

../../_images/8331b67c72f2b4e4d9d7854eda37e158f828b22092f212b2eab9790823377c15.png

Acknowledgements#

This content was developed with assistance from Perplexity AI and Chat GPT. Multiple queries were made during the Fall 2024 and the Spring 2025.