10.2: Naive Bayes and Model Construction#
Learning Objectives
Purpose
Introduce students to the manual construction of a supervised classification
model using Naïve Bayes, emphasizing how data preparation, feature selection,
and training decisions affect model behavior and evaluation.
Students Learn
- Define a fixed feature space by removing invariant fingerprint bits and preserving the feature mask.
- Create, save, and reload stratified training and test splits along with supporting metadata.
- Diagnose class imbalance in a training dataset and apply downsampling as a model-specific preprocessing step.
- Build a probabilistic Naïve Bayes classifier from training data.
- Generate and interpret confusion matrices for classification-based inference.
- Generate and interpret ROC curves for probability-based model evaluation.
Core Activities
- Organize model inputs, splits, and metadata into a reproducible directory structure.
- Construct a Naïve Bayes classifier step-by-step using saved training data.
- Evaluate model performance using both class predictions and predicted probabilities.
Prior Knowledge
- Complete Module 10.1: Data Preparation and Feature Engineering
- Complete Appendix A10.2: Bayes' Theorem: From Inference to Models
1. Preparation for model building#
We are building supervised learning models to predict the biological activity of small molecules. Each molecule is represented by a set of MACCS keys, and each molecule has an associated activity label: 1 for active and 0 for inactive.
For a single compound, the model can be written as:
where:
(X) is a feature vector containing the MACCS keys for that compound,
(y) is a scalar prediction (0 or 1),
(f) is the learned model that maps features to an activity prediction.
For an entire dataset of compounds, this generalizes to:
where:
\((\mathbf{X})\) is a feature matrix of shape (n_samples, n_features),
\((\mathbf{y})\) is a label vector of length n_samples.
We use uppercase X to indicate that the input data is multiple descriptors
In practice, the symbol X is used generically to denote the input features. Depending on context, X may refer to:
a single feature vector for one compound.
A feature matrix containing feature vectors for many compounds
Explanation
The use of uppercase X and lowercase y follows a long-standing convention in mathematics, statistics, and machine learning.
-
X(uppercase) typically represents a matrix of features when working with a dataset- Shape: (n_samples, n_features)
- Each row corresponds to one compound (observation)
- Each column corresponds to one descriptor or fingerprint bit
-
y(lowercase) represents a vector of target values (labels)- Shape: (n_samples,)
- Each element is the activity associated with one compound
In linear-algebra terms:
Xis a 2-dimensional object (a matrix)yis a 1-dimensional object (a vector)
This naming convention reflects the canonical supervised-learning equation:
\( y = f(\mathbf{X}) \)
where:
- the function
frepresents a learned model - the model takes a matrix of input features (X)
- and produces a vector of outputs (y)
In short:
- Uppercase
X→ feature matrix - Lowercase
y→ target vector
This convention is not enforced by Python, but it is widely adopted and helps communicate the structure of the data at a glance.
1.1 Loading the data into X and y.#
import pandas as pd
# read fingerprints with activities csv file into pandas dataframe
df_data = pd.read_csv("data/AID743139/features/AID743139_MACCS_activities_noSalt_20260205.csv") # Change to your file path if you need to load the dataframe
# we now have a dataframe with CIDS, activities and maccs keys
print(df_data.shape)
df_data.head(3)
(6793, 170)
| cid | activity | clean_smiles | maccs000 | maccs001 | maccs002 | maccs003 | maccs004 | maccs005 | maccs006 | ... | maccs157 | maccs158 | maccs159 | maccs160 | maccs161 | maccs162 | maccs163 | maccs164 | maccs165 | maccs166 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 12850184 | 0 | O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 1 | 89753 | 0 | O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 9403 | 0 | C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 |
3 rows × 170 columns
We will put the MACCS Keys into a variable called X_MACCS, the feature Matrix.
We will put the activity values into a variable called y, the label vector.
X_MACCS = df_data.iloc[:,3:] # this is dropping cid, activity, and clean_smiles and creating a new variable for maccs data
y = df_data['activity'].values # note, df_data is a pd dataframe, but .values makes it a np array
# Print head of feature matrix
X_MACCS.head(3)
| maccs000 | maccs001 | maccs002 | maccs003 | maccs004 | maccs005 | maccs006 | maccs007 | maccs008 | maccs009 | ... | maccs157 | maccs158 | maccs159 | maccs160 | maccs161 | maccs162 | maccs163 | maccs164 | maccs165 | maccs166 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 |
3 rows × 167 columns
import numpy as np
print(y[:5])# prints first 5 values of the label vector
print(np.unique(y)) # prints all unique values of label vector
[0 0 0 0 0]
[0 1]
# Write the code to print length of label vector (y) and number of active compounds (hint, they are zeros and ones, so you can sum them)
print(len(y))
print(y.sum())
6793
743
1.2 Feature Selection: Remove zero-variance features#
Some features in X are not helpful in distinguishing actives from inactives, because they are set ON for all compounds or OFF for all compounds. Such features need to be removed because they would consume more computational resources without improving the model.
We will use the VarianceThreshold method of sklearn to identify which features have a variance of zero or very low. Variance in data represents how spread out the values of a feature are. The threshold parameter is set to 0.0 by default, meaning only features with zero variance (constant values across all samples, 100% identical values) are removed.
What if a feature has ≥99% identical values?
Explanation
Let’s say a feature is1 in 99.5% of rows and 0 in the remaining 0.5%. It does not have zero variance, but the variance is very low.
If you want to remove such near-constant features, you need to set threshold accordingly. In this case the variance is calculated as:
where p is probability of the feature being 1
So to remove this feature, your threshold must be greater than 0.004975, for example:
VarianceThreshold(threshold=0.005)
It might be interesting to see how our models change, or time calculating the model changes if we do some prefiltering by adjusting the threshold.
from sklearn.feature_selection import VarianceThreshold
X_MACCS.shape #- Before removal
(6793, 167)
import numpy as np
from sklearn.feature_selection import VarianceThreshold
# Apply variance threshold
sel = VarianceThreshold(threshold=0.0)
X_MACCS_filtered = sel.fit_transform(X_MACCS)
# Boolean mask of retained features
mask = sel.get_support()
# Human-readable feature names
kept_features = X_MACCS.columns[mask]
removed_features = X_MACCS.columns[~mask]
print("Features removed:", list(removed_features))
print("Filtered feature matrix shape:", X_MACCS_filtered.shape)
# ---- FEATURE METADATA (generated here) ----
feature_metadata = {
"fingerprint": "MACCS",
"original_bits": X_MACCS.shape[1],
"selected_bits": int(mask.sum()),
"removed_bits": int((~mask).sum()),
"selection_method": "VarianceThreshold",
"threshold": 0.0,
"removed_feature_names": list(removed_features),
"source_file": "AID743139_MACCS_activites_noSalt_20260104_v1.csv",
"notes": "Invariant MACCS bits removed prior to modeling"
}
Features removed: ['maccs000', 'maccs001', 'maccs002', 'maccs004', 'maccs166']
Filtered feature matrix shape: (6793, 162)
A MACCS fingerprint has 166 bit positions but the RDKit MACCS fingerprint has 167. This position has zeros for all molecules and is removed regardless of the threshold, can you think of why this position is set to zero?
Answer
Themaccs000 position is always zero because it is a
dummy bit added by RDKit for bookkeeping purposes.
The canonical MACCS fingerprint defines 166 chemically meaningful keys numbered 1–166 (not 167).
By including a zero-valued bit at position 0, RDKit allows the bit index to match the MACCS key number directly (e.g., bit 1 → MACCS key 1), avoiding off-by-one confusion when labeling features.
Because this dummy bit carries no chemical information and has zero variance across all molecules, it is removed during feature preprocessing.1.3 Feature Selection: Freezing the Feature Definition (Mask and Metadata)#
Up to this point, we have generated MACCS fingerprints for all compounds and examined which fingerprint bits vary across the dataset. When we remove invariant bits using a variance filter, we are no longer just manipulating data—we are defining the feature space that every downstream model will use. From this moment forward, a “feature” has a specific meaning: it is one of the MACCS bits that survived this filtering step. This process is known as feature selection: deciding which descriptors are informative enough to be included in the model and which should be excluded.
Because this feature selection step is learned from the data, it must be treated as part of the scientific record. Any model trained on these data—whether Naive Bayes, Decision Trees, or future methods—must use exactly the same feature definition in order to be valid and comparable. To ensure reproducibility and avoid hidden assumptions, we explicitly save the feature mask (a Boolean array indicating which MACCS bits were kept or removed) as a numpy file, along with descriptive metadata explaining how and why the selection was performed as a json file.
By freezing the feature definition here (meaning it will not change for this dataset), we create a clear boundary between data preparation and modeling. All subsequent notebooks and models will load and reuse this saved feature definition rather than recomputing it, guaranteeing that results remain consistent even across kernel restarts, new environments, or alternative machine-learning methods.
from pathlib import Path
import json
import numpy as np
from cinf26pk.core import make_filename, make_fixed_filename
#PROJECT_ROOT = Path.cwd() # current working directory
FEATURES = Path("data/AID743139/features")
FEATURES.mkdir(parents=True, exist_ok=True)
# Save variance mask
mask_fname = make_filename(
prefix="AID743139_MACCS_variance_mask",
ext="npy"
)
np.save(FEATURES / mask_fname, mask)
# Save metadata JSON
meta_fname = make_filename(
prefix="AID743139_MACCS_feature_metadata",
ext="json"
)
with open(FEATURES / meta_fname, "w") as f:
json.dump(feature_metadata, f, indent=2)
print(f"[Saved] Variance mask → {FEATURES/mask_fname}")
print(f"[Saved] Feature metadata → {FEATURES/meta_fname}")
[Saved] Variance mask → data/AID743139/features/AID743139_MACCS_variance_mask_20260205_v1.npy
[Saved] Feature metadata → data/AID743139/features/AID743139_MACCS_feature_metadata_20260205_v1.json
Explain the role of the following three files in the features directory. These three files together define the feature representation and labeling scheme used for model training and evaluation. They must be treated as a matched set.
AID743139_MACCS_activites_noSalt_.csv
This CSV file contains the core machine-learning dataset. Each row corresponds to a single PubChem compound (CID) and includes:- The PubChem Compound ID (CID)
- A binary activity label (0 = inactive, 1 = active) derived from the assay outcomes
- A MACCS fingerprint vector, which will be used to construct the feature matrix (X) for model training
AID743139_MACCS_feature_metadata_.json
This JSON file records the feature-engineering decisions applied to the MACCS fingerprints. It serves as documentation and provenance for how the feature matrix was constructed.Typical contents include:
- Which MACCS bit positions were removed (e.g., invariant bits such as MACCS000)
- The reason for removal (e.g., zero variance across all compounds)
- The original fingerprint length and the final feature count
- Any parameters or assumptions used during fingerprint processing
- Reproduce the feature matrix exactly
- Understand why certain bits are missing
- Apply the same preprocessing rules to future datasets or external test compounds
AID743139_MACCS_variance_mask_.npy
This NumPy file contains the Boolean variance mask used to filter the MACCS fingerprints. The mask is a 1-D array with one entry per original MACCS bit:
- True → keep this bit
- False → drop this bit
- Transform raw MACCS fingerprints into the final feature matrix
- Ensure consistency between training data, test data, and future predictions
1.4 Reload Data#
Where we are in the Supervised Learning Pipeline
In this module, we are continuing a supervised learning pipeline that began with data preparation. We first downloaded BioAssay data from PubChem and stored the unmodified exports in the /raw directory. We then curated these data to remove invalid records and standardize chemical representations, saving the cleaned results in /curated. From this curated dataset, we generated molecular fingerprints (MACCS keys) and applied feature-level filtering to remove invariant bits, producing a refined feature representation. The resulting fingerprint dataset, along with the feature-selection mask and metadata documenting how the features were constructed, was saved in the /features directory as non-volatile .csv and .npy files. In the current notebook, we reload these saved feature artifacts and use them to define train/test splits, which establish the experimental framework for the modeling activities that follow. These splits will support multiple supervised learning models—such as Naive Bayes, Decision Trees, Random Forests, and k-Nearest Neighbors—while ensuring that all models are trained and evaluated on a consistent molecular representation.
Note: In this notebook, the term pipeline refers to a conceptual workflow, a sequence of data transformations and modeling steps applied consistently. In later notebooks, this workflow will be formalized into reusable and scalable pipeline objects.
1.4.1 Regenerate X_MACCS#
X_MACCS is the original unfiltered feature matrix
This is the “ground truth representation”
# Step 1 Load source feature data
import pandas as pd
df_data = pd.read_csv(
"data/AID743139/features/AID743139_MACCS_activities_noSalt_20260205.csv"
)# adjust code to your file
# Separate labels
y = df_data["activity"].values
# Reconstruct full MACCS feature matrix
X_MACCS = df_data.iloc[:, 3:] # CID, activity, clean_smiles dropped
print(X_MACCS.shape)
(6793, 167)
1.4.2 Load and apply saved variance mask#
Mask is 1D
length must equal
X_MACCS.shape[1]
import numpy as np
# Load saved variance mask
mask = np.load(
"data/AID743139/features/AID743139_MACCS_variance_mask_20260205_v1.npy"
)# adjust code to your file path
print("Mask shape:", mask.shape)
print(mask.sum(), "features retained")
# Safety check: ensure mask matches feature matrix
assert X_MACCS.shape[1] == mask.shape[0], (
"Incompatible feature mask: "
"mask length does not match number of MACCS features. "
"Check that the CSV and mask were generated from the same dataset."
)
Mask shape: (167,)
162 features retained
1.4.3 Reconstruct the filtered feature matrix X_MACCS_filtered#
In this step, we reconstruct the filtered MACCS feature matrix by reapplying a previously learned transformation. No feature selection is performed here; instead, we reuse the saved variance mask to ensure that the exact same feature definition is recovered after a kernel restart. The reconstruction is performed using a Pandas DataFrame so that feature alignment and consistency can be verified. Once feature semantics are finalized, the filtered feature matrix is explicitly converted to a NumPy array in preparation for downstream modeling, where only numeric operations are required.
X_MACCS_filtered = X_MACCS.loc[:, mask]
print(X_MACCS_filtered.shape)
# Decide representation for modeling
X_MACCS_filtered = X_MACCS_filtered.to_numpy() # <-- optional but explicit
(6793, 162)
1.5 Train-Test-Split (a 9:1 ratio)#
Now that we’ve prepared the dataset, the next step is to divide it into two parts: one for training the model and one for testing it. This is important because we want to evaluate how well the model performs on unseen data, and not just the data it was trained on.
This is typically done by splitting the dataset into two subsets using a specified ratio. Common splits include 80:20 or 70:30, where the larger portion is used for training and the smaller for testing. When the dataset is small or the model requires more examples to learn effectively, a 90:10 split can be helpful.
In the next code section, we will split the data so that 90% goes into the training set and 10% into the test set. The training set is used to build the model, while the test set is used to evaluate how well the model generalizes to new data. Sklearn’s train_test_split creates several NumPy arrays at once by applying the same randomized split to aligned data. We include the DataFrame index so we can recover which chemical compounds ended up in the training and test sets.
It is important that this train–test split is performed before any model-building decisions, such as class balancing, downsampling, or reweighting. These operations will later be applied only to the training data, never to the test data. By splitting the dataset at this stage and saving the resulting arrays, we preserve an untouched test set that represents the original data distribution. This ensures that model evaluation reflects true generalization performance rather than artifacts introduced during training.
The following table summarizes the key parameters used in the upcoming train_test_split() call. Review these options before running the code, as they determine how the training and test sets are constructed.
Parameter |
Value Used |
Purpose in This Workflow |
|---|---|---|
|
feature matrix |
The input feature matrix containing MACCS fingerprints after variance filtering. |
|
label vector |
Binary activity labels (0 = inactive, 1 = active) aligned row-by-row with |
|
row identifiers |
Preserves the original row identity so compounds can be traced after splitting. |
|
|
Holds out 10% of the data for testing, leaving 90% for training. |
|
|
Randomizes the order of samples before splitting to avoid ordering bias. |
|
|
Fixes the random number generator seed so the split is reproducible. |
|
|
Ensures the training and test sets retain the same active/inactive class ratio as the original dataset. |
In particular, the stratify=y argument is critical for classification problems with class imbalance. It guarantees that both the training and test sets reflect the original class distribution, preventing accidental bias in model evaluation caused by uneven splits.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(
X_MACCS_filtered,
y,
df_data.index, # preserve row identity across the split
shuffle=True,
random_state=3100, # make the split reproducible
stratify=y, # preserve class balance
test_size=0.1 # 10% of data held out for testing
)
print("Training set shape:", X_train.shape, y_train.shape)
print("where there are", X_train.shape[0], "samples, and", X_train.shape[1], "features")
print("and", y_train.shape[0], "activities associated with the training set.")
print()
print("Test set shape:", X_test.shape, y_test.shape)
print("where there are", X_test.shape[0], "samples, and", X_test.shape[1], "features")
print("and", y_test.shape[0], "activities associated with the test set.")
print()
print("Number of active compounds in training set:", y_train.sum())
print("Number of active compounds in test set:", y_test.sum())
Training set shape: (6113, 162) (6113,)
where there are 6113 samples, and 162 features
and 6113 activities associated with the training set.
Test set shape: (680, 162) (680,)
where there are 680 samples, and 162 features
and 680 activities associated with the test set.
Number of active compounds in training set: 669
Number of active compounds in test set: 74
What does this line of code do?
X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(
X_MACCS,
y,
df_data.index, # preserve row identity across the split
shuffle=True,
random_state=3100, # make the split reproducible
stratify=y, # preserve class balance
test_size=0.1 # 10% of data held out for testing
))
Explanation
This line calls scikit-learn’s train_test_split() function to partition the data into training and test sets in a way that is safe for machine learning.
Before this line:
X_MACCS_filtered is a 2-D numerical array (or array-like object) containing molecular features (rows = compounds, columns = MACCS fingerprint bits).
y is a 1-D numerical array containing the corresponding activity labels (one value per compound).
What train_test_split() does:
Splits the rows of X and y together so that each compound’s feature vector stays paired with its activity label.
Creates four NumPy arrays
X_train – feature matrix used to train the model
X_test – feature matrix held back for evaluation
y_train – activity labels for the training set
y_test – activity labels for the test set
idx_train - row numbers of training samples
idx_test - row numbers of test samples
Uses NumPy-style array slicing internally. Which is why the outputs expose NumPy attributes like .shape.
Shuffles the data before splitting (shuffle=True). This prevents ordering artifacts (e.g., all actives grouped together).
Preserves class balance (stratify=y). The fraction of active vs. inactive compounds is maintained in both the training and test sets.
Ensures reproducibility (random_state=3100). The same split will be generated every time this code is run.
Reserves 10% of the data for testing (test_size=0.1) The remaining 90% is used for model training.
We removed CID, Index and SMILES and so there is no way to go back to which chemical is which, and idx_train/test does that
In short, this single line converts your feature table and labels into four NumPy arrays structured exactly the way scikit-learn expects, establishing the data layout used throughout machine-learning workflows. |
1.5.1 Save the test/train split as np arrays#
from pathlib import Path
import json
import numpy as np
# ------------------------------------------------------------
# Define split configuration
# ------------------------------------------------------------
SPLIT_NAME = "90_10"
SPLITS = Path("data/AID743139/splits") / SPLIT_NAME
SPLITS.mkdir(parents=True, exist_ok=True)
# Optional subfolder for arrays (keeps things tidy)
ARRAYS = SPLITS / "arrays"
ARRAYS.mkdir(exist_ok=True)
# ------------------------------------------------------------
# Save NumPy arrays
# ------------------------------------------------------------
np.save(ARRAYS / "X_train.npy", X_train)
np.save(ARRAYS / "X_test.npy", X_test)
np.save(ARRAYS / "y_train.npy", y_train)
np.save(ARRAYS / "y_test.npy", y_test)
# Save index mappings (critical for traceability)
np.save(SPLITS / "train_idx.npy", idx_train)
np.save(SPLITS / "test_idx.npy", idx_test)
# ------------------------------------------------------------
# Save CID traceability
# ------------------------------------------------------------
df_data.loc[idx_train, ["cid"]].to_csv(
SPLITS / "train_cids.csv",
index=False
)
df_data.loc[idx_test, ["cid"]].to_csv(
SPLITS / "test_cids.csv",
index=False
)
# ------------------------------------------------------------
# Save split metadata
# ------------------------------------------------------------
split_metadata = {
"assay": "AID743139",
"split_name": SPLIT_NAME,
"description": "Stratified 90/10 train/test split",
"test_fraction": 0.10,
"train_fraction": 0.90,
"stratified": True,
"random_state": 3100,
"labels": "Active vs Inactive",
"features": "MACCS (variance-filtered)",
"source_feature_file": "AID743139_MACCS_activities_noSalt_20260205.csv"
}
with open(SPLITS / "split_metadata.json", "w") as f:
json.dump(split_metadata, f, indent=2)
print(f"[Saved] Train/test split → {SPLITS}")
[Saved] Train/test split → data/AID743139/splits/90_10
The train/test split produces several files that serve different roles in the machine-learning workflow. Some are used directly for model training, while others preserve traceability, reproducibility, and interpretability. The table below summarizes what was saved and why.
Purpose |
Files |
Why They Exist |
|---|---|---|
Model inputs |
|
Numerical feature matrices and label vectors used directly for model training and evaluation. |
Operational split definition |
|
Exact row indices used to construct the split, allowing it to be verified, reconstructed, or reused. |
Chemical identity traceability |
|
Maps rows in the NumPy arrays back to compound identifiers for inspection, reporting, and auditing. |
Provenance & metadata |
|
Documents how the split was created, including parameters and source files, as part of the scientific record. |
Not all of these files are used immediately in this notebook. Some exist to support later model evaluation, interpretation, or reproducibility, and will be revisited in subsequent sections and notebooks.
Explain the role of the following files in the splits/90_10 directory. Be sure to open the files! You can open the *.csv and *.json files directly from the jupyter lab file browser. You can open the *.npy files by executing the following in a code cell (and altering file name as appropriate).
SPLIT_NAME = "90_10"
SPLITS = Path(“data/AID743139/splits”) / SPLIT_NAME
SPLITS.mkdir(parents=True, exist_ok=True)
ARRAYS = SPLITS / “arrays”
ARRAYS.mkdir(exist_ok=True)
test_idx = np.load(SPLITS / “arrays/X_test.npy”)
test_idx
The first three file types define the feature representation and labeling scheme used for model training and evaluation. They must be treated as a matched set.
split_metadata.json
Documents how the split was performed- Train/test ratio (90/10)
- Random seed used
- Apply stratification (maintain same active/inactive ratio of test and train sets as was in the original set
- Source feature file name
train_cids.csv and test_cids.csv
Identify which compounds (CIDs) ended up in test and traing set. These files are for:- Inspection
- Reporting
- Debugging
train_idx.npy and test_idx.npy
The train and test index files define the data split operationally by storing the exact row positions used to slice the feature matrix (X) and label vector (y). Applying these indices guarantees that every model, even when developed in different notebooks or with different algorithms, is trained and evaluated on the exact same compounds. Chemical identifiers such as CIDs are stored separately for interpretation and auditing, but the model itself only ever sees rows of numerical features selected by these indices.
Once the split has been applied and saved as training and test arrays, the index files are no longer needed for routine model training or evaluation. They are retained as permanent artifacts to support traceability, auditing, and reproducibility, allowing the split to be verified, reconstructed, or reused if the feature representation is regenerated in the future.
X_train.npy and y_train.npy
These files contain the training data used to build machine-learning models. X_train.npy is a NumPy array representing the feature matrix, where each row corresponds to a compound and each column corresponds to a MACCS fingerprint bit retained after variance filtering. y_train.npy is the label vector, containing the binary activity labels (0 = inactive, 1 = active) for the same compounds, in the same row order.
These arrays are the direct inputs to model fitting (e.g., model.fit(X_train, y_train)). The training data may later be balanced, weighted, or resampled, depending on the modeling strategy, but these files preserve the original, stratified train split exactly as defined. To open and view the content of one of these file you can use the following command
import numpy as np
y_train = np.load("y_train.npy")
y_train[:5]
(Note: you have to set the path to the correct file location)
X_test.npy and y_test.npy
These files contain the held-out test data used to evaluate model performance. X_test.npy is the feature matrix for the test compounds, and y_test.npy contains their corresponding activity labels, aligned row-by-row with the feature matrix.
The test arrays are never modified (no balancing or resampling) and are used only for prediction and evaluation (e.g., confusion matrices, ROC curves, precision/recall). Because they were created using a stratified split, they reflect the natural class distribution of the dataset and provide an unbiased assessment of model generalization.
# Write code here to view a *.npy files in the arrays folder (see above hints)
import numpy as np
y_train = np.load(SPLITS/"arrays/y_train.npy")
y_train[:5]
array([0, 0, 1, 0, 0])
Looking Ahead: Preparing Data for Model Building
In the next section, we will reload these saved training and test arrays and begin preparing the training data for model construction. One common challenge in biological activity datasets is class imbalance, where the number of inactive compounds greatly exceeds the number of active compounds. When left unaddressed, this imbalance can cause a model to favor the dominant (majority) class rather than learning meaningful patterns associated with activity.
We will explore strategies for addressing this issue using only the training data, while keeping the test set unchanged so that model evaluation remains unbiased.
2. Preparing the Training Data Model Construction#
In this section, we focus exclusively on the training data and the preprocessing steps that are permitted during model construction. Unlike earlier steps, where we defined the feature space and created a fixed train–test split, the operations introduced here may intentionally modify the training data to help the model learn more effectively.
A key principle governs all steps in this section: the test set is never altered. Any transformations, resampling, or adjustments are applied only to the training data, ensuring that model evaluation remains unbiased and reflects performance on unseen data.
2.1 Reload the Saved Training and Test Arrays#
If you are continuing directly from Section 1 without restarting the kernel, the training and test arrays may already be in memory. However, it is important that you know how to reload these arrays from disk in case the kernel has been restarted or you are returning to the notebook at a later time. For that reason, we begin this section by explicitly loading the saved NumPy arrays. If the arrays are already in memory, you may treat this step as a demonstration of how the data can be recovered when needed.
from pathlib import Path
import numpy as np
# Define arrays directory (relative to notebook)
SPLIT_ROOT = Path("data/AID743139/splits/90_10/arrays")
# Load arrays into active memory
X_train = np.load(SPLIT_ROOT / "X_train.npy")
y_train = np.load(SPLIT_ROOT / "y_train.npy")
X_test = np.load(SPLIT_ROOT / "X_test.npy")
y_test = np.load(SPLIT_ROOT / "y_test.npy")
X_train.shape, y_train.shape
((6113, 162), (6113,))
The following code loads two objects into memory. Explain what these objects are, where they come from, and how they are used in the machine-learning workflow.
X_train = np.load(SPLIT_ROOT / "X_train.npy")
y_train = np.load(SPLIT_ROOT / "y_train.npy")
Answer
The files X_train.npy and y_train.npy store NumPy arrays
on disk. These files are persistent, meaning they remain available even
after the kernel is restarted or the notebook is closed.
When np.load() is called, the data in these files are read from disk
and loaded into active memory as the variables X_train and
y_train. At this point, they are ordinary NumPy arrays
(numpy.ndarray) that can be passed directly to machine-learning
algorithms.
Although one representation exists on disk (nonvolatile) and the other exists in memory (volatile), they are structurally identical. Once loaded, the model cannot distinguish whether a NumPy array was created by computation or loaded from a file.
2.2 Examine class imbalance#
2.2.1 What is class imbalance and why it matters#
Before training a classification model, it is important to examine the distribution of class labels in the training set. In many biological datasets, the number of inactive compounds far exceeds the number of active compounds. This situation is known as class imbalance.
When class imbalance is present, a model may appear to perform well overall while failing to correctly identify the minority class. For example, a model trained on highly imbalanced data may learn to predict the majority class (typically inactives) most of the time, resulting in misleading accuracy and poor detection of the minority class (typically active compounds). By quantifying the class distribution early, we can assess the severity of imbalance and decide whether additional steps are needed to support effective model learning.
Strategies
Many classification problems involve imbalanced datasets, where one class occurs much more frequently than another.
If untreated, imbalance can cause models to:
Favor the majority class
Misclassify rare but important samples
Produce misleading accuracy metrics
Common strategies
There are three broad approaches to handling class imbalance:
1. Downsampling (Under-sampling)
Randomly remove samples from the majority class
Simple and effective
Risk: loss of potentially useful information
2. Oversampling
Increase the size of the minority class
Can be done by duplication or synthetic methods (e.g., SMOTE)
Risk: overfitting to replicated or artificial samples
3. Cost-sensitive learning
Penalize misclassification of the minority class more heavily
Often implemented via class weights
Keeps all data but changes the learning objective
Key takeaway
Handling class imbalance is not about maximizing accuracy; it is about shaping the decision boundary so that the assignment of class labels y to feature vectors X is driven by meaningful feature patterns, not by class frequency alone.
2.2.2 Measuring Class Imbalance in the Training Set#
To understand the severity of class imbalance in this dataset, we examine the number of inactive and active compounds in the training set and compute their ratio.
print("# inactives in training set: ", len(y_train) - y_train.sum())
print("# actives in training set: ", y_train.sum())
ratio = (len(y_train) - y_train.sum())/y_train.sum()
print("the ratio of inactive to active in training set=", ratio)
# inactives in training set: 5444
# actives in training set: 669
the ratio of inactive to active in training set= 8.137518684603886
How to interpret class balance ratios.
Ratio |
How to Interpret the Value |
|---|---|
1 |
Balanced: Roughly equal number of active and inactive. Ideal for training |
2 |
Mild imbalance: 1 active for every 2 inactives. Still manageable, but performance of minority class should be monitored. |
5 |
Severe imbalance. Model may predict majority class most of the time and ignore the minority class. |
Before applying any balancing strategy, it is important to keep the following workflow principles in mind:
Always save unbalanced train/test splits
Never balance the test set
Treat balancing as part of the model, not the data
Expect some variability when downsampling
Use fixed seeds when debugging, variable seeds when evaluating robustness
2.3 Balance the Training Set by Downsampling#
When a dataset has a strong class imbalance a common mitigation strategy is downsampling the majority class. In downsampling, we randomly select a subset of the majority class so that its size is comparable to that of the minority class. This encourages the model to learn decision boundaries that treat both classes more equitably, rather than defaulting to predictions dominated by the majority class. While downsampling reduces the total amount of training data and may discard potentially informative samples from the majority class, it often improves model performance on the minority class and reduces bias toward the majority class.
In the code cell below, we randomly select a subset of the inactive compounds such that their number matches the number of active compounds.
# Indices of each class
idx_inactives = np.where(y_train == 0)[0]
idx_actives = np.where(y_train == 1)[0]
# Number of observations in each class
num_inactives = len(idx_inactives)
num_actives = len(idx_actives)
# Downsample inactives to match number of actives
np.random.seed(0)
idx_inactives_downsampled = np.random.choice(
idx_inactives,
size=num_actives,
replace=False
)
# Create balanced training set
X_train_bal = np.vstack((
X_train[idx_inactives_downsampled],
X_train[idx_actives]
))
y_train_bal = np.hstack((
y_train[idx_inactives_downsampled],
y_train[idx_actives]
))
# Confirm balancing worked
print("# inactives:", len(y_train_bal) - y_train_bal.sum())
print("# actives: ", y_train_bal.sum())
ratio = (len(y_train_bal) - y_train_bal.sum()) / y_train_bal.sum()
print("ratio inactive to active =", ratio)
print("\nTraining set shape:")
print("X:", X_train_bal.shape)
print("y:", y_train_bal.shape)
# inactives: 669
# actives: 669
ratio inactive to active = 1.0
Training set shape:
X: (1338, 162)
y: (1338,)
The following is sample output from this dataset before and after downsampling. Your exact values may differ if a different random seed or dataset version is used.
| Training Set | # Inactives | # Actives | Inactive : Active Ratio | X Shape | y Shape |
|---|---|---|---|---|---|
| Original (Unbalanced) | 5444 | 669 | ≈ 8.1 : 1 | (6113, 162) | (6113,) |
| Balanced (Downsampled) | 669 | 669 | 1 : 1 | (1338, 162) | (1338,) |
1. Was the original set balanced or unbalanced and identify the majority and minority classes
Answer
The initial data set was highly imbalanced with 8 Inactives (majority class) for each active (minority class).
2. Explain from the data how down sampling worked
Answer
Initially there were 5,444 inactives and 669 actives in the dataset. The downsampling reduced the number of unbalanced to the number of balanced, giving 669 compounds of each class, with a total dataset of 1338 compounds.
3. Can you explain the shape of the feature matrix and label vector and what data they contain?
Answer
- The feature matrix
- Contains 162 columns representing the MACCS fingerprints after the zero-variance mask was applied.
- Each row represents a compound
- The original training se contains 6113 compounds
- The balanced (downsampled) training set contains 1338 compounds
- The label vector contains the Boolean value to indicate a compound is active (1) or inactive (0)
4. Do the balanced feature matrix (X_train_bal) or label vector
(y_train_bal) contain the chemical identity (e.g., CID) of each
compound? If not, why is this information absent, and how could the chemical
identity of a specific row be recovered if needed?
Answer
No. Neither the feature matrix (X_train_bal) nor the label vector (y_train_bal) contains explicit chemical identity information such as CIDs or SMILES strings. These NumPy arrays store only numerical data: fingerprint bits in X and binary activity labels in y.
This separation is intentional. Machine-learning models operate on numerical feature vectors and labels, not on compound identifiers. Including chemical identity directly in the feature or label arrays would mix metadata with model inputs and violate standard machine-learning design principles.
Chemical identity is instead preserved in separate artifacts created earlier in section 1.5.1 of this workflow, such as the train_cids.csv file and the corresponding index arrays (e.g., train_idx.npy). These files allow individual
rows in the NumPy arrays to be traced back to specific compounds without embedding identity information into the model’s inputs.
Explanation
This code performs down sampling using NumPy array operations rather than Pandas DataFrames. The goal is to construct a new, balanced training set from the original training data.
Identifying class membership
The calls to np.where() locate the row indices corresponding to
inactive (y = 0) and active (y = 1) compounds in the
training label vector.
idx_inactives = np.where(y_train == 0)[0]
idx_actives = np.where(y_train == 1)[0]
Random downsampling of the majority class
A subset of inactive indices is randomly selected without replacement so that the number of inactive samples matches the number of active samples.
idx_inactives_downsampled = np.random.choice(
idx_inactives,
size=num_actives,
replace=False
)
Reassembling the balanced training arrays
The balanced feature matrix and label vector are constructed by stacking the selected inactive samples together with all active samples.
X_train_bal = np.vstack((
X_train[idx_inactives_downsampled],
X_train[idx_actives]
))
y_train_bal = np.hstack((
y_train[idx_inactives_downsampled],
y_train[idx_actives]
))
The result is a balanced training dataset in which both classes are equally represented and aligned row-by-row.
Summary: Imbalanced vs. Balanced Training Data
At this point, we have two versions of the training data:
The original training set, which reflects the natural class imbalance of the dataset.
A balanced training set, created by downsampling the majority class.
Both versions are valid representations of the training data, but they serve different purposes. The imbalanced training set preserves the original data distribution, while the balanced training set emphasizes equal representation of both classes during model learning.
In the next section, we will build and evaluate a classification model using the balanced training data and compare its behavior to models trained on unbalanced data. This comparison will help illustrate how class imbalance and balancing strategies influence model predictions and performance.
3. Build a model using the training set.#
Now we are ready to build predictive models using machine learning algorithms available in the scikit-learn library (https://scikit-learn.org/). This notebook will use Naïve Bayes it is relatively fast and simple.
3.1 Naïve Bayes#
Naïve Bayes is a family of probabilistic classification algorithms that are particularly well suited for datasets where features are represented as binary indicators. In cheminformatics, this commonly occurs when molecules are encoded using molecular fingerprints, where each feature answers a yes/no question, such as whether a specific substructure is present in the molecule. Bernoulli Naïve Bayes is the variant designed specifically for this situation. It treats each fingerprint bit as a binary feature (0 or 1) and learns how often each feature appears in each class (for example, active versus inactive compounds). During prediction, the model combines this information across all features to estimate which class best matches the observed pattern of fingerprint bits.
A simplifying assumption made by Naïve Bayes is that each feature contributes independently to the classification decision. In other words, the model treats each fingerprint bit as providing its own piece of evidence, without explicitly accounting for relationships between bits. This assumption is more reasonable for some types of molecular fingerprints than others. For example, MACCS keys are based on a fixed set of predefined structural patterns, and many of these bits represent distinct chemical features. In contrast, Morgan fingerprints are generated from overlapping atomic environments, so multiple bits may be activated by the same underlying substructure. As a result, Morgan fingerprint bits tend to be more strongly correlated with one another.
In practice, the independence assumption is rarely strictly true,especially in chemistry, where molecular features are inherently related. However, Naïve Bayes often performs well despite this simplification, particularly for high-dimensional, sparse feature vectors such as molecular fingerprints. The algorithm’s simplicity can make it surprisingly robust, even when its assumptions are only approximately satisfied.
In addition to modeling how features behave within each class, Naïve Bayes also incorporates how common each class is in the training data. When the dataset is balanced, the model treats each class as equally plausible before considering any molecular features, so classification decisions are driven primarily by how well the fingerprint pattern matches each class. When the dataset is imbalanced, the model naturally favors the more common class unless the feature evidence strongly supports the alternative. This interaction between feature evidence (X) and class prevalence (y) plays an important role in how Naïve Bayes behaves and motivates strategies such as downsampling.
Please see Appendix 10.2” Bayes’s Theorem: From Inference to Models for a workup on how Naïve Bayes works.
Explanation
Bernoulli Naïve Bayes is a probabilistic model, meaning it uses probabilities to decide which class label is most consistent with the observed features. At a high level, the model answers the question: “Given the fingerprint bits for this molecule, which class is more likely, 1 (active) or 0 (inactive)?”
The core idea: updating beliefs
The model starts with a baseline expectation about each class (for example, how common active compounds are overall) and then updates that expectation based on the observed fingerprint bits. This idea is formalized by Bayes’ Theorem:
Where:
\( y \) is the class label (e.g., active or inactive)
\( X \) represents all fingerprint bits for a molecule
\( P(y \mid X) \) is the probability of class ( y ) given the observed features
In practice, the model compares this quantity across classes and selects the most likely one.
What each term means (intuitively)
\( P(y) \) — the prior
This represents how common each class is before looking at any molecular features. If most compounds are inactive, the prior probability of “inactive” is higher.
\( P(X \mid y) \) — the likelihood
This measures how well the observed fingerprint bits match what the model has learned about class ( y ). This is where the fingerprint information is used.
\( P(X) \) — the evidence
This term ensures probabilities are properly scaled. Because it is the same for all classes, it does not affect which class is chosen and is usually ignored during classification.
The “naïve” assumption: breaking features apart
Computing \( P(X \mid y) \) directly would be extremely difficult for molecular fingerprints, because many bits are related to one another. Naïve Bayes simplifies this by assuming that each fingerprint bit can be treated independently when estimating probabilities. This allows the likelihood to be written as a product:
Each fingerprint bit contributes its own small piece of evidence toward the final decision. Together, these pieces of evidence are combined to determine which class the molecule most closely resembles overall.
Why it is called Bernoulli Naïve Bayes In Bernoulli Naïve Bayes, each fingerprint bit is treated as a Bernoulli random variable, meaning it can take only two values:
Value = 1 → feature is present
Value = 0 → feature is absent
The class label \( 𝑦 \) also takes on discrete values (for example, 0 = inactive and 1 = active). During training, the model looks at all molecules in each class and learns how often each fingerprint bit is present or absent within that class. For each bit and each class, the model learns a probability such as:
The above equation is saying that, among molecules in class \(y\), how common it is for bit \(i\) to be present (\(=1\)).
During prediction:
If a bit is present, the model uses the probability of presence
If a bit is absent, the model uses the probability of absence
These probabilities are multiplied together across all bits to form an overall score for each class.
Why the math still works in practice
Even though fingerprint bits are not truly independent—especially for Morgan fingerprints—the model often performs well because:
Many weak signals combine into a strong overall pattern
Errors introduced by the independence assumption tend to cancel out
Classification depends on relative likelihoods, not exact probabilities
As a result, Naïve Bayes often provides good classification performance even when its assumptions are only approximately true.
For a more detailed explaination go to Appendix A10.2 Bayes’ Theorem: From Inference to Models
from sklearn.naive_bayes import BernoulliNB #-- Naïve Bayes
# set up the NB classification model. Bernoulli is specific for binary features (0,1)
clf_NB = BernoulliNB()
In scikit-learn examples, the prefix clf is commonly used as shorthand for classifier. In this notebook, clf_NB refers to a Bernoulli Naïve Bayes classifier object. This naming convention is descriptive only; it is not required by Python or by scikit-learn. Its purpose is to make the role of the object explicit within the machine-learning workflow.
At this point in the course, two perspectives are coming together:
The scientific/statistical model, introduced conceptually in Appendix A-10.2
The scikit-learn implementation, which represents that model as a Python object with a defined interface and behavior
The BernoulliNB class in scikit-learn is a concrete implementation of the Naïve Bayes probabilistic model described in the Appendix.
The call
clf_NB.fit(X_train_balanced, y_train_balanced)
trains the Bernoulli Naïve Bayes model using the training data. In scikit-learn terminology, fitting a model means estimating the parameters of the statistical model from the data and storing them inside the classifier object.
During training, the classifier learns how molecular feature vectors are associated with activity labels. Specifically, it estimates two categories of quantities from the training set:
Feature–class likelihoods For each feature (each column in
X_train_balanced) and for each class label iny_train_balanced, the model estimates how frequently that feature occurs among samples belonging to that class. Because Bernoulli Naïve Bayes is designed for binary features, this corresponds to learning how often each fingerprint bit is equal to 1 within each class.Class prior probabilities The model also estimates how common each class is in the training data. These class frequencies form the prior probabilities, which represent how likely each class is before considering any molecular features.
Together, these learned quantities define the decision rule used by the classifier. After training is complete, the model has all the information required to evaluate new molecular feature vectors. At this stage, the model is not yet making predictions. Instead, it is establishing and storing the statistical relationships that will later be used during evaluation and inference.
Conceptually, the trained BernoulliNB object is the computational realization of the probabilistic model described in Appendix A-10.2. Once trained, it exposes two primary interfaces:
predict_proba, which returns class probabilitiespredict, which returns discrete class assignments
These interfaces will be used explicitly in later sections when we evaluate model performance and apply the model to new data.
# Train the model by fitting it to the data.
clf_NB.fit( X_train_bal, y_train_bal)
print("Model expects features:", clf_NB.n_features_in_)
clf_NB
Model expects features: 162
BernoulliNB()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| alpha | 1.0 | |
| force_alpha | True | |
| binarize | 0.0 | |
| fit_prior | True | |
| class_prior | None |
Explanation
After fitting the model, Jupyter displays a summary of the BernoulliNB object. This output does not show predictions or results; instead, it reports the configuration parameters that define how the model learned from the data and how it will behave during prediction. These parameters control how probabilities are estimated and how the learned decision rule is constructed.
Model type: BernoulliNB
This confirms that the classifier is designed for binary features (0 or 1), such as molecular fingerprint bits. Each feature is treated as a Bernoulli random variable indicating presence or absence.
alpha = 1.0 — smoothing parameter
This parameter controls Laplace smoothing, which prevents estimated probabilities from becoming exactly 0 or 1.
If a fingerprint bit never appears within a given class—that is, for all training samples with a particular class label ( y ), the feature value ( \(x_i = 0\) ), then without smoothing the model would assign a probability of zero to that feature for that class. During prediction, this would cause the entire class to be ruled out whenever that bit is present. Laplace smoothing instead assigns a small, nonzero probability.
This avoids numerical issues and makes the model more robust, especially when working with limited or sparse data.
This is different from the earlier feature selection step, where we removed features that did not change across the entire dataset (for example, the MACCS bit position that was always zero for all molecules).
Smoothing ensures that no single missing feature completely rules out a class.
force_alpha = True
This forces the model to always use the specified value of alpha.
Ensures consistent smoothing behavior
Mainly an internal safety setting
For most users, this parameter does not need to be changed.
binarize = 0.0
This parameter controls whether input features are automatically converted to binary values.
Values greater than 0.0 → 1
Values less than or equal to 0.0 → 0
In this workflow:
Molecular fingerprints are already binary
This parameter effectively has no impact
fit_prior = True
This tells the model to learn class priors from the training labels (y_train).
If the training data are balanced, the learned priors will be equal
If the training data are imbalanced, the learned priors will reflect that imbalance, giving more weight to the more common class.
As a result, when fit_prior=True, class frequency influences how strongly the model favors one class over another. In an imbalanced dataset, the model will require stronger feature evidence to predict the minority class. Setting fit_prior=False forces the model to treat all classes as equally likely, which may be useful in some cases but can lead to unreliable probability estimates when the training data are highly imbalanced, and so the default is true.
class_prior = None
This indicates that no manual class priors were provided.
Because
fit_prior = Trueandclass_prior = None, the model estimates class priors automatically from the training labels (y_train).In other words, the frequency of each class in the training data determines how strongly the model expects each class before considering any features.
It is also possible to override this behavior by supplying class priors explicitly when the model is created (for example, to force equal weighting of classes). However, in this lesson we allow the model to infer priors from the training data so that the connection between data preparation (such as downsampling) and model behavior remains transparent.
Big-picture takeaway
Together, these parameters define how Bernoulli Naïve Bayes:
Estimates feature probabilities within each class
Incorporates class balance through priors
Constructs a decision rule that maps feature vectors X to class labels y
The model has now learned this decision rule internally. In the next step, we will apply it to new data to see how well it generalizes beyond the training set.
3.2 From Training to Inference: What does the Classifier Produce?#
Model evaluation depends on what type of output the classifier produces: class labels or probabilities.
Once a classifier has been trained, it can produce two different kinds of outputs, depending on which method is used. Understanding this distinction is essential, because different evaluation metrics require different types of outputs.
The method .predict() returns discrete class labels. For a binary classifier, these are integers such as 0 (inactive) or 1 (active). Calling .predict() answers the question: “Which class does the model assign to each compound?” These class labels are used to construct confusion matrices and to compute metrics such as accuracy, precision, recall, and F1-score.
In contrast, .predict_proba() returns probabilistic scores. For each compound, the model reports its estimated probability of belonging to each class. For binary classification, this means a pair of values that sum to 1. These probabilities answer a different question: “How confident is the model in its prediction?” Probabilistic outputs are required for threshold-independent metrics such as ROC curves and ROC–AUC.
The difference can be seen directly by running the following code and we will go over these in the next two sections
# Discrete class predictions (0 or 1)
y_pred = clf_NB.predict(X_test)
# Probabilistic predictions (confidence scores)
y_proba = clf_NB.predict_proba(X_test)
y_pred[:5], y_proba[:5]
(array([0, 0, 1, 0, 0]),
array([[9.99990160e-01, 9.83971842e-06],
[5.73893051e-01, 4.26106949e-01],
[6.58510978e-03, 9.93414890e-01],
[9.48421570e-01, 5.15784297e-02],
[8.86150183e-01, 1.13849817e-01]]))
Explanation
The scikit-learn library follows the X/y
conventions consistently when fitting models and generating predictions.
- Models are trained using
model.fit(X, y) Xis treated as a feature matrixyis treated as a target vector
For probabilistic classifiers such as Naïve Bayes, prediction occurs in two conceptual steps:
X → P(y | X) → y_hat
-
X— the feature matrix- Numerical representation of the compounds
- Each row corresponds to one compound
- Each column corresponds to one descriptor or fingerprint bit
-
P(y | X)— the posterior class probabilities- The model’s estimated probability for each class
- Computed internally using Bayes’ theorem
- Returned by
model.predict_proba(X)
-
y_hat— the predicted class label- The single class assigned to each compound
- Chosen as the class with the highest probability
- Returned by
model.predict(X)
In short, predict_proba() computes probabilities, while
predict() applies a decision rule based on those probabilities.
Even when Pandas DataFrames or Series are supplied, scikit-learn converts them
internally to NumPy arrays, preserving the matrix–vector structure
implied by X and y.
Explanation
The scikit-learn library follows the X/y
conventions consistently when fitting models and generating predictions.
- Models are trained using
model.fit(X, y) Xis treated as a feature matrixyis treated as a target vector
For probabilistic classifiers such as Naive Bayes, prediction occurs in two conceptual steps:
\( \mathbf{X} \;\rightarrow\; P(y \mid \mathbf{X}) \;\rightarrow\; \hat{y} \)
-
X— the feature matrix- Numerical representation of the compounds
- Each row = one compound
- Each column = one descriptor or fingerprint bit
-
\( P(y \mid X) \): the posterior probability
- The model’s estimated probability of each class
- Computed using Bayes’ theorem
- Returned by
model.predict_proba(X)
-
\( \hat{y} \): the predicted class label
- The single class chosen for each compound
- Selected as the class with the highest probability
- Returned by
model.predict(X)
Here, predict_proba() computes probabilities, while
predict() applies a decision rule based on those probabilities.
Even when Pandas DataFrames or Series are used, scikit-learn converts them
internally to NumPy arrays, preserving the matrix–vector structure
implied by X and y.
Explanation
The scikit-learn library follows the X/y
conventions consistently when fitting models and generating predictions.
- Models are trained using
model.fit(X, y) Xis treated as a feature matrixyis treated as a target vector
For probabilistic classifiers such as Naive Bayes, prediction occurs in two conceptual steps:
\( \mathbf{X} \;\rightarrow\; P(y \mid \mathbf{X}) \;\rightarrow\; \hat{y} \)
-
X— the feature matrix- Numerical representation of the compounds
- Each row = one compound
- Each column = one descriptor or fingerprint bit
-
\( P(y \mid X) \): the posterior probability
- The model’s estimated probability of each class
- Computed using Bayes’ theorem
- Returned by
model.predict_proba(X)
-
\( \hat{y} \): the predicted class label
- The single class chosen for each compound
- Selected as the class with the highest probability
- Returned by
model.predict(X)
Here, predict_proba() computes probabilities, while
predict() applies a decision rule based on those probabilities.
Even when Pandas DataFrames or Series are used, scikit-learn converts them
internally to NumPy arrays, preserving the matrix–vector structure
implied by X and y.
4. Classification-Based Inference (.predict)#
With a trained classifier in hand, we now turn to its use for classification-based inference. In this mode, the model assigns each compound to a discrete class, such as inactive (0) or active (1), based on the learned decision rule. Evaluating the model at this stage means examining how these class assignments compare to known reference labels and analyzing the types of classification errors that occur.
In scikit-learn, classification evaluation is performed by comparing two aligned arrays:
the known class labels for a dataset, often referred to conceptually as “ground truth” (
y_train_balory_test)the class labels predicted by the model using
.predict()(y_pred)
Rather than introducing a separate variable such as y_true, this notebook consistently uses the existing dataset labels (y_train_bal or y_test) to make it explicit which dataset is being evaluated and to reinforce the distinction between training and test data. This section focuses on classification outcomes and begins with the confusion matrix, which provides a complete summary of how predicted class labels compare to known labels. Summary metrics derived from the confusion matrix are introduced afterward.
4.1 Confusion Matrix for Classification Evaluation#
To evaluate how well the trained Bernoulli Naïve Bayes model classifies compounds, we compare the model’s predicted class labels to the known reference labels using a confusion matrix. A confusion matrix always compares two one-dimensional arrays:
the reference labels for a dataset (
y_train_balory_test)the predicted class labels produced by the model for the same samples (
y_pred)
In this section, evaluation is performed using the test set, which provides an unbiased estimate of classification performance on unseen data. The known labels are therefore y_test, and the predicted labels (y_pred) are generated by applying the trained model to X_test.
The resulting confusion matrix organizes predictions into four categories: true positives, true negatives, false positives, and false negatives. These counts form the foundation for many commonly used classification metrics and provide direct insight into the types of errors the model makes.
4.2 Load Feature Arrays and Generate the Confusion Matrix#
# Imports and paths
from pathlib import Path
import numpy as np
SPLIT_ROOT = Path("data/AID743139/splits/90_10/arrays")
# Load saved feature arrays and labels
X_train = np.load(SPLIT_ROOT / "X_train.npy")
X_test = np.load(SPLIT_ROOT / "X_test.npy")
y_train = np.load(SPLIT_ROOT / "y_train.npy")
y_test = np.load(SPLIT_ROOT / "y_test.npy")
# Balance the training set (downsample inactives)
idx_inactives = np.where(y_train == 0)[0]
idx_actives = np.where(y_train == 1)[0]
np.random.seed(0)
idx_inactives_down = np.random.choice(
idx_inactives,
size=len(idx_actives),
replace=False
)
X_train_bal = np.vstack((
X_train[idx_inactives_down],
X_train[idx_actives]
))
y_train_bal = np.hstack((
y_train[idx_inactives_down],
y_train[idx_actives]
))
# Train the classifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix, classification_report
clf_NB = BernoulliNB()
clf_NB.fit(X_train_bal, y_train_bal)
# Generate test-set predictions
y_test_pred = clf_NB.predict(X_test)
# Confusion matrix
cm = confusion_matrix(y_test, y_test_pred)
TN, FP, FN, TP = cm.ravel()
print("Test-set Confusion Matrix")
print(cm)
# Classification report
#|print("\nTest-set Classification Report")
#print(classification_report(y_test, y_test_pred))
Test-set Confusion Matrix
[[400 206]
[ 25 49]]
These two lines perform the core evaluation step.
y_pred = clf_NB.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm) # [[TN, FP],
# [FN, TP]]
# Extracting TN, FP, FN, TP from the confusion matrix
TN = cm[0, 0] # True Negatives
FP = cm[0, 1] # False Positives
FN = cm[1, 0] # False Negatives
TP = cm[1, 1] # True Positives
print("True Negatives (TN):", TN)
print("False Positives (FP):", FP)
print("False Negatives (FN):", FN)
print("True Positives (TP):", TP)
print("Total predictions:", TN + FP + FN + TP)
[[400 206]
[ 25 49]]
True Negatives (TN): 400
False Positives (FP): 206
False Negatives (FN): 25
True Positives (TP): 49
Total predictions: 680
# -------------------------------------------------
# Result persistence setup
# -------------------------------------------------
from pathlib import Path
import json
import numpy as np
import pandas as pd
from datetime import date
RESULTS_ROOT = Path("results/AID743139/nb")
RESULTS_ROOT.mkdir(parents=True, exist_ok=True)
print("Results will be saved to:")
print(RESULTS_ROOT.resolve())
# -------------------------------------------------
# Save test-set predictions
# -------------------------------------------------
# y_test : known labels (ground truth)
# y_test_pred : model predictions
df_test_pred = pd.DataFrame({
"y_true": y_test, # y_true : reference labels for evaluation (here: y_test)
"y_pred": y_test_pred # y_pred : labels predicted by the model
})
pred_path = RESULTS_ROOT / "test_predictions.csv"
df_test_pred.to_csv(pred_path, index=False)
print(f"Saved test-set predictions to {pred_path.name}")
df_test_pred.head()
Results will be saved to:
/home/rebelford/jupyterbooks/cinf26book/content/modules/10_SupervisedML/results/AID743139/nb
Saved test-set predictions to test_predictions.csv
| y_true | y_pred | |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 0 | 0 |
| 2 | 0 | 1 |
| 3 | 0 | 0 |
| 4 | 0 | 0 |
4.3 Saving Confusion Matrices and Metrics#
# -------------------------------------------------
# Save confusion matrix
# -------------------------------------------------
cm_path = RESULTS_ROOT / "confusion_matrix_test.npy"
np.save(cm_path, cm)
print(f"Saved confusion matrix to {cm_path}")
print("Confusion matrix contents:")
print(cm)
Saved confusion matrix to results/AID743139/nb/confusion_matrix_test.npy
Confusion matrix contents:
[[400 206]
[ 25 49]]
4.3.1 Recording Experimental Metadata#
# -------------------------------------------------
# Save evaluation metrics and metadata
# -------------------------------------------------
from sklearn.metrics import accuracy_score
metrics = {
"accuracy": accuracy_score(y_test, y_test_pred),
"TN": int(TN),
"FP": int(FP),
"FN": int(FN),
"TP": int(TP)
}
metrics_path = RESULTS_ROOT / "metrics_test.json"
with open(metrics_path, "w") as f:
json.dump(metrics, f, indent=2)
metadata = {
"assay": "AID743139",
"model": "BernoulliNB",
"features": "MACCS",
"split": "90_10",
"training_balanced": True,
"evaluation_set": "test",
"date": date.today().isoformat(),
"notebook": "10.2_naive_bayes.ipynb"
}
metadata_path = RESULTS_ROOT / "results_metadata.json"
with open(metadata_path, "w") as f:
json.dump(metadata, f, indent=2)
print("Saved metrics and metadata:")
print(f" - {metrics_path}")
print(f" - {metadata_path}")
Saved metrics and metadata:
- results/AID743139/nb/metrics_test.json
- results/AID743139/nb/results_metadata.json
4.4 Interpreting the Confusion Matrix#
There is no single number that fully describes how a classifier performs. Instead of immediately collapsing model behavior into a summary statistic, we begin by examining the confusion matrix, which records how predictions are distributed across correct and incorrect outcomes. By separating true positives, false positives, true negatives, and false negatives, the confusion matrix makes explicit which types of mistakes the model makes and how often they occur. This structured view provides the foundation for all subsequent evaluation metrics.
Historically, the confusion matrix is so named because it explicitly shows where a model “confuses” one class for another, rather than hiding those errors inside a single summary value. Simply speaking, a False positive was when the model thought an inactive compound was active, and a false negative is when the model predicts an active compound is inactive.
In the sections that follow, we will use the confusion matrix to derive quantitative evaluation measures. Each metric is computed directly from these counts, and their meaning is best understood only after the structure of the confusion matrix is clear.
In this class we will use the scikit-learn convention above, where labels are sorted \([0,1]\) and align with binary encoding 0 (inactive) and 1 (active).
| Actual | Predicted | |
|---|---|---|
| 0 | 1 | |
| 0 | TN True Negative |
FP False Positive |
| 1 | FN False Negative |
TP True Positive |
You need to be aware that there is an alternate convention that starts with the positive \([1,0]\) type ordering.
Explanation
Many sources choose to place the positive first and negative second, essentially reversing the boolean relationships, as shown in the following confusion matrix.| Actual | Predicted | |
|---|---|---|
| 1 | 0 | |
| 1 | TP True Positive |
FN False Negative |
| 0 | FP False Positive |
TN True Negative |
We will not use this convention in this class, but when comparing data you need to be aware of which convention is being used.
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(6, 5))
sns.heatmap(
cm,
annot=True,
fmt="d",
cmap="Greens",
cbar=False,
xticklabels=["Predicted 0 (Inactive)", "Predicted 1 (Active)"],
yticklabels=["Actual 0 (Inactive)", "Actual 1 (Active)"],
linewidths=1,
linecolor="black"
)
plt.title("Confusion Matrix — Bernoulli Naïve Bayes")
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
plt.tight_layout()
plt.show()
The diagonal cells represent correct predictions, while the off-diagonal cells represent errors. The confusion matrix makes it visually clear that different types of errors occur, which is why we need multiple evaluation metrics rather than a single accuracy value
4.5 Confusion Matrix Based Evaluation Metrics#
The confusion matrix reveals which kinds of mistakes the model makes.
Predicted Inactive |
Predicted Active |
|
|---|---|---|
Actual Inactive |
TN (True Negative) |
FP (False Positive) |
Actual Active |
FN (False Negative) |
TP (True Positive) |
4.5.1 Table of Confusion Matrix Metrics#
All the metrics below are derived from TN, TP, FN & FP and summarize the model’s validity (mistakes) from different scientific perspectives.
Metric |
Mathematical Definition |
What This Metric Tells Us |
|---|---|---|
Accuracy |
\(\displaystyle \frac{TP + TN}{TP + TN + FP + FN}\) |
What fraction of all predictions—active and inactive—were correct overall. |
Precision |
\(\displaystyle \frac{TP}{TP + FP}\) |
When the model predicts “active,” how often is it actually correct? |
Sensitivity (Recall) |
\(\displaystyle \frac{TP}{TP + FN}\) |
Of all truly active compounds, how many did the model successfully identify? |
Specificity |
\(\displaystyle \frac{TN}{TN + FP}\) |
Of all truly inactive compounds, how many did the model correctly reject? |
Balanced Accuracy |
\(\displaystyle \frac{\text{Sensitivity} + \text{Specificity}}{2}\) |
How well the model performs across both classes, even when the dataset is imbalanced. |
F1 Score |
\(\displaystyle \frac{2,(\text{Precision}\times\text{Sensitivity})}{\text{Precision} + \text{Sensitivity}}\) |
A single score that balances finding actives with avoiding false positives. |
# --- Test metrics ---
acc_test = accuracy_score(y_test, y_test_pred)
prec_test = TP / (TP + FP) if (TP + FP) else 0.0
sens_test = TP / (TP + FN) if (TP + FN) else 0.0
spec_test = TN / (TN + FP) if (TN + FP) else 0.0
bacc_test = (sens_test + spec_test) / 2
f1_test = (2 * prec_test * sens_test / (prec_test + sens_test)) if (prec_test + sens_test) else 0.0
print("TEST SET PERFORMANCE")
print(f"Accuracy = {acc_test:.4f}")
print(f"Precision = {prec_test:.4f}")
print(f"Sensitivity = {sens_test:.4f}")
print(f"Specificity = {spec_test:.4f}")
print(f"Balanced Accuracy = {bacc_test:.4f}")
print(f"F1 Score = {f1_test:.4f}")
TEST SET PERFORMANCE
Accuracy = 0.6603
Precision = 0.1922
Sensitivity = 0.6622
Specificity = 0.6601
Balanced Accuracy = 0.6611
F1 Score = 0.2979
The key takeaway from this discussion is that confusion-matrix-based metrics describe model performance at a single decision threshold, and some of those metrics, especially precision and F1, are highly sensitive to class imbalance. As a result, a model may contain useful discriminatory information while still appearing unreliable at a particular cutoff. Before deciding whether a model is truly useful, we therefore need a way to evaluate its ability to separate actives from inactives independently of any single threshold. This motivates a ranking-based evaluation using the ROC–AUC curve, which assesses whether the model assigns higher scores to active compounds than to inactive ones across all possible decision thresholds.
5. Probability-Based Inference (.predict_proba)#
Up to this point, we have evaluated our classifier using a confusion matrix and metrics derived from it (accuracy, sensitivity, specificity, etc.). All of those metrics are based on a single decision rule: the model assigns each compound to a class (active or inactive) using a fixed threshold. This raises an important question:
What if the model is capable of separating actives from inactives, but the particular threshold we chose is not ideal?
The key limitation of a confusion matrix is that it evaluates the model at only one operating point—that is, at a single decision threshold. If that threshold is poorly chosen, the model may appear to perform badly even when it has learned meaningful structure in the data.
The ROC curve (Receiver Operating Characteristic curve) addresses this limitation by evaluating the model across all possible decision thresholds. At one extreme, when the threshold is set to 1, no compounds are predicted active; at the other extreme, when the threshold is set to 0, all compounds are predicted active. As the threshold is gradually lowered between these extremes, the ROC curve traces how the true positive rate (sensitivity) increases in relation to the false positive rate, revealing the trade-off between correctly identifying actives and incorrectly flagging inactives.
Historically, the term “receiver operating characteristic” comes from signal-detection theory, where it described how a signal receiver’s performance changed as its detection sensitivity was adjusted. In machine learning, the same idea applies: the ROC curve characterizes how a classifier’s behavior changes as we vary the decision rule that converts scores into classifications.
The AUC (Area Under the Curve) then condenses this threshold-sweep into a single number, representing the integrated area under the ROC curve. Conceptually, AUC measures how well the model can rank compounds independent of any particular cutoff: if active compounds tend to receive higher scores than inactive ones, the AUC will be high, even if no single threshold has yet been chosen.
In this way, ROC–AUC answers a different question than the confusion matrix:
Confusion matrices ask: “How did the model perform at this particular decision rule?”
ROC–AUC asks: “Across all possible decision rules, does the model separate actives from inactives at all?”
This distinction is especially important in imbalanced datasets, where accuracy and related metrics can be misleading. A model may appear weak at a single threshold while still containing useful discriminatory information. ROC–AUC allows us to detect whether that underlying signal exists before we commit to a specific classification cutoff.
In the next steps, we will compute ROC–AUC using the same trained model as before. The only change will be how we interpret its output: instead of hard class labels, we will use the model’s continuous scores. Once we understand how well the model separates the classes in general, we can then return to the question of choosing an appropriate threshold for a specific scientific or practical objective.
Discussion question:
When we talk about a “threshold” in ROC analysis, are we talking about the same kind of threshold that defines whether a compound is active or inactive in a bioassay?
Explanation
No — these are fundamentally different thresholds.
Experimental (bioassay) threshold:
In an assay, a threshold is part of how the labels are defined. For example, a compound might be called “active” if its IC50 is below a certain concentration. This threshold is tied to biology, chemistry, and experimental design. Once chosen, it defines the ground truth labels used for modeling.
Model (decision) threshold:
In machine learning, the threshold does not change the experimental labels. Instead, it controls how the model’s continuous output (such as a predicted probability of being active) is converted into a hard class prediction (active vs inactive).
For example, a model might assign a compound a score of 0.62 for being active. If we choose a decision threshold of 0.50, the model predicts “active.” If we choose a threshold of 0.70, the same compound would be predicted “inactive.” The underlying experimental label has not changed — only the model’s decision rule has.
Why this matters for ROC–AUC:
ROC–AUC is fundamentally a ranking-based evaluation. Rather than measuring performance at a single cutoff, it evaluates whether the model’s scores induce a ranked ordering in which truly active compounds tend to receive higher scores than truly inactive ones. The higher the AUC, the greater the probability that a randomly chosen active compound outranks a randomly chosen inactive compound.
As the decision threshold is varied from 1 to 0, we are effectively sliding a cutoff down this ranked list. Each possible threshold corresponds to a different point on the ROC curve. If the ranking meaningfully separates actives from inactives, then there will exist one or more thresholds that produce useful classification performance.
The AUC summarizes this idea by measuring how consistently the model ranks active compounds ahead of inactive ones, independent of where the cutoff is ultimately placed. If AUC is close to 0.5, the ranking is essentially random and no threshold will meaningfully separate the classes. If AUC is high, the ranking contains real discriminatory information, and the remaining task becomes choosing an appropriate threshold to turn that ranking into decisions. ROC analysis explores all possible decision thresholds applied to the model’s scores. It asks whether the model tends to assign higher scores to truly active compounds than to truly inactive ones, regardless of where we ultimately place the cutoff.
# 1. Imports
from pathlib import Path
import numpy as np
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import (
confusion_matrix,
accuracy_score,
roc_auc_score,
roc_curve
)
#2 Load arrays (uncomment if SPLIT_ROOT is not in memory)
#SPLIT_ROOT = Path("data/AID743139/splits/90_10/arrays")
X_train = np.load(SPLIT_ROOT / "X_train.npy")
X_test = np.load(SPLIT_ROOT / "X_test.npy")
y_train = np.load(SPLIT_ROOT / "y_train.npy")
y_test = np.load(SPLIT_ROOT / "y_test.npy")
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
# 3. Balance the training set
idx_inactives = np.where(y_train == 0)[0]
idx_actives = np.where(y_train == 1)[0]
num_actives = len(idx_actives)
np.random.seed(0)
idx_inactives_downsampled = np.random.choice(
idx_inactives,
size=num_actives,
replace=False
)
X_train_bal = np.vstack((
X_train[idx_inactives_downsampled],
X_train[idx_actives]
))
y_train_bal = np.hstack((
y_train[idx_inactives_downsampled],
y_train[idx_actives]
))
print(X_train_bal.shape, y_train_bal.shape)
(6113, 162) (6113,)
(680, 162) (680,)
(1338, 162) (1338,)
# 4 Fit model
clf = BernoulliNB()
clf.fit(X_train_bal, y_train_bal)
#5 Predict Hard Labels
y_pred_test = clf.predict(X_test)
y_pred_train = clf.predict(X_train_bal)
# 6. Create confusion matrices (train and test)
cm_test = confusion_matrix(y_test, y_pred_test)
cm_train = confusion_matrix(y_train_bal, y_pred_train)
TN, FP, FN, TP = cm_test.ravel()
TN_tr, FP_tr, FN_tr, TP_tr = cm_train.ravel()
print("Test confusion matrix:\n", cm_test)
print("\nTraining confusion matrix:\n", cm_train)
Test confusion matrix:
[[400 206]
[ 25 49]]
Training confusion matrix:
[[457 212]
[187 482]]
# 7. Metrics from confusion matrices
# --- Test metrics ---
acc_test = accuracy_score(y_test, y_pred_test)
prec_test = TP / (TP + FP) if (TP + FP) else 0.0
sens_test = TP / (TP + FN) if (TP + FN) else 0.0
spec_test = TN / (TN + FP) if (TN + FP) else 0.0
bacc_test = (sens_test + spec_test) / 2
f1_test = (2 * prec_test * sens_test / (prec_test + sens_test)) if (prec_test + sens_test) else 0.0
print("TEST SET PERFORMANCE")
print(f"Accuracy = {acc_test:.4f}")
print(f"Precision = {prec_test:.4f}")
print(f"Sensitivity = {sens_test:.4f}")
print(f"Specificity = {spec_test:.4f}")
print(f"Balanced Accuracy = {bacc_test:.4f}")
print(f"F1 Score = {f1_test:.4f}")
# --- Training metrics ---
acc_train = accuracy_score(y_train_bal, y_pred_train)
prec_train = TP_tr / (TP_tr + FP_tr) if (TP_tr + FP_tr) else 0.0
sens_train = TP_tr / (TP_tr + FN_tr) if (TP_tr + FN_tr) else 0.0
spec_train = TN_tr / (TN_tr + FP_tr) if (TN_tr + FP_tr) else 0.0
bacc_train = (sens_train + spec_train) / 2
f1_train = (2 * prec_train * sens_train / (prec_train + sens_train)) if (prec_train + sens_train) else 0.0
print("\nTRAINING SET PERFORMANCE")
print(f"Accuracy = {acc_train:.4f}")
print(f"Precision = {prec_train:.4f}")
print(f"Sensitivity = {sens_train:.4f}")
print(f"Specificity = {spec_train:.4f}")
print(f"Balanced Accuracy = {bacc_train:.4f}")
print(f"F1 Score = {f1_train:.4f}")
TEST SET PERFORMANCE
Accuracy = 0.6603
Precision = 0.1922
Sensitivity = 0.6622
Specificity = 0.6601
Balanced Accuracy = 0.6611
F1 Score = 0.2979
TRAINING SET PERFORMANCE
Accuracy = 0.7018
Precision = 0.6945
Sensitivity = 0.7205
Specificity = 0.6831
Balanced Accuracy = 0.7018
F1 Score = 0.7073
#8. Get probablilty scores
y_score_test = clf.predict_proba(X_test)[:, 1]
y_score_train = clf.predict_proba(X_train_bal)[:, 1]
#9. Compute ROC-AUC (train and test)
auc_test = roc_auc_score(y_test, y_score_test)
auc_train = roc_auc_score(y_train_bal, y_score_train)
print("ROC–AUC")
print(f"Test AUC = {auc_test:.4f}")
print(f"Training AUC = {auc_train:.4f}")
ROC–AUC
Test AUC = 0.7279
Training AUC = 0.7546
fpr_test, tpr_test, thresholds_test = roc_curve(y_test, y_score_test)
fpr_train, tpr_train, thresholds_train = roc_curve(y_train_bal, y_score_train)
print(
f"ROC curve arrays (test set):\n"
f" fpr_test = {fpr_test.shape} # False Positive Rate values\n"
f" tpr_test = {tpr_test.shape} # True Positive Rate values\n"
f" thresholds_test = {thresholds_test.shape} # Decision thresholds\n"
)
print(
f"ROC curve arrays (test set):\n"
f" fpr_train = {fpr_train.shape} # False Positive Rate values\n"
f" tpr_train = {tpr_train.shape} # True Positive Rate values\n"
f" thresholds_train = {thresholds_train.shape} # Decision thresholds\n"
)
ROC curve arrays (test set):
fpr_test = (144,) # False Positive Rate values
tpr_test = (144,) # True Positive Rate values
thresholds_test = (144,) # Decision thresholds
ROC curve arrays (test set):
fpr_train = (587,) # False Positive Rate values
tpr_train = (587,) # True Positive Rate values
thresholds_train = (587,) # Decision thresholds
import matplotlib.pyplot as plt
plt.figure(figsize=(6, 6))
# Plot ROC curves
plt.plot(fpr_test, tpr_test, label=f"Test ROC (AUC = {auc_test:.3f})")
plt.plot(fpr_train, tpr_train, label=f"Train ROC (AUC = {auc_train:.3f})")
# Reference line: random classifier
plt.plot([0, 1], [0, 1], linestyle='--', label="Random (AUC = 0.5)")
# Labels and title
plt.xlabel("False Positive Rate (1 − Specificity)")
plt.ylabel("True Positive Rate (Sensitivity)")
plt.title("ROC Curve: BernoulliNB with MACCS Fingerprints")
plt.legend(loc="lower right")
plt.grid(True)
plt.show()
The ROC curves for both the training and test sets lie well above the diagonal reference line, indicating that the model has learned a meaningful ranking signal rather than behaving like a random classifier. The training curve consistently exceeds the test curve, as expected, but the gap between them is modest, suggesting limited overfitting and reasonable generalization. The test-set AUC of approximately 0.72 confirms that active compounds tend to receive higher scores than inactive compounds, even though no single decision threshold has yet been chosen. At the same time, the gradual rise of the curve highlights an inherent trade-off: increasing sensitivity necessarily increases the false positive rate. This reinforces the idea that ROC–AUC evaluates whether useful separation exists at all, while the choice of a specific threshold must be guided by scientific or practical priorities.
Homework
Problem 1: hw_10.2my_AID.ipynbMake a new notebook called hw_10.2my_AID.ipynb and place it in the directory you have been running this notebook in.
You now need to create a confusion matric, and ROC and print out AOC values. I do not want you to generate all of the narrative of this notebook, just the code cells, headers for navigation and a few markdown statements when you think they are needed. There are several approaches you can take, for example, you could make a copy of this notebook, rename it, delete the bulk of the markdown cells, and then use refactoring to change the AID used in this notebook to your AID. Or you may want to open the Jupyter book in another cell, and just copy those cells you need for the workflow. You need to use the paths techniques and cinf26pk that we have been using, so I can load your completed notebook and generate the data artifacts within the directory architecture we are developing in this class