10.0: Introduction#

Supervised Machine Learning and the Tox21 Aromatase Bioassay#

Learning Objectives

Purpose
Introduce machine learning as a scientific workflow for building predictive models from experimental data, and situate the Tox21 Aromatase BioAssay as the case study used throughout this module.

Students Learn

  • What a supervised machine-learning problem is
  • What a model represents in a scientific context
  • How experimental data are transformed into inputs for modeling
  • Why data preparation and representation decisions matter before learning begins

Core Activities

  • Review basic machine-learning terminology and concepts
  • Examine the biological role of Aromatase and the meaning of assay activity
  • Identify the major stages of a machine-learning workflow
  • Preview how data, models, and results will be organized across the module

1. How This Module Is Organized#

This module is organized as a sequence of notebooks that follow this introduction, each focused on a distinct stage of the machine-learning workflow.

Notebook

Description

10.0

Introduction to supervised machine learning, the Tox21 Aromatase bioassay, and how this module is organized

10.1

Data preparation and molecular representation, including chemical curation and fingerprint generation

10.2

Manual model construction workflow using Naive Bayes, including data splitting, class imbalance, and feature selection

10.3

Evaluation and interpretation of model behavior using confusion matrices, ROC curves, and thresholds

10.4

Inference pipelines and model reuse, focusing on applying trained models safely to new data

10.5

Systematic comparison of molecular representations and learning algorithms within a controlled experimental framework

2. Machine Learning Basics#

Machine learning (ML) is a branch of artificial intelligence in which computers learn patterns from data and use those patterns to make predictions or decisions, often in a probabilistic rather than purely deterministic manner. Instead of being explicitly programmed with fixed, rule-based logic, a machine-learning system infers statistical relationships directly from examples.

This distinction is especially important in scientific applications. Chemical and biological data are rarely exact: experimental measurements contain noise, variability, and uncertainty. Machine learning embraces this reality by modeling likelihoods and trends rather than assuming perfectly deterministic behavior. In this sense, ML is not replacing scientific reasoning, it is extending statistical reasoning to complex, high-dimensional data.

In chemistry and drug discovery, machine learning is used to analyze large collections of molecular structures, experimental assay results, and measured properties in order to make informed predictions. Common applications include virtual screening of compound libraries, prediction of biological activity, identification of potential toxicity, and exploration of structure–activity relationships (SAR). In all cases, ML functions as a computational proxy for experiments, helping prioritize where experimental effort should be focused.

Deterministic vs. Probabilistic Models

Traditional computer programs operate deterministically: given the same inputs, they always produce the same outputs by following predefined rules. Many chemical calculations, such as stoichiometric bookkeeping or unit conversions, fall into this category.

Machine-learning models, by contrast, are typically probabilistic. Rather than asserting that a compound is or is not active with absolute certainty, a model estimates the likelihood of activity based on patterns learned from prior data. This probabilistic framing reflects how experimental science actually works: conclusions are drawn from evidence, not certainty. Understanding this distinction will be important throughout this module, particularly when interpreting model predictions and evaluation metrics.

2.1 Basic Categories of Machine Learning#

Machine-learning methods are commonly classified based on how they learn from data and the type of feedback they receive during training. While this module focuses on supervised learning, it is useful to understand how it fits within the broader landscape of machine learning and artificial intelligence.

2.1 Supervised Learning

Supervised learning uses labeled data, meaning that each example includes both an input and a known outcome. The model learns a statistical relationship between inputs and outputs and is evaluated by how well it predicts outcomes for new, unseen data.

Examples include:

  • Classifying compounds as active or inactive in a bioassay

  • Predicting numerical properties such as solubility, toxicity, or binding affinity

Because experimental chemical datasets are typically labeled, supervised learning is the most common and practical form of machine learning in chemistry. In this module, supervised learning provides the conceptual and technical foundation for understanding how predictive models are built, applied, and evaluated.

2.2 Unsupervised Learning

Unsupervised learning works with unlabeled data. Instead of predicting known outcomes, the goal is to identify structure, patterns, or organization within the data itself.

Examples include:

  • Clustering compounds based on structural similarity

  • Dimensionality reduction for visualizing chemical space

  • Identifying subpopulations or trends in large molecular datasets

Unsupervised methods are often used for exploratory analysis rather than direct prediction. While they are widely used in cheminformatics, they play a supporting role relative to supervised learning in this module.

Reinforcement Learning (brief overview)

Reinforcement learning involves an agent that learns by interacting with an environment and receiving feedback in the form of rewards or penalties. Rather than learning from fixed labeled examples, the model learns from sequences of actions and outcomes.

Examples include:

  • Iterative molecular design

  • Optimization of reaction pathways

  • Control of experimental or simulation workflows

These methods are powerful but conceptually more complex, and they require a different mathematical and computational framework. They are beyond the scope of this module.

Large Language Models and Generative AI

Large language models (LLMs) represent a different class of machine-learning systems. Rather than learning from structured numerical features, they are trained on massive collections of text and code to model patterns in language.

Examples include:

  • Natural-language question answering

  • Code generation and explanation

  • Scientific text summarization

In this course, students are encouraged to use LLM-based tools as assistive technologies, for example, to help write, debug, or understand code. However, it is important to distinguish their role from the models we build in this module:

  • LLMs are general-purpose, pre-trained systems

  • The models in this module are task-specific, trained from experimental chemical data

  • LLMs assist humans; supervised learning models make scientific predictions

Understanding this distinction helps clarify what it means to “build a model” versus “use an AI tool.”

3. From Experimental Data to Predictive Models#

3.1. Supervised Learning as a Reproducible Modeling Pipeline#

Machine learning is often introduced through algorithms—Naive Bayes, decision trees, neural networks—but in scientific practice, machine learning is better understood as a structured workflow that transforms experimental data into defensible predictions. In this module, we treat supervised learning not as a collection of algorithms, but as a reproducible sequence of data transformations, modeling decisions, and evaluation steps.

At its core, supervised learning involves learning a relationship between:

  • Inputs: measured or derived features (for example, molecular fingerprints), and

  • Outputs: known experimental outcomes (for example, active vs. inactive labels from a bioassay).

The defining characteristic of supervised learning is that the correct answers are known during training. Models are trained on labeled data and evaluated on how well they generalize to unseen examples.

In chemistry and drug discovery, supervised learning is commonly used to:

  • Classify compounds as active or inactive in a biological assay,

  • Predict physical or biological properties from molecular structure,

  • Prioritize compounds for further experimental testing.

Importantly, the predictive model itself is only one component of the overall workflow. The reliability of any prediction depends just as much on how the data were collected, curated, represented, and partitioned as on the choice of algorithm. Later in this module, we will introduce inference pipelines, which are software objects that bundle learned preprocessing steps together with trained models to ensure consistent reuse. But our initial focus will be on understanding the workflow that produces those pipelines.

3.2. Organizing a Machine-Learning Project#

Before we write any code, download any data, or train any models, we need to decide where things live and why. Machine-learning workflows generate many artifacts; raw datasets, curated datasets, feature arrays, trained models and evaluation results. Without a clear organizational strategy, it quickly becomes impossible to know which files were produced when, from which inputs, and for what purpose. This notebook deliberately begins with project organization, because reproducibility in machine learning is not achieved by clever algorithms, it is achieved by disciplined data management.

3.2.1: Why Project Organization Matters.#

In this module, we will repeatedly:

  • Download raw experimental data from PubChem,

  • Generate derived datasets through curation and filtering,

  • Create numerical representations suitable for machine learning,

  • Train and save multiple models using different algorithms,

  • Evaluate those models and compare their performance.

Each of these steps produces files that must be saved to disk. If those files are scattered across ad-hoc folders or referenced using hard-coded filenames, it becomes difficult—or impossible—to:

  • Reproduce results at a later date,

  • Determine which model used which version of the data,

  • Compare algorithms fairly,

  • Debug unexpected behavior.

In this module, each notebook is responsible for producing or consuming specific artifacts. Maintaining a clear directory structure allows us to move between notebooks without ambiguity about which data, models, or results are being used.

3.2.2: Clear directory hierarchy and naming conventions#

We need a directory hierarchy that separates:

  • Algorithm-agnostic data artifacts from

  • Algorithm-specific model artifacts, and

  • Executable notebooks from saved results.

We also need to generate a naming convention that tells us what the artifact is, and how it was generated. In the process of running this module we will be developing our own python helper function package that will assist us in this endeavor. Below is an outline of the directory hierarchy along with some files names using conventions we will develop.

10_ML/
├── 10_introduction.ipynb
├── 10_1_data_prep.ipynb
├── 10_2_NB_model_construction_workflow.ipynb
│── 10_3_model_eval_interpretation.ipynb
│── 10_4_pipelines_and_inference.ipynb
|── 10_5_model_comparison_exp_design.ipynb
├── data/                      ← algorithm-agnostic artifacts
│   └── AID743139/
│       ├── raw/               ← raw PubChem exports
│       │   └── AID743139_pubchem_raw_20251224.csv
│       ├── curated/           ← chemically curated datasets
│       │   └── AID743139_Activity_CID_20260111_v1.csv
│       ├── features/          ← feature-level artifacts (post-representation)
│       │   └── AID743139_MACCS_activities_noSalt_20260104_v1.csv
│       │   ├── maccs_variance_mask_20260104_v1.npy
│       │   └── feature_metadata_20260104_v1.json
│       └── splits/            ← experimental splits
│           └── 90_10/
│               ├── arrays/
│               │   ├── X_train.npy
│               │   ├── X_test.npy
│               │   ├── y_train.npy
│               │   └── y_test.npy
│               └── split_metadata.json
│
├── models/                    ← algorithm-specific artifacts
│   └── AID743139/
│       ├── nb_maccs_20260109_v1.joblib
│       ├── dt_maccs_20260110_v1.joblib
│       └── rf_maccs_20260112_v1.joblib
│
├── results/                   ← evaluation outputs
│   └── AID743139/
│       ├── nb/
│       ├── dt/
│       └── comparison_tables/

3.3 Path based directory structure#

Path.cwd() returns the current working directory of the Python kernel, and if you open a notebook in a directory like ~/cinf2026book/content/modules/10_SupervisedML Jupyter sets the kernel to that folder and that becomes the notebook’s current working directory (cwd).

In this class each module is self-contained:

  • The notebooks live directly inside the module folder

  • Data, models, and results are stored in subdirectories of that same folder. This means the module directory itself acts as the project root. So we can set up the path for each notebook at the top of the notebook by running the following cell.

from pathlib import Path

# Module-level project root (where this notebook lives)
PROJECT_ROOT = Path.cwd()

DATA_DIR    = PROJECT_ROOT / "data"
MODELS_DIR  = PROJECT_ROOT / "models"
RESULTS_DIR = PROJECT_ROOT / "results"

# Create directories if needed
for d in (DATA_DIR, MODELS_DIR, RESULTS_DIR):
    d.mkdir(parents=True, exist_ok=True)

We will then build everything else up from there.

4. Biological and Data Context for This Study#

In this module, we will develop a supervised machine-learning model to predict the inhibitory activity of small molecules against human Aromatase (Cytochrome P450 19A1, CYP19A1) using real experimental data from PubChem. The goal is to connect molecular structure to measured biological activity, and to understand how experimental assay data can be transformed into predictive computational models.

Students will work with bioactivity results from the Tox21 Aromatase assay (AID 743139) and pair those outcomes with molecular fingerprints generated from each compound’s SMILES representation. By combining cheminformatics representations, curated public bioassay data, and supervised learning methods, we will build models that classify compounds based on their likelihood of inhibiting Aromatase activity.

4.1 Cytochrome P450 and Aromatase#

Cytochrome P450 enzymes—collectively referred to as P450s—form a large and ancient superfamily of heme-containing monooxygenases found across all domains of life. The name “P450” originates from a characteristic absorbance peak at 450 nm observed when the reduced enzyme is bound to carbon monoxide, reflecting the unique thiolate ligation of the heme iron by a conserved cysteine residue.

Biologically, P450 enzymes catalyze oxidative reactions that insert one atom of molecular oxygen into a substrate while reducing the other atom to water. Their substrates include steroids, fatty acids, retinoids, and xenobiotics, making them central to hormone biosynthesis, detoxification pathways, and drug metabolism.

Among human P450 enzymes, Aromatase (CYP19A1) plays a particularly important role. Aromatase catalyzes the final, rate-limiting step in estrogen biosynthesis by converting androgens such as testosterone and androstenedione into estrogens. Dysregulation of this process is implicated in hormone-dependent cancers, and Aromatase is therefore a major target in drug discovery and clinical therapy. Aromatase inhibition is both a therapeutic objective and a toxicological concern. Several FDA-approved drugs (e.g., anastrozole, letrozole, exemestane) intentionally inhibit Aromatase to treat estrogen-dependent breast cancer. This makes Aromatase an ideal biological target for introducing machine learning using real experimental data.

4.2 Aromatase apo and holo crystal structures#

In the following code cell, we will download crystal structures of Aromatase from the RCSB Protein Data Bank (PDB). The holo structures contain bound ligands, whereas the apo structure represents the ligand-free protein. You can substitute different PDB IDs in the code to explore how various ligands interact with the protein.

Purpose

PDB ID

Description

Native ligand bound

3EQM

Aromatase + androstenedione (true substrate)

Drug-bound (letrozole)

4KQ8

Aromatase + clinical inhibitor

Drug-bound (exemestane)

3S7S

Aromatase + steroidal inhibitor

Apo structure

1C8K

Older structure, no ligand

import requests
import py3Dmol

# Aromatase structure example: 3EQM contains its native substrate (androstenedione, AND)
pdb_id = "3EQM"
url = f"https://files.rcsb.org/download/{pdb_id}.cif"

# Download the structure in mmCIF format (modern PDB standard)
response = requests.get(url)
response.raise_for_status()
cif_block = response.text

# Create viewer
view = py3Dmol.view(width=600, height=400)
view.addModel(cif_block, "cif")

# Cartoon for protein
view.setStyle({"cartoon": {"color": "spectrum"}})

# Sticks for ligands (hetflag=True catches ligands, heme, drugs, etc.)
view.setStyle({"hetflag": True}, {"stick": {}})

# Optional: highlight the heme group in Aromatase
view.setStyle({"resn":"HEM"}, {"stick":{"colorscheme":"greenCarbon"}})

view.zoomTo()
view.show()

3Dmol.js failed to load for some reason. Please check your browser console for error messages.

4.3 The Tox21 Aromatase Bioassay and Its Data Context#

The bioactivity data used in this module are drawn from the Tox21 Aromatase inhibition assay, archived in PubChem as BioAssay AID 743139 (https://pubchem.ncbi.nlm.nih.gov/bioassay/743139).

This assay is part of the Tox21 (Toxicology in the 21st Century) program, a multi-agency U.S. government initiative involving the National Center for Advancing Translational Sciences (NCATS), the National Toxicology Program (NTP), the Environmental Protection Agency (EPA), and the Food and Drug Administration (FDA). The goal of Tox21 is to modernize toxicity testing by replacing slow, animal-based assays with mechanism-driven, quantitative high-throughput screening (qHTS) combined with computational modeling.

4.3.1 PubChem, Tox21, and the Molecular Libraries Project#

To understand why these data are publicly available and so well suited for machine learning, it is helpful to understand their origin.

The infrastructure that makes datasets like AID 743139 possible was established by the NIH Molecular Libraries Initiative (MLI). Launched in the early 2000s, MLI was a landmark public-science effort designed to generate large-scale high-throughput screening data and release it openly. As part of this effort, the NIH Molecular Libraries Small Molecule Repository (MLSMR) was created, and PubChem was established as the central public database for chemical structures and HTS bioassay results.

The NIH Chemical Genomics Center (NCGC), which later became part of NCATS, developed the robotic qHTS platforms, data standards, and analysis pipelines that are still used today. When the Tox21 program was established, it directly built upon this infrastructure, reusing compound libraries, screening technology, and data dissemination pathways. PubChem therefore serves as the common data commons linking MLI-era screening, Tox21 toxicology assays, and ongoing community data deposition. This historical continuity is why PubChem contains thousands of large, well-annotated qHTS datasets that can be used directly for supervised learning.

4.3.2 What the Tox21 Aromatase Assay Measures#

Endocrine-disrupting chemicals (EDCs) interfere with the biosynthesis and normal function of steroid hormones, including estrogens and androgens. Aromatase (CYP19A1) catalyzes the conversion of androgens into estrogens and plays a central role in maintaining hormonal balance in many EDC-sensitive tissues.

The Tox21 Aromatase screening campaign used a human breast carcinoma cell line (MCF-7 aro ERE) that was stably transfected with a luciferase reporter under the control of estrogen response elements (EREs). In this system:

  • Aromatase activity produces estrogens from androgen precursors

  • Estrogens activate the estrogen receptor

  • Estrogen receptor activation drives luciferase expression

Compounds that inhibit Aromatase reduce estrogen production and therefore decrease the reporter signal. This cell-based design allows Aromatase inhibition to be detected in a biologically relevant cellular context, rather than through a purified enzyme assay alone.

4.3.3 Summary Assays and Component Assays (AID 743139)#

The dataset used in this module (AID 743139) is a summary assay, not a single experimental screen. It integrates results from two underlying component assays:

  • AID 743083 – Aromatase antagonist mode assay

  • AID 743084 – Cell viability counter screen

The purpose of this design is to distinguish true Aromatase inhibition from other effects such as cytotoxicity. The cell viability counter screen measures whether a compound simply kills or damages cells. A compound that reduces luciferase signal because the cells are no longer viable should not be interpreted as a genuine Aromatase inhibitor. By combining these assays, the summary dataset attempts to retain biologically meaningful inhibitors while flagging ambiguous or artifactual results.

In AID 743139:

  • Inactive compounds are assigned a PUBCHEM_ACTIVITY_SCORE of 0

  • Active compounds clear antagonist behavior and receive scores between 40 and 100, based on potency and efficacy

  • Inconclusive compounds are a heterogeneous category that may include active agonists (biologically real, but opposite in direction), cytotoxic compounds and compounds with atypical or ambiguous dose-response curves, often reflecting borderline activity, atypical curve behavior, or potential interference

Inconclusive outcomes are common in qHTS datasets and must be dealt with during data preparation.

4.3.4 Data Quality, Interpretation, and Caution#

Although these data are high quality and professionally curated, they are not error-free. As noted in the assay documentation, artifacts can arise from:

  • Nonspecific signal interference

  • Compound fluorescence or quenching

  • Cytotoxicity effects

  • Limitations of curve-fitting models

The activity calls and curve fits archived in PubChem reflect the NCATS analysis pipeline, but alternative interpretations and reanalyses exist and are available through EPA and NTP resources. This reality underscores an important lesson for this module: machine learning models inherit the assumptions and limitations of the experimental data they are trained on.

Despite these caveats, the Tox21 Aromatase dataset is exceptionally well suited for teaching supervised machine learning:

  • It contains thousands of unique chemical structures

  • Each compound is explicitly labeled

  • The assay targets a clearly defined biological mechanism

  • The data are public, citable, and reproducible

  • The class imbalance and inconclusive outcomes reflect real scientific data, not toy examples

In the following modules, we will treat this dataset as a reference case study. Students will then be asked to identify and analyze a different PubChem BioAssay of their choosing, applying the same computational workflow to a new biological context.

Homework

You need to identify a PubChem bioassay that you will work up through future modules and write up a 2-3 page description of the assay. This assay needs to be approved by your instructor and you should use the PubChem AID Programmatic Selector Program located in the appendices folder (/appendices/A_10_AID_Selector.ipynb) to identify an assay that will be suitable for this module. Your write up should have the following sections
  1. Summary Assay Overview - Describe the biological target of the assay, the biological relevance of the target, and the experimental method used to measure activity. You need to identify the actual assays the summary page is describing
  2. Data Summary - How many compounds were tested, how many were active, inactive, and inconclusive.
  3. Data Quality and Interpretation - Are there any known issues with the data? Are there any special considerations that need to be made when interpreting activity calls?