10.1: Data Preparation and Feature Engineering#

Learning Objectives

Purpose Introduce students to the data preparation steps needed to create a dataset suitable for machine learning. Introduce students to feature engineering and the generation of binary fingerprints from SMILES strings.

Students Learn

  • How to obtain BioAssay data from PubChem

  • How to curate the data

  • How to save data and maintain the data’s provenance

Core Activities

  • Load PubChem Data

  • Curate compounds (salts, largest fragments, conflicts…)

  • Generate fingerprints

  • Save compounds and metadata

Prior Knowledge

  • Programmatic access to PubChem

  • Familiarity with SMILES and InChI

  • Familiarity with RDKit fingerprint generation

  • Familiarity with Pandas

Reading links

1. Import bioactivity data from PubChem#

In this notebook, we will develop a prediction model for small molecule’s activity against human aromatase (https://pubchem.ncbi.nlm.nih.gov/protein/EAW77416), which is encoded by the CYP19A1 gene (https://pubchem.ncbi.nlm.nih.gov/gene/1588). The model will predict the activity of a molecule based on the structure of the molecule (represented with molecular fingerprints).

Endocrine disrupting chemicals (EDCs) interfere with the biosynthesis and normal functions of steroid hormones including estrogen and androgen in the body. Aromatase catalyzes the conversion of androgen to estrogen and plays a key role in maintaining the androgen and estrogen balance in many of the EDC-sensitive organs.

For model development, we will use the Tox21 bioassay data for human aromatase, archived in PubChem (https://pubchem.ncbi.nlm.nih.gov/bioassay/743139). The bioactivity data presented on this page can be downloaded by clicking the “Download” button available on this page and then read the data into a data frame. Alternatively, you can directly load the data into a data frame as shown in the cell below.

import pandas as pd
import numpy as np

url = 'https://pubchem.ncbi.nlm.nih.gov/assay/pcget.cgi?query=download&record_type=datatable&actvty=all&response_type=save&aid=743139'
df_raw = pd.read_csv(url)
df_raw.shape
(10489, 16)

1.1 Directory Architecture#

In this activity we are going to generate multiple fingerprints and eventually run different ML programs on different BioAssays. Below is a sample hierarchy structure for each bioassay, which is this class will be in the downloads folder of your module working directory.

10_SupervisedML/
├── 10_introduction.ipynb
│── 10_1_data_prep.ipynb
│── 10_2_NB_model_construction_workflow.ipynb
│── 10_3_model_eval_interpretation.ipynb
│── 10_4_pipelines_and_inference.ipynb
│── 10_5_model_comparison_exp_design.ipynb
├── data/                      ← algorithm-agnostic artifacts
│   └── AID743139/
│       ├── raw/               ← raw PubChem exports
│       │   └── AID743139_pubchem_raw_20251224.csv
│       ├── curated/           ← chemically curated datasets
│       │   └── AID743139_Activity_CID_20260111_v1.csv
│       ├── features/          ← feature-level artifacts (post-representation)
│       │   └── AID743139_MACCS_activities_noSalt_20260104_v1.csv
│       │   ├── maccs_variance_mask_20260104_v1.npy
│       │   └── feature_metadata_20260104_v1.json
│       └── splits/            ← experimental splits
│           └── 90_10/
│               ├── arrays/
│               │   ├── X_train.npy
│               │   ├── X_test.npy
│               │   ├── y_train.npy
│               │   └── y_test.npy
│               └── split_metadata.json
│
├── models/                    ← algorithm-specific artifacts
│   └── AID743139/
│       ├── nb_maccs_20260109_v1.joblib
│       ├── dt_maccs_20260110_v1.joblib
│       └── rf_maccs_20260112_v1.joblib
│
├── results/                   ← evaluation outputs
│   └── AID743139/
│       ├── nb/
│       ├── dt/
│       └── comparison_tables/
│

1.2 Helper Function: make_fixed_filename#

In the following code cell we are going to use the helper function make_fixed_filename to save this data in the raw data folder of the directory structure above. These helper functions are provided in the cinf26pk package we have been building in this class, and the code is provided in the green box below. If you do not have access to the package you can substitute it’s code for the import statement above

Note, we will be using two helper functions for saving files and both will generate the directory architecture based on your current working directory:

  • make_fixed_filename - for deterministic filenames with date stamp, this allows us to identify if the data file at PubChem as changed in the future, but does not rewrite a new version if we redownload it multiple times on the same day.

  • make_filename - for versioned filenames with date stamp, this allows us to save our curated data and features with version numbers so we can experiment and not lose our work.

Helper Function: make_fixed_filename

The following file was imported from the class package and used in the preceeding script

Code

The `make_fixed_filename` helpeer function was created in the class package pvhelpfct.py. It is for deterministic in contrast to versioning names, but this does at a date stamp to the filename. We use this helper function for data sets we download. After we work them up we use the `make_filename` helper function that versions the filename. Below is the code for the function, although we will import it from the class package in this example

def make_fixed_filename(
    stem: str,
    ext: str = "csv",
    add_date: bool = True
) -> str:
    """
    Create a deterministic filename of the form:
    stem[_YYYYMMDD].ext

Intended for canonical or authoritative data artifacts
that should not auto-version.
"""
if add_date:
    date_str = datetime.now().strftime("%Y%m%d")
    return f"{stem}_{date_str}.{ext}"
else:
    return f"{stem}.{ext}"

Note: For the following code cell to generate the above directory structure your notebook must be in the 10_SupervisedML/notebooks folder. This code will create the /data/AID743139/raw directory path if it does not exist. If you are working with a different bioassay you need to change the AID in the path to the new one.

from pathlib import Path
from cinf26pk.core import make_fixed_filename

RAW = Path("data/AID743139/raw")
RAW.mkdir(parents=True, exist_ok=True)

fname = make_fixed_filename(
    stem="AID743139_pubchem_raw",
    ext="csv",
    add_date=True
)

df_raw.to_csv(RAW / fname, index=False)
Python Check

The above code saves the raw PubChem dataset to a well-defined location using a consistent naming policy.

Explain the code.

Answer

There are two things going on here.

  • We are creating a path
  • We are dating our filenames
The first thing we need to do is understand the following script


PROJECT_ROOT = Path.cwd() # protject root (10_SupervisedML)
RAW = PROJECT_ROOT / "data" / "AID743139" / "raw"
RAW.mkdir(parents=True, exist_ok=True)
  

  • The first line defines the project root as the parent of the current working directory. Since this notebook is run from the notebooks/ directory, its parent directory (10_SupervisedML) becomes the project root. We capitalize PROJECT_ROOT because it represents an invariant configuration value.
  • The second line constructs the path from the project root to the directory where the raw data will be stored.
  • The third line creates the directory (and any missing parent directories) if they do not already exist, making the code safe to rerun.

The helper function make_fixed_filename is then used to construct the output filename:

Create deterministic filename with date stamp

from cinf26.core import filenames 
fname = filenames.make_fixed_filename(
    stem="AID743139_pubchem_raw",
    ext="csv",
    add_date=True
)

This uses the helper function `make_fixed_filename`, which is described in the next dropdown box

Finally, the DataFrame is written to disk:


df_raw.to_csv(RAW / fname, index=False)
  
  • Use named path objects to represent meaningful locations in the workflow, rather than embedding directory structures inside string literals.
  • Use helper functions to enforce consistent naming policies for saved artifacts, so filenames encode intent rather than ad-hoc decisions.
  1. Restart Kernel and Load the code into a dataframe called df_raw (see green box below for help)

#Write Code here 
import pandas as pd
from pathlib import Path

RAW = Path("data/AID743139/raw")
RAW.mkdir(parents=True, exist_ok=True)

df_raw = pd.read_csv(RAW / "AID743139_pubchem_raw_20260117.csv")
df_raw.head(7)  # gives the first 7 rows of the dataframe
PUBCHEM_RESULT_TAG PUBCHEM_SID PUBCHEM_CID PUBCHEM_EXT_DATASOURCE_SMILES PUBCHEM_ACTIVITY_OUTCOME PUBCHEM_ACTIVITY_SCORE PUBCHEM_ACTIVITY_URL PUBCHEM_ASSAYDATA_COMMENT Activity Summary Antagonist Activity Antagonist Potency (uM) Antagonist Efficacy (%) Viability Activity Viability Potency (uM) Viability Efficacy (%) Sample Source
0 RESULT_TYPE NaN NaN NaN NaN NaN NaN NaN STRING STRING FLOAT FLOAT STRING FLOAT FLOAT STRING
1 RESULT_DESCR NaN NaN NaN NaN NaN NaN NaN Type of compound activity based on both the ar... Type of compound activity in the aromatase ant... The concentration of sample yielding half-maxi... Percent inhibition of aromatase. Type of compound activity in the cell viabilit... The concentration of sample yielding half-maxi... Percent inhibition of cell viability. Where sample was obtained.
2 RESULT_UNIT NaN NaN NaN NaN NaN NaN NaN NaN NaN MICROMOLAR PERCENT NaN MICROMOLAR PERCENT NaN
3 1 144203552.0 12850184.0 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
4 2 144203553.0 89753.0 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
5 3 144203554.0 9403.0 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
6 4 144203555.0 13218779.0 C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
Pandas Check

How do you open the raw csv file in Pandas?

Answer There are two approaches, use the predefined path variable, or write the entire path as a string. For the first to work you must define the path, which was done in a preceding cell. Of course you must use the filename of the file you saved.

import pandas as pd
df_raw = pd.read_csv(RAW / "AID743139_pubchem_raw_20251230.csv")
#df_raw = pd.read_csv("../data/AID743139/raw/AID743139_pubchem_raw_20251230.csv")  
Note, if you hard code the path as a string, you need to start with "../" and move up one level (`10_SupervisedML`). That is the notebook is being run in the `10/ML/notebooks` directory and you want your data in the `10_SupervisedML/data` directory.
  1. Display the first 5 rows of the df_raw

#Write Code here 
import pandas as pd
df_raw.head() 
PUBCHEM_RESULT_TAG PUBCHEM_SID PUBCHEM_CID PUBCHEM_EXT_DATASOURCE_SMILES PUBCHEM_ACTIVITY_OUTCOME PUBCHEM_ACTIVITY_SCORE PUBCHEM_ACTIVITY_URL PUBCHEM_ASSAYDATA_COMMENT Activity Summary Antagonist Activity Antagonist Potency (uM) Antagonist Efficacy (%) Viability Activity Viability Potency (uM) Viability Efficacy (%) Sample Source
0 RESULT_TYPE NaN NaN NaN NaN NaN NaN NaN STRING STRING FLOAT FLOAT STRING FLOAT FLOAT STRING
1 RESULT_DESCR NaN NaN NaN NaN NaN NaN NaN Type of compound activity based on both the ar... Type of compound activity in the aromatase ant... The concentration of sample yielding half-maxi... Percent inhibition of aromatase. Type of compound activity in the cell viabilit... The concentration of sample yielding half-maxi... Percent inhibition of cell viability. Where sample was obtained.
2 RESULT_UNIT NaN NaN NaN NaN NaN NaN NaN NaN NaN MICROMOLAR PERCENT NaN MICROMOLAR PERCENT NaN
3 1 144203552.0 12850184.0 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
4 2 144203553.0 89753.0 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
Pandas Check

Why are the CID and SID columns floats and not integers

Answer

Pandas infers column data types when a CSV file is read and even though CIDs and SIDs are integers, the first few rows contain missing values (there is no CID or SID) and so Pandas converts them to NaN (Not a Number), which is actually a IEEE standard and treated as a float, allowing Pandas to handle missing number. As a result, Pandas promotes the column to float64 in order to accommodate NaN. If all values were integers then the column would be integers. Once those rows are removed, the columns can safely be converted to integers using the .astype(int) method. This conversion will fail if any values are still missing (NaN). For more information see section:   Section 4.3 Handling Missing Data in Py4Sci Pandas Chapter

Note that a column in a DataFrame is actually a Pandas Series, and the column label is the Series name. A DataFrame can be viewed as a table of Series objects that all share a common row index. Therefore, the following operation is calling a Series method on a column, and this could convert the values to integers, but will not work if one of the values is NaN.

df['column-label'] = df['column-label'].astype(int)

For more information see section:   Section 4.3 Handling Missing Data in Py4Sci Pandas Chapter

2 Data Cleaning#

Data cleaning is a critical step in our workflow that ensures the quality, consistency, and reliability of a dataset before we build a model. Raw data often contains missing values, duplicates, inconsistent formatting, or mislabeled entries that can lead to incorrect conclusions or poorly performing models. By identifying and resolving these issues at the start of our model design, we create a dataset that accurately reflects the information needed for meaningful interpretation and predictive modeling.

2.1 Inspect data and remove first 3 rows#

Looking at our imported data, lines 0-2 provide the descriptions for each column (data type, descriptions, units, etc). These rows need be removed. We will do so with a slice operation to select only those rows that contain the data we will use in the model (See Py4Sci: using .iloc on a dataframe.)

df_curated = df_raw.iloc[3:].copy()  #the [3:] indicates a slice opperation in pandas dataframes. 
# It selects all rows starting at position 3 and continues to the end, we than save
#over  the first dataframe, effectively removing tows 0,1 and2
df_curated.head(5)
PUBCHEM_RESULT_TAG PUBCHEM_SID PUBCHEM_CID PUBCHEM_EXT_DATASOURCE_SMILES PUBCHEM_ACTIVITY_OUTCOME PUBCHEM_ACTIVITY_SCORE PUBCHEM_ACTIVITY_URL PUBCHEM_ASSAYDATA_COMMENT Activity Summary Antagonist Activity Antagonist Potency (uM) Antagonist Efficacy (%) Viability Activity Viability Potency (uM) Viability Efficacy (%) Sample Source
3 1 144203552.0 12850184.0 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
4 2 144203553.0 89753.0 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
5 3 144203554.0 9403.0 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
6 4 144203555.0 13218779.0 C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
7 5 144203556.0 142766.0 C1=CC=C(C(=C1)C(=O)O)O.C1=CC2=C(C(=C1)O)N=CC=C2 Inconclusive 25.0 NaN NaN inconclusive antagonist (cytotoxic) active antagonist 15.5454 -115.803 active antagonist 14.9601 -76.8218 NCI

Now that we have removed the first three rows we can convert the CIDs and SIDs to integers using Pandas nullable "Int64" type that allows NaN (this is not int64, which does not accept null values).

df_curated["PUBCHEM_SID"] = df_curated["PUBCHEM_SID"].astype("Int64")
df_curated["PUBCHEM_CID"] = df_curated["PUBCHEM_CID"].astype("Int64")
df_curated.head()
PUBCHEM_RESULT_TAG PUBCHEM_SID PUBCHEM_CID PUBCHEM_EXT_DATASOURCE_SMILES PUBCHEM_ACTIVITY_OUTCOME PUBCHEM_ACTIVITY_SCORE PUBCHEM_ACTIVITY_URL PUBCHEM_ASSAYDATA_COMMENT Activity Summary Antagonist Activity Antagonist Potency (uM) Antagonist Efficacy (%) Viability Activity Viability Potency (uM) Viability Efficacy (%) Sample Source
3 1 144203552 12850184 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
4 2 144203553 89753 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
5 3 144203554 9403 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
6 4 144203555 13218779 C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
7 5 144203556 142766 C1=CC=C(C(=C1)C(=O)O)O.C1=CC2=C(C(=C1)O)N=CC=C2 Inconclusive 25.0 NaN NaN inconclusive antagonist (cytotoxic) active antagonist 15.5454 -115.803 active antagonist 14.9601 -76.8218 NCI
# Check to see if there are any null (NaN) values.
df_curated[["PUBCHEM_SID", "PUBCHEM_CID"]].isna().sum()
PUBCHEM_SID      0
PUBCHEM_CID    149
dtype: int64
Check Your Understanding

Why are there 149 NaN CIDs, and no NaN SIDs?

Answer Not all SIDs could be resolved into a CID. That is, a depositor submits a substance, but PubChem's standarization process may not have been able to resolve that substance into a identified compound with known structure. Thus the value remains null.

2.2 Save Versioned DataFrame as CSV#

You now want to save this as a file in your curated folder, but this file you will be working on and experimenting with, and so you want to use the versioning helper function. You could just write the inline command df_raw.to_csv("Path-To-Directory/filename.csv", index=False) but we are developing a project that can have many files and so need to set up protocols that use path objects and provide dates and versions within file name.

2.2.1 Helper Function: make_filename#

In the following code cell we are going to use the helper function make_filename to save this data with a dated version number appended to the file name. Like the make_fixed_filename we will import this from the cinf26pk package. If you do not have access to the package you can substitute it’s code in the green box below for the import statement in the code cell where it is called from.

Helper Function: make_filename

The following file was imported from the class package and used in the preceeding script

Code

The `make_filename` helper function was created in the class package pvhelpfct.py. It is for for versioning file names with a date stamp. This is useful when you are working up data as it allows you to save multiple versions of the file as you alter it. That is, you may remove something, than realize you should not have removed it, and you can go back and recover the original file.


from pathlib import Path
from datetime import datetime

def make_filename(prefix: str, ext: str = “csv”, folder: str = “downloads”) -> str: “”” Create a unique versioned filename of the form: prefix_YYYYMMDD_vN.ext “”” folder_path = Path(folder) folder_path.mkdir(exist_ok=True)

date_str = datetime.now().strftime("%Y%m%d")
base = f"{prefix}_{date_str}"

existing = [
    f for f in folder_path.iterdir()
    if f.name.startswith(base) and f.suffix == f".{ext}"
]

version = len(existing) + 1
filename = f"{base}_v{version}.{ext}"

return str(folder_path / filename)

Note: For the following code cell to generate the above directory structure your notebook must be in the 10_SupervisedML/notebooks folder. This cell will generate the curated folder within the /data/AID743139 path you had created earlier. If you are working with a different bioassay you will need to change the AID to the new one.

from pathlib import Path
from cinf26pk.core import make_fixed_filename

# Generate path (relative to notebook)
CUR = Path("data/AID743139/curated")
CUR.mkdir(parents=True, exist_ok=True)

# Create deterministic filename with date stamp
fname = make_fixed_filename(
    stem="AID743139_pc_curated",
    ext="csv",
    add_date=True
)

# Save curated dataset
df_curated.to_csv(CUR / fname, index=False)

print(f"[Saved] Curated PubChem data → {CUR / fname}")
[Saved] Curated PubChem data → data/AID743139/curated/AID743139_pc_curated_20260205.csv

2.2.2 Reload saved file into dataframe#

Load your file into a new dataframe called df_curated, and use the path variable for this new path, in contrast to the string.

# Write code here: (see green box below for guidance)
# Load your saved csv file into a new pandas dataframe called df_curated
import pandas as pd
#from pathlib import Path


df_curated = pd.read_csv(CUR / "AID743139_pc_curated_20260205.csv")
Pandas Check

How do you open the curated csv file in Pandas?

Answer There are two approaches, use the predefined path variable, or write the entire path as a string. For the first to work you must define the path, which was done in a preceding cell. Of course you must use the filename of the file you saved. If you restarted the kernel you need to redefine the path by uncommenting the commented lines of code.

import pandas as pd
#from pathlib import Path

#PROJECT_ROOT = Path.cwd() # project root (10_SupervisedML) #CUR = PROJECT_ROOT / “data” / “AID743139” / “curated”

df_curated = pd.read_csv(CUR / “AID743139_pc_curated_20260118_v1.csv”)

Inspect by displaying head

df_curated.head()
PUBCHEM_RESULT_TAG PUBCHEM_SID PUBCHEM_CID PUBCHEM_EXT_DATASOURCE_SMILES PUBCHEM_ACTIVITY_OUTCOME PUBCHEM_ACTIVITY_SCORE PUBCHEM_ACTIVITY_URL PUBCHEM_ASSAYDATA_COMMENT Activity Summary Antagonist Activity Antagonist Potency (uM) Antagonist Efficacy (%) Viability Activity Viability Potency (uM) Viability Efficacy (%) Sample Source
0 1 144203552 12850184.0 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... Inactive 0.0 NaN NaN inactive inactive NaN 0.000 inactive NaN 0.0000 NCI
1 2 144203553 89753.0 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... Inactive 0.0 NaN NaN inactive inactive NaN 0.000 inactive NaN 0.0000 NCI
2 3 144203554 9403.0 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... Inactive 0.0 NaN NaN inactive inactive NaN 0.000 inactive NaN 0.0000 NCI
3 4 144203555 13218779.0 C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N Inactive 0.0 NaN NaN inactive inactive NaN 0.000 inactive NaN 0.0000 NCI
4 5 144203556 142766.0 C1=CC=C(C(=C1)C(=O)O)O.C1=CC2=C(C(=C1)O)N=CC=C2 Inconclusive 25.0 NaN NaN inconclusive antagonist (cytotoxic) active antagonist 15.5454 -115.803 active antagonist 14.9601 -76.8218 NCI
Check Your Understanding

Notice when you reload the dataframe from the csv the CIDs were converted back to floats, while the SIDs stayed as integers. Can you explain what is going on?

Explanation CSV is plain-text formated and stores values only. When Pandas opens a CSV file it tries to identify the data type, but any column that has missing values it inserts NaN and inferres the column as a float. Since there were zero NaNs for the SIDS, those values were inferred as integers, but since there were 149 missing CIDs, those values were inferred as floats. This shows that we are too early in the pipeline to convert CIDs to integers, and we will revisit this later.

2.3 Inspect column names#

df_curated.columns #print out the column names. 
Index(['PUBCHEM_RESULT_TAG', 'PUBCHEM_SID', 'PUBCHEM_CID',
       'PUBCHEM_EXT_DATASOURCE_SMILES', 'PUBCHEM_ACTIVITY_OUTCOME',
       'PUBCHEM_ACTIVITY_SCORE', 'PUBCHEM_ACTIVITY_URL',
       'PUBCHEM_ASSAYDATA_COMMENT', 'Activity Summary', 'Antagonist Activity',
       'Antagonist Potency (uM)', 'Antagonist Efficacy (%)',
       'Viability Activity', 'Viability Potency (uM)',
       'Viability Efficacy (%)', 'Sample Source'],
      dtype='object')

2.3.1 Clean column names#

Some column names in this data frame are long and contain white spaces and special characters. Rename the columns (no spaces or special characters except for the “_” character) using a python dictionary as a map.

col_names_map = {'PUBCHEM_RESULT_TAG' : 'pc_result_tag', 
                 'PUBCHEM_SID' : 'sid', 
                 'PUBCHEM_CID' : 'cid',
                 'PUBCHEM_ACTIVITY_OUTCOME' : 'activity_outcome', 
                 'PUBCHEM_ACTIVITY_SCORE' : 'activity_score',
                 'PUBCHEM_ACTIVITY_URL' : 'activity_url', 
                 'PUBCHEM_ASSAYDATA_COMMENT' : 'assay_data_comment', 
                 'Activity Summary' : 'activity_summary',
                 'Antagonist Activity' : 'antagonist_activity', 
                 'Antagonist Potency (uM)' : 'antagonist_potency', 
                 'Antagonist Efficacy (%)' : 'antagonist_efficacy',
                 'Viability Activity' : 'viability_activity', 
                 'Viability Potency (uM)' : 'viability_potency',
                 'Viability Efficacy (%)' : 'viability_efficacy', 
                 'Sample Source' : 'sample_source' }
df_curated = df_curated.rename(columns = col_names_map)
df_curated.columns
Index(['pc_result_tag', 'sid', 'cid', 'PUBCHEM_EXT_DATASOURCE_SMILES',
       'activity_outcome', 'activity_score', 'activity_url',
       'assay_data_comment', 'activity_summary', 'antagonist_activity',
       'antagonist_potency', 'antagonist_efficacy', 'viability_activity',
       'viability_potency', 'viability_efficacy', 'sample_source'],
      dtype='object')
Check your understanding

Why is it important to rename the column names without spaces or special characters?

Answer

Renaming columns to remove spaces and special characters creates clean, consistent labels that are easier to use in Python. We do this by creating a dictionary that maps each original column name to a standardized version and then passing it to df.rename(columns=…). This ensures that columns can be accessed reliably with either bracket notation or dot notation (e.g., df.activity_score and df.column_name) without causing syntax errors. Clean names also improve compatibility with functions, plotting libraries, and automated workflows that expect simple identifiers. In short, renaming columns makes your code more readable, robust, and less error-prone.

Data Dive:

The following pop-up will describe each column in the dataframe. It should be noted that Bioassay 743139 is a summary assayfor measuring inhibition of aramatase, and so a positive assay result means a chemical is an inhibitor (antagonist) of the enzyme. The assay also measures agonist activity, but these are not considered active in this assay, but as inconlusive. In other words, only antagonists are considered active in this assay.

Explanation of the data

Column Name

Meaning / Description

pc_result_tag

Internal PubChem result index. A simple numeric tag identifying the order of assay records. Not biologically meaningful.

sid (Substance ID)

PubChem Substance ID: identifies the submitted sample tested in the assay (may map to multiple CIDs).

cid (Compound ID)

PubChem Compound ID: normalized chemical structure identifier used across PubChem. Common structure-level identifier for cheminformatics.

PUBCHEM_EXT_DATASOURCE_SMILES

A SMILES string representing the chemical structure as submitted by the data source. May differ from canonical PubChem SMILES.

activity_outcome

High-level BioAssay classification: Active, Inactive, or Inconclusive. This is the main result label for each tested substance. This is a summary table and this result depends on a biological threshold of the actual assays.

activity_score

Numeric score assigned by PubChem summarizing activity strength; depends on assay scoring rules. Often 0 = inactive, higher = stronger activity.

activity_url

URL linking to additional PubChem assay information for this result. Frequently missing (NaN) in simplified datasets.

assay_data_comment

Optional textual note describing assay-specific observations (rare in many datasets).

activity_summary

More detailed subtype of activity_outcome (e.g., active antagonist, inconclusive agonist, inconclusive antagonist (cytotoxic)). Provides mechanism- or phenotype-level interpretation.

antagonist_activity

Qualitative label identifying whether the compound acted as an antagonist in the assay (active/inactive).

antagonist_potency

Numeric potency value for antagonist activity (often in μM). Indicates concentration required for effect.

antagonist_efficacy

Percent efficacy of antagonist behavior. Measures the magnitude of response relative to assay controls.

viability_activity

Qualitative label describing cell viability response (active/inactive). Helps detect cytotoxic compounds.

viability_potency

Potency value for viability-related effects (e.g., concentration at which viability decreases).

viability_efficacy

Percent efficacy for viability effects (extent of cell killing or protection).

sample_source

Origin of the tested sample (e.g., NCI). Refers to the organization or project that submitted the substance to PubChem.

3. Inspect Activity Outcomes#

First, we need to understand the structure of our data, particularly the activity class of the tested compounds, since our goal is to develop a model that classifies small molecules based on their activity against a target. This information is found in the activity_outcome and activity_summary columns of the dataframe. The activity outcomes have three values, Active, Inactive and Inconclusive.

To explore this, we use the groupby method, which splits the data into groups based on one or more column values, enabling us to perform calculations on each group independently. In the code cell below, we group by activity_outcome and use the count method to determine how many entries fall into each outcome category:

df_curated.groupby(['activity_outcome']).count()
pc_result_tag sid cid PUBCHEM_EXT_DATASOURCE_SMILES activity_score activity_url assay_data_comment activity_summary antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source
activity_outcome
Active 379 379 378 378 379 0 0 379 379 378 379 379 115 359 379
Inactive 7562 7562 7466 7466 7562 0 0 7562 7562 0 7562 7562 324 7449 7562
Inconclusive 2545 2545 2493 2493 2545 0 0 2545 2545 2111 2136 2545 1206 2450 2545
Check your understanding

1. Explain this line of code: `df_raw.groupby(['activity_outcome']).count()`

Answer
  1. This represents two chained Pandas methods; .groupby() followed by .count().

    • .groupby(‘activity_outcome’) splits the DataFrame into groups based on the category labels in the activity_outcome column; namely “Active,” “Inactive,” and “Inconclusive.”

    • .count() then counts how many non-missing values appear in each of the other columns within each category. This provides a quick summary of how many compounds fall into each activity class and reveals whether key fields (such as CID or SMILES) contain missing data within any category.

2. How many actives, inactives and inconclusives compounds are there in the dataset based on the activity_outcome column?

Answer 2. Based on the data in the activity_outcome column, there are 379 actives, 7562 inactives, and 2545 inconclusive compounds in the dataset.

3. Scroll through the rest of the data. Compare these numbers to other columns such as CID and SMILES. What problems might arise if these counts are not consistent?

Answer 3. If the counts of compounds differ across columns (for example, if a compound has an activity_outcome but is missing a CID or SMILES), it means important information is missing. This causes problems such as:
  • Compounds with missing CIDs/SMILES cannot be used for structural analysis, descriptor generation, or machine learning.
  • The dataset can become biased if missing data are unevenly distributed across activity classes, which can affect model performance and interpretation.
  • Results become harder to interpret because activity labels may exist without the structural or potency data needed to contextualize them.

3.1 Inspect Activity summaries#

If we group by both activity_outcome and activity_summary we get more insight about the data. Particularly, it reveals information that will indicate that there are subcategories for “Inconclusive”, giving us a better understanding of how thse results are classified.

df_curated.groupby(['activity_outcome','activity_summary']).count()
pc_result_tag sid cid PUBCHEM_EXT_DATASOURCE_SMILES activity_score activity_url assay_data_comment antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source
activity_outcome activity_summary
Active active antagonist 379 379 378 378 379 0 0 379 378 379 379 115 359 379
Inactive inactive 7562 7562 7466 7466 7562 0 0 7562 0 7562 7562 324 7449 7562
Inconclusive active agonist 612 612 571 571 612 0 0 612 612 612 612 60 590 612
inconclusive 44 44 44 44 44 0 0 44 0 0 44 19 42 44
inconclusive agonist 414 414 409 409 414 0 0 414 212 223 414 12 397 414
inconclusive agonist (cytotoxic) 59 59 59 59 59 0 0 59 41 45 59 59 59 59
inconclusive antagonist 367 367 364 364 367 0 0 367 227 230 367 8 313 367
inconclusive antagonist (cytotoxic) 1049 1049 1046 1046 1049 0 0 1049 1019 1026 1049 1048 1049 1049

Note, sometimes the above display sort of creates a UI illusion. Run the next code and you will see that activity_outcome Active and Inactive have only one type of activity_summary, and Inconclusive has many, of which one is active agonist.

for outcome, summaries in (
    df_curated
    .groupby('activity_outcome')['activity_summary']
    .unique()
    .items()
):
    print(f"\n{outcome}:")
    for s in sorted(summaries):
        print(f"  - {s}")
Active:
  - active antagonist

Inactive:
  - inactive

Inconclusive:
  - active agonist
  - inconclusive
  - inconclusive agonist
  - inconclusive agonist (cytotoxic)
  - inconclusive antagonist
  - inconclusive antagonist (cytotoxic)
print('there are',len(df_curated),'compounds in the dataframe that are defined as active antagonist, inactive, or inconclusive.')
there are 10486 compounds in the dataframe that are defined as active antagonist, inactive, or inconclusive.

Now, we can see that, in the activity_summary column, the inconclusive compounds are further classified into subclasses, which include:

  • active agonist

  • inconclusive

  • inconclusive agonist

  • inconclusive antagonist

  • inconclusive agonist (cytotoxic)

  • inconclusive antagonist (cytotoxic)

As indicated by the title of this PubChem assay record (AID 743139), the purpose of the experiment was to identify aromatase inhibitors. For this reason, the original submitter labeled all compounds showing antagonist behavior as active in the activity_outcome column. The assay also detected 612 agonists, but these were marked as inconclusive because the submitter defined activity strictly in terms of inhibitory, not stimulatory, effects. In this dataset, inactive compounds are those that show neither agonist nor antagonist behavior.

It is important to remember that “active” and “inactive” are not universal labels; their meaning depends entirely on the assay design and the submitter’s interpretation. For our purposes, we redefine these categories to support a binary classification task: any compound that changes the function of the target, whether by increasing or decreasing its activity, is considered active, and compounds that do not alter the target are considered inactive. In short, we are grouping all function-altering interactions together as active and treating the remainder as inactive, so the model can focus on distinguishing compounds that affect the enzyme from those that do not.

Check Your Understanding: Defining “Active” as a Modeling Decision

As we prepare to collapse assay outcomes into a binary classification for model building (active vs inactive), we must decide how to handle compounds labeled as inconclusive. In this dataset, there are 379 active antagonists, 7,562 inactive compounds, and 2,545 inconclusive compounds. Notably, 612 of the inconclusive compounds are labeled as active agonists, substantially more than the number of active antagonists. Rather than asking whether a particular choice is correct, this discussion is intended to surface the consequences of the choices we make and to emphasize the importance of recognizing that a choice is being made at all.

Discussion Points

The deeper lesson: recognizing that a decision was made

  • At what point in the workflow does the definition of “active” enter the modeling process?
  • How does collapsing rich experimental outcomes into a binary label necessarily simplify the biology?
  • What assumptions are we making about mechanism, relevance, and similarity when we redefine labels?
  • How might a different research goal lead to a different but equally reasonable decision using the same data?

Why documentation may matter more than the decision itself

  • Could someone reuse this model without realizing how “active” was defined?
  • How would performance metrics be interpreted differently if agonists were excluded rather than included?
  • How does documenting this choice enable reproducibility, comparison, and critical evaluation?
  • What risks arise when modeling assumptions are implicit rather than explicit?

Factors that influence this decision

  • Biological considerations: agonists are clearly bioactive and demonstrate target engagement, even if the mechanism differs.
  • Chemical considerations: agonists and antagonists may share structural features relevant to binding and recognition.
  • Data considerations: there are more active agonists (612) than active antagonists (379), and excluding them would further reduce an already small positive class.
  • Modeling considerations: extreme class imbalance can negatively affect learning, evaluation, and generalization.

Consequences of including active agonists as “active”

  • The model becomes a general bioactivity classifier rather than a mechanism-specific antagonist detector.
  • Predictions answer the question “is this compound active against the target?” rather than “is this compound an antagonist?”
  • Mechanistic distinctions are deferred to later analysis rather than embedded in the initial model.

Consequences of excluding active agonists

  • The active class becomes very small relative to the inactive class, increasing imbalance.
  • Biologically active compounds are treated as negatives, potentially obscuring meaningful structure–activity relationships.
  • The model becomes more specialized but less general and less transferable.

Connecting this decision to the next step

  • In the next section, we will explicitly define active and inactive classes for binary classification.
  • All inconclusive compounds will be removed except those labeled as active agonists.
  • This decision should be understood as a deliberate modeling choice shaped by biological goals, data constraints, and downstream use.

The central takeaway is not that one labeling strategy is universally correct, but that supervised learning models are built on decisions that shape what they can learn, what they ignore, and how their predictions should be interpreted. Recognizing, justifying, and documenting those decisions is a core part of responsible model building.

4. Select active/inactive compounds for model building#

4.1 Select active and inactive compounds#

Now we want to select only the active and inactive compounds from the data frame (that is, active agonists, active antagonists, and inactives based on the “activity summary” column). This code filters the DataFrame so that only rows labeled as active agonist, active antagonist, or inactive are kept, using the | operator to combine multiple Boolean conditions with “OR.” This removes all inconclusive entries, ensuring that the resulting DataFrame contains only compounds with clear activity classifications suitable for model training.

# create a new dataframe with active agonists, active antagnonists and inactives
# remove any molecules that are inconclusive and not active agonist
df = df_curated[ (df_curated['activity_summary'] == 'active agonist' ) | 
             (df_curated['activity_summary'] == 'active antagonist' ) |
             (df_curated['activity_summary'] == 'inactive' ) ]

print("The number of total molecules that are active or inactive is:", len(df))
The number of total molecules that are active or inactive is: 8553

Since we will be obtaining structural data from PubChem, it’s important to examine how many unique PubChem Substance IDs (SIDs) and Compound IDs (CIDs) are present in our dataset. Recall the distinction:

  • Substances (SIDs) represent depositor-submitted records. Multiple SIDs may refer to the same chemical structure if submitted by different sources.

  • Compounds (CIDs) are standardized, unique chemical structures derived by PubChem from submitted substances through a process of structure normalization and deduplication.

Understanding how many unique SIDs and CIDs we have will help us assess redundancy and ensure we’re working with non-duplicated chemical structures when building models or visualizing data.

# identify Compounds IDs and Substance IDs. 
print('total number of unique Substance IDs=',len(df['sid'].unique()))
print('total number of unique Compound IDs= ',len(df['cid'].unique()))
total number of unique Substance IDs= 8553
total number of unique Compound IDs=  6858

This data indicates several important points:

  1. The number of unique Substance IDs matches the number of active and inactive molecules in the dataset.

  2. Not all substances have an associated Compound ID (CID), remember we identified 149 of them earlier, meaning some structures have not been standardized by PubChem.

  3. Some substances may have the same CID, indicating that different submissions or assay runs may correspond to the same chemical structure.

Since our goal is to build a model that uses chemical structure to predict biological activity, we must remove substances without CIDs as these lack standardized structural information necessary for modeling. We must also remove duplicate SIDs for the same CID, as they are the same structure. By cleaning our data based on CID and activity type, we ensure the model learns from reliable, unambiguous data where each chemical structure is uniquely associated with a single activity classification.

4.2 Drop substances without associated CIDs.#

First, check if there are subtances without associated CIDs. We will used the .isna() method to identify all rows that are missing values (not available). Note the following code cell uses chaining to sequentially call multiple methods on a pandas object (we chain .isna with .sum() to count how many are missing).

df.isna().sum()  
pc_result_tag                       0
sid                                 0
cid                               138
PUBCHEM_EXT_DATASOURCE_SMILES     138
activity_outcome                    0
activity_score                      0
activity_url                     8553
assay_data_comment               8553
activity_summary                    0
antagonist_activity                 0
antagonist_potency               7563
antagonist_efficacy                 0
viability_activity                  0
viability_potency                8054
viability_efficacy                155
sample_source                       0
dtype: int64
Check your understanding

How many records lack an associated CID? Why does this number differ from the 149 NaNs identified in section 4.1?

Answer

138 compounds do not have a CID. There was originally 149 and thereare fewer now because we have removed all inconslusive compounds that were not active agonist.

The records without CID are not going to provide us with standardized structural information, so we will remove them from the dataframe.

This is accomplished by using the .dropna() method. Since we only want to remove rows where the CID is missing, we can use the subset parameter to specify we are targeting only those NA values in the ‘cid’ column.

Look at the output of the previous cell block. If we did the dropna() method without indicating which subset we are dropping, we could lose all of our data because the activity_url and **assay_data_comment” are null for every row in the dataframe!

print(len(df))   #in previous we saw 138 rows without CID
df = df.dropna( subset=['cid'] )
print(len(df))   #this values should be 138 fewer
8553
8415

Check if the NULL values disappeared in the cid column. The value should be 0 for that column now.

df.isna().sum()
pc_result_tag                       0
sid                                 0
cid                                 0
PUBCHEM_EXT_DATASOURCE_SMILES       0
activity_outcome                    0
activity_score                      0
activity_url                     8415
assay_data_comment               8415
activity_summary                    0
antagonist_activity                 0
antagonist_potency               7467
antagonist_efficacy                 0
viability_activity                  0
viability_potency                7919
viability_efficacy                154
sample_source                       0
dtype: int64

Now that we have removed all CIDs that are NaN we can safely convert these to integers and not worry about the value being converted back to a float when we reload a saved csv file.

#Write code here to convert CIDs to integers and print the head
df["cid"] = df["cid"].astype("Int64")
df.head()
pc_result_tag sid cid PUBCHEM_EXT_DATASOURCE_SMILES activity_outcome activity_score activity_url assay_data_comment activity_summary antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source
0 1 144203552 12850184 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI
1 2 144203553 89753 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI
2 3 144203554 9403 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI
3 4 144203555 13218779 C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI
5 6 144203557 16043 CC(C)(C)C1=C(C=CC(=C1)O)O Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI
# identify Compounds IDs and Substance IDs. 
print('total number of Substance IDs=',len(df['sid'].unique()))
print('total number of Compound IDs= ',len(df['cid'].unique()))
total number of Substance IDs= 8415
total number of Compound IDs=  6857
Check your understanding

Why can the numer of SIDs be greater than the number of CIDs?

Answer

The SID is related to the submitter and multiple assays could have been performed on the same chemical. This means we not only have multiple entries for the same chemical compound (CID), but it could result in conflicting activities.

4.3 Remove CIDs with conflicting activities#

This code identifies unique CIDs and checks whether multiple values exist in the activity_summary column. If conflicting activity summaries are found, the code stores the CID in the cid_conflict list and and records the corresponding row indices in idx_conflict. Finally, the total number of CIDs with conflicts and total number of rows are output.

cid_conflict = [] # list to store CIDs with conflicting activities
idx_conflict = [] # list to store indices of rows with conflicting activities

for mycid in df['cid'].unique() : # iterate over each unique Compound ID
    
    outcomes = df[ df.cid == mycid ].activity_summary.unique() #
    
    if len(outcomes) > 1 : # if there are multiple unique activity summaries for this CID
        
        idx_tmp = df.index[ df.cid == mycid ].tolist() # get the indices of these rows
        idx_conflict.extend(idx_tmp) # add these indices to the conflict list
        cid_conflict.append(mycid) # # add the CID to the conflict list

print("#", len(cid_conflict), "CIDs with conflicting activities [associated with", len(idx_conflict), "rows (SIDs).]")
print(idx_conflict[:10]) # print first 10 CID index numbers of conflicting activities
# 66 CIDs with conflicting activities [associated with 150 rows (SIDs).]
[5, 5953, 6847, 49, 6127, 63, 9115, 103, 5917, 6961]

To examine which CIDs have conflicting activity data, we can display a portion of the dataframe using the .loc[] method. This allows us to select specific rows. In this case, we used the ones listed in idx_conflict, which correspond to compounds with inconsistent activity summaries. We’ll chain with .head(10) to view just the first 10 of these rows for quick inspection.

This helps us verify the nature of the conflicts and better understand why these entries need to be removed before modeling.

df.loc[idx_conflict,:].head(10)
pc_result_tag sid cid PUBCHEM_EXT_DATASOURCE_SMILES activity_outcome activity_score activity_url assay_data_comment activity_summary antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source
5 6 144203557 16043 CC(C)(C)C1=C(C=CC(=C1)O)O Inactive 0.0 NaN NaN inactive inactive NaN 0.0000 inactive NaN 0.0000 NCI
5953 5954 144209507 16043 CC(C)(C)C1=C(C=CC(=C1)O)O Active 43.0 NaN NaN active antagonist active antagonist 54.48270 -73.4024 inconclusive antagonist NaN NaN SigmaAldrich
6847 6848 144210401 16043 CC(C)(C)C1=C(C=CC(=C1)O)O Inactive 0.0 NaN NaN inactive inactive NaN 0.0000 inactive NaN 0.0000 SIGMA
49 50 144203601 443939 C[C@H]1[C@H]([C@H](C[C@@H](O1)O[C@H]2C[C@@](CC... Inactive 0.0 NaN NaN inactive inactive NaN 0.0000 inactive NaN 0.0000 NCI
6127 6128 144209681 443939 C[C@H]1[C@H]([C@H](C[C@@H](O1)O[C@H]2C[C@@](CC... Active 61.0 NaN NaN active antagonist active antagonist 1.65519 -115.9320 active antagonist 12.1763 -120.5980 Toronto Research
63 64 144203615 2170 C1CN(CCN1)C2=NC3=CC=CC=C3OC4=C2C=C(C=C4)Cl Inactive 0.0 NaN NaN inactive inactive NaN 0.0000 inactive NaN 0.0000 BIOMOL
9115 9116 144212669 2170 C1CN(CCN1)C2=NC3=CC=CC=C3OC4=C2C=C(C=C4)Cl Active 50.0 NaN NaN active antagonist active antagonist 16.58030 -115.2020 inconclusive antagonist 61.1306 -80.7706 SIGMA
103 104 144203655 2554 C1=CC=C2C(=C1)C=CC3=CC=CC=C3N2C(=O)N Inconclusive 20.0 NaN NaN active agonist active agonist 2.87255 73.7025 inactive NaN 0.0000 SigmaAldrich
5917 5918 144209471 2554 C1=CC=C2C(=C1)C=CC3=CC=CC=C3N2C(=O)N Inactive 0.0 NaN NaN inactive inactive NaN 0.0000 inactive NaN 0.0000 SIGMA
6961 6962 144210515 2554 C1=CC=C2C(=C1)C=CC3=CC=CC=C3N2C(=O)N Inactive 0.0 NaN NaN inactive inactive NaN 0.0000 inactive NaN 0.0000 SIGMA

We will be creating a binary classifier system distinguishing between active and inactive compounds, and will lump active antagonist with active agonist as active. The active agonist are being pulled from the activity_summary of inactive compounds in the activity_outcome data, and so are being read as inactive when comparing the activity_outcome. This means we are dropping them as conflicting outcomes even though we will consider active agonist as active.

This should be rare as one assay is saying the compound is active antagonist (activity_outcome [Active] = active antagonist) and the other as active agonist (activity_outcome [Inconclusive]activity_summary= active agonist). The following code cell identifies these compounds and returns a python set of them with the variable name allowed_cids.

# Identify conflicting CIDs using activity_summary (already computed earlier)
# idx_conflict comes from the earlier conflict-detection cell

# Define biologically allowed exceptions:
# CIDs that have both active antagonist and active agonist summaries
allowed_cids = []

for cid, g in df.groupby('cid'):
    summaries = set(g['activity_summary'].str.lower())

    if (
        'active antagonist' in summaries
        and 'active agonist' in summaries
    ):
        allowed_cids.append(cid)

allowed_cids = set(allowed_cids)

print(f'allowed exceptions: {allowed_cids}')

# Optional inspection
df.loc[df['cid'].isin(allowed_cids)] \
  .sort_values(['cid', 'sid']) \
  .head(20)
allowed exceptions: {np.int64(10868), np.int64(54454)}
pc_result_tag sid cid PUBCHEM_EXT_DATASOURCE_SMILES activity_outcome activity_score activity_url assay_data_comment activity_summary antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source
6644 6645 144210198 10868 C1CCC(CC1)N=C=NC2CCCCC2 Inconclusive 20.0 NaN NaN active agonist active agonist 4.19020 40.5990 inactive NaN 0.0 SIGMA
10000 10001 144213554 10868 C1CCC(CC1)N=C=NC2CCCCC2 Active 50.0 NaN NaN active antagonist active antagonist 5.04574 -68.1018 inconclusive antagonist NaN NaN LightBiologicals
695 696 144204247 54454 CCC(C)(C)C(=O)O[C@H]1C[C@H](C=C2[C@H]1[C@H]([C... Inconclusive 20.0 NaN NaN active agonist active agonist 4.35177 54.7546 inactive NaN 0.0 Prestwick Chemical, Inc.
7164 7165 144210718 54454 CCC(C)(C)C(=O)O[C@H]1C[C@H](C=C2[C@H]1[C@H]([C... Active 45.0 NaN NaN active antagonist active antagonist 17.22890 -54.6818 inactive NaN 0.0 Toronto Research
Deeper Dive Explaination

Go to the DYAD pages for AID_743139-SID_144210198 and AID_743139-SID_144213554 , both of which are CID 10868. Explain the results of the above output.

Answer

Although the assay was designed to detect antagonism, it measures receptor signaling in the presence of a reference agonist. Compounds such as CIDs 10868 and 54454 can both activate the receptor on their own and reduce signaling relative to a stronger agonist, consistent with partial agonism. PubChem therefore records agonist-like signal increases as “active agonist” (often marked inconclusive) and signal suppression as “active antagonist.” Both outcomes indicate real target engagement and should be grouped as Active in our binary classification.

Check your understanding
  1. Examine the portion of the dataframe above. Why should CID 16043 be removed from the analysis?

  2. Can you explain the sequence of the output of df.loc[idx_conflict,:].head(10)? Is this data being sorted?

Answer
  1. There were three SIDs with a common CID of 16043 and the results were conflictory (two were active, one was inactive)

  2. This is not sorting, but a grouping by CID occured because the ‘for loop’ iterated through unique CIDs in the order of the idx_conflict list. When a CID with conflicting activity was found, all rows with that CID were then added to the idx_conflict list. So index 5 was the first position (CID 16043), then index 5953 and 6847 as they are also CID 16043, now we go to the next index position in the idx_conflict list, which is 49 and there are two CIDs with that value,and then to the next, which is index position 63, and do this for the length of the idx_conflict list.

# Drop rows with conflicting data, except for allowed biological exceptions
idx_conflict_fixed = df.index[
    df.index.isin(idx_conflict) &
    ~df["cid"].isin(allowed_cids)
]

df = df.drop(idx_conflict_fixed)
df.head()
pc_result_tag sid cid PUBCHEM_EXT_DATASOURCE_SMILES activity_outcome activity_score activity_url assay_data_comment activity_summary antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source
0 1 144203552 12850184 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI
1 2 144203553 89753 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI
2 3 144203554 9403 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI
3 4 144203555 13218779 C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI
9 10 144203561 637566 CC(=CCC/C(=C/CO)/C)C Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI
# After removing the conflicting data, we can check the counts of each activity summary again
df.groupby('activity_summary').count()
pc_result_tag sid cid PUBCHEM_EXT_DATASOURCE_SMILES activity_outcome activity_score activity_url assay_data_comment antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source
activity_summary
active agonist 539 539 539 539 539 539 0 0 539 539 539 539 58 519 539
active antagonist 344 344 344 344 344 344 0 0 344 343 344 344 108 327 344
inactive 7386 7386 7386 7386 7386 7386 0 0 7386 0 7386 7386 318 7275 7386
# identify Compounds IDs and Substance IDs. 
print('total number of Substance IDs=',len(df['sid'].unique()))
print('total number of Compound IDs= ',len(df['cid'].unique()))
total number of Substance IDs= 8269
total number of Compound IDs=  6793

4.4 Remove redundant data#

The code cells in section 4.3. do not remove compounds tested multiple times if the testing results are consistent (e.g., active agonist in all samples (substances)). The rows corresponding to these compounds are redundant, so we want remove them except for only one row for each compound. To illustrate this, let’s create a sorted dataframe to see some of the redundancies.

sorted_df = df.sort_values(by=['cid','sid','activity_summary'],
                           ascending=[True,True,True])
sorted_df.head(10)
pc_result_tag sid cid PUBCHEM_EXT_DATASOURCE_SMILES activity_outcome activity_score activity_url assay_data_comment activity_summary antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source
8467 8468 144212021 4 CC(CN)O Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 Enamine
6110 6111 144209664 11 C(CCl)Cl Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 SIGMA
4955 4956 144208509 13 C1=CC(=C(C=C1Cl)Cl)Cl Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 SIGMA
7326 7327 144210880 13 C1=CC(=C(C=C1Cl)Cl)Cl Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 LightBiologicals
7915 7916 144211469 33 C(C=O)Cl Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 LightBiologicals
6809 6810 144210363 34 C(CCl)O Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 Enamine
15 16 144203567 51 C(CC(=O)O)C(=O)C(=O)O Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI
4562 4563 144208116 51 C(CC(=O)O)C(=O)C(=O)O Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 SIGMA
5533 5534 144209087 66 C1=CC(=C(C=C1Cl)O)Cl Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 SIGMA
9816 9817 144213370 66 C1=CC(=C(C=C1Cl)O)Cl Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 SIGMA
Explore the above displayed sorted dataframe. Identify any CIDs that are repeated, and if they have different SID or activity_outcome values.

We will use the drop_duplicates() method in pandas to remove any duplicate rows in our dataframe. We can use the subset parameter to signify which column to consider when identifying duplicates. In our case, we need unique identifiers to get structural data, so we use ‘cid’.

df_clean = df.drop_duplicates(subset='cid')  # remove duplicate rows except for the first occurring row.
print('total number of Substance IDs=',len(df_clean['sid'].unique()))
print('total number of Compound IDs= ',len(df_clean['cid'].unique()))
total number of Substance IDs= 6793
total number of Compound IDs=  6793

Save as AID743139_clean_CID…#

Save the dataframe in your curated data folder with date stamp and version number using prefix="AID743139_clean_CID",.

# Save cleaned CID data (curated)
from pathlib import Path
from cinf26pk.core import make_fixed_filename

# Generate path (relative to notebook)
CUR = Path("data/AID743139/curated")
CUR.mkdir(parents=True, exist_ok=True)

# Create deterministic filename with date stamp
fname = make_fixed_filename(
    stem="AID743139_clean_CID",
    ext="csv",
    add_date=True
)

# Save curated dataset
df_clean.to_csv(CUR / fname, index=False)

print(f"[Saved] Curated PubChem data → {CUR / fname}")
[Saved] Curated PubChem data → data/AID743139/curated/AID743139_clean_CID_20260205.csv
#write your code here to open the file as a dataframe called df 
#double check the the number of CIDs and SIDs has not changedand check that inconclusives and duplicates have been removed
import pandas as pd
df = pd.read_csv(CUR/"AID743139_clean_CID_20260205.csv")
print('total number of Substance IDs=',len(df['sid'].unique()))
print('total number of Compound IDs= ',len(df['cid'].unique()))
df.head()
total number of Substance IDs= 6793
total number of Compound IDs=  6793
pc_result_tag sid cid PUBCHEM_EXT_DATASOURCE_SMILES activity_outcome activity_score activity_url assay_data_comment activity_summary antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source
0 1 144203552 12850184 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI
1 2 144203553 89753 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI
2 3 144203554 9403 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI
3 4 144203555 13218779 C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI
4 10 144203561 637566 CC(=CCC/C(=C/CO)/C)C Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI

Note from the above that cids are now integers, and the number of CIDs equal the number of SIDs

4.5 Adding “numeric” activity classes#

In general, machine learning algorithms require both inputs and outputs to be in numerical form. In our case the input we will be using are molecular structure. We have been working to clean our data to ensure we have valid CIDs that we can use to retrieve SMILES strings and subsequently calculate binary fingerprints as input features. However. the output of biological activity is currently stored as text labels as active or inactive. To make this data suitable for modeling, we will create a new column called ‘activity’ that assigns these labels as numeric values:

  • 1 for actives (including both active agonists and antagonists)

  • 0 for inactives

We combine the two types of active compounds into a single class because our goal is to train a binary classifier that distinguishes between compounds that interact with the target (active) and those that do not (inactive).

df.groupby(['activity_outcome','activity_summary']).count()
pc_result_tag sid cid PUBCHEM_EXT_DATASOURCE_SMILES activity_score activity_url assay_data_comment antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source
activity_outcome activity_summary
Active active antagonist 290 290 290 290 290 0 0 290 289 290 290 88 275 290
Inactive inactive 6050 6050 6050 6050 6050 0 0 6050 0 6050 6050 269 5964 6050
Inconclusive active agonist 453 453 453 453 453 0 0 453 453 453 453 44 434 453
df['activity'] = [ 0 if x == 'inactive' else 1 for x in df['activity_summary'] ]
df.head() #new column is added to end of column list, so scroll right in the resulting cell output.
pc_result_tag sid cid PUBCHEM_EXT_DATASOURCE_SMILES activity_outcome activity_score activity_url assay_data_comment activity_summary antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source activity
0 1 144203552 12850184 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI 0
1 2 144203553 89753 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI 0
2 3 144203554 9403 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI 0
3 4 144203555 13218779 C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI 0
4 10 144203561 637566 CC(=CCC/C(=C/CO)/C)C Inactive 0.0 NaN NaN inactive inactive NaN 0.0 inactive NaN 0.0 NCI 0

Double-check the count of active/inactive compounds.

df.groupby('activity_summary').count()
pc_result_tag sid cid PUBCHEM_EXT_DATASOURCE_SMILES activity_outcome activity_score activity_url assay_data_comment antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source activity
activity_summary
active agonist 453 453 453 453 453 453 0 0 453 453 453 453 44 434 453 453
active antagonist 290 290 290 290 290 290 0 0 290 289 290 290 88 275 290 290
inactive 6050 6050 6050 6050 6050 6050 0 0 6050 0 6050 6050 269 5964 6050 6050
df.groupby('activity').count() 
pc_result_tag sid cid PUBCHEM_EXT_DATASOURCE_SMILES activity_outcome activity_score activity_url assay_data_comment activity_summary antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source
activity
0 6050 6050 6050 6050 6050 6050 0 0 6050 6050 0 6050 6050 269 5964 6050
1 743 743 743 743 743 743 0 0 743 743 742 743 743 132 709 743
Do the number of actives in this output equal the active agonist and active antagonist above?

4.6 Create a smaller data frame that only contains CIDs and activities.#

Let’s create a smaller data frame that only contains CIDs, activities and the PubChem External Data Sources. This data frame will be merged with a data frame containing molecular fingerprint information.

df_activity = df[['cid','activity','PUBCHEM_EXT_DATASOURCE_SMILES']]  #also included the external datasource SMILES to compare later with PubChem validated CID SMILES
df_activity.head(5)
cid activity PUBCHEM_EXT_DATASOURCE_SMILES
0 12850184 0 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...
1 89753 0 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...
2 9403 0 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...
3 13218779 0 C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N
4 637566 0 CC(=CCC/C(=C/CO)/C)C
df_activity.shape
(6793, 3)

While we could have used the SMILES that were already in our dataframe, the label was PUBCHEM_EXT_DATASOURCE_SMILES indicating that the SMILES may have been provided by the submitter, and not cannonical. To ensure we have quality data, we will retrieve SMILES from PubChem.

print("CIDs are stored in original df dataframe as", df['cid'].dtype)
print("CIDs are stored in df_activity dataframe as", df_activity['cid'].dtype)
cids = df.cid.astype(int).tolist()
print(cids[:10])
print("CIDs are now stored as", type(cids[0]))
print(f"CIDs is type {type(cids)}")
CIDs are stored in original df dataframe as int64
CIDs are stored in df_activity dataframe as int64
[12850184, 89753, 9403, 13218779, 637566, 4766, 3080, 7048801, 51, 66435]
CIDs are now stored as <class 'int'>
CIDs is type <class 'list'>

4.6.1 Save as AID743139_activity_CID#

Save the dataframe in your curated data folder with date stamp and version number using prefix="AID743139_activity_CID",.

#write code here
# Complete the code to save your work as a versioned filename in the folder

from pathlib import Path
from cinf26pk.core import make_fixed_filename

# Generate path (relative to notebook)
CUR = Path("data/AID743139/curated")
CUR.mkdir(parents=True, exist_ok=True)

# Create deterministic filename with date stamp
fname = make_fixed_filename(
    stem="AID743139_Activity_CID",
    ext="csv",
    add_date=True
)

# Save curated dataset
df_activity.to_csv(CUR / fname, index=False)

print(f"[Saved] Curated PubChem data → {CUR / fname}")
[Saved] Curated PubChem data → data/AID743139/curated/AID743139_Activity_CID_20260205.csv

5 Acquire standardized PubChem SMILES#

Now that we have a list of CIDs as integers, we can retrive the SMILES from PubChem using the PUG REST API. We will use these to generate RDKit fingerprints.

Once we have retrieved the SMILES we will store them and their respective CIDs into a new df_smiles dataframe.

Code Dive

The following code uses several new functions in the cinf26pk.py package, along with the tqdm package. You can insert the cinf26pk functions directly into the code cell if you do not have access to the package (after removing the import statements)

chunk_list() - a function of the cinf26.core.utils module
      
1)def chunk_list(lst: List[Any], size: int = 100):
    """
    Yield successive fixed-size chunks from a list.
Parameters
----------
lst : list
    List to split.
size : int, default=100
    Chunk size.

Yields
------
list
    List chunk.
"""
for i in range(0, len(lst), size):
    yield lst[i:i + size]

pubchem_get() - a function of the cinf26.pubchem.http module

def pubchem_get(url, retries=DEFAULT_RETRIES, timeout=DEFAULT_TIMEOUT,
                base_delay=2, return_json=True):
    """
    Safely performs a GET request to a PubChem PUG-REST endpoint with retry/backoff.
Parameters
----------
url : str
    Full URL for the GET request.
retries : int, default=3
    Number of retry attempts.
timeout : int, default=30
    Timeout (seconds) for request.
base_delay : int, default=2
    Backoff base delay.
return_json : bool, default=True
    Return JSON or raw text.

Returns
-------
dict or str or None
    Parsed response, or None if all retries fail.
"""  
for attempt in range(1, retries + 1):
    try:
        r = requests.get(url, timeout=timeout)
        r.raise_for_status()
        return r.json() if return_json else r.text

    except Exception as e:
        print(f"[Warning] GET attempt {attempt}/{retries} failed.")
        print(f"         URL: {url}")
        print(f"         Error: {type(e).__name__}: {e}")
        time.sleep(base_delay * attempt)

print(f"[Error] All GET attempts failed for URL: {url}")
return None

tqdm

tqdm is a Python package that provides fast, extensible progress bars for loops and iterable-based operations. It is commonly used in data science and machine learning workflows to give real-time feedback on long-running tasks such as file I/O, model training, or batch processing, without requiring changes to the underlying logic of the code. tqdm integrates cleanly with standard Python iterables, Pandas operations, and Jupyter notebooks.

To install activate your env and use conda-forge
conda install -c conda-forge tqdm

import time
from io import StringIO
from pathlib import Path

import pandas as pd
from tqdm import tqdm

from cinf26pk.core import chunk_list, make_fixed_filename
from cinf26pk.pubchem import http


# ------------------------------------------------------------
# 1. Reload curated activity data
# ------------------------------------------------------------
df_activity = pd.read_csv(
    "data/AID743139/curated/AID743139_Activity_CID_20260205.csv"
)

# Extract CIDs as integers
cids = df_activity["cid"].astype(int).tolist()
print("# CIDs:", len(cids))


# ------------------------------------------------------------
# 2. Retrieve standardized PubChem SMILES
# ------------------------------------------------------------
chunk_size = 100
sleep_time = 0.2   # 5 calls per second
base_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid"

df_chunks = []
failed_chunks = []

chunks = list(chunk_list(cids, size=chunk_size))
print("# Total chunks:", len(chunks))

for i, chunk in enumerate(tqdm(chunks)):
    
    cid_str = ",".join(str(x) for x in chunk)
    url = f"{base_url}/{cid_str}/property/SMILES/TXT"

    text = http.pubchem_get(url, return_json=False)
    
    if text is None:
        print(f"[Warning] Failed chunk {i}")
        failed_chunks.append(i)
        continue
    
    # Convert to DataFrame
    df_chunk = pd.read_csv(
        StringIO(text),
        header=None,
        names=["smiles"]
    )
    df_chunks.append(df_chunk)
    
    # Rate limit compliance
    time.sleep(sleep_time)


# ------------------------------------------------------------
# 3. Combine results
# ------------------------------------------------------------
df_smiles = pd.concat(df_chunks, ignore_index=True)
df_smiles["cid"] = cids

print("\nPreview of combined SMILES:")
print(df_smiles.head())

print(f"\nFailed chunks: {failed_chunks}")


# ------------------------------------------------------------
# 4. Save canonical SMILES dataset (overwrite-on-rerun)
# ------------------------------------------------------------
CUR = Path("data/AID743139/curated")
CUR.mkdir(parents=True, exist_ok=True)

fname = make_fixed_filename(
    stem="AID743139_pcSMILES_act",
    ext="csv",
    add_date=True
)

df_smiles.to_csv(CUR / fname, index=False)

print(f"[Saved] Curated PubChem SMILES → {CUR / fname}")
# CIDs: 6793
# Total chunks: 68
100%|███████████████████████████████████████████████████████████████████████████████████| 68/68 [05:54<00:00,  5.21s/it]
Preview of combined SMILES:
                                              smiles       cid
0  C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...  12850184
1  C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...     89753
2  C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...      9403
3         C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N  13218779
4                               CC(=CCC/C(=C/CO)/C)C    637566

Failed chunks: []
[Saved] Curated PubChem SMILES → data/AID743139/curated/AID743139_pcSMILES_act_20260205.csv

df_smiles.head()
smiles cid
0 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... 12850184
1 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... 89753
2 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... 9403
3 C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N 13218779
4 CC(=CCC/C(=C/CO)/C)C 637566

5.1 Deal with salts and mixtures#

We will be generating fingerprints from the smiles strings. To ensure the quality of the fingerprints, we will clean the SMILES strings to remove salts and keep only the largest fragment of each compound. This is important because salts and mixtures can introduce noise and variability in the fingerprint data, which can negatively impact model performance.

SaltRemover is a utility module within RDKit that remove counterions and small disconnected fragments from molecular structures. Many chemical records represent compounds in their salted or formulated forms (e.g., hydrochloride salts, sodium salts, solvates). While chemically valid, these extra components can interfere with structure-based analysis, fingerprint generation, and machine-learning workflows. The purpose of SaltRemover is therefore not to change the underlying chemical identity of a compound, but to isolate the core molecular structure that is typically responsible for biological activity.

In the following code, each SMILES string is first converted into an RDKit molecule object using Chem.MolFromSmiles. The SaltRemover is then applied via StripMol, which removes known salt fragments while ensuring that at least one fragment remains (dontRemoveEverything=True). After salt stripping, the molecule may still consist of multiple disconnected fragments (for example, residual solvents or small ions). To resolve this, the code explicitly extracts all fragments and retains only the largest fragment by atom count, under the assumption that this fragment represents the biologically relevant parent compound. Finally, the cleaned molecule is converted back into a SMILES string using Chem.MolToSmiles, and the result is stored as clean_smiles. The net effect is a standardized, salt-free representation of each compound suitable for fingerprint generation, comparison, and machine-learning analysis.

# reload the file you just saved and print the head to ensure it was correctly saved
import pandas as pd
df_smiles = pd.read_csv("data/AID743139/curated/AID743139_pcSMILES_act_20260205.csv")
df_smiles.head()
smiles cid
0 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... 12850184
1 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... 89753
2 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... 9403
3 C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N 13218779
4 CC(=CCC/C(=C/CO)/C)C 637566
Code Dive

This line of code in the following code cell.

df_smiles['clean_smiles'] = df_smiles['smiles'].apply(clean_smiles)
Explanation In the above code we created a python function called clean_smiles(). The apply method allowed us to apply it to the "smiles" series and generate a new "clean_smiles" series within the DataFrame, which is a column of salt-free SMILES. Although it behaves like a loop, it is written in a Pandas-native way that preserves index alignment and avoids manual iteration.

To identify how many smiles needed to be cleaned you could compare the smiles in the ‘smiles’ column with those in the ‘clean_smiles’ column but due to the noncanonical nature of smiles those generated by RDKit may differ than those obtained by PubChem, even if the structure has undergone on modification.

from rdkit import Chem
from rdkit.Chem import SaltRemover

remover = SaltRemover.SaltRemover()

def clean_smiles(sm):
    mol = Chem.MolFromSmiles(sm)
    if mol is None:
        return None

    mol = remover.StripMol(mol, dontRemoveEverything=True)

    # keep only largest fragment
    frags = Chem.GetMolFrags(mol, asMols=True)
    mol = max(frags, key=lambda m: m.GetNumAtoms())

    return Chem.MolToSmiles(mol)

df_smiles['clean_smiles'] = df_smiles['smiles'].apply(clean_smiles)

print(df_smiles.head(5))
print(df_smiles.shape)
[12:31:32] WARNING: not removing hydrogen atom without neighbors
                                              smiles       cid  \
0  C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...  12850184   
1  C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...     89753   
2  C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...      9403   
3         C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N  13218779   
4                               CC(=CCC/C(=C/CO)/C)C    637566   

                                        clean_smiles  
0         O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-]  
1       O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO  
2  C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C...  
3       CC1(C)[C@@H]2CC[C@@]1(C)[C@H](OC(=O)CSC#N)C2  
4                                CC(C)=CCC/C(C)=C/CO  
(6793, 3)
Check Your Understanding

In this section, we transformed the original SMILES strings into clean_smiles by stripping salts and retaining only the largest fragment. This step is often treated as “routine preprocessing,” but it is better understood as a deliberate modeling decision that changes the data we carry forward into fingerprint generation. The goal here is not to memorize a single correct approach, but to recognize that multiple reasonable approaches exist, and that the approach you choose should be justified and recorded. You also need to keep in mind that this decision does not only affect the data we are using now, but that any future data set using the model we build using these fingerprints will implicitly assume that data is processed in the same way.

What assumptions did we make? — recognizing that a decision was made

  • What is more important for modeling purposes: the full formulation, or the parent structure?
  • What assumptions are embedded in the phrase “biologically relevant parent compound”?
  • How might the meaning of “clean” change depending on the assay context, the intended model, or the chemistry?

Why documentation may matter more than the decision itself

  • If someone else reproduces this work, what do they need to know to obtain the same clean_smiles values?
  • What information about the original records is lost once we overwrite or replace the original SMILES with cleaned structures?
  • How could a future comparison between two models be confounded if they used different salt or mixture handling rules?
  • Why is it important to keep versioned files as the dataset evolves through these transformations?

Looking ahead: applying the model to new data

  • When this model is applied to new compounds, will those SMILES need to be cleaned in the same way?
  • What could happen if new data explain salts or mixtures differently than the training data?
  • How might inconsistent preprocessing affect fingerprint generation and model predictions?
  • What parts of this preprocessing step should eventually be treated as part of the model pipeline rather than as one-time data cleaning?

The main takeaway is that preprocessing steps create a new version of your dataset and define the assumptions your model will carry forward. Once fingerprints are generated, the model’s behavior depends not only on the algorithm, but also on whether future data are transformed in a compatible way.

df_smiles['is_identical'] = df_smiles['smiles'] == df_smiles['clean_smiles']
print(df_smiles['is_identical'].sum())
df_smiles['is_identical'].value_counts()
778
is_identical
False    6015
True      778
Name: count, dtype: int64
# make sure the number of rows in df_smiles is equal to the number of unique CIDs we had previously
df_activity = pd.read_csv("data/AID743139/curated/AID743139_Activity_CID_20260205.csv")
print("Number of CIDs in df_smiles:", len(df_smiles))
print("Number of unique CIDs in original df:", len(df_activity['cid'].unique()))

print("The number of CIDs in df_smiles is equal to the number of unique CIDs in the original df:",
       len(df_smiles) == len(df_activity['cid'].unique()))
df_smiles.head()
Number of CIDs in df_smiles: 6793
Number of unique CIDs in original df: 6793
The number of CIDs in df_smiles is equal to the number of unique CIDs in the original df: True
smiles cid clean_smiles is_identical
0 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... 12850184 O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-] False
1 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... 89753 O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO False
2 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... 9403 C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C... False
3 C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N 13218779 CC1(C)[C@@H]2CC[C@@]1(C)[C@H](OC(=O)CSC#N)C2 False
4 CC(=CCC/C(=C/CO)/C)C 637566 CC(C)=CCC/C(C)=C/CO False
Check Your Understanding

After generating clean_smiles, we compared the original SMILES strings to the cleaned ones and found that only 778 entries were identical, while 6,015 were different. At first glance, it is tempting to conclude that this means 6,015 compounds were “changed” during cleaning. This check is useful, but it is also an opportunity to think carefully about what our comparison is actually measuring.

What question does this comparison really answer?

  • Does a difference between smiles and clean_smiles always mean the underlying chemical structure changed?
  • What assumptions are we making when we treat SMILES strings as if they were canonical identifiers?
  • How might RDKit-generated SMILES differ from PubChem SMILES even when no salt or fragment was removed?

Why this result can be misleading at first glance

  • SMILES are not unique representations; the same molecule can be written in multiple valid ways.
  • RDKit often reorders atoms or normalizes bonding when generating SMILES, even if the structure is unchanged.
  • As a result, a False value here may reflect a change in representation, not a change in chemistry.

Connecting this to data provenance and decision making

  • What evidence do we actually need to support the claim that a compound was chemically altered during cleaning?
  • How does the choice of representation affect our interpretation of “before” and “after” datasets?
  • Why is it important to distinguish between a structural change and a formatting or normalization change?

Looking ahead to better comparisons

  • What properties of a molecule should remain invariant if the underlying structure is truly the same?
  • How might a representation designed to be canonical help answer this question more reliably?
  • In later steps, how could misleading checks like this propagate confusion if they are not carefully interpreted?

The key takeaway is that comparing SMILES strings is a convenient first check, but it is not a definitive test of whether a compound changed during preprocessing. Understanding what a comparison actually measures is just as important as running the comparison itself, especially when those results inform downstream modeling decisions.

5.2 Comparing structures using InChI#

In the following code cell we will convert both the original and cleaned SMILES strings to InChIKeys using RDKit. InChIKeys are unique, fixed-length identifiers derived from the International Chemical Identifier (InChI) representation of a molecule. They provide a standardized way to compare chemical structures, as two identical molecules will produce the same InChIKey.

from rdkit import Chem

def smiles_to_inchi(sm):
    mol = Chem.MolFromSmiles(sm)
    if mol is None:
        return None
    return Chem.MolToInchi(mol)

def smiles_to_inchikey(sm):
    mol = Chem.MolFromSmiles(sm)
    if mol is None:
        return None
    return Chem.MolToInchiKey(mol)

df_smiles['inchikey_original'] = df_smiles['smiles'].apply(smiles_to_inchikey)
df_smiles['inchikey_clean']    = df_smiles['clean_smiles'].apply(smiles_to_inchikey)

df_smiles['is_identical'] = (
    df_smiles['inchikey_original'] == df_smiles['inchikey_clean']
)

print(df_smiles['is_identical'].sum())
df_smiles['is_identical'].value_counts()
[12:33:40] WARNING: not removing hydrogen atom without neighbors
5336
is_identical
True     5336
False    1457
Name: count, dtype: int64
Check Your Understanding

We repeated the comparison using InChI-derived identifiers and found that 5,336 entries were identical, while 1,457 were different. Discuss why these result looks very different from the SMILES-based comparison and points to an important concept in cheminformatics: not all molecular representations are intended to answer the same question.

Why does the InChI comparison behave differently?

  • InChI (International Chemical Identifier) is an IUPAC-led, open-source standard designed to uniquely describe a chemical structure.
  • Two molecules that are chemically identical should, by design, generate the same InChI, even if their SMILES strings differ.
  • This makes InChI (or InChIKey) much better suited for identity checks than raw SMILES strings.

What this tells us about SMILES

  • SMILES are flexible, human-readable encodings, not unique identifiers.
  • There is no single, universal “canonical SMILES” standard; different software packages use different canonicalization algorithms.
  • As a result, two SMILES strings can differ even when they describe the same molecule.

A well-known illustration of this flexibility is that there are thousands of valid SMILES representations for a single molecule such as caffeine. For a fun and eye-opening example, see this blog post: How do I write thee? Let me count the ways .

Key takeaway

  • SMILES are excellent for storing structures and generating fingerprints, but they are not reliable for determining molecular identity.
  • InChI was created specifically to solve this problem and should be used when the question is “are these two molecules the same?”
  • Choosing the right representation is not a modeling decision that changes the data, but a correctness issue that affects how results are interpreted.

The main lesson here is that tools in cheminformatics are purpose-built. Understanding what question a representation is designed to answer is essential for using it correctly and avoiding misleading conclusions.

Our last step before fingerprint generation will be to remove the series we generated to analyze the data and save the CID and clean_smiles for creating fingerprints with.

df_smiles.head()
smiles cid clean_smiles is_identical inchikey_original inchikey_clean
0 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... 12850184 O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-] False HNXKHQGRMMZMKP-UINDWOIOSA-L IZSRJDGCGRAUAR-MROZADKFSA-M
1 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... 89753 O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO False PECPTSUHUFWKDG-IYEMJOQQSA-L RGHNJXZEOKUKBD-SQOUGZDYSA-M
2 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... 9403 C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C... True UOACKFBJUYNSLK-XRKIENNPSA-N UOACKFBJUYNSLK-XRKIENNPSA-N
3 C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N 13218779 CC1(C)[C@@H]2CC[C@@]1(C)[C@H](OC(=O)CSC#N)C2 True IXEVGHXRXDBAOB-BREBYQMCSA-N IXEVGHXRXDBAOB-BREBYQMCSA-N
4 CC(=CCC/C(=C/CO)/C)C 637566 CC(C)=CCC/C(C)=C/CO True GLZPCOQZEFWAFX-JXMROGBWSA-N GLZPCOQZEFWAFX-JXMROGBWSA-N
# reorder columns to have 'cid' first, then 'clean_smiles' and remove columns used to check for differences

df_smiles = df_smiles[['cid','clean_smiles']]
print(df_smiles.shape)
print(df_smiles.dtypes)
df_smiles.head()
(6793, 2)
cid              int64
clean_smiles    object
dtype: object
cid clean_smiles
0 12850184 O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-]
1 89753 O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO
2 9403 C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C...
3 13218779 CC1(C)[C@@H]2CC[C@@]1(C)[C@H](OC(=O)CSC#N)C2
4 637566 CC(C)=CCC/C(C)=C/CO

Save as AID743139_noSalt_pcSMILES…#

Save the dataframe in your curated data folder with date stamp and version number using prefix="AID743139_noSalt_pcSMILES",.

# write your code here:
# Save to data/ with versioning
from pathlib import Path
from cinf26pk.core import make_fixed_filename

# Generate path (relative to notebook)
CUR = Path("data/AID743139/curated")
CUR.mkdir(parents=True, exist_ok=True)

# Create deterministic filename with date stamp
fname = make_fixed_filename(
    stem="AID743139_noSalt_pcSMILES",
    ext="csv",
    add_date=True
)

# Save curated dataset
df_smiles.to_csv(CUR / fname, index=False)

print(f"[Saved] Curated PubChem data → {CUR / fname}")
[Saved] Curated PubChem data → data/AID743139/curated/AID743139_noSalt_pcSMILES_20260205.csv
Check your understanding
  1. When we requested the SMILES data from PubChem through PUG REST, why did we chunk our data and use time.sleep(.2)?

  2. Briefly explain the steps and importance for cleaning the data in the notebook thus far.

  3. How many compounds were salts or multicomponent mixtures and how did we handle them?

Answer
  1. As noted in the Introduction to PUG REST notebook, PubChem enforces 30-second time limit per request. If we attempt fetch properties for too many compounds at once, the request may exceed this limit and result in a timeout error. Since our dataset includes nearly 7000 molecules, we will split the list of CIDs into smaller chunks and process each chunk separately. This approach helps keep each request efficient, reduces the likelihood of failure, and ensures smooth and successful data retrieval from the API.

  2. Cleaning the data has involved several key steps: removing entries without CIDs (since structural information cannot be retrieved without them), resolving conflicting activity labels to ensure each compound is consistently classified, and converting text-based activity labels into numerical values suitable for machine learning. These steps are critical to ensure that the model is trained on accurate, consistent, and well-structured data, which improves its performance and prevents errors during modeling.

  3. When we compared Smiles from PubChem and those after salt removal that were generated by RDKit we found 6013 that were not equal. But Smiles are not canonical and RDKit and PubChem could generate different SMILES for the same molecule. So we then converted them to InChIKeys, which are canonical, and found that 1,457 had been changed. For these compounds we extracted the fragment with the largest number of heavy atoms, assuming this represented the parent compound.

6 Convert SMILES to Binary Data for Model Input#

6.1 Generate MACCS keys from SMILES#

At this point in the workflow, we have completed the chemical and biological curation of the dataset. We have:

  • Selected which compounds to include

  • Resolved conflicting assay outcomes

  • Defined a binary activity label

  • Standardized molecular representations by removing salts and mixtures

The remaining task in this module is to translate chemical structure into a numerical form that a machine-learning algorithm can consume. To do this, we convert each cleaned SMILES string into an RDKit molecule object and generate MACCS structural keys. Each MACCS key encodes the presence (1) or absence (0) of a predefined substructure pattern. The result is a fixed-length binary vector that represents chemical structure in a form suitable for supervised learning. Rather than working with RDKit bitvector objects directly, we explicitly convert these fingerprints into tabular numerical data and save them to disk. The resulting file contains:

  • Compound identifiers (cid)

  • The curated binary activity label (activity)

  • The cleaned SMILES representation (clean_smiles)

  • One column for each MACCS key (e.g., MACCS000 through MACCS166)

This file represents the final output of Module 10.1. It serves as the starting point for:

  • Model construction in 10.2 Naive Bayes: Model Construction Workflow

  • Pipeline construction, reuse, and inference in 10.3 Pipelines and Inference

Once fingerprints are generated, the model no longer “sees” molecules, it only sees numbers. Any chemical assumptions made earlier (salt removal, fragment selection, fingerprint choice) are now frozen into the feature representation. This is why the outputs of this step are treated as durable artifacts and reused across later modeling workflows. By saving this representation explicitly, we make the transition from “chemical data preparation” to “machine-learning experimentation” clear, reproducible, and inspectable. Our directory architecture will support this by moving the fingerprints out of the \data\curated and into a new \data\features directory.

# load ..._noSalt_pcSMILES... into df_smiles if you restarted the kernel
df_smiles.head()
cid clean_smiles
0 12850184 O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-]
1 89753 O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO
2 9403 C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C...
3 13218779 CC1(C)[C@@H]2CC[C@@]1(C)[C@H](OC(=O)CSC#N)C2
4 637566 CC(C)=CCC/C(C)=C/CO
from rdkit import Chem
from rdkit.Chem import MACCSkeys

fps=dict()

for idx, row in df_smiles.iterrows() :
    
    mol = Chem.MolFromSmiles(row.clean_smiles)
    
    if mol == None :
        print("Can't generate MOL object:", "CID", row.cid, row.smiles)
    else:
        fps[row.cid] = [row.cid] + list(MACCSkeys.GenMACCSKeys(mol).ToBitString())

print(f"Dictionary contains {len(fps)} key–value pairs")

# Print first two key-value pairs of the dictionary fps
def dict_head(d, n=2):
    return list(d.items())[:n]
print(dict_head(fps))
Dictionary contains 6793 key–value pairs
[(12850184, [12850184, '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '1', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '1', '1', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '1', '1', '0', '0', '0', '1', '0', '0', '1', '1', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '1', '1', '1', '1', '0', '1', '0', '1', '0', '0', '0', '0', '1', '0', '0']), (89753, [89753, '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '1', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '1', '1', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '1', '1', '0', '0', '0', '0', '0', '0', '1', '1', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '1', '1', '1', '1', '0', '1', '0', '1', '0', '0', '0', '0', '1', '0', '0'])]
# Create a dataframe of the fingerprints generated fromm the noSalt smiles
# Note we lost the fingerprints
# Generate column names
fpbitnames = []

fpbitnames.append('cid')

for i in range(0,167):   # from MACCS000 to MACCS166
    fpbitnames.append( "maccs" + str(i).zfill(3) )

df_fps = pd.DataFrame.from_dict(fps, orient='index', columns=fpbitnames)

print(df_fps.shape)
df_fps.head(5)
(6793, 168)
cid maccs000 maccs001 maccs002 maccs003 maccs004 maccs005 maccs006 maccs007 maccs008 ... maccs157 maccs158 maccs159 maccs160 maccs161 maccs162 maccs163 maccs164 maccs165 maccs166
12850184 12850184 0 0 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 0
89753 89753 0 0 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 0
9403 9403 0 0 0 0 0 0 0 0 0 ... 1 0 1 1 0 1 1 1 1 0
13218779 13218779 0 0 0 0 0 0 0 0 0 ... 1 0 1 1 1 0 1 1 1 0
637566 637566 0 0 0 0 0 0 0 0 0 ... 1 0 0 1 0 0 0 1 0 0

5 rows × 168 columns

# Merge DataFrames (add fingerprints to end of df)
df_fps = df_fps.merge(
    df_smiles[['cid', 'clean_smiles']],
    on='cid',
    how='left'
)
df_fps.head(3)
cid maccs000 maccs001 maccs002 maccs003 maccs004 maccs005 maccs006 maccs007 maccs008 ... maccs158 maccs159 maccs160 maccs161 maccs162 maccs163 maccs164 maccs165 maccs166 clean_smiles
0 12850184 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 1 0 0 O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-]
1 89753 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 1 0 0 O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO
2 9403 0 0 0 0 0 0 0 0 0 ... 0 1 1 0 1 1 1 1 0 C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C...

3 rows × 169 columns

# Arrange Columns (move fingerprints to left)
front = ['cid', 'clean_smiles']
rest = [c for c in df_fps.columns if c not in front]

df_fps = df_fps[front + rest]

df_fps.head(3)
cid clean_smiles maccs000 maccs001 maccs002 maccs003 maccs004 maccs005 maccs006 maccs007 ... maccs157 maccs158 maccs159 maccs160 maccs161 maccs162 maccs163 maccs164 maccs165 maccs166
0 12850184 O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-] 0 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 0
1 89753 O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO 0 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 0
2 9403 C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C... 0 0 0 0 0 0 0 0 ... 1 0 1 1 0 1 1 1 1 0

3 rows × 169 columns

Save as features/AID743139_MACCS_noSalt…#

Lets now save this in a new folder, data/features and give it a versioning name AID743139_MACCS_noSalt

from pathlib import Path
from cinf26pk.core import make_fixed_filename

# Generate path (relative to notebook)
FEATURES = Path("data/AID743139/features")
FEATURES.mkdir(parents=True, exist_ok=True)

# Create deterministic filename with date stamp
fname = make_fixed_filename(
    stem="AID743139_MACCS_noSalt",
    ext="csv",
    add_date=True
)

# Save MACCS fingerprint dataset
df_fps.to_csv(FEATURES / fname, index=False)

print(f"[Saved] MACCS features → {FEATURES / fname}")
[Saved] MACCS features → data/AID743139/features/AID743139_MACCS_noSalt_20260205.csv

6.2 Merge activity data and fingerprint information#

note, you may need to load the data into df_activity, which you generated in step 4.6.1 (with a file prefix of AID743139_Activity_CID_)

#df_activity=pd.read_csv("data/AID743139/curated/AID743139_Activity_CID_20260118_v1.csv")

print(df_activity.shape)
df_activity.head(3)
(6793, 3)
cid activity PUBCHEM_EXT_DATASOURCE_SMILES
0 12850184 0 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...
1 89753 0 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...
2 9403 0 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...
print(df_fps.shape)
df_fps.head(3)
(6793, 169)
cid clean_smiles maccs000 maccs001 maccs002 maccs003 maccs004 maccs005 maccs006 maccs007 ... maccs157 maccs158 maccs159 maccs160 maccs161 maccs162 maccs163 maccs164 maccs165 maccs166
0 12850184 O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-] 0 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 0
1 89753 O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO 0 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 0
2 9403 C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C... 0 0 0 0 0 0 0 0 ... 1 0 1 1 0 1 1 1 1 0

3 rows × 169 columns

# For each row in df_activity, 
#attach the fingerprint columns from df_fps that have the same CID
df_data = df_activity.join(df_fps.set_index('cid'), on='cid')
df_data.head(3)
cid activity PUBCHEM_EXT_DATASOURCE_SMILES clean_smiles maccs000 maccs001 maccs002 maccs003 maccs004 maccs005 ... maccs157 maccs158 maccs159 maccs160 maccs161 maccs162 maccs163 maccs164 maccs165 maccs166
0 12850184 0 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-] 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 0
1 89753 0 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 0
2 9403 0 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C... 0 0 0 0 0 0 ... 1 0 1 1 0 1 1 1 1 0

3 rows × 171 columns

# drop pubchem source smiles
df_data = df_data.drop(columns=['PUBCHEM_EXT_DATASOURCE_SMILES'])
print(df_data.shape)
df_data.head(3)
(6793, 170)
cid activity clean_smiles maccs000 maccs001 maccs002 maccs003 maccs004 maccs005 maccs006 ... maccs157 maccs158 maccs159 maccs160 maccs161 maccs162 maccs163 maccs164 maccs165 maccs166
0 12850184 0 O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-] 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 0
1 89753 0 O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 0
2 9403 0 C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C... 0 0 0 0 0 0 0 ... 1 0 1 1 0 1 1 1 1 0

3 rows × 170 columns

Check to see if there are any CIDs for which the MACCS keys could not be generated. They need to be removed from df_data.

df_data[df_data.isna().any(axis=1)]
cid activity clean_smiles maccs000 maccs001 maccs002 maccs003 maccs004 maccs005 maccs006 ... maccs157 maccs158 maccs159 maccs160 maccs161 maccs162 maccs163 maccs164 maccs165 maccs166

0 rows × 170 columns

print(f"before drop nan: {len(df_data)}")
df_data = df_data.dropna()
print(f"after drop nan: {len(df_data)}")
df_data.head(3)
before drop nan: 6793
after drop nan: 6793
cid activity clean_smiles maccs000 maccs001 maccs002 maccs003 maccs004 maccs005 maccs006 ... maccs157 maccs158 maccs159 maccs160 maccs161 maccs162 maccs163 maccs164 maccs165 maccs166
0 12850184 0 O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-] 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 0
1 89753 0 O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 0
2 9403 0 C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C... 0 0 0 0 0 0 0 ... 1 0 1 1 0 1 1 1 1 0

3 rows × 170 columns

Save df_data in CSV for future use using the prefix=”AID743139_MACCS_activites_noSalt”

from pathlib import Path
from cinf26pk.core import make_fixed_filename

# Generate path (relative to notebook)
FEAT = Path("data/AID743139/features")
FEAT.mkdir(parents=True, exist_ok=True)

# Create deterministic filename with date stamp
fname = make_fixed_filename(
    stem="AID743139_MACCS_activities_noSalt",
    ext="csv",
    add_date=True
)

# Save feature dataset
df_data.to_csv(FEAT / fname, index=False)

print(f"[Saved] Features CSV data → {FEAT / fname}")
[Saved] Features CSV data → data/AID743139/features/AID743139_MACCS_activities_noSalt_20260205.csv

At this point, we have completed the primary objective of this module: transforming curated chemical and biological data into a feature-level dataset suitable for supervised machine learning. The file saved in data/AID743139/features/ captures the outcome of all decisions made in this module—compound selection, activity labeling, structural standardization, and fingerprint representation—encoded as numerical input that models can consume. Importantly, this file is not an end product, but a handoff artifact. In the next modules, it will be reused to construct models, build pipelines, and perform inference. Likewise, in the homework, you will repeat this workflow with a different assay and explore an alternative fingerprinting strategy, reinforcing the idea that feature generation is one step in a broader, repeatable process rather than a one-off task. The goal is not to produce a single “correct” dataset, but to understand how each representation fits into a reproducible machine-learning pipeline.

Homework#

Problem 1: hw_10.1a_fp.ipynb

Create a new notebook in your notebooks directory called hw_10.1a_fp.ipynb and with this notebook create a new .csv fingerprint file in the /data/AID743139/features directory. You must get instructor approval of the fingerprint and parameters before proceeding and use the modern generator function approach to create the fingerprints.

File naming convention needs to identify

  • AID
  • Finger print type (ecfp, fcfp...)
  • Algorithm parameters (radius, size)
  • curation decisions
  • date and version
Example AID743139ecfp4_1024_activites_noSalt_20260111_v1 which describes the first version of a 1024 bit extended connectivity fingerprint with a radius of 4 of desalted compounds showing binary activity data for AID743139 that was generated on January 11, 2026.

Problem 2: hw_10.1_data_prep.ipynb

Create a new notebook in your notebooks directory called hw_10.2_data_prep.ipynb and repeat the work of this notebook with the bioassay you choose in your first homework assignment (AID999999 in the directory structure below). You need generate the structure with your notebook the way we did it with this one.


10_SupervisedML/
├── data/                    
│   │── AID743139/
│   │ 
│   │── AID999999/   (your AID)   
├── models/                    
│   │── AID743139/
│   │ 
│   │── AID999999/   (your AID)