10.1: Data Preparation and Feature Engineering

10.1: Data Preparation and Feature Engineering#

Learning Objectives

Purpose Introduce students to the data preparation steps needed to create a dataset suitable for machine learning. Introduce students to feature engineering and the generation of binary fingerprints from SMILES strings.

Students Learn

How to obtain BioAssay data from PubChem
How to curate the data
How to save data and maintain the data’s provenance

Core Activities

Load PubChem Data
Curate compounds (salts, largest fragments, conflicts…)
Generate fingerprints
Save compounds and metadata

Prior Knowledge

Programmatic access to PubChem
Familiarity with SMILES and InChI
Familiarity with RDKit fingerprint generation
Familiarity with Pandas

Reading links

Pandas Chapter of Py4Sci

1. Import bioactivity data from PubChem#

In this notebook, we will develop a prediction model for small molecule’s activity against human aromatase (https://pubchem.ncbi.nlm.nih.gov/protein/EAW77416), which is encoded by the CYP19A1 gene (https://pubchem.ncbi.nlm.nih.gov/gene/1588). The model will predict the activity of a molecule based on the structure of the molecule (represented with molecular fingerprints).

Endocrine disrupting chemicals (EDCs) interfere with the biosynthesis and normal functions of steroid hormones including estrogen and androgen in the body. Aromatase catalyzes the conversion of androgen to estrogen and plays a key role in maintaining the androgen and estrogen balance in many of the EDC-sensitive organs.

For model development, we will use the Tox21 bioassay data for human aromatase, archived in PubChem (https://pubchem.ncbi.nlm.nih.gov/bioassay/743139). The bioactivity data presented on this page can be downloaded by clicking the “Download” button available on this page and then read the data into a data frame. Alternatively, you can directly load the data into a data frame as shown in the cell below.

import pandas as pd
import numpy as np

url = 'https://pubchem.ncbi.nlm.nih.gov/assay/pcget.cgi?query=download&record_type=datatable&actvty=all&response_type=save&aid=743139'
df_raw = pd.read_csv(url)

df_raw.shape

(10489, 16)

1.1 Directory Architecture#

In this activity we are going to generate multiple fingerprints and eventually run different ML programs on different BioAssays. Below is a sample hierarchy structure for each bioassay, which is this class will be in the downloads folder of your module working directory.

10_SupervisedML/
├── 10_introduction.ipynb
│── 10_1_data_prep.ipynb
│── 10_2_NB_model_construction_workflow.ipynb
│── 10_3_model_eval_interpretation.ipynb
│── 10_4_pipelines_and_inference.ipynb
│── 10_5_model_comparison_exp_design.ipynb
├── data/                      ← algorithm-agnostic artifacts
│   └── AID743139/
│       ├── raw/               ← raw PubChem exports
│       │   └── AID743139_pubchem_raw_20251224.csv
│       ├── curated/           ← chemically curated datasets
│       │   └── AID743139_Activity_CID_20260111_v1.csv
│       ├── features/          ← feature-level artifacts (post-representation)
│       │   └── AID743139_MACCS_activities_noSalt_20260104_v1.csv
│       │   ├── maccs_variance_mask_20260104_v1.npy
│       │   └── feature_metadata_20260104_v1.json
│       └── splits/            ← experimental splits
│           └── 90_10/
│               ├── arrays/
│               │   ├── X_train.npy
│               │   ├── X_test.npy
│               │   ├── y_train.npy
│               │   └── y_test.npy
│               └── split_metadata.json
│
├── models/                    ← algorithm-specific artifacts
│   └── AID743139/
│       ├── nb_maccs_20260109_v1.joblib
│       ├── dt_maccs_20260110_v1.joblib
│       └── rf_maccs_20260112_v1.joblib
│
├── results/                   ← evaluation outputs
│   └── AID743139/
│       ├── nb/
│       ├── dt/
│       └── comparison_tables/
│

1.2 Helper Function: make_fixed_filename#

In the following code cell we are going to use the helper function make_fixed_filename to save this data in the raw data folder of the directory structure above. These helper functions are provided in the cinf26pk package we have been building in this class, and the code is provided in the green box below. If you do not have access to the package you can substitute it’s code for the import statement above

Note, we will be using two helper functions for saving files and both will generate the directory architecture based on your current working directory:

make_fixed_filename - for deterministic filenames with date stamp, this allows us to identify if the data file at PubChem as changed in the future, but does not rewrite a new version if we redownload it multiple times on the same day.
make_filename - for versioned filenames with date stamp, this allows us to save our curated data and features with version numbers so we can experiment and not lose our work.

Helper Function: make_fixed_filename

The following file was imported from the class package and used in the preceeding script

Code

The `make_fixed_filename` helpeer function was created in the class package pvhelpfct.py. It is for deterministic in contrast to versioning names, but this does at a date stamp to the filename. We use this helper function for data sets we download. After we work them up we use the `make_filename` helper function that versions the filename. Below is the code for the function, although we will import it from the class package in this example

def make_fixed_filename(
    stem: str,
    ext: str = "csv",
    add_date: bool = True
) -> str:
    """
    Create a deterministic filename of the form:
    stem[_YYYYMMDD].ext

Intended for canonical or authoritative data artifacts
that should not auto-version.
"""
if add_date:
    date_str = datetime.now().strftime("%Y%m%d")
    return f"{stem}_{date_str}.{ext}"
else:
    return f"{stem}.{ext}"

Note: For the following code cell to generate the above directory structure your notebook must be in the 10_SupervisedML/notebooks folder. This code will create the /data/AID743139/raw directory path if it does not exist. If you are working with a different bioassay you need to change the AID in the path to the new one.

from pathlib import Path
from cinf26pk.core import make_fixed_filename

RAW = Path("data/AID743139/raw")
RAW.mkdir(parents=True, exist_ok=True)

fname = make_fixed_filename(
    stem="AID743139_pubchem_raw",
    ext="csv",
    add_date=True
)

df_raw.to_csv(RAW / fname, index=False)

Python Check

The above code saves the raw PubChem dataset to a well-defined location using a consistent naming policy.

Explain the code.

Answer

There are two things going on here.

We are creating a path
We are dating our filenames

The first thing we need to do is understand the following script


PROJECT_ROOT = Path.cwd() # protject root (10_SupervisedML)
RAW = PROJECT_ROOT / "data" / "AID743139" / "raw"
RAW.mkdir(parents=True, exist_ok=True)

The first line defines the project root as the parent of the current working directory. Since this notebook is run from the notebooks/ directory, its parent directory (10_SupervisedML) becomes the project root. We capitalize PROJECT_ROOT because it represents an invariant configuration value.
The second line constructs the path from the project root to the directory where the raw data will be stored.
The third line creates the directory (and any missing parent directories) if they do not already exist, making the code safe to rerun.

The helper function make_fixed_filename is then used to construct the output filename:

Create deterministic filename with date stamp


from cinf26.core import filenames 
fname = filenames.make_fixed_filename(
    stem="AID743139_pubchem_raw",
    ext="csv",
    add_date=True
)

This uses the helper function `make_fixed_filename`, which is described in the next dropdown box

Finally, the DataFrame is written to disk:


df_raw.to_csv(RAW / fname, index=False)

Use named path objects to represent meaningful locations in the workflow, rather than embedding directory structures inside string literals.
Use helper functions to enforce consistent naming policies for saved artifacts, so filenames encode intent rather than ad-hoc decisions.

Restart Kernel and Load the code into a dataframe called df_raw (see green box below for help)

#Write Code here 
import pandas as pd
from pathlib import Path

RAW = Path("data/AID743139/raw")
RAW.mkdir(parents=True, exist_ok=True)

df_raw = pd.read_csv(RAW / "AID743139_pubchem_raw_20260117.csv")
df_raw.head(7)  # gives the first 7 rows of the dataframe

	PUBCHEM_RESULT_TAG	PUBCHEM_SID	PUBCHEM_CID	PUBCHEM_EXT_DATASOURCE_SMILES	PUBCHEM_ACTIVITY_OUTCOME	PUBCHEM_ACTIVITY_SCORE	PUBCHEM_ACTIVITY_URL	PUBCHEM_ASSAYDATA_COMMENT	Activity Summary	Antagonist Activity	Antagonist Potency (uM)	Antagonist Efficacy (%)	Viability Activity	Viability Potency (uM)	Viability Efficacy (%)	Sample Source
0	RESULT_TYPE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	STRING	STRING	FLOAT	FLOAT	STRING	FLOAT	FLOAT	STRING
1	RESULT_DESCR	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Type of compound activity based on both the ar...	Type of compound activity in the aromatase ant...	The concentration of sample yielding half-maxi...	Percent inhibition of aromatase.	Type of compound activity in the cell viabilit...	The concentration of sample yielding half-maxi...	Percent inhibition of cell viability.	Where sample was obtained.
2	RESULT_UNIT	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	MICROMOLAR	PERCENT	NaN	MICROMOLAR	PERCENT	NaN
3	1	144203552.0	12850184.0	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0	inactive	NaN	0	NCI
4	2	144203553.0	89753.0	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0	inactive	NaN	0	NCI
5	3	144203554.0	9403.0	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0	inactive	NaN	0	NCI
6	4	144203555.0	13218779.0	C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0	inactive	NaN	0	NCI

Pandas Check

How do you open the raw csv file in Pandas?

Answer

There are two approaches, use the predefined path variable, or write the entire path as a string. For the first to work you must define the path, which was done in a preceding cell. Of course you must use the filename of the file you saved.


import pandas as pd
df_raw = pd.read_csv(RAW / "AID743139_pubchem_raw_20251230.csv")
#df_raw = pd.read_csv("../data/AID743139/raw/AID743139_pubchem_raw_20251230.csv")

Note, if you hard code the path as a string, you need to start with "../" and move up one level (`10_SupervisedML`). That is the notebook is being run in the `10/ML/notebooks` directory and you want your data in the `10_SupervisedML/data` directory.

Display the first 5 rows of the df_raw

#Write Code here 
import pandas as pd
df_raw.head() 

	PUBCHEM_RESULT_TAG	PUBCHEM_SID	PUBCHEM_CID	PUBCHEM_EXT_DATASOURCE_SMILES	PUBCHEM_ACTIVITY_OUTCOME	PUBCHEM_ACTIVITY_SCORE	PUBCHEM_ACTIVITY_URL	PUBCHEM_ASSAYDATA_COMMENT	Activity Summary	Antagonist Activity	Antagonist Potency (uM)	Antagonist Efficacy (%)	Viability Activity	Viability Potency (uM)	Viability Efficacy (%)	Sample Source
0	RESULT_TYPE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	STRING	STRING	FLOAT	FLOAT	STRING	FLOAT	FLOAT	STRING
1	RESULT_DESCR	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Type of compound activity based on both the ar...	Type of compound activity in the aromatase ant...	The concentration of sample yielding half-maxi...	Percent inhibition of aromatase.	Type of compound activity in the cell viabilit...	The concentration of sample yielding half-maxi...	Percent inhibition of cell viability.	Where sample was obtained.
2	RESULT_UNIT	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	MICROMOLAR	PERCENT	NaN	MICROMOLAR	PERCENT	NaN
3	1	144203552.0	12850184.0	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0	inactive	NaN	0	NCI
4	2	144203553.0	89753.0	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0	inactive	NaN	0	NCI

Pandas Check

Why are the CID and SID columns floats and not integers

Answer

Pandas infers column data types when a CSV file is read and even though CIDs and SIDs are integers, the first few rows contain missing values (there is no CID or SID) and so Pandas converts them to NaN (Not a Number), which is actually a IEEE standard and treated as a float, allowing Pandas to handle missing number. As a result, Pandas promotes the column to float64 in order to accommodate NaN. If all values were integers then the column would be integers. Once those rows are removed, the columns can safely be converted to integers using the .astype(int) method. This conversion will fail if any values are still missing (NaN). For more information see section: Section 4.3 Handling Missing Data in Py4Sci Pandas Chapter

Note that a column in a DataFrame is actually a Pandas Series, and the column label is the Series name. A DataFrame can be viewed as a table of Series objects that all share a common row index. Therefore, the following operation is calling a Series method on a column, and this could convert the values to integers, but will not work if one of the values is NaN.

df['column-label'] = df['column-label'].astype(int)

For more information see section: Section 4.3 Handling Missing Data in Py4Sci Pandas Chapter

2 Data Cleaning#

Data cleaning is a critical step in our workflow that ensures the quality, consistency, and reliability of a dataset before we build a model. Raw data often contains missing values, duplicates, inconsistent formatting, or mislabeled entries that can lead to incorrect conclusions or poorly performing models. By identifying and resolving these issues at the start of our model design, we create a dataset that accurately reflects the information needed for meaningful interpretation and predictive modeling.

2.1 Inspect data and remove first 3 rows#

Looking at our imported data, lines 0-2 provide the descriptions for each column (data type, descriptions, units, etc). These rows need be removed. We will do so with a slice operation to select only those rows that contain the data we will use in the model (See Py4Sci: using .iloc on a dataframe.)

df_curated = df_raw.iloc[3:].copy()  #the [3:] indicates a slice opperation in pandas dataframes. 
# It selects all rows starting at position 3 and continues to the end, we than save
#over  the first dataframe, effectively removing tows 0,1 and2
df_curated.head(5)

	PUBCHEM_RESULT_TAG	PUBCHEM_SID	PUBCHEM_CID	PUBCHEM_EXT_DATASOURCE_SMILES	PUBCHEM_ACTIVITY_OUTCOME	PUBCHEM_ACTIVITY_SCORE	PUBCHEM_ACTIVITY_URL	PUBCHEM_ASSAYDATA_COMMENT	Activity Summary	Antagonist Activity	Antagonist Potency (uM)	Antagonist Efficacy (%)	Viability Activity	Viability Potency (uM)	Viability Efficacy (%)	Sample Source
3	1	144203552.0	12850184.0	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0	inactive	NaN	0	NCI
4	2	144203553.0	89753.0	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0	inactive	NaN	0	NCI
5	3	144203554.0	9403.0	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0	inactive	NaN	0	NCI
6	4	144203555.0	13218779.0	C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0	inactive	NaN	0	NCI
7	5	144203556.0	142766.0	C1=CC=C(C(=C1)C(=O)O)O.C1=CC2=C(C(=C1)O)N=CC=C2	Inconclusive	25.0	NaN	NaN	inconclusive antagonist (cytotoxic)	active antagonist	15.5454	-115.803	active antagonist	14.9601	-76.8218	NCI

Now that we have removed the first three rows we can convert the CIDs and SIDs to integers using Pandas nullable "Int64" type that allows NaN (this is not int64, which does not accept null values).

df_curated["PUBCHEM_SID"] = df_curated["PUBCHEM_SID"].astype("Int64")
df_curated["PUBCHEM_CID"] = df_curated["PUBCHEM_CID"].astype("Int64")
df_curated.head()

	PUBCHEM_RESULT_TAG	PUBCHEM_SID	PUBCHEM_CID	PUBCHEM_EXT_DATASOURCE_SMILES	PUBCHEM_ACTIVITY_OUTCOME	PUBCHEM_ACTIVITY_SCORE	PUBCHEM_ACTIVITY_URL	PUBCHEM_ASSAYDATA_COMMENT	Activity Summary	Antagonist Activity	Antagonist Potency (uM)	Antagonist Efficacy (%)	Viability Activity	Viability Potency (uM)	Viability Efficacy (%)	Sample Source
3	1	144203552	12850184	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0	inactive	NaN	0	NCI
4	2	144203553	89753	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0	inactive	NaN	0	NCI
5	3	144203554	9403	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0	inactive	NaN	0	NCI
6	4	144203555	13218779	C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0	inactive	NaN	0	NCI
7	5	144203556	142766	C1=CC=C(C(=C1)C(=O)O)O.C1=CC2=C(C(=C1)O)N=CC=C2	Inconclusive	25.0	NaN	NaN	inconclusive antagonist (cytotoxic)	active antagonist	15.5454	-115.803	active antagonist	14.9601	-76.8218	NCI

# Check to see if there are any null (NaN) values.
df_curated[["PUBCHEM_SID", "PUBCHEM_CID"]].isna().sum()

PUBCHEM_SID      0
PUBCHEM_CID    149
dtype: int64

Check Your Understanding

Why are there 149 NaN CIDs, and no NaN SIDs?

Answer

Not all SIDs could be resolved into a CID. That is, a depositor submits a substance, but PubChem's standarization process may not have been able to resolve that substance into a identified compound with known structure. Thus the value remains null.

2.2 Save Versioned DataFrame as CSV#

You now want to save this as a file in your curated folder, but this file you will be working on and experimenting with, and so you want to use the versioning helper function. You could just write the inline command df_raw.to_csv("Path-To-Directory/filename.csv", index=False) but we are developing a project that can have many files and so need to set up protocols that use path objects and provide dates and versions within file name.

2.2.1 Helper Function: make_filename#

In the following code cell we are going to use the helper function make_filename to save this data with a dated version number appended to the file name. Like the make_fixed_filename we will import this from the cinf26pk package. If you do not have access to the package you can substitute it’s code in the green box below for the import statement in the code cell where it is called from.

Helper Function: make_filename

The following file was imported from the class package and used in the preceeding script

Code

The `make_filename` helper function was created in the class package pvhelpfct.py. It is for for versioning file names with a date stamp. This is useful when you are working up data as it allows you to save multiple versions of the file as you alter it. That is, you may remove something, than realize you should not have removed it, and you can go back and recover the original file.


from pathlib import Path
from datetime import datetime
def make_filename(prefix: str, ext: str = “csv”, folder: str = “downloads”) -> str:
“””
Create a unique versioned filename of the form:
prefix_YYYYMMDD_vN.ext
“””
folder_path = Path(folder)
folder_path.mkdir(exist_ok=True)
date_str = datetime.now().strftime("%Y%m%d")
base = f"{prefix}_{date_str}"

existing = [
    f for f in folder_path.iterdir()
    if f.name.startswith(base) and f.suffix == f".{ext}"
]

version = len(existing) + 1
filename = f"{base}_v{version}.{ext}"

return str(folder_path / filename)

Note: For the following code cell to generate the above directory structure your notebook must be in the 10_SupervisedML/notebooks folder. This cell will generate the curated folder within the /data/AID743139 path you had created earlier. If you are working with a different bioassay you will need to change the AID to the new one.

from pathlib import Path
from cinf26pk.core import make_fixed_filename

# Generate path (relative to notebook)
CUR = Path("data/AID743139/curated")
CUR.mkdir(parents=True, exist_ok=True)

# Create deterministic filename with date stamp
fname = make_fixed_filename(
    stem="AID743139_pc_curated",
    ext="csv",
    add_date=True
)

# Save curated dataset
df_curated.to_csv(CUR / fname, index=False)

print(f"[Saved] Curated PubChem data → {CUR / fname}")

[Saved] Curated PubChem data → data/AID743139/curated/AID743139_pc_curated_20260205.csv

2.2.2 Reload saved file into dataframe#

Load your file into a new dataframe called df_curated, and use the path variable for this new path, in contrast to the string.

# Write code here: (see green box below for guidance)
# Load your saved csv file into a new pandas dataframe called df_curated
import pandas as pd
#from pathlib import Path


df_curated = pd.read_csv(CUR / "AID743139_pc_curated_20260205.csv")

Pandas Check

How do you open the curated csv file in Pandas?

Answer

There are two approaches, use the predefined path variable, or write the entire path as a string. For the first to work you must define the path, which was done in a preceding cell. Of course you must use the filename of the file you saved. If you restarted the kernel you need to redefine the path by uncommenting the commented lines of code.


import pandas as pd
#from pathlib import Path
#PROJECT_ROOT = Path.cwd() # project root (10_SupervisedML)
#CUR = PROJECT_ROOT / “data” / “AID743139” / “curated”
df_curated = pd.read_csv(CUR / “AID743139_pc_curated_20260118_v1.csv”)

Inspect by displaying head

df_curated.head()

	PUBCHEM_RESULT_TAG	PUBCHEM_SID	PUBCHEM_CID	PUBCHEM_EXT_DATASOURCE_SMILES	PUBCHEM_ACTIVITY_OUTCOME	PUBCHEM_ACTIVITY_SCORE	PUBCHEM_ACTIVITY_URL	PUBCHEM_ASSAYDATA_COMMENT	Activity Summary	Antagonist Activity	Antagonist Potency (uM)	Antagonist Efficacy (%)	Viability Activity	Viability Potency (uM)	Viability Efficacy (%)	Sample Source
0	1	144203552	12850184.0	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0.000	inactive	NaN	0.0000	NCI
1	2	144203553	89753.0	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0.000	inactive	NaN	0.0000	NCI
2	3	144203554	9403.0	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0.000	inactive	NaN	0.0000	NCI
3	4	144203555	13218779.0	C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0.000	inactive	NaN	0.0000	NCI
4	5	144203556	142766.0	C1=CC=C(C(=C1)C(=O)O)O.C1=CC2=C(C(=C1)O)N=CC=C2	Inconclusive	25.0	NaN	NaN	inconclusive antagonist (cytotoxic)	active antagonist	15.5454	-115.803	active antagonist	14.9601	-76.8218	NCI

Check Your Understanding

Notice when you reload the dataframe from the csv the CIDs were converted back to floats, while the SIDs stayed as integers. Can you explain what is going on?

Explanation

CSV is plain-text formated and stores values only. When Pandas opens a CSV file it tries to identify the data type, but any column that has missing values it inserts NaN and inferres the column as a float. Since there were zero NaNs for the SIDS, those values were inferred as integers, but since there were 149 missing CIDs, those values were inferred as floats. This shows that we are too early in the pipeline to convert CIDs to integers, and we will revisit this later.

2.3 Inspect column names#

df_curated.columns #print out the column names. 

Index(['PUBCHEM_RESULT_TAG', 'PUBCHEM_SID', 'PUBCHEM_CID',
       'PUBCHEM_EXT_DATASOURCE_SMILES', 'PUBCHEM_ACTIVITY_OUTCOME',
       'PUBCHEM_ACTIVITY_SCORE', 'PUBCHEM_ACTIVITY_URL',
       'PUBCHEM_ASSAYDATA_COMMENT', 'Activity Summary', 'Antagonist Activity',
       'Antagonist Potency (uM)', 'Antagonist Efficacy (%)',
       'Viability Activity', 'Viability Potency (uM)',
       'Viability Efficacy (%)', 'Sample Source'],
      dtype='object')

2.3.1 Clean column names#

Some column names in this data frame are long and contain white spaces and special characters. Rename the columns (no spaces or special characters except for the “_” character) using a python dictionary as a map.

col_names_map = {'PUBCHEM_RESULT_TAG' : 'pc_result_tag', 
                 'PUBCHEM_SID' : 'sid', 
                 'PUBCHEM_CID' : 'cid',
                 'PUBCHEM_ACTIVITY_OUTCOME' : 'activity_outcome', 
                 'PUBCHEM_ACTIVITY_SCORE' : 'activity_score',
                 'PUBCHEM_ACTIVITY_URL' : 'activity_url', 
                 'PUBCHEM_ASSAYDATA_COMMENT' : 'assay_data_comment', 
                 'Activity Summary' : 'activity_summary',
                 'Antagonist Activity' : 'antagonist_activity', 
                 'Antagonist Potency (uM)' : 'antagonist_potency', 
                 'Antagonist Efficacy (%)' : 'antagonist_efficacy',
                 'Viability Activity' : 'viability_activity', 
                 'Viability Potency (uM)' : 'viability_potency',
                 'Viability Efficacy (%)' : 'viability_efficacy', 
                 'Sample Source' : 'sample_source' }

df_curated = df_curated.rename(columns = col_names_map)
df_curated.columns

Index(['pc_result_tag', 'sid', 'cid', 'PUBCHEM_EXT_DATASOURCE_SMILES',
       'activity_outcome', 'activity_score', 'activity_url',
       'assay_data_comment', 'activity_summary', 'antagonist_activity',
       'antagonist_potency', 'antagonist_efficacy', 'viability_activity',
       'viability_potency', 'viability_efficacy', 'sample_source'],
      dtype='object')

Check your understanding

Why is it important to rename the column names without spaces or special characters?

Answer

Renaming columns to remove spaces and special characters creates clean, consistent labels that are easier to use in Python. We do this by creating a dictionary that maps each original column name to a standardized version and then passing it to df.rename(columns=…). This ensures that columns can be accessed reliably with either bracket notation or dot notation (e.g., df.activity_score and df.column_name) without causing syntax errors. Clean names also improve compatibility with functions, plotting libraries, and automated workflows that expect simple identifiers. In short, renaming columns makes your code more readable, robust, and less error-prone.

Data Dive:

The following pop-up will describe each column in the dataframe. It should be noted that Bioassay 743139 is a summary assayfor measuring inhibition of aramatase, and so a positive assay result means a chemical is an inhibitor (antagonist) of the enzyme. The assay also measures agonist activity, but these are not considered active in this assay, but as inconlusive. In other words, only antagonists are considered active in this assay.

Explanation of the data

Column Name	Meaning / Description
pc_result_tag	Internal PubChem result index. A simple numeric tag identifying the order of assay records. Not biologically meaningful.
sid (Substance ID)	PubChem Substance ID: identifies the submitted sample tested in the assay (may map to multiple CIDs).
cid (Compound ID)	PubChem Compound ID: normalized chemical structure identifier used across PubChem. Common structure-level identifier for cheminformatics.
PUBCHEM_EXT_DATASOURCE_SMILES	A SMILES string representing the chemical structure as submitted by the data source. May differ from canonical PubChem SMILES.
activity_outcome	High-level BioAssay classification: Active, Inactive, or Inconclusive. This is the main result label for each tested substance. This is a summary table and this result depends on a biological threshold of the actual assays.
activity_score	Numeric score assigned by PubChem summarizing activity strength; depends on assay scoring rules. Often 0 = inactive, higher = stronger activity.
activity_url	URL linking to additional PubChem assay information for this result. Frequently missing (NaN) in simplified datasets.
assay_data_comment	Optional textual note describing assay-specific observations (rare in many datasets).
activity_summary	More detailed subtype of activity_outcome (e.g., active antagonist, inconclusive agonist, inconclusive antagonist (cytotoxic)). Provides mechanism- or phenotype-level interpretation.
antagonist_activity	Qualitative label identifying whether the compound acted as an antagonist in the assay (active/inactive).
antagonist_potency	Numeric potency value for antagonist activity (often in μM). Indicates concentration required for effect.
antagonist_efficacy	Percent efficacy of antagonist behavior. Measures the magnitude of response relative to assay controls.
viability_activity	Qualitative label describing cell viability response (active/inactive). Helps detect cytotoxic compounds.
viability_potency	Potency value for viability-related effects (e.g., concentration at which viability decreases).
viability_efficacy	Percent efficacy for viability effects (extent of cell killing or protection).
sample_source	Origin of the tested sample (e.g., NCI). Refers to the organization or project that submitted the substance to PubChem.

3. Inspect Activity Outcomes#

First, we need to understand the structure of our data, particularly the activity class of the tested compounds, since our goal is to develop a model that classifies small molecules based on their activity against a target. This information is found in the activity_outcome and activity_summary columns of the dataframe. The activity outcomes have three values, Active, Inactive and Inconclusive.

To explore this, we use the groupby method, which splits the data into groups based on one or more column values, enabling us to perform calculations on each group independently. In the code cell below, we group by activity_outcome and use the count method to determine how many entries fall into each outcome category:

df_curated.groupby(['activity_outcome']).count()

	pc_result_tag	sid	cid	PUBCHEM_EXT_DATASOURCE_SMILES	activity_score	activity_url	assay_data_comment	activity_summary	antagonist_activity	antagonist_potency	antagonist_efficacy	viability_activity	viability_potency	viability_efficacy	sample_source
activity_outcome
Active	379	379	378	378	379	0	0	379	379	378	379	379	115	359	379
Inactive	7562	7562	7466	7466	7562	0	0	7562	7562	0	7562	7562	324	7449	7562
Inconclusive	2545	2545	2493	2493	2545	0	0	2545	2545	2111	2136	2545	1206	2450	2545

Check your understanding

1. Explain this line of code: `df_raw.groupby(['activity_outcome']).count()`

Answer

This represents two chained Pandas methods; .groupby() followed by .count().
- .groupby(‘activity_outcome’) splits the DataFrame into groups based on the category labels in the activity_outcome column; namely “Active,” “Inactive,” and “Inconclusive.”
- .count() then counts how many non-missing values appear in each of the other columns within each category. This provides a quick summary of how many compounds fall into each activity class and reveals whether key fields (such as CID or SMILES) contain missing data within any category.

2. How many actives, inactives and inconclusives compounds are there in the dataset based on the activity_outcome column?

Answer

2. Based on the data in the activity_outcome column, there are 379 actives, 7562 inactives, and 2545 inconclusive compounds in the dataset.

3. Scroll through the rest of the data. Compare these numbers to other columns such as CID and SMILES. What problems might arise if these counts are not consistent?

Answer

3. If the counts of compounds differ across columns (for example, if a compound has an activity_outcome but is missing a CID or SMILES), it means important information is missing. This causes problems such as:

Compounds with missing CIDs/SMILES cannot be used for structural analysis, descriptor generation, or machine learning.
The dataset can become biased if missing data are unevenly distributed across activity classes, which can affect model performance and interpretation.
Results become harder to interpret because activity labels may exist without the structural or potency data needed to contextualize them.

3.1 Inspect Activity summaries#

If we group by both activity_outcome and activity_summary we get more insight about the data. Particularly, it reveals information that will indicate that there are subcategories for “Inconclusive”, giving us a better understanding of how thse results are classified.

df_curated.groupby(['activity_outcome','activity_summary']).count()

		pc_result_tag	sid	cid	PUBCHEM_EXT_DATASOURCE_SMILES	activity_score	activity_url	assay_data_comment	antagonist_activity	antagonist_potency	antagonist_efficacy	viability_activity	viability_potency	viability_efficacy	sample_source
activity_outcome	activity_summary
Active	active antagonist	379	379	378	378	379	0	0	379	378	379	379	115	359	379
Inactive	inactive	7562	7562	7466	7466	7562	0	0	7562	0	7562	7562	324	7449	7562
Inconclusive	active agonist	612	612	571	571	612	0	0	612	612	612	612	60	590	612
	inconclusive	44	44	44	44	44	0	0	44	0	0	44	19	42	44
	inconclusive agonist	414	414	409	409	414	0	0	414	212	223	414	12	397	414
	inconclusive agonist (cytotoxic)	59	59	59	59	59	0	0	59	41	45	59	59	59	59
	inconclusive antagonist	367	367	364	364	367	0	0	367	227	230	367	8	313	367
	inconclusive antagonist (cytotoxic)	1049	1049	1046	1046	1049	0	0	1049	1019	1026	1049	1048	1049	1049

Note, sometimes the above display sort of creates a UI illusion. Run the next code and you will see that activity_outcome Active and Inactive have only one type of activity_summary, and Inconclusive has many, of which one is active agonist.

for outcome, summaries in (
    df_curated
    .groupby('activity_outcome')['activity_summary']
    .unique()
    .items()
):
    print(f"\n{outcome}:")
    for s in sorted(summaries):
        print(f"  - {s}")

Active:
  - active antagonist

Inactive:
  - inactive

Inconclusive:
  - active agonist
  - inconclusive
  - inconclusive agonist
  - inconclusive agonist (cytotoxic)
  - inconclusive antagonist
  - inconclusive antagonist (cytotoxic)

print('there are',len(df_curated),'compounds in the dataframe that are defined as active antagonist, inactive, or inconclusive.')

there are 10486 compounds in the dataframe that are defined as active antagonist, inactive, or inconclusive.

Now, we can see that, in the activity_summary column, the inconclusive compounds are further classified into subclasses, which include:

active agonist
inconclusive
inconclusive agonist
inconclusive antagonist
inconclusive agonist (cytotoxic)
inconclusive antagonist (cytotoxic)

As indicated by the title of this PubChem assay record (AID 743139), the purpose of the experiment was to identify aromatase inhibitors. For this reason, the original submitter labeled all compounds showing antagonist behavior as active in the activity_outcome column. The assay also detected 612 agonists, but these were marked as inconclusive because the submitter defined activity strictly in terms of inhibitory, not stimulatory, effects. In this dataset, inactive compounds are those that show neither agonist nor antagonist behavior.

It is important to remember that “active” and “inactive” are not universal labels; their meaning depends entirely on the assay design and the submitter’s interpretation. For our purposes, we redefine these categories to support a binary classification task: any compound that changes the function of the target, whether by increasing or decreasing its activity, is considered active, and compounds that do not alter the target are considered inactive. In short, we are grouping all function-altering interactions together as active and treating the remainder as inactive, so the model can focus on distinguishing compounds that affect the enzyme from those that do not.

Check Your Understanding: Defining “Active” as a Modeling Decision

As we prepare to collapse assay outcomes into a binary classification for model building (active vs inactive), we must decide how to handle compounds labeled as inconclusive. In this dataset, there are 379 active antagonists, 7,562 inactive compounds, and 2,545 inconclusive compounds. Notably, 612 of the inconclusive compounds are labeled as active agonists, substantially more than the number of active antagonists. Rather than asking whether a particular choice is correct, this discussion is intended to surface the consequences of the choices we make and to emphasize the importance of recognizing that a choice is being made at all.

Discussion Points

The deeper lesson: recognizing that a decision was made

At what point in the workflow does the definition of “active” enter the modeling process?
How does collapsing rich experimental outcomes into a binary label necessarily simplify the biology?
What assumptions are we making about mechanism, relevance, and similarity when we redefine labels?
How might a different research goal lead to a different but equally reasonable decision using the same data?

Why documentation may matter more than the decision itself

Could someone reuse this model without realizing how “active” was defined?
How would performance metrics be interpreted differently if agonists were excluded rather than included?
How does documenting this choice enable reproducibility, comparison, and critical evaluation?
What risks arise when modeling assumptions are implicit rather than explicit?

Factors that influence this decision

Biological considerations: agonists are clearly bioactive and demonstrate target engagement, even if the mechanism differs.
Chemical considerations: agonists and antagonists may share structural features relevant to binding and recognition.
Data considerations: there are more active agonists (612) than active antagonists (379), and excluding them would further reduce an already small positive class.
Modeling considerations: extreme class imbalance can negatively affect learning, evaluation, and generalization.

Consequences of including active agonists as “active”

The model becomes a general bioactivity classifier rather than a mechanism-specific antagonist detector.
Predictions answer the question “is this compound active against the target?” rather than “is this compound an antagonist?”
Mechanistic distinctions are deferred to later analysis rather than embedded in the initial model.

Consequences of excluding active agonists

The active class becomes very small relative to the inactive class, increasing imbalance.
Biologically active compounds are treated as negatives, potentially obscuring meaningful structure–activity relationships.
The model becomes more specialized but less general and less transferable.

Connecting this decision to the next step

In the next section, we will explicitly define active and inactive classes for binary classification.
All inconclusive compounds will be removed except those labeled as active agonists.
This decision should be understood as a deliberate modeling choice shaped by biological goals, data constraints, and downstream use.

The central takeaway is not that one labeling strategy is universally correct, but that supervised learning models are built on decisions that shape what they can learn, what they ignore, and how their predictions should be interpreted. Recognizing, justifying, and documenting those decisions is a core part of responsible model building.

4. Select active/inactive compounds for model building#

4.1 Select active and inactive compounds#

Now we want to select only the active and inactive compounds from the data frame (that is, active agonists, active antagonists, and inactives based on the “activity summary” column). This code filters the DataFrame so that only rows labeled as active agonist, active antagonist, or inactive are kept, using the | operator to combine multiple Boolean conditions with “OR.” This removes all inconclusive entries, ensuring that the resulting DataFrame contains only compounds with clear activity classifications suitable for model training.

# create a new dataframe with active agonists, active antagnonists and inactives
# remove any molecules that are inconclusive and not active agonist
df = df_curated[ (df_curated['activity_summary'] == 'active agonist' ) | 
             (df_curated['activity_summary'] == 'active antagonist' ) |
             (df_curated['activity_summary'] == 'inactive' ) ]

print("The number of total molecules that are active or inactive is:", len(df))

The number of total molecules that are active or inactive is: 8553

Since we will be obtaining structural data from PubChem, it’s important to examine how many unique PubChem Substance IDs (SIDs) and Compound IDs (CIDs) are present in our dataset. Recall the distinction:

Substances (SIDs) represent depositor-submitted records. Multiple SIDs may refer to the same chemical structure if submitted by different sources.
Compounds (CIDs) are standardized, unique chemical structures derived by PubChem from submitted substances through a process of structure normalization and deduplication.

Understanding how many unique SIDs and CIDs we have will help us assess redundancy and ensure we’re working with non-duplicated chemical structures when building models or visualizing data.

# identify Compounds IDs and Substance IDs. 
print('total number of unique Substance IDs=',len(df['sid'].unique()))
print('total number of unique Compound IDs= ',len(df['cid'].unique()))

total number of unique Substance IDs= 8553
total number of unique Compound IDs=  6858

This data indicates several important points:

The number of unique Substance IDs matches the number of active and inactive molecules in the dataset.
Not all substances have an associated Compound ID (CID), remember we identified 149 of them earlier, meaning some structures have not been standardized by PubChem.
Some substances may have the same CID, indicating that different submissions or assay runs may correspond to the same chemical structure.

Since our goal is to build a model that uses chemical structure to predict biological activity, we must remove substances without CIDs as these lack standardized structural information necessary for modeling. We must also remove duplicate SIDs for the same CID, as they are the same structure. By cleaning our data based on CID and activity type, we ensure the model learns from reliable, unambiguous data where each chemical structure is uniquely associated with a single activity classification.

4.2 Drop substances without associated CIDs.#

First, check if there are subtances without associated CIDs. We will used the .isna() method to identify all rows that are missing values (not available). Note the following code cell uses chaining to sequentially call multiple methods on a pandas object (we chain .isna with .sum() to count how many are missing).

df.isna().sum()  

pc_result_tag                       0
sid                                 0
cid                               138
PUBCHEM_EXT_DATASOURCE_SMILES     138
activity_outcome                    0
activity_score                      0
activity_url                     8553
assay_data_comment               8553
activity_summary                    0
antagonist_activity                 0
antagonist_potency               7563
antagonist_efficacy                 0
viability_activity                  0
viability_potency                8054
viability_efficacy                155
sample_source                       0
dtype: int64

Check your understanding

How many records lack an associated CID? Why does this number differ from the 149 NaNs identified in section 4.1?

Answer

138 compounds do not have a CID. There was originally 149 and thereare fewer now because we have removed all inconslusive compounds that were not active agonist.

The records without CID are not going to provide us with standardized structural information, so we will remove them from the dataframe.

This is accomplished by using the .dropna() method. Since we only want to remove rows where the CID is missing, we can use the subset parameter to specify we are targeting only those NA values in the ‘cid’ column.

Look at the output of the previous cell block. If we did the dropna() method without indicating which subset we are dropping, we could lose all of our data because the activity_url and **assay_data_comment” are null for every row in the dataframe!

print(len(df))   #in previous we saw 138 rows without CID
df = df.dropna( subset=['cid'] )
print(len(df))   #this values should be 138 fewer

8553
8415

Check if the NULL values disappeared in the cid column. The value should be 0 for that column now.

df.isna().sum()

pc_result_tag                       0
sid                                 0
cid                                 0
PUBCHEM_EXT_DATASOURCE_SMILES       0
activity_outcome                    0
activity_score                      0
activity_url                     8415
assay_data_comment               8415
activity_summary                    0
antagonist_activity                 0
antagonist_potency               7467
antagonist_efficacy                 0
viability_activity                  0
viability_potency                7919
viability_efficacy                154
sample_source                       0
dtype: int64

Now that we have removed all CIDs that are NaN we can safely convert these to integers and not worry about the value being converted back to a float when we reload a saved csv file.

#Write code here to convert CIDs to integers and print the head
df["cid"] = df["cid"].astype("Int64")
df.head()

	pc_result_tag	sid	cid	PUBCHEM_EXT_DATASOURCE_SMILES	activity_outcome	activity_url	assay_data_comment	activity_summary	antagonist_activity	antagonist_potency	viability_activity	viability_potency	sample_source
0	1	144203552	12850184	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI
1	2	144203553	89753	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI
2	3	144203554	9403	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI
3	4	144203555	13218779	C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI
5	6	144203557	16043	CC(C)(C)C1=C(C=CC(=C1)O)O	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI

# identify Compounds IDs and Substance IDs. 
print('total number of Substance IDs=',len(df['sid'].unique()))
print('total number of Compound IDs= ',len(df['cid'].unique()))

total number of Substance IDs= 8415
total number of Compound IDs=  6857

Check your understanding

Why can the numer of SIDs be greater than the number of CIDs?

Answer

The SID is related to the submitter and multiple assays could have been performed on the same chemical. This means we not only have multiple entries for the same chemical compound (CID), but it could result in conflicting activities.

4.3 Remove CIDs with conflicting activities#

This code identifies unique CIDs and checks whether multiple values exist in the activity_summary column. If conflicting activity summaries are found, the code stores the CID in the cid_conflict list and and records the corresponding row indices in idx_conflict. Finally, the total number of CIDs with conflicts and total number of rows are output.

cid_conflict = [] # list to store CIDs with conflicting activities
idx_conflict = [] # list to store indices of rows with conflicting activities

for mycid in df['cid'].unique() : # iterate over each unique Compound ID
    
    outcomes = df[ df.cid == mycid ].activity_summary.unique() #
    
    if len(outcomes) > 1 : # if there are multiple unique activity summaries for this CID
        
        idx_tmp = df.index[ df.cid == mycid ].tolist() # get the indices of these rows
        idx_conflict.extend(idx_tmp) # add these indices to the conflict list
        cid_conflict.append(mycid) # # add the CID to the conflict list

print("#", len(cid_conflict), "CIDs with conflicting activities [associated with", len(idx_conflict), "rows (SIDs).]")
print(idx_conflict[:10]) # print first 10 CID index numbers of conflicting activities

# 66 CIDs with conflicting activities [associated with 150 rows (SIDs).]
[5, 5953, 6847, 49, 6127, 63, 9115, 103, 5917, 6961]

To examine which CIDs have conflicting activity data, we can display a portion of the dataframe using the .loc[] method. This allows us to select specific rows. In this case, we used the ones listed in idx_conflict, which correspond to compounds with inconsistent activity summaries. We’ll chain with .head(10) to view just the first 10 of these rows for quick inspection.

This helps us verify the nature of the conflicts and better understand why these entries need to be removed before modeling.

df.loc[idx_conflict,:].head(10)

	pc_result_tag	sid	cid	PUBCHEM_EXT_DATASOURCE_SMILES	activity_outcome	activity_score	activity_url	assay_data_comment	activity_summary	antagonist_activity	antagonist_potency	antagonist_efficacy	viability_activity	viability_potency	viability_efficacy	sample_source
5	6	144203557	16043	CC(C)(C)C1=C(C=CC(=C1)O)O	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0.0000	inactive	NaN	0.0000	NCI
5953	5954	144209507	16043	CC(C)(C)C1=C(C=CC(=C1)O)O	Active	43.0	NaN	NaN	active antagonist	active antagonist	54.48270	-73.4024	inconclusive antagonist	NaN	NaN	SigmaAldrich
6847	6848	144210401	16043	CC(C)(C)C1=C(C=CC(=C1)O)O	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0.0000	inactive	NaN	0.0000	SIGMA
49	50	144203601	443939	C[C@H]1[C@H]([C@H](C[C@@H](O1)O[C@H]2C[C@@](CC...	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0.0000	inactive	NaN	0.0000	NCI
6127	6128	144209681	443939	C[C@H]1[C@H]([C@H](C[C@@H](O1)O[C@H]2C[C@@](CC...	Active	61.0	NaN	NaN	active antagonist	active antagonist	1.65519	-115.9320	active antagonist	12.1763	-120.5980	Toronto Research
63	64	144203615	2170	C1CN(CCN1)C2=NC3=CC=CC=C3OC4=C2C=C(C=C4)Cl	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0.0000	inactive	NaN	0.0000	BIOMOL
9115	9116	144212669	2170	C1CN(CCN1)C2=NC3=CC=CC=C3OC4=C2C=C(C=C4)Cl	Active	50.0	NaN	NaN	active antagonist	active antagonist	16.58030	-115.2020	inconclusive antagonist	61.1306	-80.7706	SIGMA
103	104	144203655	2554	C1=CC=C2C(=C1)C=CC3=CC=CC=C3N2C(=O)N	Inconclusive	20.0	NaN	NaN	active agonist	active agonist	2.87255	73.7025	inactive	NaN	0.0000	SigmaAldrich
5917	5918	144209471	2554	C1=CC=C2C(=C1)C=CC3=CC=CC=C3N2C(=O)N	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0.0000	inactive	NaN	0.0000	SIGMA
6961	6962	144210515	2554	C1=CC=C2C(=C1)C=CC3=CC=CC=C3N2C(=O)N	Inactive	0.0	NaN	NaN	inactive	inactive	NaN	0.0000	inactive	NaN	0.0000	SIGMA

We will be creating a binary classifier system distinguishing between active and inactive compounds, and will lump active antagonist with active agonist as active. The active agonist are being pulled from the activity_summary of inactive compounds in the activity_outcome data, and so are being read as inactive when comparing the activity_outcome. This means we are dropping them as conflicting outcomes even though we will consider active agonist as active.

This should be rare as one assay is saying the compound is active antagonist (activity_outcome [Active] = active antagonist) and the other as active agonist (activity_outcome [Inconclusive]activity_summary= active agonist). The following code cell identifies these compounds and returns a python set of them with the variable name allowed_cids.

# Identify conflicting CIDs using activity_summary (already computed earlier)
# idx_conflict comes from the earlier conflict-detection cell

# Define biologically allowed exceptions:
# CIDs that have both active antagonist and active agonist summaries
allowed_cids = []

for cid, g in df.groupby('cid'):
    summaries = set(g['activity_summary'].str.lower())

    if (
        'active antagonist' in summaries
        and 'active agonist' in summaries
    ):
        allowed_cids.append(cid)

allowed_cids = set(allowed_cids)

print(f'allowed exceptions: {allowed_cids}')

# Optional inspection
df.loc[df['cid'].isin(allowed_cids)] \
  .sort_values(['cid', 'sid']) \
  .head(20)

allowed exceptions: {np.int64(10868), np.int64(54454)}

	pc_result_tag	sid	cid	PUBCHEM_EXT_DATASOURCE_SMILES	activity_outcome	activity_score	activity_url	assay_data_comment	activity_summary	antagonist_activity	antagonist_potency	antagonist_efficacy	viability_activity	viability_potency	viability_efficacy	sample_source
6644	6645	144210198	10868	C1CCC(CC1)N=C=NC2CCCCC2	Inconclusive	20.0	NaN	NaN	active agonist	active agonist	4.19020	40.5990	inactive	NaN	0.0	SIGMA
10000	10001	144213554	10868	C1CCC(CC1)N=C=NC2CCCCC2	Active	50.0	NaN	NaN	active antagonist	active antagonist	5.04574	-68.1018	inconclusive antagonist	NaN	NaN	LightBiologicals
695	696	144204247	54454	CCC(C)(C)C(=O)O[C@H]1C[C@H](C=C2[C@H]1[C@H]([C...	Inconclusive	20.0	NaN	NaN	active agonist	active agonist	4.35177	54.7546	inactive	NaN	0.0	Prestwick Chemical, Inc.
7164	7165	144210718	54454	CCC(C)(C)C(=O)O[C@H]1C[C@H](C=C2[C@H]1[C@H]([C...	Active	45.0	NaN	NaN	active antagonist	active antagonist	17.22890	-54.6818	inactive	NaN	0.0	Toronto Research

Deeper Dive Explaination

Go to the DYAD pages for AID_743139-SID_144210198 and AID_743139-SID_144213554 , both of which are CID 10868. Explain the results of the above output.

Answer

Although the assay was designed to detect antagonism, it measures receptor signaling in the presence of a reference agonist. Compounds such as CIDs 10868 and 54454 can both activate the receptor on their own and reduce signaling relative to a stronger agonist, consistent with partial agonism. PubChem therefore records agonist-like signal increases as “active agonist” (often marked inconclusive) and signal suppression as “active antagonist.” Both outcomes indicate real target engagement and should be grouped as Active in our binary classification.

Check your understanding

Examine the portion of the dataframe above. Why should CID 16043 be removed from the analysis?
Can you explain the sequence of the output of df.loc[idx_conflict,:].head(10)? Is this data being sorted?

Answer

There were three SIDs with a common CID of 16043 and the results were conflictory (two were active, one was inactive)
This is not sorting, but a grouping by CID occured because the ‘for loop’ iterated through unique CIDs in the order of the idx_conflict list. When a CID with conflicting activity was found, all rows with that CID were then added to the idx_conflict list. So index 5 was the first position (CID 16043), then index 5953 and 6847 as they are also CID 16043, now we go to the next index position in the idx_conflict list, which is 49 and there are two CIDs with that value,and then to the next, which is index position 63, and do this for the length of the idx_conflict list.

# Drop rows with conflicting data, except for allowed biological exceptions
idx_conflict_fixed = df.index[
    df.index.isin(idx_conflict) &
    ~df["cid"].isin(allowed_cids)
]

df = df.drop(idx_conflict_fixed)
df.head()

	pc_result_tag	sid	cid	PUBCHEM_EXT_DATASOURCE_SMILES	activity_outcome	activity_url	assay_data_comment	activity_summary	antagonist_activity	antagonist_potency	viability_activity	viability_potency	sample_source
0	1	144203552	12850184	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI
1	2	144203553	89753	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI
2	3	144203554	9403	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI
3	4	144203555	13218779	C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI
9	10	144203561	637566	CC(=CCC/C(=C/CO)/C)C	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI

# After removing the conflicting data, we can check the counts of each activity summary again
df.groupby('activity_summary').count()

	pc_result_tag	sid	cid	PUBCHEM_EXT_DATASOURCE_SMILES	activity_outcome	activity_score	activity_url	assay_data_comment	antagonist_activity	antagonist_potency	antagonist_efficacy	viability_activity	viability_potency	viability_efficacy	sample_source
activity_summary
active agonist	539	539	539	539	539	539	0	0	539	539	539	539	58	519	539
active antagonist	344	344	344	344	344	344	0	0	344	343	344	344	108	327	344
inactive	7386	7386	7386	7386	7386	7386	0	0	7386	0	7386	7386	318	7275	7386

# identify Compounds IDs and Substance IDs. 
print('total number of Substance IDs=',len(df['sid'].unique()))
print('total number of Compound IDs= ',len(df['cid'].unique()))

total number of Substance IDs= 8269
total number of Compound IDs=  6793

4.4 Remove redundant data#

The code cells in section 4.3. do not remove compounds tested multiple times if the testing results are consistent (e.g., active agonist in all samples (substances)). The rows corresponding to these compounds are redundant, so we want remove them except for only one row for each compound. To illustrate this, let’s create a sorted dataframe to see some of the redundancies.

sorted_df = df.sort_values(by=['cid','sid','activity_summary'],
                           ascending=[True,True,True])
sorted_df.head(10)

	pc_result_tag	sid	cid	PUBCHEM_EXT_DATASOURCE_SMILES	activity_outcome	activity_url	assay_data_comment	activity_summary	antagonist_activity	antagonist_potency	viability_activity	viability_potency	sample_source
8467	8468	144212021	4	CC(CN)O	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	Enamine
6110	6111	144209664	11	C(CCl)Cl	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	SIGMA
4955	4956	144208509	13	C1=CC(=C(C=C1Cl)Cl)Cl	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	SIGMA
7326	7327	144210880	13	C1=CC(=C(C=C1Cl)Cl)Cl	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	LightBiologicals
7915	7916	144211469	33	C(C=O)Cl	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	LightBiologicals
6809	6810	144210363	34	C(CCl)O	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	Enamine
15	16	144203567	51	C(CC(=O)O)C(=O)C(=O)O	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI
4562	4563	144208116	51	C(CC(=O)O)C(=O)C(=O)O	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	SIGMA
5533	5534	144209087	66	C1=CC(=C(C=C1Cl)O)Cl	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	SIGMA
9816	9817	144213370	66	C1=CC(=C(C=C1Cl)O)Cl	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	SIGMA

Explore the above displayed sorted dataframe. Identify any CIDs that are repeated, and if they have different SID or activity_outcome values.

We will use the drop_duplicates() method in pandas to remove any duplicate rows in our dataframe. We can use the subset parameter to signify which column to consider when identifying duplicates. In our case, we need unique identifiers to get structural data, so we use ‘cid’.

df_clean = df.drop_duplicates(subset='cid')  # remove duplicate rows except for the first occurring row.
print('total number of Substance IDs=',len(df_clean['sid'].unique()))
print('total number of Compound IDs= ',len(df_clean['cid'].unique()))

total number of Substance IDs= 6793
total number of Compound IDs=  6793

Save as AID743139_clean_CID…#

Save the dataframe in your curated data folder with date stamp and version number using prefix="AID743139_clean_CID",.

# Save cleaned CID data (curated)
from pathlib import Path
from cinf26pk.core import make_fixed_filename

# Generate path (relative to notebook)
CUR = Path("data/AID743139/curated")
CUR.mkdir(parents=True, exist_ok=True)

# Create deterministic filename with date stamp
fname = make_fixed_filename(
    stem="AID743139_clean_CID",
    ext="csv",
    add_date=True
)

# Save curated dataset
df_clean.to_csv(CUR / fname, index=False)

print(f"[Saved] Curated PubChem data → {CUR / fname}")

[Saved] Curated PubChem data → data/AID743139/curated/AID743139_clean_CID_20260205.csv

#write your code here to open the file as a dataframe called df 
#double check the the number of CIDs and SIDs has not changedand check that inconclusives and duplicates have been removed
import pandas as pd
df = pd.read_csv(CUR/"AID743139_clean_CID_20260205.csv")
print('total number of Substance IDs=',len(df['sid'].unique()))
print('total number of Compound IDs= ',len(df['cid'].unique()))
df.head()

total number of Substance IDs= 6793
total number of Compound IDs=  6793

	pc_result_tag	sid	cid	PUBCHEM_EXT_DATASOURCE_SMILES	activity_outcome	activity_url	assay_data_comment	activity_summary	antagonist_activity	antagonist_potency	viability_activity	viability_potency	sample_source
0	1	144203552	12850184	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI
1	2	144203553	89753	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI
2	3	144203554	9403	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI
3	4	144203555	13218779	C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI
4	10	144203561	637566	CC(=CCC/C(=C/CO)/C)C	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI

Note from the above that cids are now integers, and the number of CIDs equal the number of SIDs

4.5 Adding “numeric” activity classes#

In general, machine learning algorithms require both inputs and outputs to be in numerical form. In our case the input we will be using are molecular structure. We have been working to clean our data to ensure we have valid CIDs that we can use to retrieve SMILES strings and subsequently calculate binary fingerprints as input features. However. the output of biological activity is currently stored as text labels as active or inactive. To make this data suitable for modeling, we will create a new column called ‘activity’ that assigns these labels as numeric values:

1 for actives (including both active agonists and antagonists)
0 for inactives

We combine the two types of active compounds into a single class because our goal is to train a binary classifier that distinguishes between compounds that interact with the target (active) and those that do not (inactive).

df.groupby(['activity_outcome','activity_summary']).count()

		pc_result_tag	sid	cid	PUBCHEM_EXT_DATASOURCE_SMILES	activity_score	activity_url	assay_data_comment	antagonist_activity	antagonist_potency	antagonist_efficacy	viability_activity	viability_potency	viability_efficacy	sample_source
activity_outcome	activity_summary
Active	active antagonist	290	290	290	290	290	0	0	290	289	290	290	88	275	290
Inactive	inactive	6050	6050	6050	6050	6050	0	0	6050	0	6050	6050	269	5964	6050
Inconclusive	active agonist	453	453	453	453	453	0	0	453	453	453	453	44	434	453

df['activity'] = [ 0 if x == 'inactive' else 1 for x in df['activity_summary'] ]
df.head() #new column is added to end of column list, so scroll right in the resulting cell output.

	pc_result_tag	sid	cid	PUBCHEM_EXT_DATASOURCE_SMILES	activity_outcome	activity_url	assay_data_comment	activity_summary	antagonist_activity	antagonist_potency	viability_activity	viability_potency	sample_source
0	1	144203552	12850184	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI
1	2	144203553	89753	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI
2	3	144203554	9403	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI
3	4	144203555	13218779	C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI
4	10	144203561	637566	CC(=CCC/C(=C/CO)/C)C	Inactive	NaN	NaN	inactive	inactive	NaN	inactive	NaN	NCI

Double-check the count of active/inactive compounds.

df.groupby('activity_summary').count()

	pc_result_tag	sid	cid	PUBCHEM_EXT_DATASOURCE_SMILES	activity_outcome	activity_score	activity_url	assay_data_comment	antagonist_activity	antagonist_potency	antagonist_efficacy	viability_activity	viability_potency	viability_efficacy	sample_source	activity
activity_summary
active agonist	453	453	453	453	453	453	0	0	453	453	453	453	44	434	453	453
active antagonist	290	290	290	290	290	290	0	0	290	289	290	290	88	275	290	290
inactive	6050	6050	6050	6050	6050	6050	0	0	6050	0	6050	6050	269	5964	6050	6050

df.groupby('activity').count() 

	pc_result_tag	sid	cid	PUBCHEM_EXT_DATASOURCE_SMILES	activity_outcome	activity_score	activity_url	assay_data_comment	activity_summary	antagonist_activity	antagonist_potency	antagonist_efficacy	viability_activity	viability_potency	viability_efficacy	sample_source
activity
0	6050	6050	6050	6050	6050	6050	0	0	6050	6050	0	6050	6050	269	5964	6050
1	743	743	743	743	743	743	0	0	743	743	742	743	743	132	709	743

Do the number of actives in this output equal the active agonist and active antagonist above?

4.6 Create a smaller data frame that only contains CIDs and activities.#

Let’s create a smaller data frame that only contains CIDs, activities and the PubChem External Data Sources. This data frame will be merged with a data frame containing molecular fingerprint information.

df_activity = df[['cid','activity','PUBCHEM_EXT_DATASOURCE_SMILES']]  #also included the external datasource SMILES to compare later with PubChem validated CID SMILES

df_activity.head(5)

	cid	PUBCHEM_EXT_DATASOURCE_SMILES
0	12850184	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...
1	89753	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...
2	9403	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...
3	13218779	C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N
4	637566	CC(=CCC/C(=C/CO)/C)C

df_activity.shape

(6793, 3)

While we could have used the SMILES that were already in our dataframe, the label was PUBCHEM_EXT_DATASOURCE_SMILES indicating that the SMILES may have been provided by the submitter, and not cannonical. To ensure we have quality data, we will retrieve SMILES from PubChem.

print("CIDs are stored in original df dataframe as", df['cid'].dtype)
print("CIDs are stored in df_activity dataframe as", df_activity['cid'].dtype)
cids = df.cid.astype(int).tolist()
print(cids[:10])
print("CIDs are now stored as", type(cids[0]))
print(f"CIDs is type {type(cids)}")

CIDs are stored in original df dataframe as int64
CIDs are stored in df_activity dataframe as int64
[12850184, 89753, 9403, 13218779, 637566, 4766, 3080, 7048801, 51, 66435]
CIDs are now stored as <class 'int'>
CIDs is type <class 'list'>

4.6.1 Save as AID743139_activity_CID#

Save the dataframe in your curated data folder with date stamp and version number using prefix="AID743139_activity_CID",.

#write code here
# Complete the code to save your work as a versioned filename in the folder

from pathlib import Path
from cinf26pk.core import make_fixed_filename

# Generate path (relative to notebook)
CUR = Path("data/AID743139/curated")
CUR.mkdir(parents=True, exist_ok=True)

# Create deterministic filename with date stamp
fname = make_fixed_filename(
    stem="AID743139_Activity_CID",
    ext="csv",
    add_date=True
)

# Save curated dataset
df_activity.to_csv(CUR / fname, index=False)

print(f"[Saved] Curated PubChem data → {CUR / fname}")

[Saved] Curated PubChem data → data/AID743139/curated/AID743139_Activity_CID_20260205.csv

5 Acquire standardized PubChem SMILES#

Now that we have a list of CIDs as integers, we can retrive the SMILES from PubChem using the PUG REST API. We will use these to generate RDKit fingerprints.

Once we have retrieved the SMILES we will store them and their respective CIDs into a new df_smiles dataframe.

Code Dive

The following code uses several new functions in the cinf26pk.py package, along with the tqdm package. You can insert the cinf26pk functions directly into the code cell if you do not have access to the package (after removing the import statements)

chunk_list() - a function of the cinf26.core.utils module

      
1)def chunk_list(lst: List[Any], size: int = 100):
    """
    Yield successive fixed-size chunks from a list.
Parameters
----------
lst : list
    List to split.
size : int, default=100
    Chunk size.

Yields
------
list
    List chunk.
"""
for i in range(0, len(lst), size):
    yield lst[i:i + size]

pubchem_get() - a function of the cinf26.pubchem.http module


def pubchem_get(url, retries=DEFAULT_RETRIES, timeout=DEFAULT_TIMEOUT,
                base_delay=2, return_json=True):
    """
    Safely performs a GET request to a PubChem PUG-REST endpoint with retry/backoff.
Parameters
----------
url : str
    Full URL for the GET request.
retries : int, default=3
    Number of retry attempts.
timeout : int, default=30
    Timeout (seconds) for request.
base_delay : int, default=2
    Backoff base delay.
return_json : bool, default=True
    Return JSON or raw text.

Returns
-------
dict or str or None
    Parsed response, or None if all retries fail.
"""  
for attempt in range(1, retries + 1):
    try:
        r = requests.get(url, timeout=timeout)
        r.raise_for_status()
        return r.json() if return_json else r.text

    except Exception as e:
        print(f"[Warning] GET attempt {attempt}/{retries} failed.")
        print(f"         URL: {url}")
        print(f"         Error: {type(e).__name__}: {e}")
        time.sleep(base_delay * attempt)

print(f"[Error] All GET attempts failed for URL: {url}")
return None

tqdm

tqdm is a Python package that provides fast, extensible progress bars for loops and iterable-based operations. It is commonly used in data science and machine learning workflows to give real-time feedback on long-running tasks such as file I/O, model training, or batch processing, without requiring changes to the underlying logic of the code. tqdm integrates cleanly with standard Python iterables, Pandas operations, and Jupyter notebooks.

To install activate your env and use conda-forge

conda install -c conda-forge tqdm

import time
from io import StringIO
from pathlib import Path

import pandas as pd
from tqdm import tqdm

from cinf26pk.core import chunk_list, make_fixed_filename
from cinf26pk.pubchem import http


# ------------------------------------------------------------
# 1. Reload curated activity data
# ------------------------------------------------------------
df_activity = pd.read_csv(
    "data/AID743139/curated/AID743139_Activity_CID_20260205.csv"
)

# Extract CIDs as integers
cids = df_activity["cid"].astype(int).tolist()
print("# CIDs:", len(cids))


# ------------------------------------------------------------
# 2. Retrieve standardized PubChem SMILES
# ------------------------------------------------------------
chunk_size = 100
sleep_time = 0.2   # 5 calls per second
base_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid"

df_chunks = []
failed_chunks = []

chunks = list(chunk_list(cids, size=chunk_size))
print("# Total chunks:", len(chunks))

for i, chunk in enumerate(tqdm(chunks)):
    
    cid_str = ",".join(str(x) for x in chunk)
    url = f"{base_url}/{cid_str}/property/SMILES/TXT"

    text = http.pubchem_get(url, return_json=False)
    
    if text is None:
        print(f"[Warning] Failed chunk {i}")
        failed_chunks.append(i)
        continue
    
    # Convert to DataFrame
    df_chunk = pd.read_csv(
        StringIO(text),
        header=None,
        names=["smiles"]
    )
    df_chunks.append(df_chunk)
    
    # Rate limit compliance
    time.sleep(sleep_time)


# ------------------------------------------------------------
# 3. Combine results
# ------------------------------------------------------------
df_smiles = pd.concat(df_chunks, ignore_index=True)
df_smiles["cid"] = cids

print("\nPreview of combined SMILES:")
print(df_smiles.head())

print(f"\nFailed chunks: {failed_chunks}")


# ------------------------------------------------------------
# 4. Save canonical SMILES dataset (overwrite-on-rerun)
# ------------------------------------------------------------
CUR = Path("data/AID743139/curated")
CUR.mkdir(parents=True, exist_ok=True)

fname = make_fixed_filename(
    stem="AID743139_pcSMILES_act",
    ext="csv",
    add_date=True
)

df_smiles.to_csv(CUR / fname, index=False)

print(f"[Saved] Curated PubChem SMILES → {CUR / fname}")

# CIDs: 6793
# Total chunks: 68

100%|███████████████████████████████████████████████████████████████████████████████████| 68/68 [05:54<00:00,  5.21s/it]

Preview of combined SMILES:
                                              smiles       cid
0  C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...  12850184
1  C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...     89753
2  C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...      9403
3         C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N  13218779
4                               CC(=CCC/C(=C/CO)/C)C    637566

Failed chunks: []
[Saved] Curated PubChem SMILES → data/AID743139/curated/AID743139_pcSMILES_act_20260205.csv

df_smiles.head()

	smiles	cid
0	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...	12850184
1	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...	89753
2	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...	9403
3	C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N	13218779
4	CC(=CCC/C(=C/CO)/C)C	637566

5.1 Deal with salts and mixtures#

We will be generating fingerprints from the smiles strings. To ensure the quality of the fingerprints, we will clean the SMILES strings to remove salts and keep only the largest fragment of each compound. This is important because salts and mixtures can introduce noise and variability in the fingerprint data, which can negatively impact model performance.

SaltRemover is a utility module within RDKit that remove counterions and small disconnected fragments from molecular structures. Many chemical records represent compounds in their salted or formulated forms (e.g., hydrochloride salts, sodium salts, solvates). While chemically valid, these extra components can interfere with structure-based analysis, fingerprint generation, and machine-learning workflows. The purpose of SaltRemover is therefore not to change the underlying chemical identity of a compound, but to isolate the core molecular structure that is typically responsible for biological activity.

In the following code, each SMILES string is first converted into an RDKit molecule object using Chem.MolFromSmiles. The SaltRemover is then applied via StripMol, which removes known salt fragments while ensuring that at least one fragment remains (dontRemoveEverything=True). After salt stripping, the molecule may still consist of multiple disconnected fragments (for example, residual solvents or small ions). To resolve this, the code explicitly extracts all fragments and retains only the largest fragment by atom count, under the assumption that this fragment represents the biologically relevant parent compound. Finally, the cleaned molecule is converted back into a SMILES string using Chem.MolToSmiles, and the result is stored as clean_smiles. The net effect is a standardized, salt-free representation of each compound suitable for fingerprint generation, comparison, and machine-learning analysis.

# reload the file you just saved and print the head to ensure it was correctly saved
import pandas as pd
df_smiles = pd.read_csv("data/AID743139/curated/AID743139_pcSMILES_act_20260205.csv")
df_smiles.head()

	smiles	cid
0	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...	12850184
1	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...	89753
2	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...	9403
3	C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N	13218779
4	CC(=CCC/C(=C/CO)/C)C	637566

Code Dive

This line of code in the following code cell.

df_smiles['clean_smiles'] = df_smiles['smiles'].apply(clean_smiles)

Explanation

In the above code we created a python function called clean_smiles(). The apply method allowed us to apply it to the "smiles" series and generate a new "clean_smiles" series within the DataFrame, which is a column of salt-free SMILES. Although it behaves like a loop, it is written in a Pandas-native way that preserves index alignment and avoids manual iteration.

To identify how many smiles needed to be cleaned you could compare the smiles in the ‘smiles’ column with those in the ‘clean_smiles’ column but due to the noncanonical nature of smiles those generated by RDKit may differ than those obtained by PubChem, even if the structure has undergone on modification.

from rdkit import Chem
from rdkit.Chem import SaltRemover

remover = SaltRemover.SaltRemover()

def clean_smiles(sm):
    mol = Chem.MolFromSmiles(sm)
    if mol is None:
        return None

    mol = remover.StripMol(mol, dontRemoveEverything=True)

    # keep only largest fragment
    frags = Chem.GetMolFrags(mol, asMols=True)
    mol = max(frags, key=lambda m: m.GetNumAtoms())

    return Chem.MolToSmiles(mol)

df_smiles['clean_smiles'] = df_smiles['smiles'].apply(clean_smiles)

print(df_smiles.head(5))
print(df_smiles.shape)

[12:31:32] WARNING: not removing hydrogen atom without neighbors

                                              smiles       cid  \
C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...  12850184   
C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...     89753   
C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...      9403   
       C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N  13218779   
                             CC(=CCC/C(=C/CO)/C)C    637566   

                                        clean_smiles  
       O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-]  
     O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO  
C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C...  
     CC1(C)[C@@H]2CC[C@@]1(C)[C@H](OC(=O)CSC#N)C2  
                              CC(C)=CCC/C(C)=C/CO  
(6793, 3)

Check Your Understanding

In this section, we transformed the original SMILES strings into clean_smiles by stripping salts and retaining only the largest fragment. This step is often treated as “routine preprocessing,” but it is better understood as a deliberate modeling decision that changes the data we carry forward into fingerprint generation. The goal here is not to memorize a single correct approach, but to recognize that multiple reasonable approaches exist, and that the approach you choose should be justified and recorded. You also need to keep in mind that this decision does not only affect the data we are using now, but that any future data set using the model we build using these fingerprints will implicitly assume that data is processed in the same way.

What assumptions did we make? — recognizing that a decision was made

What is more important for modeling purposes: the full formulation, or the parent structure?
What assumptions are embedded in the phrase “biologically relevant parent compound”?
How might the meaning of “clean” change depending on the assay context, the intended model, or the chemistry?

Why documentation may matter more than the decision itself

If someone else reproduces this work, what do they need to know to obtain the same clean_smiles values?
What information about the original records is lost once we overwrite or replace the original SMILES with cleaned structures?
How could a future comparison between two models be confounded if they used different salt or mixture handling rules?
Why is it important to keep versioned files as the dataset evolves through these transformations?

Looking ahead: applying the model to new data

When this model is applied to new compounds, will those SMILES need to be cleaned in the same way?
What could happen if new data explain salts or mixtures differently than the training data?
How might inconsistent preprocessing affect fingerprint generation and model predictions?
What parts of this preprocessing step should eventually be treated as part of the model pipeline rather than as one-time data cleaning?

The main takeaway is that preprocessing steps create a new version of your dataset and define the assumptions your model will carry forward. Once fingerprints are generated, the model’s behavior depends not only on the algorithm, but also on whether future data are transformed in a compatible way.

df_smiles['is_identical'] = df_smiles['smiles'] == df_smiles['clean_smiles']
print(df_smiles['is_identical'].sum())
df_smiles['is_identical'].value_counts()

is_identical
False    6015
True      778
Name: count, dtype: int64

# make sure the number of rows in df_smiles is equal to the number of unique CIDs we had previously
df_activity = pd.read_csv("data/AID743139/curated/AID743139_Activity_CID_20260205.csv")
print("Number of CIDs in df_smiles:", len(df_smiles))
print("Number of unique CIDs in original df:", len(df_activity['cid'].unique()))

print("The number of CIDs in df_smiles is equal to the number of unique CIDs in the original df:",
       len(df_smiles) == len(df_activity['cid'].unique()))
df_smiles.head()

Number of CIDs in df_smiles: 6793
Number of unique CIDs in original df: 6793
The number of CIDs in df_smiles is equal to the number of unique CIDs in the original df: True

	smiles	cid	clean_smiles	is_identical
0	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...	12850184	O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-]	False
1	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...	89753	O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO	False
2	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...	9403	C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C...	False
3	C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N	13218779	CC1(C)[C@@H]2CC[C@@]1(C)[C@H](OC(=O)CSC#N)C2	False
4	CC(=CCC/C(=C/CO)/C)C	637566	CC(C)=CCC/C(C)=C/CO	False

Check Your Understanding

After generating clean_smiles, we compared the original SMILES strings to the cleaned ones and found that only 778 entries were identical, while 6,015 were different. At first glance, it is tempting to conclude that this means 6,015 compounds were “changed” during cleaning. This check is useful, but it is also an opportunity to think carefully about what our comparison is actually measuring.

What question does this comparison really answer?

Does a difference between smiles and clean_smiles always mean the underlying chemical structure changed?
What assumptions are we making when we treat SMILES strings as if they were canonical identifiers?
How might RDKit-generated SMILES differ from PubChem SMILES even when no salt or fragment was removed?

Why this result can be misleading at first glance

SMILES are not unique representations; the same molecule can be written in multiple valid ways.
RDKit often reorders atoms or normalizes bonding when generating SMILES, even if the structure is unchanged.
As a result, a False value here may reflect a change in representation, not a change in chemistry.

Connecting this to data provenance and decision making

What evidence do we actually need to support the claim that a compound was chemically altered during cleaning?
How does the choice of representation affect our interpretation of “before” and “after” datasets?
Why is it important to distinguish between a structural change and a formatting or normalization change?

Looking ahead to better comparisons

What properties of a molecule should remain invariant if the underlying structure is truly the same?
How might a representation designed to be canonical help answer this question more reliably?
In later steps, how could misleading checks like this propagate confusion if they are not carefully interpreted?

The key takeaway is that comparing SMILES strings is a convenient first check, but it is not a definitive test of whether a compound changed during preprocessing. Understanding what a comparison actually measures is just as important as running the comparison itself, especially when those results inform downstream modeling decisions.

5.2 Comparing structures using InChI#

In the following code cell we will convert both the original and cleaned SMILES strings to InChIKeys using RDKit. InChIKeys are unique, fixed-length identifiers derived from the International Chemical Identifier (InChI) representation of a molecule. They provide a standardized way to compare chemical structures, as two identical molecules will produce the same InChIKey.

from rdkit import Chem

def smiles_to_inchi(sm):
    mol = Chem.MolFromSmiles(sm)
    if mol is None:
        return None
    return Chem.MolToInchi(mol)

def smiles_to_inchikey(sm):
    mol = Chem.MolFromSmiles(sm)
    if mol is None:
        return None
    return Chem.MolToInchiKey(mol)

df_smiles['inchikey_original'] = df_smiles['smiles'].apply(smiles_to_inchikey)
df_smiles['inchikey_clean']    = df_smiles['clean_smiles'].apply(smiles_to_inchikey)

df_smiles['is_identical'] = (
    df_smiles['inchikey_original'] == df_smiles['inchikey_clean']
)

print(df_smiles['is_identical'].sum())
df_smiles['is_identical'].value_counts()

[12:33:40] WARNING: not removing hydrogen atom without neighbors

is_identical
True     5336
False    1457
Name: count, dtype: int64

Check Your Understanding

We repeated the comparison using InChI-derived identifiers and found that 5,336 entries were identical, while 1,457 were different. Discuss why these result looks very different from the SMILES-based comparison and points to an important concept in cheminformatics: not all molecular representations are intended to answer the same question.

Why does the InChI comparison behave differently?

InChI (International Chemical Identifier) is an IUPAC-led, open-source standard designed to uniquely describe a chemical structure.
Two molecules that are chemically identical should, by design, generate the same InChI, even if their SMILES strings differ.
This makes InChI (or InChIKey) much better suited for identity checks than raw SMILES strings.

What this tells us about SMILES

SMILES are flexible, human-readable encodings, not unique identifiers.
There is no single, universal “canonical SMILES” standard; different software packages use different canonicalization algorithms.
As a result, two SMILES strings can differ even when they describe the same molecule.

A well-known illustration of this flexibility is that there are thousands of valid SMILES representations for a single molecule such as caffeine. For a fun and eye-opening example, see this blog post: How do I write thee? Let me count the ways .

Key takeaway

SMILES are excellent for storing structures and generating fingerprints, but they are not reliable for determining molecular identity.
InChI was created specifically to solve this problem and should be used when the question is “are these two molecules the same?”
Choosing the right representation is not a modeling decision that changes the data, but a correctness issue that affects how results are interpreted.

The main lesson here is that tools in cheminformatics are purpose-built. Understanding what question a representation is designed to answer is essential for using it correctly and avoiding misleading conclusions.

Our last step before fingerprint generation will be to remove the series we generated to analyze the data and save the CID and clean_smiles for creating fingerprints with.

df_smiles.head()

	smiles	cid	clean_smiles	is_identical	inchikey_original	inchikey_clean
0	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...	12850184	O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-]	False	HNXKHQGRMMZMKP-UINDWOIOSA-L	IZSRJDGCGRAUAR-MROZADKFSA-M
1	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...	89753	O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO	False	PECPTSUHUFWKDG-IYEMJOQQSA-L	RGHNJXZEOKUKBD-SQOUGZDYSA-M
2	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...	9403	C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C...	True	UOACKFBJUYNSLK-XRKIENNPSA-N	UOACKFBJUYNSLK-XRKIENNPSA-N
3	C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N	13218779	CC1(C)[C@@H]2CC[C@@]1(C)[C@H](OC(=O)CSC#N)C2	True	IXEVGHXRXDBAOB-BREBYQMCSA-N	IXEVGHXRXDBAOB-BREBYQMCSA-N
4	CC(=CCC/C(=C/CO)/C)C	637566	CC(C)=CCC/C(C)=C/CO	True	GLZPCOQZEFWAFX-JXMROGBWSA-N	GLZPCOQZEFWAFX-JXMROGBWSA-N

# reorder columns to have 'cid' first, then 'clean_smiles' and remove columns used to check for differences

df_smiles = df_smiles[['cid','clean_smiles']]
print(df_smiles.shape)
print(df_smiles.dtypes)
df_smiles.head()

(6793, 2)
cid              int64
clean_smiles    object
dtype: object

	cid	clean_smiles
0	12850184	O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-]
1	89753	O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO
2	9403	C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C...
3	13218779	CC1(C)[C@@H]2CC[C@@]1(C)[C@H](OC(=O)CSC#N)C2
4	637566	CC(C)=CCC/C(C)=C/CO

Save as AID743139_noSalt_pcSMILES…#

Save the dataframe in your curated data folder with date stamp and version number using prefix="AID743139_noSalt_pcSMILES",.

# write your code here:
# Save to data/ with versioning
from pathlib import Path
from cinf26pk.core import make_fixed_filename

# Generate path (relative to notebook)
CUR = Path("data/AID743139/curated")
CUR.mkdir(parents=True, exist_ok=True)

# Create deterministic filename with date stamp
fname = make_fixed_filename(
    stem="AID743139_noSalt_pcSMILES",
    ext="csv",
    add_date=True
)

# Save curated dataset
df_smiles.to_csv(CUR / fname, index=False)

print(f"[Saved] Curated PubChem data → {CUR / fname}")

[Saved] Curated PubChem data → data/AID743139/curated/AID743139_noSalt_pcSMILES_20260205.csv

Check your understanding

When we requested the SMILES data from PubChem through PUG REST, why did we chunk our data and use time.sleep(.2)?
Briefly explain the steps and importance for cleaning the data in the notebook thus far.
How many compounds were salts or multicomponent mixtures and how did we handle them?

Answer

As noted in the Introduction to PUG REST notebook, PubChem enforces 30-second time limit per request. If we attempt fetch properties for too many compounds at once, the request may exceed this limit and result in a timeout error. Since our dataset includes nearly 7000 molecules, we will split the list of CIDs into smaller chunks and process each chunk separately. This approach helps keep each request efficient, reduces the likelihood of failure, and ensures smooth and successful data retrieval from the API.
Cleaning the data has involved several key steps: removing entries without CIDs (since structural information cannot be retrieved without them), resolving conflicting activity labels to ensure each compound is consistently classified, and converting text-based activity labels into numerical values suitable for machine learning. These steps are critical to ensure that the model is trained on accurate, consistent, and well-structured data, which improves its performance and prevents errors during modeling.
When we compared Smiles from PubChem and those after salt removal that were generated by RDKit we found 6013 that were not equal. But Smiles are not canonical and RDKit and PubChem could generate different SMILES for the same molecule. So we then converted them to InChIKeys, which are canonical, and found that 1,457 had been changed. For these compounds we extracted the fragment with the largest number of heavy atoms, assuming this represented the parent compound.

6 Convert SMILES to Binary Data for Model Input#

6.1 Generate MACCS keys from SMILES#

At this point in the workflow, we have completed the chemical and biological curation of the dataset. We have:

Selected which compounds to include
Resolved conflicting assay outcomes
Defined a binary activity label
Standardized molecular representations by removing salts and mixtures

The remaining task in this module is to translate chemical structure into a numerical form that a machine-learning algorithm can consume. To do this, we convert each cleaned SMILES string into an RDKit molecule object and generate MACCS structural keys. Each MACCS key encodes the presence (1) or absence (0) of a predefined substructure pattern. The result is a fixed-length binary vector that represents chemical structure in a form suitable for supervised learning. Rather than working with RDKit bitvector objects directly, we explicitly convert these fingerprints into tabular numerical data and save them to disk. The resulting file contains:

Compound identifiers (cid)
The curated binary activity label (activity)
The cleaned SMILES representation (clean_smiles)
One column for each MACCS key (e.g., MACCS000 through MACCS166)

This file represents the final output of Module 10.1. It serves as the starting point for:

Model construction in 10.2 Naive Bayes: Model Construction Workflow
Pipeline construction, reuse, and inference in 10.3 Pipelines and Inference

Once fingerprints are generated, the model no longer “sees” molecules, it only sees numbers. Any chemical assumptions made earlier (salt removal, fragment selection, fingerprint choice) are now frozen into the feature representation. This is why the outputs of this step are treated as durable artifacts and reused across later modeling workflows. By saving this representation explicitly, we make the transition from “chemical data preparation” to “machine-learning experimentation” clear, reproducible, and inspectable. Our directory architecture will support this by moving the fingerprints out of the \data\curated and into a new \data\features directory.

# load ..._noSalt_pcSMILES... into df_smiles if you restarted the kernel
df_smiles.head()

	cid	clean_smiles
0	12850184	O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-]
1	89753	O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO
2	9403	C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C...
3	13218779	CC1(C)[C@@H]2CC[C@@]1(C)[C@H](OC(=O)CSC#N)C2
4	637566	CC(C)=CCC/C(C)=C/CO

from rdkit import Chem
from rdkit.Chem import MACCSkeys

fps=dict()

for idx, row in df_smiles.iterrows() :
    
    mol = Chem.MolFromSmiles(row.clean_smiles)
    
    if mol == None :
        print("Can't generate MOL object:", "CID", row.cid, row.smiles)
    else:
        fps[row.cid] = [row.cid] + list(MACCSkeys.GenMACCSKeys(mol).ToBitString())

print(f"Dictionary contains {len(fps)} key–value pairs")

# Print first two key-value pairs of the dictionary fps
def dict_head(d, n=2):
    return list(d.items())[:n]
print(dict_head(fps))

Dictionary contains 6793 key–value pairs
[(12850184, [12850184, '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '1', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '1', '1', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '1', '1', '0', '0', '0', '1', '0', '0', '1', '1', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '1', '1', '1', '1', '0', '1', '0', '1', '0', '0', '0', '0', '1', '0', '0']), (89753, [89753, '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '1', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '1', '1', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '1', '1', '0', '0', '0', '0', '0', '0', '1', '1', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '1', '1', '1', '1', '0', '1', '0', '1', '0', '0', '0', '0', '1', '0', '0'])]

# Create a dataframe of the fingerprints generated fromm the noSalt smiles
# Note we lost the fingerprints
# Generate column names
fpbitnames = []

fpbitnames.append('cid')

for i in range(0,167):   # from MACCS000 to MACCS166
    fpbitnames.append( "maccs" + str(i).zfill(3) )

df_fps = pd.DataFrame.from_dict(fps, orient='index', columns=fpbitnames)

print(df_fps.shape)
df_fps.head(5)

(6793, 168)

	cid	...	maccs157	maccs159	maccs160	maccs161	maccs162	maccs163	maccs164	maccs165
12850184	12850184	...	1	1	0	0	0	0	1	0
89753	89753	...	1	1	0	0	0	0	1	0
9403	9403	...	1	1	1	0	1	1	1	1
13218779	13218779	...	1	1	1	1	0	1	1	1
637566	637566	...	1	0	1	0	0	0	1	0

5 rows × 168 columns

# Merge DataFrames (add fingerprints to end of df)
df_fps = df_fps.merge(
    df_smiles[['cid', 'clean_smiles']],
    on='cid',
    how='left'
)
df_fps.head(3)

	cid	...	maccs159	maccs160	maccs162	maccs163	maccs164	maccs165	clean_smiles
0	12850184	...	1	0	0	0	1	0	O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-]
1	89753	...	1	0	0	0	1	0	O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO
2	9403	...	1	1	1	1	1	1	C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C...

3 rows × 169 columns

# Arrange Columns (move fingerprints to left)
front = ['cid', 'clean_smiles']
rest = [c for c in df_fps.columns if c not in front]

df_fps = df_fps[front + rest]

df_fps.head(3)

	cid	clean_smiles	...	maccs157	maccs159	maccs160	maccs162	maccs163	maccs164	maccs165
0	12850184	O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-]	...	1	1	0	0	0	1	0
1	89753	O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO	...	1	1	0	0	0	1	0
2	9403	C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C...	...	1	1	1	1	1	1	1

3 rows × 169 columns

Save as features/AID743139_MACCS_noSalt…#

Lets now save this in a new folder, data/features and give it a versioning name AID743139_MACCS_noSalt

from pathlib import Path
from cinf26pk.core import make_fixed_filename

# Generate path (relative to notebook)
FEATURES = Path("data/AID743139/features")
FEATURES.mkdir(parents=True, exist_ok=True)

# Create deterministic filename with date stamp
fname = make_fixed_filename(
    stem="AID743139_MACCS_noSalt",
    ext="csv",
    add_date=True
)

# Save MACCS fingerprint dataset
df_fps.to_csv(FEATURES / fname, index=False)

print(f"[Saved] MACCS features → {FEATURES / fname}")

[Saved] MACCS features → data/AID743139/features/AID743139_MACCS_noSalt_20260205.csv

6.2 Merge activity data and fingerprint information#

note, you may need to load the data into df_activity, which you generated in step 4.6.1 (with a file prefix of AID743139_Activity_CID_)

#df_activity=pd.read_csv("data/AID743139/curated/AID743139_Activity_CID_20260118_v1.csv")

print(df_activity.shape)
df_activity.head(3)

(6793, 3)

	cid	PUBCHEM_EXT_DATASOURCE_SMILES
0	12850184	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...
1	89753	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...
2	9403	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...

print(df_fps.shape)
df_fps.head(3)

(6793, 169)

	cid	clean_smiles	...	maccs157	maccs159	maccs160	maccs162	maccs163	maccs164	maccs165
0	12850184	O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-]	...	1	1	0	0	0	1	0
1	89753	O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO	...	1	1	0	0	0	1	0
2	9403	C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C...	...	1	1	1	1	1	1	1

3 rows × 169 columns

# For each row in df_activity, 
#attach the fingerprint columns from df_fps that have the same CID
df_data = df_activity.join(df_fps.set_index('cid'), on='cid')
df_data.head(3)

	cid	PUBCHEM_EXT_DATASOURCE_SMILES	clean_smiles	...	maccs157	maccs159	maccs160	maccs162	maccs163	maccs164	maccs165
0	12850184	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...	O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-]	...	1	1	0	0	0	1	0
1	89753	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...	O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO	...	1	1	0	0	0	1	0
2	9403	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...	C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C...	...	1	1	1	1	1	1	1

3 rows × 171 columns

# drop pubchem source smiles
df_data = df_data.drop(columns=['PUBCHEM_EXT_DATASOURCE_SMILES'])
print(df_data.shape)
df_data.head(3)

(6793, 170)

	cid	clean_smiles	...	maccs157	maccs159	maccs160	maccs162	maccs163	maccs164	maccs165
0	12850184	O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-]	...	1	1	0	0	0	1	0
1	89753	O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO	...	1	1	0	0	0	1	0
2	9403	C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C...	...	1	1	1	1	1	1	1

3 rows × 170 columns

Check to see if there are any CIDs for which the MACCS keys could not be generated. They need to be removed from df_data.

df_data[df_data.isna().any(axis=1)]

	cid	activity	clean_smiles	maccs000	maccs001	maccs002	maccs003	maccs004	maccs005	maccs006	...	maccs157	maccs158	maccs159	maccs160	maccs161	maccs162	maccs163	maccs164	maccs165	maccs166

0 rows × 170 columns

print(f"before drop nan: {len(df_data)}")
df_data = df_data.dropna()
print(f"after drop nan: {len(df_data)}")
df_data.head(3)

before drop nan: 6793
after drop nan: 6793

	cid	clean_smiles	...	maccs157	maccs159	maccs160	maccs162	maccs163	maccs164	maccs165
0	12850184	O=C(CO)[C@@H](O)[C@H](O)[C@@H](O)C(=O)[O-]	...	1	1	0	0	0	1	0
1	89753	O=C([O-])[C@H](O)[C@@H](O)[C@H](O)[C@H](O)CO	...	1	1	0	0	0	1	0
2	9403	C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C...	...	1	1	1	1	1	1	1

3 rows × 170 columns

Save df_data in CSV for future use using the prefix=”AID743139_MACCS_activites_noSalt”

from pathlib import Path
from cinf26pk.core import make_fixed_filename

# Generate path (relative to notebook)
FEAT = Path("data/AID743139/features")
FEAT.mkdir(parents=True, exist_ok=True)

# Create deterministic filename with date stamp
fname = make_fixed_filename(
    stem="AID743139_MACCS_activities_noSalt",
    ext="csv",
    add_date=True
)

# Save feature dataset
df_data.to_csv(FEAT / fname, index=False)

print(f"[Saved] Features CSV data → {FEAT / fname}")

[Saved] Features CSV data → data/AID743139/features/AID743139_MACCS_activities_noSalt_20260205.csv

At this point, we have completed the primary objective of this module: transforming curated chemical and biological data into a feature-level dataset suitable for supervised machine learning. The file saved in data/AID743139/features/ captures the outcome of all decisions made in this module—compound selection, activity labeling, structural standardization, and fingerprint representation—encoded as numerical input that models can consume. Importantly, this file is not an end product, but a handoff artifact. In the next modules, it will be reused to construct models, build pipelines, and perform inference. Likewise, in the homework, you will repeat this workflow with a different assay and explore an alternative fingerprinting strategy, reinforcing the idea that feature generation is one step in a broader, repeatable process rather than a one-off task. The goal is not to produce a single “correct” dataset, but to understand how each representation fits into a reproducible machine-learning pipeline.

Homework#

Problem 1: hw_10.1a_fp.ipynb

Create a new notebook in your notebooks directory called hw_10.1a_fp.ipynb and with this notebook create a new .csv fingerprint file in the /data/AID743139/features directory. You must get instructor approval of the fingerprint and parameters before proceeding and use the modern generator function approach to create the fingerprints.

File naming convention needs to identify

AID
Finger print type (ecfp, fcfp...)
Algorithm parameters (radius, size)
curation decisions
date and version

Example AID743139ecfp4_1024_activites_noSalt_20260111_v1 which describes the first version of a 1024 bit extended connectivity fingerprint with a radius of 4 of desalted compounds showing binary activity data for AID743139 that was generated on January 11, 2026.

Problem 2: hw_10.1_data_prep.ipynb

Create a new notebook in your notebooks directory called hw_10.2_data_prep.ipynb and repeat the work of this notebook with the bioassay you choose in your first homework assignment (AID999999 in the directory structure below). You need generate the structure with your notebook the way we did it with this one.


10_SupervisedML/
├── data/                    
│   │── AID743139/
│   │ 
│   │── AID999999/   (your AID)   
├── models/                    
│   │── AID743139/
│   │ 
│   │── AID999999/   (your AID)

10.1: Data Preparation and Feature Engineering

Contents

10.1: Data Preparation and Feature Engineering#

Learning Objectives

1. Import bioactivity data from PubChem#

1.1 Directory Architecture#

1.2 Helper Function: make_fixed_filename#

2 Data Cleaning#

2.1 Inspect data and remove first 3 rows#

2.2 Save Versioned DataFrame as CSV#

2.2.1 Helper Function: make_filename#

2.2.2 Reload saved file into dataframe#

2.3 Inspect column names#

2.3.1 Clean column names#

3. Inspect Activity Outcomes#

3.1 Inspect Activity summaries#

4. Select active/inactive compounds for model building#

4.1 Select active and inactive compounds#

4.2 Drop substances without associated CIDs.#

4.3 Remove CIDs with conflicting activities#

4.4 Remove redundant data#

Save as AID743139_clean_CID…#

4.5 Adding “numeric” activity classes#

4.6 Create a smaller data frame that only contains CIDs and activities.#

4.6.1 Save as AID743139_activity_CID#

5 Acquire standardized PubChem SMILES#

5.1 Deal with salts and mixtures#

5.2 Comparing structures using InChI#

Save as AID743139_noSalt_pcSMILES…#

6 Convert SMILES to Binary Data for Model Input#

6.1 Generate MACCS keys from SMILES#

Save as features/AID743139_MACCS_noSalt…#

6.2 Merge activity data and fingerprint information#

Homework#