10. RDKit and Descriptors

10. RDKit and Descriptors#

Desriptors and neurmuscular blockers#

The following is a list of neuromuscular blockers:

Drug Name	Notes
Atracurium	Benchmark; undergoes Hofmann elimination
Cisatracurium	Isomer of atracurium; fewer side effects
Vecuronium	Intermediate-acting aminosteroid
Rocuronium	Rapid onset; widely used in surgery
Pancuronium	Long-acting aminosteroid
Mivacurium	Short-acting benzylisoquinolinium
Tubocurarine	Natural product; historical use

Create a DataFrame of names and SMILES#

import requests
import pandas as pd


# List of drug names to retrieve
drug_names = ["Atracurium", "Cisatracurium", "Vecuronium", 
              "Rocuronium", "Pancuronium", "Mivacurium", "Tubocurarine"]

# Prepare lists to store results
smiles_list = []

# Loop through names and retrieve Canonical SMILES
for name in drug_names:
    try:
        url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}/property/ConnectivitySMILES/JSON"
        res = requests.get(url)
        res.raise_for_status()
        smiles = res.json()['PropertyTable']['Properties'][0]['ConnectivitySMILES']
        smiles_list.append(smiles)
    except Exception as e:
        smiles_list.append(None)
        print(f"Could not retrieve SMILES for {name}: {e}")

# Create DataFrame

df = pd.DataFrame({"Name": drug_names, "SMILES": smiles_list})


# Display the DataFrame
df

	Name	SMILES
0	Atracurium	C[N+]1(CCC2=CC(=C(C=C2C1CC3=CC(=C(C=C3)OC)OC)O...
1	Cisatracurium	C[N+]1(CCC2=CC(=C(C=C2C1CC3=CC(=C(C=C3)OC)OC)O...
2	Vecuronium	CC(=O)OC1CC2CCC3C(C2(CC1N4CCCCC4)C)CCC5(C3CC(C...
3	Rocuronium	CC(=O)OC1C(CC2C1(CCC3C2CCC4C3(CC(C(C4)O)N5CCOC...
4	Pancuronium	CC(=O)OC1CC2CCC3C(C2(CC1[N+]4(CCCCC4)C)C)CCC5(...
5	Mivacurium	C[N+]1(CCC2=CC(=C(C=C2C1CC3=CC(=C(C(=C3)OC)OC)...
6	Tubocurarine	CN1CCC2=CC(=C3C=C2C1CC4=CC=C(C=C4)OC5=C6C(CC7=...

Add Field for RDKit molecualar objects#

from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole  # Enables RDKit molecule display in Jupyter
# Use pandas broadcasting (apply over the SMILES column) to generate RDKit Mol objects
df["Mol"] = df["SMILES"].apply(lambda s: Chem.MolFromSmiles(s) if pd.notna(s) else None)

# Optional: enable molecule image rendering in the DataFrame (for Jupyter notebooks)
PandasTools.RenderImagesInAllDataFrames(images=True)

# Display the DataFrame
df

	Name	SMILES
0	Atracurium	C[N+]1(CCC2=CC(=C(C=C2C1CC3=CC(=C(C=C3)OC)OC)O...
1	Cisatracurium	C[N+]1(CCC2=CC(=C(C=C2C1CC3=CC(=C(C=C3)OC)OC)O...
2	Vecuronium	CC(=O)OC1CC2CCC3C(C2(CC1N4CCCCC4)C)CCC5(C3CC(C...
3	Rocuronium	CC(=O)OC1C(CC2C1(CCC3C2CCC4C3(CC(C(C4)O)N5CCOC...
4	Pancuronium	CC(=O)OC1CC2CCC3C(C2(CC1[N+]4(CCCCC4)C)C)CCC5(...
5	Mivacurium	C[N+]1(CCC2=CC(=C(C=C2C1CC3=CC(=C(C(=C3)OC)OC)...
6	Tubocurarine	CN1CCC2=CC(=C3C=C2C1CC4=CC=C(C=C4)OC5=C6C(CC7=...

What is the Data Type of the Mol column with the image

# Render dataframe (shows molecule images)
display(df)

# Print out Python types under each column
print("Column data types (based on first row):")
for col in df.columns:
    print(f"{col}: {type(df[col].iloc[0])}")

	Name	SMILES
0	Atracurium	C[N+]1(CCC2=CC(=C(C=C2C1CC3=CC(=C(C=C3)OC)OC)O...
1	Cisatracurium	C[N+]1(CCC2=CC(=C(C=C2C1CC3=CC(=C(C=C3)OC)OC)O...
2	Vecuronium	CC(=O)OC1CC2CCC3C(C2(CC1N4CCCCC4)C)CCC5(C3CC(C...
3	Rocuronium	CC(=O)OC1C(CC2C1(CCC3C2CCC4C3(CC(C(C4)O)N5CCOC...
4	Pancuronium	CC(=O)OC1CC2CCC3C(C2(CC1[N+]4(CCCCC4)C)C)CCC5(...
5	Mivacurium	C[N+]1(CCC2=CC(=C(C=C2C1CC3=CC(=C(C(=C3)OC)OC)...
6	Tubocurarine	CN1CCC2=CC(=C3C=C2C1CC4=CC=C(C=C4)OC5=C6C(CC7=...

Column data types (based on first row):
Name: <class 'str'>
SMILES: <class 'str'>
Mol: <class 'rdkit.Chem.rdchem.Mol'>

Drop SMILES Column#

Let’s remove the SMILES string since we no longer need it

df.drop(columns=["SMILES"], inplace=True)
df

	Name	Mol
0	Atracurium
1	Cisatracurium
2	Vecuronium
3	Rocuronium
4	Pancuronium
5	Mivacurium
6	Tubocurarine

Lipinski Rule of Five:#

A compound is likely to have good oral bioavailability if:

Rule	Threshold	Why it matters
≤ 5 hydrogen bond donors	H-bond donors = NH, OH groups	Too many → low permeability
≤ 10 hydrogen bond acceptors	H-bond acceptors = N, O atoms	Too many → low absorption
MW ≤ 500 Da	Molecular weight	Too large → poor absorption
logP ≤ 5	Partition coefficient (lipophilicity)	Too greasy → poor solubility

Note: It’s called “Rule of 5” because the numbers involved (5, 10, 500, logP 5) are all multiples of 5.

We are now going to add 4 new columns by applying a function (Descriptors.X) to each Mol in the column using apply(). We can do this because the mol column is not a column of images but a column pointers to the location of the rdkit molecular object in memory

“MolWt” – Molecular Weight
“LogP” – Octanol-water partition coefficient
“HBD” – Number of H-bond donors
“HBA” – Number of H-bond acceptors

from rdkit.Chem import Descriptors

df["MolWt"] = df["Mol"].apply(Descriptors.MolWt)
df["LogP"] = df["Mol"].apply(Descriptors.MolLogP)
df["HBD"] = df["Mol"].apply(Descriptors.NumHDonors)
df["HBA"] = df["Mol"].apply(Descriptors.NumHAcceptors)
df

	Name	MolWt	LogP	HBD	HBA
0	Atracurium	929.161	8.0655	0	12
1	Cisatracurium	929.161	8.0655	0	12
2	Vecuronium	557.840	5.9659	0	5
3	Rocuronium	529.786	4.4075	1	5
4	Pancuronium	572.875	6.1105	0	4
5	Mivacurium	1029.278	9.0290	0	14
6	Tubocurarine	609.743	6.7010	2	7

Create Boolean Mask#

Now we want to set up a Boolean Mask based on Lipinski’s Rules. These columns are a set of True/False values based on a criteria.

Property	Threshold
Molecular Weight	≤ 500
LogP	≤ 5
HBD	≤ 5
HBA	≤ 10

mw_ok   = df["MolWt"] <= 500
logp_ok = df["LogP"]  <= 5
hbd_ok  = df["HBD"]   <= 5
hba_ok  = df["HBA"]   <= 10
print(mw_ok)
print(type(mw_ok))

  False
  False
  False
  False
  False
  False
  False
Name: MolWt, dtype: bool
<class 'pandas.core.series.Series'>

Lets now add new columns for each of the Boolean Series we created

df["MW_OK"]   = mw_ok
df["LogP_OK"] = logp_ok
df["HBD_OK"]  = hbd_ok
df["HBA_OK"]  = hba_ok
df

	Name	MolWt	LogP	HBD	HBA	MW_OK	LogP_OK	HBD_OK	HBA_OK
0	Atracurium	929.161	8.0655	0	12	False	False	True	False
1	Cisatracurium	929.161	8.0655	0	12	False	False	True	False
2	Vecuronium	557.840	5.9659	0	5	False	False	True	True
3	Rocuronium	529.786	4.4075	1	5	False	True	True	True
4	Pancuronium	572.875	6.1105	0	4	False	False	True	True
5	Mivacurium	1029.278	9.0290	0	14	False	False	True	False
6	Tubocurarine	609.743	6.7010	2	7	False	False	True	True

Now lets make a Lipinski Score for the total pass count based on the 4 conditions

df["LipinskiPassCount"] = df[["MW_OK", "LogP_OK", "HBD_OK", "HBA_OK"]].sum(axis=1)
df["LipinskiFailCount"] = 4 - df["LipinskiPassCount"]
df

	Name	MolWt	LogP	HBD	HBA	MW_OK	LogP_OK	HBD_OK	HBA_OK	LipinskiPassCount	LipinskiFailCount
0	Atracurium	929.161	8.0655	0	12	False	False	True	False	1	3
1	Cisatracurium	929.161	8.0655	0	12	False	False	True	False	1	3
2	Vecuronium	557.840	5.9659	0	5	False	False	True	True	2	2
3	Rocuronium	529.786	4.4075	1	5	False	True	True	True	3	1
4	Pancuronium	572.875	6.1105	0	4	False	False	True	True	2	2
5	Mivacurium	1029.278	9.0290	0	14	False	False	True	False	1	3
6	Tubocurarine	609.743	6.7010	2	7	False	False	True	True	2	2

rule_cols = ["MW_OK", "LogP_OK", "HBD_OK", "HBA_OK"]
pass_fail_matrix = df[rule_cols]

pass_fail_matrix

	MW_OK	LogP_OK	HBD_OK	HBA_OK
0	False	False	True	False
1	False	False	True	False
2	False	False	True	True
3	False	True	True	True
4	False	False	True	True
5	False	False	True	False
6	False	False	True	True

pass_fail_numeric = pass_fail_matrix.astype(int)
pass_fail_numeric

	LogP_OK	HBD_OK	HBA_OK
0	0	1	0
1	0	1	0
2	0	1	1
3	1	1	1
4	0	1	1
5	0	1	0
6	0	1	1

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
sns.heatmap(pass_fail_numeric, annot=True, cmap="YlGnBu", cbar=False, linewidths=0.5)

plt.title("Lipinski Rule Pass/Fail Matrix")
plt.xlabel("Lipinski Rule")
plt.ylabel("Molecule Index")
plt.tight_layout()
plt.show()

../../_images/0f542e8c722e109ead492dca2bc02e312f3744de3da1703acccd51c9e7462fb7.png