9. RDKit#

1. Introduction#

RDKit (Rational Discovery) is a cheminformatics toolbox for which Greg Landrum is the lead developer.

RDKit provides tools for:

  • Creating and parsing molecular structures (SMILES, SDF, etc.)

  • Computing chemical descriptors and fingerprints

  • Drawing chemical structures and highlighting substructures

  • Performing substructure and similarity searches

  • Working with reactions, 3D conformers, and force fields

1.1 RDkit Resources#

Resource

Link

Description

GitHub Repository

github.com/rdkit/rdkit

Source code, issue tracking, and installation instructions

Documentation (ReadTheDocs)

rdkit.readthedocs.io

Core documentation and API references

Official Website

www.rdkit.org

Central hub with installation links, tutorials, and project info

Install rdkit

conda install -c rdkit rdkit

2. RDkit Major Submodules#

Submodule

Description

Chem

Core chemical functionality (molecules, atoms, bonds, reactions)

DataStructs

Fingerprints and similarity metrics

ML

Machine learning tools, often used with molecular descriptors

Geometry

2D and 3D geometry operations and vector math

SimDivFilters

Diversity filtering for compound selection

VLib

Virtual screening and filtering tools

2.1 RDKit Functions and Classes#

Name

Type

Module

Description

Mol

Class

Chem.rdchem

Core molecule object

RWMol

Class

Chem.rdchem

Read/write version of Mol for editing

Atom

Class

Chem.rdchem

Atom object in a molecule

Bond

Class

Chem.rdchem

Bond between two atoms

Conformer

Class

Chem.rdchem

Stores 3D atom coordinates

MolFromSmiles(smiles)

Function

Chem

Creates a Mol from a SMILES string

MolToSmiles(mol)

Function

Chem

Converts a molecule back to SMILES

MolFromMolFile(path)

Function

Chem

Loads a molecule from .mol or .sdf file

MolToMolFile(mol, path)

Function

Chem

Saves a molecule to file

AddHs(mol)

Function

Chem

Adds explicit hydrogens

RemoveHs(mol)

Function

Chem

Removes hydrogens

SanitizeMol(mol)

Function

Chem

Standardizes valences, charges, etc.

Compute2DCoords(mol)

Function

AllChem

Generates 2D coordinates for drawing

EmbedMolecule(mol)

Function

AllChem

Generates 3D coordinates (distance geometry)

EmbedMultipleConfs(mol)

Function

AllChem

Creates multiple 3D conformers

UFFOptimizeMolecule(mol)

Function

AllChem

Minimizes energy using UFF

MMFFOptimizeMolecule(mol)

Function

AllChem

Minimizes energy using MMFF94

ReactionFromSmarts(smartstr)

Function

AllChem

Creates a reaction object from SMARTS

RunReactants((mol1, mol2))

Function

AllChem

Applies a reaction to reactants

GetMorganFingerprint(mol, radius)

Function

AllChem

Generates circular (ECFP-like) fingerprint

GetMorganFingerprintAsBitVect()

Function

AllChem

Bit vector version for similarity search

TanimotoSimilarity(fp1, fp2)

Function

DataStructs

Computes similarity between fingerprints

MolToImage(mol)

Function

Draw

Returns a PIL image of the molecule

MolsToGridImage([mol1, mol2])

Function

Draw

Displays multiple molecules as a grid

IPythonConsole.ipython_useSVG=True

Setting

Draw

Enables high-quality SVG rendering in Jupyter

Descriptors.MolWt(mol)

Function

Descriptors

Computes molecular weight

Descriptors.MolLogP(mol)

Function

Descriptors

Computes logP

Descriptors.NumRotatableBonds()

Function

Descriptors

Rotatable bond count

rdMolDescriptors.CalcTPSA()

Function

rdMolDescriptors

Calculates topological polar surface area

rdMolDescriptors.CalcNumRings()

Function

rdMolDescriptors

Counts rings in a molecule

rdMolDescriptors.CalcExactMolWt()

Function

rdMolDescriptors

Computes monoisotopic molecular weight

AlignMol(mol, refMol)

Function

AllChem

Aligns one molecule to another (3D)

3. rdkit.Chem#

Submodule

Description

AllChem

3D coordinates, conformers, reaction SMARTS, molecular mechanics

Draw

Visualization and rendering tools (e.g., SVGs for notebooks)

Descriptors

Standard descriptors (MolWt, LogP, TPSA, etc.)

rdFingerprintGenerator

Tools for creating fingerprints (e.g., Morgan, MACCS)

rdMolDescriptors

More specific descriptors (e.g., fragment counts, topological descriptors)

rdChemReactions

Reaction objects, SMARTS parsing

rdMolAlign, rdForceFieldHelpers

Tools for alignment and force-field calculations

rdmolfiles, rdmolops

File I/O and general molecule operations


3.1 AllChem (rdkit.chem.AllChem)#

Key Functions in rdkit.Chem.AllChem#

Function Name

Description

EmbedMolecule()

Generates a 3D conformer from a molecule (uses distance geometry)

EmbedMultipleConfs()

Generates multiple conformers (for conformer search)

UFFOptimizeMolecule()

Optimizes geometry using UFF force field

MMFFOptimizeMolecule()

Optimizes geometry using MMFF94 force field

MMFFGetMoleculeProperties()

Returns MMFF-specific parameters for a molecule

UFFGetMoleculeForceField()

Returns a UFF force field object

MMFFGetMoleculeForceField()

Returns a MMFF force field object

GenerateDepictionMatching2DStructure()

Matches 2D coordinates across conformers

GetMorganFingerprint()

Computes circular Morgan fingerprints (similar to ECFP)

GetMorganFingerprintAsBitVect()

Computes a bit vector version of the Morgan fingerprint

ReactionFromSmarts()

Creates a chemical reaction object from a SMARTS string

RunReactants()

Applies a chemical reaction object to a set of reactants

AssignBondOrdersFromTemplate()

Transfers bond order info to a generated 3D molecule

AlignMol()

Aligns one molecule to another (conformer or scaffold alignment)

Compute2DCoords()

Generates 2D coordinates for drawing (used in visualization)

Generate 3D structure from SMILES#

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw

mol = Chem.MolFromSmiles("CCO")        # Ethanol
mol = Chem.AddHs(mol)                  # Add hydrogens (important for 3D)
AllChem.EmbedMolecule(mol)             # Generate 3D coordinates
AllChem.UFFOptimizeMolecule(mol)       # Minimize energy with UFF force field

# Visualize 2D structure (you can't see the 3D here, but it's encoded)
Draw.MolToImage(mol)
../../_images/aab5380ae952f1563340c807f31bfab610568d652d9baf1c250d3211a1913436.png

Generating RDkit Molecules from SMILES

caffeine = Chem.MolFromSmiles('Cn1cnc2c1c(=O)n(C)c(=O)n2C')
caffeine
../../_images/366adff2f19ec1b4c1f22ed895e71277f2029880ffc1551ebabc869aab69ff96.png
caffeine
type(caffeine)
rdkit.Chem.rdchem.Mol
from rdkit import Chem
from rdkit.Chem import Draw
from IPython.display import display

# Display the molecule
display(Draw.MolToImage(caffeine))

# Print the type below it
print(type(caffeine))
../../_images/cf354682325750119fd4ae9b1d33d1f111f4a63314429611817879f46fb419d8.png
<class 'rdkit.Chem.rdchem.Mol'>
Chem.Draw.MolToFile(caffeine, 'caffeine.svg',
                    size=(500,500),
                    imageType='svg')

To install jcamp you need to open your terminal

conda activate your_env_name
pip install jcamp

The IR spectra of caffeine can be downloaded from NIST WebBook as a jcmap-dx file,

from rdkit import Chem
from rdkit.Chem import Draw
import matplotlib.pyplot as plt
from jcamp import jcamp_readfile
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
import numpy as np

# Create molecule object
caffeine = Chem.MolFromSmiles('Cn1cnc2c1c(=O)n(c(=O)n2C)C')
mol_image = Draw.MolToImage(caffeine, size=(200, 200))

# Read JCAMP-DX spectrum
spectrum = jcamp_readfile('data/caffeine_IR.jdx')
x = np.array(spectrum['x'])
y = np.array(spectrum['y'])

# Ensure x and y are the same length
min_len = min(len(x), len(y))
x = x[:min_len]
y = y[:min_len]

# Filter to 400–4000 cm⁻¹ range
mask = (x >= 400) & (x <= 4000)
x_filtered = x[mask]
y_filtered = y[mask]

# Plot with inset image
fig, ax = plt.subplots(figsize=(10, 6), constrained_layout=True)

# Plot your main data
ax.plot(x_filtered, y_filtered, color='darkred')
ax.set_xlabel('Wavenumber (cm⁻¹)')
ax.set_ylabel('Absorbance')
ax.set_title('IR Spectrum of Caffeine')
ax.invert_xaxis()
ax.grid(True)

# Add inset axes
ax_inset = inset_axes(ax, width="30%", height="30%", loc='upper right')
ax_inset.imshow(mol_image)
ax_inset.axis('off')

plt.show()
../../_images/d92d002bfbb2efeff4e7d2e61f53dc0e7fc00d7861145cb556ae3ff31135eb23.png

3.2 Draw (rdkit.Chem.Draw)#

The following material is adapted from Charles Weiss First we will read a text file with amino acids as smiles strings into a Pandas DataFame

import pandas as pd
df = pd.read_csv('data/amino_acid_SMILES.txt', skiprows=2)
df.head()
name SMILES
0 alanine C[C@@H](C(=O)[O-])[NH3+]
1 arginine [NH3+][C@@H](CCCNC(=[NH2+])N)C(=O)[O-]
2 asparagine O=C(N)C[C@H]([NH3+])C(=O)[O-]
3 aspartate C([C@@H](C(=O)[O-])[NH3+])C(=O)[O-]
4 cysteine C([C@@H](C(=O)[O-])[NH3+])S

Now we will use list-comprehension to create a list of rdkit molecule objects from the SMILES Strings in the dataframe

AminoAcids = [Chem.MolFromSmiles(SMILES) for SMILES in df['SMILES']]
print(AminoAcids)
print(type(AminoAcids))
[<rdkit.Chem.rdchem.Mol object at 0x7c31c9f2adc0>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2ae30>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2aea0>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2af10>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2af80>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2aff0>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2b060>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2b0d0>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2b140>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2b1b0>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2b220>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2b290>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2b300>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2b370>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2b3e0>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2b450>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2b4c0>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2b530>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2b5a0>, <rdkit.Chem.rdchem.Mol object at 0x7c31c9f2b610>]
<class 'list'>

MolsToGridImage#

This has one positional argument which is a container data type containing the molecular objects you want to plot. It is followed by optional keyword arguments. Note for the legends we used the “name” column of the dataframe, and we used SVG images.

Chem.Draw.MolsToGridImage(AminoAcids, 
                          molsPerRow=4,
                          subImgSize=(200,200),
                          legends=list(df['name']),
                          useSVG=True) 
../../_images/7db0dc9fa290ab18567b4441860ce1d94dc88f0cdc7a1f3db8dbb0e548c0d804.svg

4. Stereochemistry#

Sterochemistry in SMILES#

  • @ or @@ to represent tetrahedral chirality

  • / and \ for cis/trans (E/Z) stereochemistry around double bonds

lactic1 = Chem.MolFromSmiles("C[C@H](O)C(=O)O")
lactic2 = Chem.MolFromSmiles("C[C@@H](O)C(=O)O")
#Draw.MolToImage(lactic)

cis_butene = Chem.MolFromSmiles("C/C=C/C")  # cis
trans_butene = Chem.MolFromSmiles("C/C=C\\C")  # trans

Draw.MolsToGridImage([lactic1, lactic2, cis_butene, trans_butene], legends=["chiral", "chiral", "cis", "trans"])
../../_images/5aa6661b20f90f4fc223a0f6636e3e2951a09ff0621e5f302cf6f00d57ab8db4.png
  1. Create molecule objects from SMILES strings.

from rdkit import Chem
from rdkit.Chem import Draw
import requests

# Define SMILES strings for the molecules
smiles_list = [
    "C[C@H](O)C(=O)O",    # Lactic acid (R)
    "C[C@@H](O)C(=O)O",   # Lactic acid (S)
    "C/C=C/C",            # cis-2-butene
    "C/C=C\\C"            # trans-2-butene
]

# Convert SMILES strings to RDKit molecule objects
molecules = [Chem.MolFromSmiles(smiles) for smiles in smiles_list]
molecules
[<rdkit.Chem.rdchem.Mol at 0x7c31c99f4660>,
 <rdkit.Chem.rdchem.Mol at 0x7c31c99f4820>,
 <rdkit.Chem.rdchem.Mol at 0x7c31c99f6490>,
 <rdkit.Chem.rdchem.Mol at 0x7c31c99f6500>]
  1. Retrieve IUPAC Names Using NCI Chemical Resolver

def get_iupac_name(smiles):
    try:
        url = f"https://cactus.nci.nih.gov/chemical/structure/{smiles}/iupac_name"
        response = requests.get(url)
        response.raise_for_status()
        return response.text
    except requests.RequestException:
        return "IUPAC name not found"

iupac_names = [get_iupac_name(smiles) for smiles in smiles_list]
print(iupac_names)
['(2S)-2-hydroxypropanoic acid', '(2R)-2-hydroxypropanoic acid', '(E)-But-2-ene', '(Z)-But-2-ene']
  1. Generate Iosmeric Smiles Strings

isomeric_smiles = [Chem.MolToSmiles(mol, isomericSmiles=True) for mol in molecules]
  1. Create Legend Combining IUPAC Names and SMILES

legends = [f"{name}\n{smiles}" for name, smiles in zip(iupac_names, isomeric_smiles)]
  1. Organize Molecules and Legends into a 2x2 Grid

# Arrange molecules and legends into a 2x2 grid
mols_matrix = [molecules[:2], molecules[2:]]
legends_matrix = [legends[:2], legends[2:]]

Generate and Display the Grid

from IPython.display import display

from rdkit.Chem.Draw import MolsMatrixToGridImage

# Generate the grid image
img = MolsMatrixToGridImage(
    molsMatrix=mols_matrix,
    legendsMatrix=legends_matrix,
    subImgSize=(300, 300)
)

# Display the image (in Jupyter Notebook)
#img
display(img)

# To save as image, uncomment the next line
#  img.save("molecule_grid.png")
../../_images/b8372bfbfaa358067ed38f48723d8cf276e66e4874a0aef579ed9d2f24dfc893.png

Steroisomers of Glucose#

Glucose has 4 chiral centers and so \(2^4\) = 16 sterioisomers

image.png

Generate a Grid of the stereoisomers with both their IUPAC name and isomeric SMILES Strings

  1. Generate a tuple containing the 16 rdkit molecule objects representing each stereoisomer

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.EnumerateStereoisomers import EnumerateStereoisomers, StereoEnumerationOptions

# Base SMILES for D-glucose without specified stereochemistry
base_smiles = "OC[C@H]1O[C@H](O)[C@H](O)[C@H](O)[C@H]1O"
base_mol = Chem.MolFromSmiles(base_smiles)

# Set up options to enumerate all stereoisomers
opts = StereoEnumerationOptions(onlyUnassigned=False, unique=True)
isomers = tuple(EnumerateStereoisomers(base_mol, options=opts))
import requests
from rdkit import Chem

def get_iupac_name(smiles):
    try:
        url = f"https://cactus.nci.nih.gov/chemical/structure/{smiles}/iupac_name"
        response = requests.get(url, timeout=5)
        response.raise_for_status()
        return response.text.strip()
    except requests.RequestException:
        return "IUPAC name not found"

# Prepare data for each isomer
isomer_data = []
for mol in isomers:
    smiles = Chem.MolToSmiles(mol, isomericSmiles=True)
    name = get_iupac_name(smiles)
    isomer_data.append((mol, name, smiles))
print(isomer_data)
print(type(isomer_data))
print(f"There are {len(isomer_data)} tuples in this list, meaning there are {len(isomer_data)} isomers." )
[(<rdkit.Chem.rdchem.Mol object at 0x7c31c9859fd0>, '(2S,3R,4R,5S,6S)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@@H]1O[C@H](O)[C@H](O)[C@H](O)[C@@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985a070>, '(2S,3R,4R,5S,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@H]1O[C@H](O)[C@H](O)[C@H](O)[C@@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985a110>, '(2R,3R,4R,5S,6S)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@@H]1O[C@@H](O)[C@H](O)[C@H](O)[C@@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985a1b0>, '(2R,3R,4R,5S,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@H]1O[C@@H](O)[C@H](O)[C@H](O)[C@@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985a250>, 'IUPAC name not found', 'OC[C@@H]1O[C@H](O)[C@@H](O)[C@H](O)[C@@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985a2f0>, '(2S,3S,4R,5S,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@H]1O[C@H](O)[C@@H](O)[C@H](O)[C@@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985a390>, '(2R,3S,4R,5S,6S)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@@H]1O[C@@H](O)[C@@H](O)[C@H](O)[C@@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985a430>, '(2R,3S,4R,5S,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@H]1O[C@@H](O)[C@@H](O)[C@H](O)[C@@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985a4d0>, '(2S,3R,4S,5S,6S)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@@H]1O[C@H](O)[C@H](O)[C@@H](O)[C@@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985a570>, 'IUPAC name not found', 'OC[C@H]1O[C@H](O)[C@H](O)[C@@H](O)[C@@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985a610>, '(2R,3R,4S,5S,6S)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@@H]1O[C@@H](O)[C@H](O)[C@@H](O)[C@@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985a6b0>, '(2R,3R,4S,5S,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@H]1O[C@@H](O)[C@H](O)[C@@H](O)[C@@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985a750>, 'IUPAC name not found', 'OC[C@@H]1O[C@H](O)[C@@H](O)[C@@H](O)[C@@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985a7f0>, '(2S,3S,4S,5S,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@H]1O[C@H](O)[C@@H](O)[C@@H](O)[C@@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985a890>, '(2R,3S,4S,5S,6S)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@@H]1O[C@@H](O)[C@@H](O)[C@@H](O)[C@@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985a930>, '(2R,3S,4S,5S,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@H]1O[C@@H](O)[C@@H](O)[C@@H](O)[C@@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985a9d0>, '(2S,3R,4R,5R,6S)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@@H]1O[C@H](O)[C@H](O)[C@H](O)[C@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985aa70>, '(2S,3R,4R,5R,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@H]1O[C@H](O)[C@H](O)[C@H](O)[C@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985ab10>, 'IUPAC name not found', 'OC[C@@H]1O[C@@H](O)[C@H](O)[C@H](O)[C@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985abb0>, '(2R,3R,4R,5R,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@H]1O[C@@H](O)[C@H](O)[C@H](O)[C@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c9c067f0>, '(2S,3S,4R,5R,6S)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@@H]1O[C@H](O)[C@@H](O)[C@H](O)[C@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985ac50>, '(2S,3S,4R,5R,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@H]1O[C@H](O)[C@@H](O)[C@H](O)[C@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985ad40>, 'IUPAC name not found', 'OC[C@@H]1O[C@@H](O)[C@@H](O)[C@H](O)[C@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985ade0>, '(2R,3S,4R,5R,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@H]1O[C@@H](O)[C@@H](O)[C@H](O)[C@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985ae80>, 'IUPAC name not found', 'OC[C@@H]1O[C@H](O)[C@H](O)[C@@H](O)[C@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985af20>, '(2S,3R,4S,5R,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@H]1O[C@H](O)[C@H](O)[C@@H](O)[C@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985afc0>, '(2R,3R,4S,5R,6S)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@@H]1O[C@@H](O)[C@H](O)[C@@H](O)[C@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985b060>, '(2R,3R,4S,5R,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@H]1O[C@@H](O)[C@H](O)[C@@H](O)[C@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31d18b31a0>, '(2S,3S,4S,5R,6S)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@@H]1O[C@H](O)[C@@H](O)[C@@H](O)[C@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985b100>, '(2S,3S,4S,5R,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@H]1O[C@H](O)[C@@H](O)[C@@H](O)[C@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985b1f0>, '(2R,3S,4S,5R,6S)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@@H]1O[C@@H](O)[C@@H](O)[C@@H](O)[C@H]1O'), (<rdkit.Chem.rdchem.Mol object at 0x7c31c985b290>, '(2R,3S,4S,5R,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol', 'OC[C@H]1O[C@@H](O)[C@@H](O)[C@@H](O)[C@H]1O')]
<class 'list'>
There are 32 tuples in this list, meaning there are 32 isomers.

So where is the 5th stereocenter?

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdMolDraw2D

# Define the SMILES string for glucose
smiles = 'OC[C@H]1O[C@H](O)[C@H](O)[C@H](O)[C@H]1O'
mol = Chem.MolFromSmiles(smiles)

# Assign stereochemistry to the molecule
Chem.AssignStereochemistry(mol, force=True, cleanIt=True)

# Find chiral centers
chiral_centers = Chem.FindMolChiralCenters(mol, includeUnassigned=True)
print(f"Chiral centers: {chiral_centers}")

# Prepare the molecule for drawing
drawer = rdMolDraw2D.MolDraw2DCairo(500, 500)
drawer.drawOptions().addAtomIndices = True
drawer.drawOptions().addStereoAnnotation = True
rdMolDraw2D.PrepareAndDrawMolecule(drawer, mol)
drawer.FinishDrawing()

# Save the image
with open('glucose_chiral_centers.png', 'wb') as f:
    f.write(drawer.GetDrawingText())
Chiral centers: [(2, 'R'), (4, 'S'), (6, 'R'), (8, 'R'), (10, 'R')]
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

# Enable SVG rendering in Jupyter
IPythonConsole.ipython_useSVG = True
IPythonConsole.drawOptions.addAtomIndices = True
IPythonConsole.drawOptions.addStereoAnnotation = True

# Define the SMILES string for D-glucose
smiles = 'OC[C@H]1O[C@H](O)[C@H](O)[C@H](O)[C@H]1O'
mol = Chem.MolFromSmiles(smiles)

# Display the molecule
mol
from rdkit.Chem import Draw

# Extract molecules and legends
mols = [data[0] for data in isomer_data]
legends = [f"{data[1]}\n{data[2]}" for data in isomer_data]

# Generate and display the grid image
img = Draw.MolsToGridImage(mols, legends=legends, molsPerRow=4, subImgSize=(200, 200))
img
../../_images/c2d1e8a2dfb0cea544552428a00b51376e1e54cb90d4365fefa7db994aec22c6.svg
from rdkit.Chem import Draw

# Extract molecules and legends
mols = [data[0] for data in isomer_data]
legends = [f"{data[1]}\n{data[2]}" for data in isomer_data]

# Generate and display the grid image
img = Draw.MolsToGridImage(mols, legends=legends, molsPerRow=4, subImgSize=(200, 200))
img
../../_images/c2d1e8a2dfb0cea544552428a00b51376e1e54cb90d4365fefa7db994aec22c6.svg

Open Tautomer Form#

from rdkit import Chem
from rdkit.Chem import Draw

# Explicitly defined stereochemistry
smiles_open_chain = "OC[C@H](O)[C@H](O)[C@H](O)[C@H](O)C=O"
smiles_alpha = "OC[C@H]1O[C@@H](O)[C@H](O)[C@H](O)[C@H]1O"  # α-D-glucopyranose
smiles_beta = "OC[C@H]1O[C@H](O)[C@H](O)[C@H](O)[C@H]1O"   # β-D-glucopyranose

mols = [Chem.MolFromSmiles(smi) for smi in [smiles_open_chain, smiles_alpha, smiles_beta]]

# Optional: add hydrogens for better 2D layout
mols = [Chem.AddHs(mol) for mol in mols]

# Generate image
Draw.MolsToImage(mols, legends=[
    "Open-Chain D-Glucose",
    "alpha-D-Glucopyranose (axial OH at C1)",
    "beta-D-Glucopyranose (equatorial OH at C1)"
], subImgSize=(300, 300))
../../_images/1c1e3caa271e67bf95cb0a0ab9e2a88050d8b39acbaf14530a698cf0815320ae.png
open_chain_smiles = 'OC[C@H](O)[C@H](O)[C@H](O)[C@H](O)C=O'
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

# Enable SVG rendering in Jupyter
IPythonConsole.ipython_useSVG = True

# Create the molecule from the SMILES string
mol = Chem.MolFromSmiles(open_chain_smiles)

# Display the molecule
mol
from rdkit import Chem
from rdkit.Chem.EnumerateStereoisomers import EnumerateStereoisomers, StereoEnumerationOptions

# Define the open-chain SMILES for D-glucose
open_chain_smiles = 'OC[C@H](O)[C@H](O)[C@H](O)[C@H](O)C=O'
mol = Chem.MolFromSmiles(open_chain_smiles)

# Set up stereoisomer enumeration options
opts = StereoEnumerationOptions(onlyUnassigned=False, unique=True)

# Generate stereoisomers
stereoisomers = tuple(EnumerateStereoisomers(mol, options=opts))

print(f"Number of open-chain stereoisomers generated: {len(stereoisomers)}")
Number of open-chain stereoisomers generated: 16
from rdkit import Chem
from rdkit.Chem.EnumerateStereoisomers import EnumerateStereoisomers, StereoEnumerationOptions

# Define the open-chain SMILES for D-glucose
open_chain_smiles = 'OC[C@H](O)[C@H](O)[C@H](O)[C@H](O)C=O'
mol = Chem.MolFromSmiles(open_chain_smiles)

# Set up stereoisomer enumeration options
opts = StereoEnumerationOptions(onlyUnassigned=False, unique=True)

# Generate stereoisomers
stereoisomers = tuple(EnumerateStereoisomers(mol, options=opts))

print(f"Number of open-chain stereoisomers generated: {len(stereoisomers)}")
Number of open-chain stereoisomers generated: 16
from rdkit.Chem import Draw

# Generate images with SMILES as legends
images = [Chem.MolToSmiles(mol, isomericSmiles=True) for mol in stereoisomers]

# Create a grid image
img = Draw.MolsToGridImage(stereoisomers, molsPerRow=4, subImgSize=(200, 200), legends=images)

# Display the image
img
../../_images/7bca8f081b51d44bd92f4b9b4df060654eabd0e22883ceb08b3f2d54d41f8b1b.svg

5. Descriptors (rdkit.Chem.Descriptors)#

  • Physicochemical Properties: Basic molecular properties like size, lipophilicity, and polarity (e.g. molecular weight, LogP, polar surface area). These influence solubility, permeability, and drug-likeness.

  • Topological Indices: Graph-theoretic indices that quantify molecular shape and connectivity (e.g. chi connectivity indices, kappa shape indices, Balaban’s J index). They measure structural complexity and are useful in QSAR modeling.

  • Structural Features: Simple counts of structural moieties (e.g. ring counts, rotatable bond count, functional group counts). These help profile molecular flexibility, aromaticity, and the presence of particular substructures (used in filters like Lipinski’s Rule of 5).

  • Electronic Properties: Descriptors capturing electronic distribution (e.g. partial charge extremes, EState indices). They reflect how electron-rich or electron-poor parts of a molecule are, which can affect reactivity and interactions.

  • Fragment Counts: Indicators for specific functional groups or substructures (usually prefixed with fr_ in RDKit). These count occurrences of motifs like benzene rings, halogens, nitro groups, etc., often used as structural alerts or to calculate fragment-based scores.

  • Drug-Likeness Scores: Composite indices that combine multiple descriptors into a single score evaluating how “drug-like” a molecule is (e.g. QED score). These aid in prioritizing compounds likely to have favorable pharmacokinetic and safety profiles.

Physicochemical Properties#

Descriptor Name

Description

MolWt

Molecular weight (average mass of the molecule)

ExactMolWt

Exact molecular weight (monoisotopic mass of the molecule)

MolLogP

Wildman–Crippen estimated LogP (octanol–water partition coefficient)

MolMR

Wildman–Crippen molar refractivity (volume-dependent refractivity)

TPSA

Topological polar surface area (sum of fragment-based polar surface areas)

LabuteASA

Labute’s approximate solvent-accessible surface area (MOE ASA)

Topological Indices#

Descriptor Name

Description

Chi0, Chi1 (Chi Indices)

Molecular connectivity indices of order 0 and 1 (higher orders Chi2–Chi4 also available, including valence (v) and normalized (n) forms) (Getting Started with the RDKit in Python — The RDKit 2025.03.1 documentation). These indices are derived from molecular graph connectivity and help describe molecular topology.

Kappa1, Kappa2, Kappa3 (Kappa Indices)

Kier & Hall molecular shape indices of first, second, and third order (Getting Started with the RDKit in Python — The RDKit 2025.03.1 documentation). They quantify shape by comparing the molecule’s graph to idealized linear or ring structures (i.e. capturing branching and complexity).

BalabanJ

Balaban’s J index (distance connectivity index) – a topological index accounting for distances in the molecular graph (Molecular descriptors Datagrok). It is used as an overall metric of molecular connectivity and complexity.

BertzCT

Bertz’s complexity index – a topological index quantifying molecular complexity (rdkit.Chem.GraphDescriptors module — The RDKit 2025.03.1 documentation) (sums contributions from bonding patterns and heteroatom distribution (rdkit.Chem.GraphDescriptors module — The RDKit 2025.03.1 documentation)).

Ipc

Information content index – Shannon information content of the adjacency matrix’s characteristic polynomial coefficients (measures structural information content) (rdkit.Chem.GraphDescriptors module — The RDKit 2025.03.1 documentation). RDKit provides both total (Ipc) and average (AvgIpc) variants of this descriptor.

WienerIndex

Wiener index – the sum of all shortest-path distances between pairs of atoms in the molecule (a classic topological index indicating overall connectivity).

Structural Features#

Descriptor Name

Description

HeavyAtomCount

Number of heavy atoms (non-hydrogen atoms) in the molecule ([Molecular descriptors

NumHeteroatoms

Number of heteroatoms (atoms that are not carbon or hydrogen) (rdkit.Chem.Lipinski module — The RDKit 2025.03.1 documentation).

RingCount

Total number of rings in the molecule (SSSR count). Note: RDKit also provides specific ring counts like NumAromaticRings, NumSaturatedRings, NumHeterocycles, etc. for more detailed ring analysis.

NumRotatableBonds

Number of rotatable bonds (single bonds that can rotate freely, excluding trivial rotations) (rdkit.Chem.Lipinski module — The RDKit 2025.03.1 documentation).

FractionCSP3

Fraction of carbon atoms that are sp³-hybridized (a ratio indicating saturated carbon content) .

NumHBA / NumHBD (NumHAcceptors / NumHDonors)

Number of hydrogen bond acceptors and donors, respectively (rdkit.Chem.Lipinski module — The RDKit 2025.03.1 documentation). (Commonly used in Lipinski’s Rule of 5 for drug-likeness.)

Electronic Properties#

Descriptor Name

Description

MaxPartialCharge / MinPartialCharge

Maximum and minimum partial atomic charge on any atom in the molecule (most positive and most negative partial charges). These are computed (e.g. via Gasteiger charges) to gauge charge distribution.

MaxAbsPartialCharge / MinAbsPartialCharge

Maximum and minimum absolute partial charge on any atom – i.e. the largest magnitude partial charges (positive or negative).

MaxEStateIndex / MinEStateIndex

Maximum and minimum EState indices in the molecule – the highest and lowest electrotopological state values of any atom . EState indices integrate atom electronegativity with topological environment, indicating reactive centers.

MaxAbsEStateIndex / MinAbsEStateIndex

Maximum and minimum absolute EState index values in the molecule . These represent the extremal magnitudes of electrotopological state, regardless of sign.

Fragment Counts (Functional Group Indicators)#

Descriptor Name

Description (counts of specific substructures in the molecule)

fr_benzene

Number of benzene rings (aromatic C6 rings) (

fr_imidazole

Number of imidazole rings (five-membered aromatic rings with two nitrogens) (

fr_halogen

Number of halogen atoms present (F, Cl, Br, or I) .

fr_nitro

Number of nitro groups (–NO₂) (

fr_ether

Number of ether groups (oxygen atoms connecting two carbons, incl. phenoxy

fr_amide

Number of amide functional groups (–C(=O)NH–)

Drug-Likeness Scores#

Descriptor Name

Description

qed

QED (Quantitative Estimate of Drug-likeness) – a unitless score from 0 to 1 assessing drug-likeness ([Molecular descriptors

Summary: The above descriptors provide a quantitative profile of molecules. Physicochemical and electronic descriptors relate to ADME properties and reactivity, topological and structural descriptors capture 2D structure and complexity, fragment counts flag particular functional groups, and drug-likeness indices combine these factors to guide drug discovery decisions. Chemoinformaticians often use these in QSAR models, filtering libraries (e.g. Lipinski’s Rule of 5 uses MolWt, LogP, HBD, HBA, etc.), and comparing molecules during lead optimization (Decoding Molecular Weight: A Dive into Molecular Descriptors | by Ayush Gharat | Medium). By organizing RDKit’s descriptor functions into categories, researchers can more easily select relevant descriptors for their specific needs in cheminformatics and drug design.

from rdkit.Chem import Descriptors

# Access the list of descriptor name-function pairs
descriptor_list = Descriptors._descList

# Display the total number of descriptors
print(f"Total descriptors available: {len(descriptor_list)}\n")

# Print the first 10 descriptors as an example
for name, func in descriptor_list[:10]:
    print(f"{name}: {func}")
Total descriptors available: 217

MaxAbsEStateIndex: <function MaxAbsEStateIndex at 0x7c31c941e560>
MaxEStateIndex: <function MaxEStateIndex at 0x7c31c941e3b0>
MinAbsEStateIndex: <function MinAbsEStateIndex at 0x7c31c941e5f0>
MinEStateIndex: <function MinEStateIndex at 0x7c31c941e4d0>
qed: <function qed at 0x7c31c96093f0>
SPS: <function SPS at 0x7c31c9609870>
MolWt: <function <lambda> at 0x7c31c9609ea0>
HeavyAtomMolWt: <function HeavyAtomMolWt at 0x7c31c9609f30>
ExactMolWt: <function <lambda> at 0x7c31c9609fc0>
NumValenceElectrons: <function NumValenceElectrons at 0x7c31c960a050>
# List all descriptor names
descriptor_names = [name for name, _ in Descriptors._descList]
print(len(descriptor_names))
print(descriptor_names)
217
['MaxAbsEStateIndex', 'MaxEStateIndex', 'MinAbsEStateIndex', 'MinEStateIndex', 'qed', 'SPS', 'MolWt', 'HeavyAtomMolWt', 'ExactMolWt', 'NumValenceElectrons', 'NumRadicalElectrons', 'MaxPartialCharge', 'MinPartialCharge', 'MaxAbsPartialCharge', 'MinAbsPartialCharge', 'FpDensityMorgan1', 'FpDensityMorgan2', 'FpDensityMorgan3', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', 'BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'AvgIpc', 'BalabanJ', 'BertzCT', 'Chi0', 'Chi0n', 'Chi0v', 'Chi1', 'Chi1n', 'Chi1v', 'Chi2n', 'Chi2v', 'Chi3n', 'Chi3v', 'Chi4n', 'Chi4v', 'HallKierAlpha', 'Ipc', 'Kappa1', 'Kappa2', 'Kappa3', 'LabuteASA', 'PEOE_VSA1', 'PEOE_VSA10', 'PEOE_VSA11', 'PEOE_VSA12', 'PEOE_VSA13', 'PEOE_VSA14', 'PEOE_VSA2', 'PEOE_VSA3', 'PEOE_VSA4', 'PEOE_VSA5', 'PEOE_VSA6', 'PEOE_VSA7', 'PEOE_VSA8', 'PEOE_VSA9', 'SMR_VSA1', 'SMR_VSA10', 'SMR_VSA2', 'SMR_VSA3', 'SMR_VSA4', 'SMR_VSA5', 'SMR_VSA6', 'SMR_VSA7', 'SMR_VSA8', 'SMR_VSA9', 'SlogP_VSA1', 'SlogP_VSA10', 'SlogP_VSA11', 'SlogP_VSA12', 'SlogP_VSA2', 'SlogP_VSA3', 'SlogP_VSA4', 'SlogP_VSA5', 'SlogP_VSA6', 'SlogP_VSA7', 'SlogP_VSA8', 'SlogP_VSA9', 'TPSA', 'EState_VSA1', 'EState_VSA10', 'EState_VSA11', 'EState_VSA2', 'EState_VSA3', 'EState_VSA4', 'EState_VSA5', 'EState_VSA6', 'EState_VSA7', 'EState_VSA8', 'EState_VSA9', 'VSA_EState1', 'VSA_EState10', 'VSA_EState2', 'VSA_EState3', 'VSA_EState4', 'VSA_EState5', 'VSA_EState6', 'VSA_EState7', 'VSA_EState8', 'VSA_EState9', 'FractionCSP3', 'HeavyAtomCount', 'NHOHCount', 'NOCount', 'NumAliphaticCarbocycles', 'NumAliphaticHeterocycles', 'NumAliphaticRings', 'NumAmideBonds', 'NumAromaticCarbocycles', 'NumAromaticHeterocycles', 'NumAromaticRings', 'NumAtomStereoCenters', 'NumBridgeheadAtoms', 'NumHAcceptors', 'NumHDonors', 'NumHeteroatoms', 'NumHeterocycles', 'NumRotatableBonds', 'NumSaturatedCarbocycles', 'NumSaturatedHeterocycles', 'NumSaturatedRings', 'NumSpiroAtoms', 'NumUnspecifiedAtomStereoCenters', 'Phi', 'RingCount', 'MolLogP', 'MolMR', 'fr_Al_COO', 'fr_Al_OH', 'fr_Al_OH_noTert', 'fr_ArN', 'fr_Ar_COO', 'fr_Ar_N', 'fr_Ar_NH', 'fr_Ar_OH', 'fr_COO', 'fr_COO2', 'fr_C_O', 'fr_C_O_noCOO', 'fr_C_S', 'fr_HOCCN', 'fr_Imine', 'fr_NH0', 'fr_NH1', 'fr_NH2', 'fr_N_O', 'fr_Ndealkylation1', 'fr_Ndealkylation2', 'fr_Nhpyrrole', 'fr_SH', 'fr_aldehyde', 'fr_alkyl_carbamate', 'fr_alkyl_halide', 'fr_allylic_oxid', 'fr_amide', 'fr_amidine', 'fr_aniline', 'fr_aryl_methyl', 'fr_azide', 'fr_azo', 'fr_barbitur', 'fr_benzene', 'fr_benzodiazepine', 'fr_bicyclic', 'fr_diazo', 'fr_dihydropyridine', 'fr_epoxide', 'fr_ester', 'fr_ether', 'fr_furan', 'fr_guanido', 'fr_halogen', 'fr_hdrzine', 'fr_hdrzone', 'fr_imidazole', 'fr_imide', 'fr_isocyan', 'fr_isothiocyan', 'fr_ketone', 'fr_ketone_Topliss', 'fr_lactam', 'fr_lactone', 'fr_methoxy', 'fr_morpholine', 'fr_nitrile', 'fr_nitro', 'fr_nitro_arom', 'fr_nitro_arom_nonortho', 'fr_nitroso', 'fr_oxazole', 'fr_oxime', 'fr_para_hydroxylation', 'fr_phenol', 'fr_phenol_noOrthoHbond', 'fr_phos_acid', 'fr_phos_ester', 'fr_piperdine', 'fr_piperzine', 'fr_priamide', 'fr_prisulfonamd', 'fr_pyridine', 'fr_quatN', 'fr_sulfide', 'fr_sulfonamd', 'fr_sulfone', 'fr_term_acetylene', 'fr_tetrazole', 'fr_thiazole', 'fr_thiocyan', 'fr_thiophene', 'fr_unbrch_alkane', 'fr_urea']
ptx = Chem.MolFromSmiles('[H][C@@]12CC[C@](O)(C(=O)CO)[C@@]1(C)CC(=O)[C@@]1([H])[C@@]2([H])CCC2=CC(=O)C=C[C@]12C')
Chem.Draw.MolToImage(ptx, size=(500,500))
../../_images/ac345d8858f427fd1d2acd7ba3a4dc6a0c7d9d620ba1c08f865fba86e2a24140.png
import urllib.parse

smiles = "[H][C@@]12CC[C@](O)(C(=O)CO)[C@@]1(C)CC(=O)[C@@]1([H])[C@@]2([H])CCC2=CC(=O)C=C[C@]12C"
encoded_smiles = urllib.parse.quote(smiles)

print(encoded_smiles)
%5BH%5D%5BC%40%40%5D12CC%5BC%40%5D%28O%29%28C%28%3DO%29CO%29%5BC%40%40%5D1%28C%29CC%28%3DO%29%5BC%40%40%5D1%28%5BH%5D%29%5BC%40%40%5D2%28%5BH%5D%29CCC2%3DCC%28%3DO%29C%3DC%5BC%40%5D12C

Acknowldegments#

Material in this activty has been adapted from chapter 3 of Charles Weiss’s book “Scientific Computing for Chemists with Python”, CC BY-NC-SA 4.0.

Additional content was developed with assistance from Perplexity AI and Chat GPT. Multiple queries were made during the Fall 2024 and the Spring 2025.