SMARTS Primer

SMARTS Primer#

1. What SMARTS is#

SMARTS = SMiles ARbitrary Target Specification. It extends the SMILES syntax so you can search for patterns (substructures) inside molecules instead of describing a single, fully defined molecule.

A SMILES string → describes one molecule. A SMARTS pattern → describes all molecules that match that pattern

2. The “anatomy” of a SMARTS string#

A SMARTS pattern is made of:

Component	Meaning	Example
Atom expressions	Describe what kind of atom can match.	`[C]`, `[O]`, `[N+]`, `[!C]`
Bond expressions	Describe how atoms are connected.	`-`, `=`, `#`, `:`, `~`
Logical operators	Combine conditions.	`&` (AND), `,` (OR), `!` (NOT)
Parentheses `()`	Group subpatterns.	`[$(C(=O)O)]`
Recursion `$()`	A pattern within a pattern (powerful).	`[$([CX4][OX2H])]`

a. Atom Expressions#

Atom Expressions

Inside square brackets [...], you can specify which atoms match and how they’re bonded.

Token	Meaning	Example	Matches	Excludes
`C`, `N`, `O`	Element symbol (default valence inferred)	`[C]`	any carbon	—
`A`	any aliphatic atom (non-aromatic)	`[A]`	sp³ C, N, O…	aromatic atoms
`a`	any aromatic atom	`[a]`	benzene ring atoms	aliphatic atoms
`*`	any atom	`[*]`	everything	—
`#n`	atomic number =n	`[#6]`	carbon	—
`Xn`	connectivity (number of attached atoms)	`[CX4]`	sp³ carbon	carbonyl C (X3)
`Hn`	explicit hydrogen count	`[OH1]`	hydroxyl oxygen	ether oxygen (H0)
`v n`	valence electrons (rarely used)	`[Nv3]`	neutral amine N	quaternary N⁺
`R`	atom is in a ring (boolean)	`[CR]`	cyclohexane C	chain C
`r n`	in a ring of size n	`[Cr6]`	benzene C	cyclopropane C (r3)
`+n`, `-n`	charge	`[N+]`	ammonium	neutral N
`!`	logical NOT	`[!O]`	everything but O	O
`;`	logical AND (within atom expr.)	`[C;H3]`	methyl C	secondary C
`,`	logical OR	`[O,N]`	O or N	others
`$()`	recursive subpattern	`[$([CX4][OX2H])]`	alcohol	carbonyl

b. Bond Expressions#

Outside atom brackets, characters define bond types:

Symbol	Meaning	Example	Matches
`-`	single	`CC` or `C-C`	sigma bonds
`=`	double	`C=O`	carbonyl
`#`	triple	`C#N`	nitrile
`:`	aromatic bond	`c1ccccc1`	benzene
`~`	any bond (wildcard)	`C~O`	single/double/etc.
`@`	ring bond (stereo)	`[C@H]`	stereochem.
`!`	NOT a bond type	`!@`	non-ring bond

c. Logic and Grouping#

Syntax	Meaning	Example
`[C,N]`	carbon OR nitrogen	`[O,N,S]` = heteroatoms
`[C;H3]`	carbon AND 3 hydrogens	methyl carbon
`[!O;!N]`	not O AND not N	hydrocarbon atoms
`$( ... )`	recursive/embedded subpattern	`[$(C(=O)O)]` = carboxyl group
`()`	parentheses for order grouping	`[C;$(=O)]`

Meaning of “X” and related symbols

Symbol	Meaning	Example
X	connectivity number (count of σ-bonded neighbors)	`[CX4]` = sp³ carbon
v	valence (number of bonded electrons)	`[Nv3]` = trivalent nitrogen
H	number of attached hydrogens	`[OH1]` = hydroxyl O
R	ring membership (1 = in a ring)	`[CR1]` = ring carbon
r	ring size	`[Cr6]` = in a 6-membered ring
+ / −	charge	`[N+]`, `[O-]`
!	NOT operator	`[!O]` = any atom except O
& , ;	AND/OR connectors	`[C;H3]`, `[C,N]`

3. SMARTS Patterns#

Common Patterns#

SMARTS Pattern	Meaning	Example Match
`*`	Any atom	Matches all atoms
`#6`	Carbon (atomic number 6)	Matches any carbon atom
`C`	Aliphatic carbon	Matches non-aromatic C
`c`	Aromatic carbon	Matches benzene ring carbon
`N`	Aliphatic nitrogen	Matches amines, etc.
`n`	Aromatic nitrogen	Matches pyridine-type N
`[O;H1]`	Hydroxyl oxygen	Matches -OH group
`[CX3]=O`	Carbon (sp2) double bonded to O	Carbonyl group
`[OX1H]`	Single-bonded OH group	Alcohol or acid OH
`[$([NH2]),$([NH])`	Primary or secondary amine	Amine groups
`[!#6]`	Any atom except carbon	e.g., O, N, S, etc.
`[C;H3]`	Methyl carbon	Terminal CH₃ group
`[R]`	Any atom in a ring	Benzene, cyclohexane atoms
`[R0]`	Atom not in a ring	Linear chain atoms
`[cH]`	Aromatic hydrogen	e.g., hydrogen on benzene ring
`C(=O)O`	Carboxylic acid group (non-generic)	Acetic acid
`[C@H](N)(C)`	Chiral carbon (with wedge notation)	Specific stereochemistry

Functional Group Patterns#

Functional Group	SMARTS Pattern	Explanation
Alcohol	`[CX4][OX2H]`	sp³ C bonded to hydroxyl
Ether	`[OD2]([CX4])[CX4]`	O with two sp³ C neighbors
Phenol	`[c][OX2H]`	aromatic C–OH
Aldehyde	`[CX3H1](=O)[#6]`	terminal carbonyl H
Ketone	`[CX3](=O)[#6]`	internal carbonyl
Carboxylic acid	`[CX3](=O)[OX2H1]`	–C(=O)OH
Ester	`[CX3](=O)[OX2][CX4]`	–C(=O)O–
Amide	`[NX3][CX3](=O)`	N attached to carbonyl
Amine (primary/secondary)	`[NX3;H2,H1;!$(NC=O)]`	excludes amides
Aromatic ring	`c1ccccc1`	benzene 6-ring
Halogen	`[F,Cl,Br,I]`	halogen atoms
Thiol	`[SX2H]`	–SH
Disulfide	`[SX2]S`	–S–S–
Carboxylate anion	`[CX3](=O)[O-]`	deprotonated acid

4. Using SMARTS in RDKit#

SMARTS is a notation and language specification—a symbolic way to describe structural patterns in molecules. However, what actually happens when you run a SMARTS query depends on how the SMARTS parser and matcher are implemented in the cheminformatics toolkit you are using. Different toolkits such as RDKit, Open Babel, or Daylight, may interpret certain SMARTS features differently based on how they handle aromaticity, implicit hydrogens, or valence rules.

In this section, we focus on how RDKit interprets and applies SMARTS patterns. Understanding these nuances is essential because the same SMARTS expression can yield different results across toolkits. By exploring RDKit’s parsing behavior, matching functions, and query options, you’ll gain insight into how SMARTS are operationalized within a specific chemical perception engine—and why awareness of these implementation details matters in cheminformatics workflows.

When RDKit reads a SMARTS string, it doesn’t treat it as plain text — it compiles it into a special internal object called a query molecule. This object defines a set of atom and bond constraints that RDKit can test against other molecules. The process works conceptually as follows:

Parsing – The SMARTS string (e.g., [O;D2]) is read by Chem.MolFromSmarts(), which checks for valid syntax and builds an RDKit molecule object. This is the same type of object you get when you read a SMILES string, but it has special properties for querying. It has the same Python type as a normal molecule (rdkit.Chem.rdchem.Mol), but internally its atoms are QueryAtoms and its bonds are QueryBonds—specialized objects that store logical rules instead of fixed chemical properties.
Compilation – RDKit translates these symbolic queries into a query graph, a data structure optimized for substructure searching.
Matching – When you call mol.HasSubstructMatch(patt) or mol.GetSubstructMatches(patt), RDKit walks this query graph, comparing each atom and bond in the target molecule against the constraints in the query molecule.
Result Handling – Matches are returned as tuples of atom indices (for example, which atoms in ethanol match the [OX2H] pattern).

Because RDKit relies on its own definitions of aromaticity, valence, and hydrogen handling, the results of a SMARTS query reflect RDKit’s chemical perception model rather than an absolute SMARTS truth.

4.1 RDKit Functions and Methods for SMARTS#

Working with SMARTS in RDKit involves a mix of functions (usually from rdkit.Chem) and methods (attached to molecule objects). The table below summarizes the most common ones used interactively with SMARTS patterns in Jupyter Lab.

Note: RDKit functions that start with Chem. generally create or convert molecule objects, while methods called on a molecule (e.g., mol.HasSubstructMatch(patt)) act upon those objects.

Table: Common RDKit Functions and Methods for SMARTS Workflows#

Category	Function / Method	Description	Typical Use Case
Create / Parse	`Chem.MolFromSmarts(s)`	Parses a SMARTS string into a query molecule (`Mol` with `QueryAtoms`). Returns `None` if invalid.	Create a SMARTS pattern: `patt = Chem.MolFromSmarts("[O;D2]")`
	`Chem.MolToSmarts(mol)`	Converts a molecule back into its SMARTS string representation.	Inspect or export a query molecule.
	`Chem.MolFromSmiles(smi)`	Parses a SMILES into a regular molecule (for targets).	Define target molecules for substructure searching.
Search / Match	`mol.HasSubstructMatch(patt)`	Returns `True` if the molecule matches the SMARTS pattern.	Quick Boolean tests.
	`mol.GetSubstructMatches(patt)`	Returns tuples of atom indices that match the SMARTS pattern.	Locate and highlight matched atoms.
	`mol.GetSubstructMatch(patt)`	Returns the first match only (single tuple).	Simpler alternative for small molecules.
Query Introspection	`atom.DescribeQuery()`	Returns a text description of the logical tests for a query atom.	Explore how RDKit interprets your SMARTS internally.
	`bond.DescribeQuery()`	Describes bond-level query logic.	Examine SMARTS bond conditions.
	`Chem.MolToMolBlock(patt)`	Prints the query molecule in MOL-block format, including SMARTS info.	Debugging or visualization.
Visualization (Jupyter-friendly)	`from rdkit.Chem import Draw`	RDKit drawing submodule.	Must import to render structures in notebooks.
	`Draw.MolToImage(mol, size=(200,200))`	Creates a PIL image of a molecule or SMARTS match.	Display molecules inline in Jupyter Lab.
	`Draw.MolsToGridImage([...])`	Displays multiple molecules side-by-side (optionally with highlights or legends).	Compare pattern matches visually.
Batch Searching	`Chem.SubstructLibrary()`	Builds an indexed library for fast bulk substructure queries.	Screening many molecules against one SMARTS.
Reactions / Advanced	`Chem.ReactionFromSmarts(s)`	Parses reaction SMARTS into an RDKit reaction object.	For SMARTS-based transformations.
	`rxn.RunReactants((mol1, ...))`	Applies reaction SMARTS to one or more molecules.	Explore reaction mapping or retrosynthesis.

Jupyter-Specific Notes#

In Jupyter Lab, RDKit’s Draw module automatically integrates with IPython’s display system. Simply placing a molecule as the last line in a code cell will render it as an image:
```
from rdkit import Chem
from rdkit.Chem import Draw

mol = Chem.MolFromSmiles("CCO")
mol  # ← rendered automatically in Jupyter
```
Outside Jupyter (e.g., VS Code terminal or a script), you must explicitly call Draw.MolToImage() or Draw.ShowMol() to view molecules.

To highlight SMARTS matches in teaching examples, you can use:

patt = Chem.MolFromSmarts("[OX2H]")
matches = mol.GetSubstructMatches(patt)
Draw.MolToImage(mol, highlightAtoms=[a for m in matches for a in m])

from rdkit import Chem
from rdkit.Chem import Draw
from IPython.display import Image

# Step 1: Create the target molecule from SMILES
mol = Chem.MolFromSmiles("C(CO)CO")  # ethanol

# Step 2: Create the SMARTS query and store its string form
patt = Chem.MolFromSmarts("[OX2H]")  # alcohol oxygen pattern
smarts_string = Chem.MolToSmarts(patt).replace("&","")  # convert to string for display

# Step 3: Test for substructure match
if mol.HasSubstructMatch(patt):
    print("Match found!")

# Step 4: Get substructure matches and print them
matches = mol.GetSubstructMatches(patt)
print("Matched atom indices:", matches)

# Step 5: Inspect object types
print("mol type:", type(mol))
print("patt type:", type(patt))
print("smarts_string type:", type(smarts_string))

# Step 6: Highlight matching atoms
highlight_atoms = [a for m in matches for a in m]

# Step 7: Draw molecule with atom indices, highlights, and SMARTS legend
drawer = Draw.MolDraw2DCairo(300, 300)
opts = drawer.drawOptions()
opts.addAtomIndices = True  # show atom index numbers

# ✅ Add the legend argument *inside* DrawMolecule()
legend_text = f"SMARTS: {smarts_string}"
drawer.DrawMolecule(mol, highlightAtoms=highlight_atoms, legend=legend_text)
drawer.FinishDrawing()

# Step 8: Display image in Jupyter
Image(drawer.GetDrawingText())

Match found!
Matched atom indices: ((2,), (4,))
mol type: <class 'rdkit.Chem.rdchem.Mol'>
patt type: <class 'rdkit.Chem.rdchem.Mol'>
smarts_string type: <class 'str'>

../../_images/8b482931f54fad1801afbf5a248d6588084a88a8179cb55a4c959e29c105d461.png

from rdkit import Chem

molecules = {
    "ethanol": "CCO",
    "acetone": "CC(=O)C",
    "acetic_acid": "CC(=O)O",
    "ethyl_acetate": "CCOC(=O)C",
    "aniline": "c1ccccc1N"
}

patterns = {
    "alcohol": "[CX4][OX2H]",
    "ketone": "[CX3](=O)[#6]",
    "ester": "[CX3](=O)[OX2][CX4]",
    "amine": "[NX3;H2,H1;!$(NC=O)]"
}

for name, smi in molecules.items():
    mol = Chem.MolFromSmiles(smi)
    print(f"\n{name}: {smi}")
    for pname, smarts in patterns.items():
        patt = Chem.MolFromSmarts(smarts)
        print(f"  {pname:8s} -> {mol.HasSubstructMatch(patt)}")

ethanol: CCO
  alcohol  -> True
  ketone   -> False
  ester    -> False
  amine    -> False

acetone: CC(=O)C
  alcohol  -> False
  ketone   -> True
  ester    -> False
  amine    -> False

acetic_acid: CC(=O)O
  alcohol  -> False
  ketone   -> True
  ester    -> False
  amine    -> False

ethyl_acetate: CCOC(=O)C
  alcohol  -> False
  ketone   -> True
  ester    -> True
  amine    -> False

aniline: c1ccccc1N
  alcohol  -> False
  ketone   -> False
  ester    -> False
  amine    -> True

5. SMARTS Explorer Activity#

Run the following code and then use the SMARTS explorer to answer the following questions.

molecules = {
    "Ethanol": "CCO",
    "Explicit Ethanol": "CC[OH]",
    "Acetone": "CC(=O)C",
    "Acetic acid": "CC(=O)O",
    "Ethyl acetate": "CCOC(=O)C",
    "Phenol": "c1ccc(cc1)O",
    "Aniline": "c1ccccc1N",
    "Dimethyl ether": "COC",
    "Toluene": "Cc1ccccc1",
    "Nitrobenzene": "c1ccc(cc1)[N+](=O)[O-]",
    "2-nitropropane": "CC(C)[N+](=O)[O-]",
    "Methanol": "CO",
    "Acetamide": "CC(=O)N",   
}

from rdkit import Chem
from rdkit.Chem import Draw
from IPython.display import display, Markdown
import ipywidgets as widgets



def smarts_lab(smarts_pattern):
    patt = Chem.MolFromSmarts(smarts_pattern)
    if patt is None:
        display(Markdown(f"❌ Invalid SMARTS pattern: `{smarts_pattern}`"))
        return
    images = []
    legends = []
    for name, smi in molecules.items():
        mol = Chem.MolFromSmiles(smi)
        matches = mol.GetSubstructMatches(patt)
        highlight = [a for m in matches for a in m]
        img = Draw.MolToImage(mol, size=(200,200), highlightAtoms=highlight)
        images.append(img)
        legends.append(name)
    display(Markdown(f"### Pattern: `{smarts_pattern}`"))
    display(Draw.MolsToGridImage(
        [Chem.MolFromSmiles(s) for s in molecules.values()],
        legends=legends,
        highlightAtomLists=[
            [a for m in Chem.MolFromSmiles(s).GetSubstructMatches(patt) for a in m]
            for s in molecules.values()
        ],
        subImgSize=(200,200),
        molsPerRow=4
    ))

widgets.interact(
    smarts_lab,
    smarts_pattern=widgets.Text(
        value='[CX4][OX2H]',
        description='SMARTS:',
        placeholder='Type a SMARTS pattern…',
        continuous_update=False
    )
)

<function __main__.smarts_lab(smarts_pattern)>

Comparing groups#

Paste the following groups into the SMARTS Explorer and see what they match.

Topic	Try this SMARTS	Expected Outcome
Alcohol vs Carbonyl	`[CX4][OX2H]`	Highlights ethanol & phenol but not acetone
Carbonyl group	`[CX3]=O`	Highlights acetone, acetic acid, ethyl acetate
Ketone group	`[#6][CX3](=O)[#6]`	Highlights acetone only
Carboxylic acid	`[CX3](=O)[OX2H1]`	Only acetic acid
Ester linkage	`[CX3](=O)[OX2][CX4]`	Only ethyl acetate
Amine only	`[NX3;!$(NC=O);!$(N=O)]`	Aniline only
Amide only	`[NX3][CX3](=O)`	Only acetamide
Hydrocarbon chain C	`[CX4;!R]`	Aliphatic chains only
Aromatic ring	`c1ccccc1`	All benzene derivatives

Comparing ether patterns#

Paste the following ehter patterns into the SMARTS Explorer to see how there are different ways of generating SMARTS patterns.

SMARTS Pattern	type	comments
[O;D2]	minimalist	Ether oxygen with 2 carbons
[#6X4]-[OX2]-[#6X4]	Bond based	each carbon has 4 bonds
`[OD2]([CX4])[CX4]`	Branch based	Oxygen has a C branch

Activity#

Provide SMARTS patterns that do the following. You can test them in the SMARTS Explorer above.

Show nitro group of nitrobenzene and 2-nitropropane
Show the nitro group of nitrobenzene but not 2-nitropropane
Show the nitro group of 2-nitropropane but not nitrobenzene
Show just the nitrogen of the nitro group in nitrobenzene and 2-nitropropane (use recursive SMARTS)
Show the alcohol group of all alcohols
Show the alcohol group of ethanol but not phenol
Show the alcohol group of phenol but not ethanol
Show the ether group of dimethyl ether but not ethyl acetate
Show all C=O bonds
Show amines and amides but not nitro or nitroso compounds

6. Assignment#

You should use an AI to assist you in this assignment. The following code cell takes a dictionary of amino acids and their SMILES strings and uses the RDKit library to create a polypeptide from 5 randomly selected amino acids. Your job is to create a new Jupyter notebook called Amino Acid Functional Group Explorer that identifies which functional groups are present in your amino acid, and displays an image for each functional group with the name of the group as the label and the group(s) highlighted on the polypeptide.

Two assist you I have also created a dictionary of common functional groups and their SMARTS patterns. You can use this to identify the functional groups in your polypeptide.

#This program will build and display a polypeptide from 5 randomly selected amino acids from a python dictionary

import random
from rdkit import Chem
from rdkit.Chem import AllChem, Draw

# Step 1: Define amino acids (simplified SMILES, unprotected)
amino_acids = {
    "Gly": "NCC(=O)O",
    "Ala": "NCC(C)C(=O)O",
    "Val": "NCC(C(C)C)C(=O)O",
    "Leu": "NCC(CC(C)C)C(=O)O",
    "Ile": "NCC(C(C)C)C(=O)O",
    "Ser": "NCC(CO)C(=O)O",
    "Thr": "NCC(C(O)C)C(=O)O",
    "Asp": "NCC(C(=O)O)C(=O)O",
    "Glu": "NCC(CC(=O)O)C(=O)O",
    "Lys": "NCC(CCCCN)C(=O)O",
    "Arg": "NCC(CCCNC(N)=N)C(=O)O",
    "Phe": "NCC(Cc1ccccc1)C(=O)O",
    "Tyr": "NCC(Cc1ccc(O)cc1)C(=O)O",
    "Trp": "NCC(Cc1c[nH]c2ccccc12)C(=O)O",
    "His": "NCC(Cc1cncn1)C(=O)O",
    "Asn": "NCC(C(=O)N)C(=O)O",
    "Gln": "NCC(CC(=O)N)C(=O)O",
    "Met": "NCC(CSC)C(=O)O",
    "Cys": "NCC(CS)C(=O)O",
    "Pro": "N1CCC[C@H](C(=O)O)N1"
}

# Step 2: Randomly select 5 amino acids
selected = random.sample(list(amino_acids.items()), 5)
print("Selected amino acids:", [aa for aa, smi in selected])

# Step 3: Convert to RDKit molecules
mols = [Chem.MolFromSmiles(smi) for aa, smi in selected]

# Step 4: Use RDKit’s peptide builder (requires RDKit >=2022)
# if not available, can use a manual connection function
from rdkit.Chem import rdChemReactions

# Define generic peptide coupling reaction: acid + amine → amide + H2O
rxn = rdChemReactions.ReactionFromSmarts("[C:1](=O)[O:2].[N:3]>>[C:1](=O)[N:3]")

# Step 5: Iteratively build the chain
peptide = mols[0]
for next_mol in mols[1:]:
    prod = rxn.RunReactants((peptide, next_mol))
    peptide = prod[0][0]  # take first product
    Chem.SanitizeMol(peptide)

# Step 6: Display the final random pentapeptide
Draw.MolToImage(peptide, size=(400, 300))

Selected amino acids: ['Asp', 'His', 'Ile', 'Gly', 'Cys']

[17:34:20] Can't kekulize mol.  Unkekulized atoms: 4 5 6 7 8
[17:34:20] mapped atoms in the reactants were not mapped in the products.
  unmapped numbers are: 2 

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 48
     46 peptide = mols[0]
     47 for next_mol in mols[1:]:
---> 48     prod = rxn.RunReactants((peptide, next_mol))
     49     peptide = prod[0][0]  # take first product
     50     Chem.SanitizeMol(peptide)

ValueError: reaction called with None reactants

functional_groups = {
    # Core functional groups
    "Amine (primary/secondary/tertiary)": "[NX3;!$(NC=O);!$([N]~[!#1;!#6])]",
    "Carboxylic acid": "[CX3](=O)[OX2H1]",
    "Amide": "[CX3](=O)[NX3]",
    "Alcohol": "[OX2H][CX4]",
    "Phenol": "c[OX2H]",
    "Thiol": "[SX2H]",
    "Thioether": "[SX2][CX4]",
    "Ether": "[OD2]([#6])[#6]",
    "Ketone": "[#6][CX3](=O)[#6]",
    "Aldehyde": "[CX3H1](=O)[#6]",
    "Carboxylate (deprotonated acid)": "[CX3](=O)[O-]",
    
    # Nitrogen-rich groups
    "Guanidinium": "NC(=[NH2+])N",
    "Amidinium": "NC(=N)N",
    "Imidazole": "c1ncnc1",
    "Indole": "c1ccc2c(c1)[nH]cc2",
    "Amine (aromatic)": "c[NX3;!$(NC=O)]",
    
    # Sulfur groups
    "Disulfide": "[SX2][SX2]",
    "Thioester": "[CX3](=O)[SX2]",
    
    # Acid derivatives
    "Ester": "[CX3](=O)[OX2][CX4]",
    "Carbamate": "[NX3][CX3](=O)[OX2]",
    
    # Aromatic ring
    "Aromatic ring": "a1aaaaa1"
}