Generating Molecular Fingerprints
Generate structural keys
Generate hashed fingerprints
Many useful documents/papers describe various aspects of molecular similarity, including molecular fingerprints and similarity measures. Please read these if you need more details.
Getting Started with the RDKit in Python
(https://www.rdkit.org/docs/GettingStartedInPython.html#fingerprinting-and-molecular-similarity)Fingerprint Generation, GraphSim Toolkit 2.4.2
(https://docs.eyesopen.com/toolkits/python/graphsimtk/fingerprint.html)Chemical Fingerprints
(https://docs.chemaxon.com/display/docs/Chemical+Fingerprints)Extended-Connectivity Fingerprints
(https://doi.org/10.1021/ci100050t)
Fingerprint Generation#
Molecular fingerprints are molecular descriptors that encode a molecule’s structure as a bit string. Each bit in the string indicates the presence or absence of a structural feature. This notebook explores generation of two major types of molecular fingerprints:
structural keys
hashed fingerprints
from rdkit import Chem
mol = Chem.MolFromSmiles('CC(C)C1=C(C(=C(N1CC[C@H](C[C@H](CC(=O)O)O)O)C2=CC=C(C=C2)F)C3=CC=CC=C3)C(=O)NC4=CC=CC=C4')# atorvastatin
mol
Structural Keys
1: MACCS keys#
The MACCS key is a binary fingerprint (a string of 0’s and 1’s) with a total length of 166 bits. Each bit position represents the presence (=1) or absence (=0) of a pre-defined structural feature. The feature definitions for the MACCS keys are available at:
https://github.com/rdkit/rdkit/blob/master/rdkit/Chem/MACCSkeys.py
from rdkit.Chem import MACCSkeys #import library for generating MACCSkeys from rdkit
fp = MACCSkeys.GenMACCSKeys(mol) # create an object called fp. It is an instance of the MACCSkeys fingerprint
print('fp is of type:',(type(fp)))
print()
#this code prints out each value of the bits in the fingerprint ojbect
#it loops through each bit and print
print('printing the values of the bitstring as a loop:')
for i in range(len(fp)):
print(fp[i], end='')
print()
# Alternative, easier way to convert it to a bitstring for output.
print('printing the values of the bitstring as ToBitString method in rdkit')
fp.ToBitString()
print(len(fp)) #one way to get the number of bits in the fingerprint
print(fp.GetNumBits()) # another way to get the number of bits in the fingerprint
Note that the MACCS key is 166-bit-long, but RDKit generates a 167-bit-long fingerprint. It is because the index of a list/vector in many programming languages (including python) begins at 0. To use the original numbering of the MACCS keys (1-166) (rather than 0-165), the MACCS keys were implemented to be 167-bit-long, with Bit 0 being always zero. Because Bit 0 is set to OFF for all compounds, it does not affect the evaluation of molecular similarity.
These are some methods that allow you to get some additional information on the MACCS Keys.
print(fp.GetNumBits()) #get total number of bits
print(fp.GetNumOffBits()) #get the total number bits with value 0
print(fp.GetNumOnBits()) #get the total number bits with value 1
print(fp.ToBinary()) #reports the binary representation of the fingerprint
Run the following code cells.
The first cell contains SMILES for a series of molecules and displays their structures
The second cell displays the MACCS keys for each molecule in the list called smiles.
# display a series of smiles
smiles = [ 'C1=CC=CC=C1', # Benzene (Kekule)
'c1ccccc1', # Benzene ("Aromatized" carbons)
'C1=CC=NC=C1', # pyridine
'C1CCCCC1', # Cyclohexene
'C1CCNCC1'] # piperidine
from rdkit.Chem import Draw
mols = []
for x in smiles:
mols.append(Chem.MolFromSmiles(x))
#mols = [ Chem.MolFromSmiles(x) for x in smiles ]
Chem.Draw.MolsToGridImage(mols, molsPerRow=5, subImgSize=(100,100), legends=[str(x) for x in smiles] )
# generate MACCSKeys from smiles
for smile in smiles:
print(smile)
mol = Chem.MolFromSmiles(smile)
fp = MACCSkeys.GenMACCSKeys(mol)
fp = fp.ToBitString()
print(fp)
for index, val in enumerate(fp):
if val == '1':
print("index is %d and value is %s" %(index, val))
print()
In general what structural features do benzene and pyridine have in common?
Which MACCS keys do they have in common?
Write the fragment definition of the bits ON that are in common for benzene and pyridine (one is already provided for you as an example).
2: PubChem Fingerprint#
The PubChem Fingerprint is a 881-bit-long binary fingerprint (ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.pdf). Similar to the MACCS keys, it uses a pre-defined fragment dictionary. The PubChem fingerprint for each compound in PubChem can be downloaded from PubChem.
# the following code cell generates a PUG REST request to obtain the PubChem Fingerprint
import requests
prolog = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
cid ="2244" # CID for aspirin
url = prolog + "/compound/cid/" + cid + "/property/Fingerprint2D/TXT"
res = requests.get(url)
pcfp_base64 = res.text #pcfp = PubChemFingerPrint
print(pcfp_base64)
Notice that the above output is not a binary bitstring. Pubchem Fingerprints are provided as Base64 encoded strings. The PubChem fingerprint is 881 bits representing structural features. Instead of storing these bits directly, PubChem encodes them as a new Base64 string. This provides for safer and easier transmittion through application programming interfaces (API) like PUG REST and databases can handle text better than the raw binary ones and zeros. When retrieved by the PUG REST API, the fingerprint appears as a 157 character string. Because they are base64-encoded, they should be decoded into binary bitstrings or bitvectors.
Details about how to decode base64-encoded PubChem fingerprints is described on the last page of the PubChem Fingerprint specification (https://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.pdf). Below is a user-defined function that decodes a PubChem fingerprint into a bit string.
from base64 import b64decode
def PCFP_BitString(pcfp_base64) :
pcfp_bitstring = "".join( ["{:08b}".format(x) for x in b64decode( pcfp_base64 )] )[32:913]
return pcfp_bitstring
Using the user defined function we created above, we can convert the base output to binary.
pcfp_bitstring = PCFP_BitString(pcfp_base64) # use the user defined function to convert the PubChem FingerPrint to a bitstring
print(len(pcfp_bitstring))
print(pcfp_bitstring)
The generated bitstring can be converted to a bitvector that can be used for molecular similarity computation in RDKit (to be discussed in part 2).
from rdkit import DataStructs
bitvect = DataStructs.CreateFromBitString(PCFP_BitString(pcfp_base64))
type(bitvect)
Hashed Fingerprints
1: Circular Fingerprints#
MACCS Keys and PubChem Fingerprints are examples of structural keys which use a fixed length bit vector (166 and 881, respectively). Each bit in the vector corresponds to the presence (1) or absence (0) of a predefined chemical feature (atoms, bonds) or substructure ( aromatic rings, carbonyl). In contrast, extended connectivity fingerprints are generated algorithmically by exploring the neighborhood or each atom up to a given radius. The environment is then encoded into hashed identifiers. Structural keys rely on a fixed dictionary of features. Extnded connectivity fingerprints are more flexible and capture local atomic environments without the need for predefined substructures.
Circular fingerprints are hashed fingerprints. They are generated by exhaustively enumerating “circular” fragments (containing all atoms within a given radius from each heavy atom of the molecule) and then hashing these fragments into a fixed-length bitstring. (Here, the “radius” from an atom is measured by the number of bonds that separates two atoms).
Examples of circular fingerprints are the extended-connectivity fingerprint (ECFPs) and their variant called FCFPs (Functional-Class Fingerprints), originally described in a paper by Rogers and Hahn (https://doi.org/10.1021/ci100050t). The RDKit implementation of these fingerprints are called “Morgan Fingerprints” (https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints).
fluoxetine = Chem.MolFromSmiles('CNCCC(C1=CC=CC=C1)OC2=CC=C(C=C2)C(F)(F)F')
fluoxetine
from rdkit.Chem import rdFingerprintGenerator
mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2,fpSize=2048)
fp2 = mfpgen.GetFingerprint(fluoxetine)
bitstring= fp2.ToBitString()
print(bitstring)
print("The fingerprint length is",len(bitstring))
When comparing the RDK’s Morgan fingerprints with the ECFP/FCFP fingerprints, it is important to remember that the name of ECFP/FCFP fingerprints are suffixed with the diameter of the atom environments considered, while the Morgan Fingerprints take a radius parameter (e.g., the second argument “2” of GetMorganFingerprintAsBitVect() in the above code cell). The Morgan fingerprint generated above (with a radius of 2) is comparable to the ECFP4 fingerprint (with a diameter of 4).
MACCS Keys and PubChem Fingerprints are examples of structural keys which use a fixed length bit vector (166 and 881, respectively). Each bit in the vector corresponds to the presence (1) or absence (0) of a predefined chemical feature (atoms, bonds) or substructure ( aromatic rings, carbonyl). In contrast, extended connectivity fingerprints are generated algorithmically by exploring the neighborhood or each atom up to a given radius. The environment is then encoded into hashed identifiers. Structural keys rely on a fixed dictionary of features. Extnded connectivity fingerprints are more flexible and capture local atomic environments without the need for predefined substructures.
To get a better idea of how these fingerprints are generated, over this next section you will explore 1-bromobutane and 1-chlorobutane
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_useSVG = True
IPythonConsole.drawOptions.addAtomIndices = True #this will add numbers to the image to help identify carbons later
IPythonConsole.drawOptions.addStereoAnnotation = False
mol = Chem.MolFromSmiles("BrCCCC") # 1-bromobutane
mol
Let’s generate its rdkit Morgan Fingerprint (radius = 2). This would be comparable to the ECFP4 fingerprint.
from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
# Step 1: Create the molecule from SMILES
mol = Chem.MolFromSmiles('BrCCCC') # 1-bromobutane
#mol = Chem.MolFromSmiles('CNC[C@H](O)c1ccc(O)c(O)c1') #epinephrine
# Step 2: Initialize the Morgan fingerprint generator. We are setting the radius to 2 and total bit size to 1024
morgan_gen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=1024)
# Step 3: Prepare the AdditionalOutput object to capture bit information
additional_output = rdFingerprintGenerator.AdditionalOutput()
additional_output.AllocateBitInfoMap()
# Step 4: Generate the fingerprint with additional output
fp1 = morgan_gen.GetFingerprint(mol, additionalOutput=additional_output)
print("The Morgan Fingerprint radius 2 fingerprint (ECPF4) for the molecule is:")
print(fp1.ToBitString())
print()
# Step 5: Retrieve and display the bit information
bit_info = additional_output.GetBitInfoMap()
for bit_id, atom_radius_list in bit_info.items():
print(f"Bit {bit_id} is set by:")
for atom_idx, radius in atom_radius_list:
atom_symbol = mol.GetAtomWithIdx(atom_idx).GetSymbol()
print(f" - Atom index {atom_idx} ({atom_symbol}), Radius {radius}")
We can also display an image of the fragment that caused the bit to be equal to 1.
Some notes about rendering:
The molecule fragment is drawn with the atoms in the same positions as in the original molecule.
The central atom is highlighted in blue.
Aromatic atoms are highlighted in yellow
Aliphatic ring atoms are highlighted in dark gray
Atoms/bonds that are drawn in light gray indicate pieces of the structure that influence the atoms’ connectivity invariants but that are not directly part of the fingerprint.
As an example, we can draw the fragment for bit 80 below (and change to different values based on the above bitlist.
IPythonConsole.drawOptions.addAtomIndices = False
mfp2_svg = Draw.DrawMorganBit(mol, 80, bit_info, useSVG=True)
mfp2_svg
While the above code can display 1 fingerprint fragment, it is more useful to display all fragments simultaneously:
# Create a list of tuples for visualization
tpls = [(mol, bit_id, bit_info) for bit_id in fp1.GetOnBits()]
# Generate legends for each bit
legends = [str(bit_id) for bit_id in fp1.GetOnBits()]
# Visualize the bits
Draw.DrawMorganBits(tpls, molsPerRow=4, legends=legends)
Which bit above represents the bromine with a radius of 0?
Why do three carbons in 1-bromobutane result in an “on” bit for fragment 80?
If you changed 1-bromobutane to 1-chlorobutane, which fragments above (33, 80, 251, 294, 375, 495, 591, 640, 728, 794, 887) would you still expect to have a value of 1? (write this prediction down in the next cell for later)
# Write your prediction for which values would you still expect to have a value of 1 if you changed the molecule from 1-bromobutane to 1-chlorobutane
Let’s compare the 1-bromobutane to 1-chlorobutane to determine if they have any fragments in common.
IPythonConsole.drawOptions.addAtomIndices = True
mol = Chem.MolFromSmiles('ClCCCC') # 1-chlorobutane
mol
# Step 1: Create the molecule from SMILES
mol = Chem.MolFromSmiles('ClCCCC') # 1-chlorobutane
# Step 2: Initialize the Morgan fingerprint generator. We are setting the radius to 2 and total bit size to 1024
morgan_gen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=1024)
# Step 3: Prepare the AdditionalOutput object to capture bit information
additional_output = rdFingerprintGenerator.AdditionalOutput()
additional_output.AllocateBitInfoMap()
# Step 4: Generate the fingerprint with additional output
fp2 = morgan_gen.GetFingerprint(mol, additionalOutput=additional_output)
print("The Morgan radius =2 fingerprint (ECFP4)for the molecule is:")
print(fp2.ToBitString())
print()
# Step 5: Retrieve and display the bit information
bit_info = additional_output.GetBitInfoMap()
for bit_id, atom_radius_list in bit_info.items():
print(f"Bit {bit_id} is set by:")
for atom_idx, radius in atom_radius_list:
atom_symbol = mol.GetAtomWithIdx(atom_idx).GetSymbol()
print(f" - Atom index {atom_idx} ({atom_symbol}), Radius {radius}")
IPythonConsole.drawOptions.addAtomIndices = False
# Create a list of tuples for visualization
tpls = [(mol, bit_id, bit_info) for bit_id in fp2.GetOnBits()]
# Generate legends for each bit
legends = [str(bit_id) for bit_id in fp2.GetOnBits()]
# Visualize the bits
Draw.DrawMorganBits(tpls, molsPerRow=4, legends=legends)
Now let’s identify the bits it common of 1-bromobutane and 1-chlorobutane!
common_bits = set(fp1.GetOnBits()) & set(fp2.GetOnBits())
print(f"Common bits: {sorted(common_bits)}")
Did your prediction from the previous check your understanding hold true? If not, review the data to make sure you understand which fragments are on in both.
2: Path-Based Fingerprints#
Path-based fingerprints are also hashed fingerprints. They are generated by enumerating linear fragments of a given length and hashing them into a fixed-length bitstring. An example is the RDKit’s topological fingeprint. As described in the RDK documentation (https://www.rdkit.org/docs/GettingStartedInPython.html#topological-fingerprints), while this fingerprint can be generated using FingerprintMols.FingerprintMol(), it is recommended to use rdmolops.RDKFingerprint() to generate the fingerprint using non-default parameter values.
The RDKFingerprint(mol) method has a number of arguments that can be added:
mol: the molecule to use
fpSize: (optional) number of bits in the fingerprint Defaults to 2048.
minPath: (optional) minimum number of bonds to include in the subgraphs Defaults to 1.
maxPath: (optional) maximum number of bonds to include in the subgraphs Defaults to 7.
# path based fingerprints
from rdkit.Chem import rdmolops
mol = Chem.MolFromSmiles("CCOC(=O)N1CCC(=C2C3=C(CCC4=C2N=CC=C4)C=C(C=C3)Cl)CC1") # loratadine
fp = rdmolops.RDKFingerprint(mol, fpSize=2048, minPath=1, maxPath=7).ToBitString()
print(fp)
Homework
Problem 1: MACCS Keys#
For the list of terpene SMILES in the next code cell:
Display the structures
Generate the MACCS Keys for each molecule
Calculate number of “on” bits and “off” bits
terpenes = ['CC(=CCCC(C)(C=C)O)C', #linalool
'CC(=CCC/C(=C/CO)/C)C', #geraniol
'CC1=CCC(CC1)C(=C)C', #limonene
'CC1=CCC(=C(C)C)CC1', #terpinolene
'CC1=CCC2CC1C2(C)C'] #Alpha-pinene
# write your code here to display the molecules in Problem 1
# write your code here to generate MACCS Keys and off/on bits in Problem 1
Problem 2#
For the list of PubChem compound ID numbers in the next code cell:
Use the use the PUG-REST API to obtain their SMILES and PubChem Fingerprints and store in lists
convert the PubChem Fingerprints to bitstrings and display
display their structures
CIDS = [ 4980, # psilocin
1615, # methylenedioxymethamphetamine
10257, # bufotenin
360252, # 5-bromo-DMT
4076, # mescaline
98527, # 2C-B
5761 ] # LSD
# write your code here to obtain SMILES and PubChem Fingerprints
# write your code here to convert fingerprints from Base64 to bitstring
# write your code here to display structures
Problem 3#
For the molecules below, generate the 512-bit-long Morgan Fingeprint comparable to the FCFP6 fingerprint.
Use the PUG REST API to search for the compounds by name and get their SMILES strings.
Generate the molecular fingerprints from the SMILES strings.
Print the generated fingerprints.
synonyms = [ 'diphenhydramine', 'cetirizine', 'fexofenadine', 'loratadine' ]
# Write your code in this cell