Virtual Screening#

This module is under construction

Objectives
  • Perform virtual screening against PubChem using ligand-based approach

  • Apply filters to prioritize virtual screening hit list.

  • Learn how to use pandas’ data frame.

In this notebook, we perform virtual screening against PubChem using a set of known ligands for muscle glycogen phosphorylase. Compound filters will be applied to identify drug-like compounds and unique structures in terms of connectivity SMILES (no stereochemistry encoded) will be selected to remove redundant structures. For some top-ranked compounds in the list, their binding mode will be predicted using machine learning (which will be covered in Module 10: Supervised machine learning).

1. Read known ligands from a file.#

As a starting point, let’s download a set of known ligands against muscle glycogen phosphorylase. These data are obtained from the DUD-E (Directory of Useful Decoys, Enhanced) data sets (http://dude.docking.org/), which contain known actives and inactives for 102 protein targets. The DUD-E sets are widely used in benchmarking studies that compare the performance of different virtual screening approaches (https://doi.org/10.1021/jm300687e).

Go to the DUD-E target page (http://dude.docking.org/targets) and find muscle glycogen phosphorylase (Target Name: PYGM, PDB ID: 1c8k) from the target list. Clicking the target name “PYGM” directs you to the page that lists various files (http://dude.docking.org/targets/pygm). Download file “actives_final.ism”, which contains the SMILES strings of known actives. Rename the file name as “pygm_1c8k_actives.ism”. [Open the file in a text viewer/editor to check the format of this file].

import pandas as pd
colnames = ['smiles','DUD-E ID', 'CHEMBL_ID']
df_act = pd.read_csv("https://dude.docking.org/targets/pygm/actives_final.ism", sep=" ", names=colnames)
df_act.head(5)
print(len(df_act))    # Show how many structures are in the "data frame"

2. Similarity Search against PubChem#

Now, let’s perform similarity search against PubChem using each known active compound as a query. There are a few things to mention in this step:

  • The SMILES were obtained from the DUD-E database. Since there are multiple ways to generate SMILES, PubChem will run an algorithm to convert the SMILES in the query to the SMILES in the PubChem database.

  • The SMILES string is available for each query compound. This string will be used to specify the input structure, so HTTP POST should be used. (Please review lecture02-structure-inputs.ipynb)

  • During PubChem’s similarity search, molecular similarity is evaluated using the PubChem fingerprints and Tanimoto coefficient. By default, similarity search will return compounds with Tanimoto scores of 0.9 or higher. While we will use the default threshold in this practice, it is noteworthy that it is adjustable. If you use a higher threshold (e.g., 0.99), you will get a fewer hits, which are too similar to the query compounds. If you use a lower threshold (e.g., 0.88), you will get more hits, but they will include more false positives.

  • PubChem’s similarity search does not return the similarity scores between the query and hit compounds. Only the hit compound list is returned, which makes it difficult to rank the hit compounds for compound selection. To address this issue, for each hit compound, we compute the number of query compounds that returned that compound as a hit. [Because we are using multiple query compounds for similarity search, it is possible for different query compounds to return the same compound as a hit. That is, the hit compound may be similar to multiple query compounds. The underlying assumption is that hit compounds returned multiple times from different queries are more likely to be active than those returned only once from a single query.]

  • Add “time.sleep()” to avoid overloading PubChem servers and getting blocked.

smiles_act = df_act.smiles.to_list()
# Note: 
# This code block may take a while to run, as it queries PubChem for each structure.
# It took about 2 minutes in testing

import time
import requests

prolog = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"

cids_hit = dict()

for idx, mysmiles in enumerate(smiles_act) :
    
    mydata = { 'smiles' : mysmiles }
    
    url = prolog + "/compound/fastsimilarity_2d/smiles/cids/txt"
    res = requests.post(url, data=mydata)

    if ( res.status_code == 200 ) :
        cids = res.text.split()
        cids = [ int(x) for x in cids ]    # Convert CIDs from string to integer.
    else :
        print("Error at", idx, ":", df_act.loc[idx,'CHEMBL_ID'], mysmiles )
        print(res.status_code)
        print(res.content)
    
    for mycid in cids:
        cids_hit[mycid] = cids_hit.get(mycid, 0) + 1
    
    time.sleep(0.2)
len(cids_hit)    # Show the number of compounds returned from any query.

In the above code cells, the returned hits are stored in a dictionary, along with the number of times they are returned. Let’s print the top 10 compounds that are returned the most number of times from the search.

sorted_by_freq = [ (v, k) for k, v in cids_hit.items() ]
sorted_by_freq.sort(reverse=True)

for v, k in enumerate(sorted_by_freq) :

    if v == 10 : 
        break
    
    print(v, k) # Print (frequency, CID)
    

3. Exclude the query compounds from the hits#

In the previous step, we repeated similarity searches using multiple query molecules. This may result in a query molecule being returned as a hit from similarity search using another query molecule. Therefore, we want to check if the hit compound list has any query compounds and if any, we want to remove them. Below, we search PubChem for compounds identical to the query molecules and remove them from the hit compound list.

Note that the identity_type parameter in the PUG-REST request is set to “same_connectivity”, which will return compounds with the same connectivity with the query molecule (ignoring stereochemistry and isotope information). The default for this parameter is “same_stereo_isotope”, which returns compounds with the same stereochemistry AND isotope information.

# This code cell make take some time to run, as it queries PubChem for each structure.
# It took about 1 minute in testing

cids_query = dict()

for idx, mysmiles in enumerate(smiles_act) :
    
    mydata = { 'smiles' : mysmiles }
    url = prolog + "/compound/fastidentity/smiles/cids/txt?identity_type=same_connectivity"
    res = requests.post(url, data=mydata)

    if ( res.status_code == 200 ) :
        cids = res.text.split()
        cids = [ int(x) for x in cids]
    else :
        print("Error at", idx, ":", df_COX2_act.loc[idx,'CHEMBL_ID'], mysmiles )
        print(res.status_code)
        print(res.content)
       
    for mycid in cids:
        cids_query[mycid] = cids_query.get(mycid, 0) + 1
    
    time.sleep(0.2)
len(cids_query.keys())    # Show the number of CIDs that represent the query compounds.

Now remove the query compounds from the hit list (if they are found in the list)

for mycid in cids_query.keys() :
    
    cids_hit.pop(mycid, None)
len(cids_hit)

Print the top 10 compounds in the current hit list and compare them with the old ones.

sorted_by_freq = [ (v, k) for k, v in cids_hit.items() ]
sorted_by_freq.sort(reverse=True)

for v, k in enumerate(sorted_by_freq) :
    
    if v == 10 : 
        break
    
    print(v, k)    # Print (frequency, CID)
Check Your Understanding
  • How many molecules were returned from the original query?

  • How many molecules were returned were the query compounds?

  • Does the number of molecules in len(cids_hits) confirm you have removed the query compounds from the original list?

  • Have the top 10 compounds in hit list changed? Does this confirm you have removed the query compounds from the original list?

4. Filtering out non-drug-like compounds#

We want to identify problematic compounds to exclude from virtual screening. These are molecules that are not absorbed well and may not be good for an orally administered drug.

A team at Pfizer led by Christopher Lipinski evaluated 2245 compounds that reached phase II clinical trials and determined most orally administered drugs are relatively small and moderately lipophilic. This lead to the ”Rule of 5” for “drug-likeness” and helps determine whether a molecule has good solubility and permeability. The team suggests that a good lead follow has characteristics based on the number 5:

  • Molecular weight < 500

  • no more than 5 hydrogen bond donors

  • no more than 10 hydrogen bond acceptors

  • a calculated octanol-water partician coefficent (XLogP) <5

In this step, non-drug-like compounds are filtered out from the list using these criteria. To do that, these four molecular properties are downloaded from PubChem and stored in CSV.

# This code cell may take a while to run, as it queries PubChem for each structure.
# It took about 4 minutes in testing

chunk_size = 100

if ( len(cids_hit) % chunk_size == 0 ) :
    num_chunks = len(cids_hit) // chunk_size 
else :
    num_chunks = len(cids_hit) // chunk_size + 1

cids_list = list(cids_hit.keys())
    
print("# Number of chunks:", num_chunks )

csv = ""   #sets a variable called csv to save the comma separated output

for i in range(num_chunks) :
    
    print(i, end=" ")
    
    idx1 = chunk_size * i
    idx2 = chunk_size * (i + 1)

    cids_str = ",".join([ str(x) for x in cids_list[idx1:idx2] ]) # build pug input for chunks of data
    url = prolog + "/compound/cid/" + cids_str + "/property/HBondDonorCount,HBondAcceptorCount,MolecularWeight,XLogP,ConnectivitySMILES,SMILES/csv"
    
    res = requests.get(url)
    
    if ( i == 0 ) : # if this is the first request, store result in empty csv variable
        csv = res.text 
    else :          # if this is a subsequent request, add the request to the csv variable adding a new line between chunks
        csv = csv + "\n".join(res.text.split()[1:]) + "\n" 
      
    time.sleep(0.2)

#print(csv)

Downloaded data (in CSV) are loaded into a pandas data frame.

from io import StringIO

csv_file = StringIO(csv)

df_raw = pd.read_csv(csv_file, sep=",")

df_raw.shape    # Show the shape (dimension) of the data frame
df_raw.head(5)    # Show the first 5 rows of the data frame

Note that some compounds do not have computed XLogP values (because XLogP algorithm cannot handle inorganic compounds, salts, and mixtures) and we want to remove them.

df_raw.isna().sum()    # Check if there are any NULL values.
len(df_raw)    # Check the number of rows (which is equals to the number of CIDs)

For convenience, add the information contained in the cids_hit dictionary to this data frame

# First load the cids_hit dictionary into a data frame.
df_freq = pd.DataFrame( cids_hit.items(), columns=['CID','HitFreq'])
df_freq.head(5)
# Double-check if the data are loaded correctly
# Compare the data with those from Cell [12]
df_freq.sort_values(by=['HitFreq', 'CID'], ascending=False).head(10)
# Create a new data frame called "df" by joining the df and df_freq data frames
df = df_raw.join(df_freq.set_index('CID'), on='CID')
df.shape
df.sort_values(by=['HitFreq', 'CID'], ascending=False).head(10)

Now identify and remove those compounds that satisfy all criteria of Lipinski’s rule of five.

len(df[ df['HBondDonorCount'] <= 5 ])
len(df[ df['HBondAcceptorCount'] <= 10 ])
len(df[ df['MolecularWeight'] <= 500 ])
len(df[ df['XLogP'] < 5 ])
df = df[ ( df['HBondDonorCount'] <= 5) &
         ( df['HBondAcceptorCount'] <= 10 ) &
         ( df['MolecularWeight'] <= 500 ) &
         ( df['XLogP'] < 5 ) ]
len(df)
df.isna().sum()    # Check if there are any NULL values.
Check Your Understanding
  • How many molecules were in the raw dataset of molecules that had Tanimoto Similarity of 0.9?

  • How many molecules were removed because they didn’t follow the Rule of 5?

  • How were null values treated in the applying the Rule of 5?

5. Draw the structures of the top 10 compounds#

Let’s check the structure of the top 10 compounds in the hit list.

cids_top = df.sort_values(by=['HitFreq', 'CID'], ascending=False).head(10).CID.to_list()
from rdkit import Chem
from rdkit.Chem import Draw

mols = []

for mycid in cids_top :
    
    mysmiles = df[ df.CID==mycid ].SMILES.item()
    
    mol = Chem.MolFromSmiles( mysmiles )
    Chem.FindPotentialStereoBonds(mol)    # Identify potential stereo bonds!
    mols.append(mol)

mylegends = [ "CID " + str(x) for x in cids_top ]
img = Draw.MolsToGridImage(mols, molsPerRow=4, subImgSize=(200,200), legends=mylegends)
display(img)

An important observation from these images is that the hit list contains multiple compounds with the same connectivity. For example, CIDs 93077065 and 93077064 are stereoisomers of each other and CID 53013349 has the same connectivity as the two CIDs, but with its stereocenter being unspecified. When performing a screening with limited resources in the early stage of drug discovery, you may want to test as diverse molecules as possible, avoiding testing too similar structures.

To do so, let’s look into PubChem’s Connectivity SMILES strings, which do not encode the stereochemisry and isotope information. Chemicals with the same connectivity but with different stereochemistry/isotopes should have the same connectivity SMILES. In the next section, we select unique compounds in terms of connectivity SMILES to reduce the number of compounds to screen.

6. Extract unique compounds in terms of connectivity SMILES#

The next few cells show how to get unique values within a column (in this case, unique connectivity SMILES).

# Show the number of rows in the data frame
len(df) 
# Show the number of unique structures in the data frame
len(df.ConnectivitySMILES.unique()) 
# Get unique structures
connectivity_smiles = df.ConnectivitySMILES.unique() 
# Show all the entries for the first unique structure
df[ df.ConnectivitySMILES == connectivity_smiles[0] ] 
# Show all the entries for the second unique structure
df[ df.ConnectivitySMILES == connectivity_smiles[1] ] 
# Show all the stereoisomers for the first unique structure
df[ df.ConnectivitySMILES == connectivity_smiles[0] ].SMILES.to_list() 

Now let’s generate a list of unique compounds in terms of connectivity SMILES. If multiple compounds have the same connectivity SMILES, the one that appears very first will be included in the unique compound list.

idx_to_include = []

for mysmiles in connectivity_smiles :

    myidx = df[ df.ConnectivitySMILES == mysmiles ].index.to_list()[0]
    
    idx_to_include.append( myidx )
len(idx_to_include)
# Create a new column 'Include' 
# All values initialized to 0 (not include)
df['Include'] = 0 
print(df['Include'].sum()) # Show the total number of included compounds, should be 0 now
# Now the "Include" column's value is modified if the record is in the idx_to_include list.
df.loc[idx_to_include,'Include'] = 1 # Set "Include" to 1 for the selected records
print(df['Include'].sum()) # Show the total number of included compounds, should be equal to the number of unique structures
df[['CID','Include']].head(15) # Show the first 15 rows of the "CID" and "Include" columns

Now draw the top 10 unique compounds (in terms of connectivity SMILES). Note the, the structure figures are drawn using isomeric SMILES, but connectivity SMILES strings could be used.

cids_top = df[ df['Include'] == 1].sort_values(by=['HitFreq', 'CID'], ascending=False).head(10).CID.to_list()
mols = []

for mycid in cids_top :
    
    mysmiles = df[ df.CID==mycid ].SMILES.item()
    
    mol = Chem.MolFromSmiles( mysmiles )
    Chem.FindPotentialStereoBonds(mol)    # Identify potential stereo bonds!
    mols.append(mol)

mylegends = [ "CID " + str(x) for x in cids_top ]
img = Draw.MolsToGridImage(mols, molsPerRow=5, subImgSize=(200,200), legends=mylegends)
display(img)
Check Your Understanding
  • Recall from cell [46] that row 15 CID 70155849 is an isomer of row 3. In Cell [55] how do you know this was excluded because it was an isomer?

  • Previously we showed CIDs 93077065 and 93077064 are stereoisomers of each other and CID 53013349 has the same connectivity as the two CIDs, but with its stereocenter being unspecified. All three were in our original display of the top 10 unique molecules. Which of these are in our new list?

7. Saving molecules in files#

Now save the molecules in the cids_top list in files, which will be used in molecular docking experiments. For simplicity, we will use only the top 3 compounds in the list.

from rdkit.Chem import AllChem

for idx, mycid in enumerate( cids_top ) :

    if idx == 3 :
        break
        
    mysmiles = df[ df['CID'] == mycid ].SMILES.item()

    mymol = Chem.MolFromSmiles(mysmiles)
    mymol = Chem.AddHs(mymol)
    AllChem.EmbedMolecule(mymol)
    AllChem.MMFFOptimizeMolecule(mymol)

    filename = "pygm_lig" + str(idx) + "_" + str(mycid) + ".mol"
    Chem.MolToMolFile(mymol, filename)

To save all data in the df data frame (in CSV)

df.to_csv('pygm_df.csv')

Homework

In this activity you used filtered by Lipinski’s Rule of Five(Ro5) and saved those molecules in pygm_df.csv. The Ro5 focuses on drug-likeness helping to identify favorable properties for oral bioavailability which is important in the later stages of the drug development pipeline.

If we are interested in creating a list of molecues that are potentially interesting in the early phases of drug discovery, there is another set of rules that emphasizes “lead-likeness”. Leads are molecules that serve as a starting point modify and create drugs for testing.

Miles Congreve proposed the ”Rule of 3” to create a list of lead like filters rather than drug like filters.

Congreve’s rule of 3

  • The number of hydrogen bond donors ≤3

  • The number of hydrogen bond acceptors ≤ 3

  • Molecular weight < 300

  • XlogP ≤ 3

Step 1#

Load the actives_final.ism for cyclooxygenase-2 (Target Name: PGH2, PDB ID: 3ln1) from the DUD-E database into a pandas DataFrame called df_COX2_act. After loading the data, show the following information:

  • the number of rows of the data frame

  • the first five rows of the data frame

# Write your code here

Step 2#

Perform similarity search using each of the isomeric SMILES contained in the loaded data frame.

  • As we did for PYGM ligands in this notebook, track the number of times a particular hit is returned from multiple queries, using a dictionary named cids_hit (CIDs as keys and the frequencies as values). This information will be used to rank the hit compounds.

  • Make sure that the CIDs are recognized as integers when they are used as keys in the dictionary.

  • Print the total number of hits returned from this step (which is the same as the number of CIDs in cids_hit).

  • Add time.sleep() to avoid overloading PubChem servers.

# Write your code here

Step 3#

The hit list from the above step may contain the query compounds themselves. Get the CIDs of the query compounds through idenitity search and remove them from the hit list.

  • Set the optional parameter “identity_type” to “same_connectivity”.

  • Add time.sleep() to avoid overloading PubChem servers.

  • Print the number of CIDs corresponding to the query compounds.

  • Print the number of the remaining hit compounds, after removing the query compounds from the hit list.

# Write your code here

Step 4#

Download the hydrogen donor and acceptor counts, molecular weights, XlogP, and connectivity and isomeric SMILES for each compound in cids_hit. Load the downloaded data into a new data frame called df_COX2_raw. Print the size (or dimension) of the data frame using .shape.

# Write your code here

Step 5#

Create a new data frame called df_COX2_combined, which combines the data stored in cids_hit and df_COX2_raw.

  • First load the frequency data into a new data frame called df_COX2_freq

  • Join df_COX2_raw and df_COX2_freq into df_COX2_combined

  • Print the shape (dimension) of df_COX2_combined

# Write your code here

Step 6#

Remove from df_COX2_combined all compounds that violate any criterion of Congreve’s rule of 3 and show the number of remaining compounds.

# Write your code here

Step 7#

Get the unique connectivity SMILES strings from the df_COX2_combined. Add to df_COX2_combined a column named ‘Include’, which contains a flag set to 1 for the lowest CID associated with each unique CID and set to 0 for other CIDs. Show the number of compounds for which this flag is set to 1.

# Write your code here

Step 8#

Among those with the “Include” flag set to 1, identify the top 10 compounds that were returned from the largest number of query compounds.

  • Sort the data frame by the number of times returned (in descending order) and then by CID (in ascending order)

  • For each of the 10 compounds, print its CID, isomeric SMILES, and the number of times it was returned.

  • For each of the 10 compounds, draw its structure (using isomeric SMILES).

# Write your code here

Step 9#

Now save the top 3 compounds with CID and SMILES in a file called COX2_congreve.csv.

# Write your code here