(modules/05-Searching-Chemical-Databases/05-02-structure-inputs)=

# 5.1 List Conversions

PUG-REST can be used to retrieve PubChem records related to another PubChem records.  Basically, PUG-REST takes an input list of records in one of the three PubChem databases (Compound, Substance, and BioAssay) and returns a list of the related records in the same or different database.  Here, the meaning of the relationship between the input and output records may be specified using an optional parameter.  This allows one to do various tasks, including (but not limited to):<br>

- Depositor-provided records (i.e., substances) that are standardized to a given compound.
- Mixture compounds that contain a given component compound.
- Stereoisomers/isotopomers of a given compound.
- Compounds that are tested to be active in a given assay.
- Compounds that have similar structures to a given compound.

## 1. Getting depositor-provided records for a given compound

First let's import the requests package necessary to make a web service request.

In [1]:
import requests

The code snippet below retrieves the substance record associated with a given CID (CID 129825914).

In [2]:
prolog    = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"

pr_input  = "compound/cid/129825914"
pr_oper   = "sids"
pr_output = "txt"
url       = prolog + '/' + pr_input + '/' + pr_oper + '/' + pr_output

res = requests.get(url)
print(res.text)

341669951



It is also possible to provide a comma seprated list of CIDs as input identifiers.

In [3]:
pugin   = "compound/cid/129825914,129742624,129783988"
pugoper = "sids"
pugout  = "txt"
url     = prolog + '/' + pugin + '/' + pugoper + '/' + pugout

res = requests.get(url)
print(res.text)

341669951
341492923
341577059
368769438



In the example above, the input list has three CIDs, but the PUG-REST request returned five SIDs.  It means that some CID(s) must be associated with multiple SIDs, but it is hard to see which CID it is.  Therefore, we want the SIDs grouped by the corrsponding CIDs.  This can be done using the optional parameter "__list_return=grouped__" and changing the output format to __json__.

In [4]:
pugin   = "compound/cid/129825914,129742624,129783988"
pugoper = "sids"
pugout  = "json"
pugopt  = "list_return=grouped"
url     = prolog + '/' + pugin + '/' + pugoper + '/' + pugout + "?" + pugopt

res = requests.get(url)
print(res.text)

{
  "InformationList": {
    "Information": [
      {
        "CID": 129825914,
        "SID": [
          341669951
        ]
      },
      {
        "CID": 129742624,
        "SID": [
          341492923
        ]
      },
      {
        "CID": 129783988,
        "SID": [
          341577059,
          368769438
        ]
      }
    ]
  }
}



Note that the __json__ output format is used in the above request.  The "__txt__" output format in PUG-REST returns data into a single column but the result from the above request cannot fit well into a single column.

If you want output records to be "flattened", rather than being grouped by the input identifiers, use "**list_return=flat**".

In [5]:
pugopt  = "list_return=flat"
url     = prolog + '/' + pugin + '/' + pugoper + '/' + pugout + "?" + pugopt

res = requests.get(url)
print(res.text)

{
  "IdentifierList": {
    "SID": [
      341492923,
      341577059,
      341669951,
      368769438
    ]
  }
}



The default value for the "list_return" parameter is: 
- "flat" when the output format is TXT 
- "grouped" when the output format is JSON and XML

It is also possible to specify the input list **implicitly**, rather than providing the input identifiers explicitly.  For example, the following example uses a chemical name to specify the input list.

In [6]:
# Input CIDs are provided using a chemical name
url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/lactose/cids/txt'
res = requests.get(url)
cids = res.text.split()
print("# CIDs returned:", len(cids))
print(",".join(cids))

# Input CIDs are provided using the name, then coverted to SIDs.
url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/lactose/sids/txt'
res = requests.get(url)
sids = res.text.split()
print("# SIDs returned (method 1):", len(sids))
#print(",".join(sids))

# Input *SIDs* are provided using the name, and returned the input SIDs.
url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/name/lactose/sids/txt'
res = requests.get(url)
sids = res.text.split()
print("# SIDs returned (method 2):", len(sids))
#print(",".join(sids))

# CIDs returned: 2
6134,440995
# SIDs returned (method 1): 225
# SIDs returned (method 2): 166


The above example illustrates how the list conversion works.  
- In the first request, the name "lactose" is searched for against the Compound database and the resulting 7 CIDs are returned.
- If you change the operation part from "cids" to "sids" (as in the second request), the same name search is done first against the __Compound__ database, followed by the list conversion from the resulting 7 CIDs to associted 415 SIDs.
- In the third request, the name search is performed against the __Substance__ database, and the resulting 125 SIDs are returned.


**Exercise 1a** Statins are a class of drugs that lower cholesterol levels in the blood.  Retrieve in **JSON** the substance records associated with the compounds whose names contain the string "statin". 

- Make only one PUG-REST request.
- For partial name matching, set the *name_type* parameter to "word" (See the PUG-REST document for an example). 
- Group the substances by the corresponding compound records.
- Print the json output using print()

In [None]:
# Write your code in this cell.




## 2. Getting mixture/component molecules for a given molecule.

The list interconversion may be used to retrieve mixtures that contain a given molecule as a component.  To do this, the input molecule should be a single-component compound (that is, with only one covalently-bound unit), and the optional parameter "**cids_type=component**" should be provided.

In [7]:
prolog    = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"

url = prolog + "/compound/name/tylenol/cids/txt?cids_type=component"
res = requests.get(url)
cids = res.text.split()
print(len(cids))
print( cids )

671
['176477247', '176079080', '176022543', '175984358', '175840151', '175563451', '175558915', '175558910', '175474320', '175457622', '175455693', '175423291', '175422481', '175384108', '175313751', '175228627', '175190561', '175190560', '175161737', '175056836', '175056834', '175043516', '175016706', '174943509', '174943239', '174935408', '174932963', '174928335', '174910732', '174885752', '174868651', '174828309', '174818171', '174813759', '174809979', '174809741', '174806838', '174802486', '174802484', '174798608', '174798294', '174795810', '174792216', '174790751', '174779388', '174756506', '174745631', '174732680', '174729216', '174728309', '174726576', '174712054', '174666827', '174633154', '174630777', '174619777', '174617442', '174568247', '174567984', '174538403', '174460833', '174444355', '174436784', '174407973', '174353959', '174059328', '173791507', '173791485', '173779165', '173682868', '173682864', '173682863', '173682862', '173682860', '173682859', '173682858', '173682

<br>It should be noted that, if the input molecule is a multi-component compound, the option "**cids_type=component**" returns the components of that compound.  For example, the following example shows how to get all components of the first molecule in the "cids" list generated in the previous example. 

In [8]:
url = prolog + "/compound/cid/" + cids[0] + "/cids/txt?cids_type=component"
res = requests.get(url)
component_cids = res.text.split()
print( "CID:", cids[0])
print( "Number of Components", len(component_cids))
print( component_cids )

CID: 176477247
Number of Components 3
['5462222', '1983', '1118']


**Exercise 2a:** Many over-the-counter drugs contain more than one active ingredients.  In this exercise, we want to find component molecules that occur with three common pain killers (aspirin, tylenol, advil) as a mixture.

__Step 1.__ Define a list that contains three drug names (aspirin, tylenol, advil).

In [None]:
# Write your code in this cell.




__Step 2.__ Using a for loop, retrieve PubChem CIDs corresponding to the three drugs and store them in a new list.  In order not to overload the PubChem servers, stop the program for 0.2 second for each iteration in the for loop (using sleep()).

In [None]:
# Write your code in this cell.




__Step 3.__ Using another for loop, do the following things for each drug:
- Get the PubChem CIDs of the mixture compounds that contain each drug and store them in a list.
- Get the PubChem CIDs of the components that occur in any of the returned mixtures, by setting the "list_return" parameter to "flat".  This can be done with a single request.  
- Print all the components.
- Stop the code for 0.2 second using sleep() each time a PUG-REST request is made.

In [None]:
# Write your code in this cell.




## 3. Getting compounds tested in a given assay

PUG-REST may be used to retrieve compounds tested in a given assay.  For example, the following code cell shows how to get all compounds tested in AID 1207599.

In [9]:
url = prolog + "/assay/aid/" + "1207599" + "/cids/txt"
res = requests.get(url)
cids = res.text.split()
print(len(cids))
print(cids)

791
['6175', '6197', '8547', '10219', '14169', '17558', '21389', '68050', '84677', '95783', '95996', '142779', '177894', '180548', '182792', '241056', '253602', '302770', '348623', '379338', '408190', '427456', '453048', '456183', '458959', '463795', '467892', '467895', '467898', '467900', '467902', '468692', '493035', '540335', '615754', '628093', '653020', '658095', '659146', '659572', '660337', '660996', '661700', '664853', '665381', '670727', '678644', '679624', '684193', '686636', '692799', '696459', '697239', '701785', '705510', '709466', '711950', '718105', '722343', '726776', '728907', '732311', '742641', '745456', '746602', '759319', '763219', '780973', '783532', '787413', '787416', '805487', '807557', '819039', '819041', '826058', '826108', '826140', '865238', '866779', '871153', '876820', '879749', '899915', '929152', '933766', '934186', '935739', '939076', '940283', '945743', '951335', '951809', '962627', '972880', '973099', '991453', '1000261', '1036940', '1042562', '10466

If you are interested in only the compounds that are tested "active" in a given assay, set the "**cids_type**" parameter to "**active**", as shwon in the code below.

In [10]:
url = prolog + "/assay/aid/" + "1207599" + "/cids/txt?cids_type=active"
res = requests.get(url)
cids = res.text.split()
print(len(cids))
print(cids)

393
['6197', '10219', '14169', '17558', '68050', '177894', '182792', '253602', '348623', '453048', '456183', '458959', '463795', '467895', '467898', '467900', '540335', '697239', '701785', '742641', '745456', '807557', '826140', '972880', '973099', '1092462', '1104215', '1104245', '1187199', '1272562', '1330474', '1507416', '1591101', '1929483', '1931935', '2226126', '2229100', '2454286', '2788193', '2826655', '2840340', '2840651', '2871881', '2876588', '2877655', '2897031', '2923731', '2946841', '3010592', '3020289', '3098392', '3114195', '3304735', '3351585', '3732278', '4524296', '4827679', '4970781', '5065884', '5311382', '5322214', '5322341', '5328733', '6404647', '6603435', '7086352', '7292609', '7292627', '7292667', '7292689', '7294801', '7294819', '9549410', '9549480', '9802843', '10066728', '10173796', '10215271', '10237991', '10432767', '11237028', '11534555', '11953179', '13751046', '16012811', '16032335', '16192614', '16192765', '16193792', '17325420', '17388866', '18566671

It is also possible to specify the input assay list implicitly.  For example, the following code cell retrieves compounds tested in any assays targeting human Carbonic anhydrase 2 (CA2), whose accession number is P00918.

In [11]:
url = prolog + "/assay/target/accession/" + "P00918" + "/cids/txt"
res = requests.get(url)
cids = res.text.split()
print(len(cids))
#print(cids)

30613


**Exercise 3a:** Find compounds that are tested to be active against human acetylcholinesterase (accession: P08173) and retrieve SMILES strings for those compounds.<br>  
- Split the CID list into smaller chunks (with a chunk size of 100).
- Print the retrieved data in a CSV format (CID and SMILES strings in the first and second columns, respectively).

In [None]:
# Write your code in this cell.


