Exploratory Data Analysis 2:
Chemical Modulation of Renin-Angiotensin-Aldosterone system (RAAS)#
A biological pathway is a series of interactions among molecules (e.g., chemicals, genes, and proteins) in a cell that leads to a certain product or a change in a cell.
In this notebook, we will learn how to access information for a given pathway, with the human Renin-Angiotensin-Aldosterone System (RAAS) as an example. In PubChem, there are multiple records for the RAAS pathway, derived from different data sources. Among them, we will use the RAAS record from WikiPathways (WikiPathways ID: WP4756). The summary page of this record can be accessed at the following URL:
https://pubchem.ncbi.nlm.nih.gov/pathway/WikiPathways:WP4756
Among various annotations available on this page, we will use the list of genes downloaded from the Genes section:
https://pubchem.ncbi.nlm.nih.gov/pathway/WikiPathways:WP4756#section=Genes
The downloaded data file is already stored in the current directory.
file_pathway_genes = "pubchem_pathwayid_1295579_pcget_pathway_gene.csv"
Let’s import necessary libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import time
import json
from IPython.display import HTML
Loading the gene list into a data frame.#
First, create a data frame containing the list of genes involved in the RAAS pathway.
df = pd.read_csv(file_pathway_genes)
df
To quickly access information on the individual genes, we want to add the URLs to their summary pages.
df["url"] = "https://pubchem.ncbi.nlm.nih.gov/gene/" + df['geneid'].astype(str)
df
In the output of the above code cell, the URLs are not clickable. To make them clickable, display the data frame as an HTML page.
HTML(df.to_html(render_links=True))
Click the URL for each gene and read the “Description” in the box presented at the top of the summary page to learn about the gene.
Getting chemicals tested against the genes#
It is noteworthy that the above gene summary pages have the bioactivity data for chemicals tested against the genes. Therefore, you can find chemical modulators of a given pathway, by looking into the bioactivity of the chemicals against the genes involved in that pathway. In this section, we will get the bioactivity data for the genes involved in the RAAS pathway programmatically through PUG-REST.
Getting chemicals tested against a single gene.#
url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/gene/geneid/183/concise/CSV"
datatype = { "CID":"Int64", "RNAi":"Int64" }
df_gene_183 = pd.read_csv(url, dtype=datatype)
df_gene_183.head()
df_gene_183.shape # Checking the dimension of the data frame
The downloaded data set does not have the target gene information. If we download the bioactivity data for multiple genes and merge them into a single data frame, it may be inconveninent to find which gene a given row is for. Therefore, we want to add the gene information to the data frame.
df_gene_183.insert(loc=0, column='Gene_ID', value='183')
df_gene_183.insert(loc=1, column='Gene_Symbol', value='AGT')
df_gene_183.insert(loc=2, column='Gene_Name', value='angiotensinogen')
df_gene_183.head()
Getting bioactivity data for all genes#
Now generate a single data frame containing the bioactivity data for all seven genes involved in the RAAS pathway.
First, create an empty list called
list_dfs
.Use a
for
loop to get the bioactivity data for the seven genes.To get the gene IDs, loop over the rows of the
df
data frame, usingiterrows()
.For each gene, add new columns called
Gene_ID
,Gene_Symbol
, andGene_Name
, containing relevant information.For each loop, add the generated data frame to the
list_df
list.At the end of each loop, add a
time.sleep(0.5)
command, not to overload the PubChem server.
After adding the bioactivity data for all genes to the
list_df
list, create a single data frame by usingpd.concat
fromlist_dfs
# Write your own code here
list_dfs = []
for index, row in df.iterrows() :
url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/gene/geneid/" + str(row['geneid']) + "/concise/CSV"
print("Getting data for GeneID ", row['geneid'], "from ", url)
df_tmp = pd.read_csv(url, dtype=datatype)
df_tmp.insert(0, column='Gene_ID', value=row['geneid'])
df_tmp.insert(1, column='Gene_Symbol', value=row['genesymbol'])
df_tmp.insert(2, column='Gene_Name', value=row['genename'])
list_dfs.append( df_tmp )
time.sleep(0.5)
df_bioactivity = pd.concat(list_dfs, axis=0)
df_bioactivity.head()
df_bioactivity.shape
Getting Descriptive Statistics of the data#
The BioAssay records in PubChem can be classified into two groups: small molecule assays and RNAi screenings:
small molecule assay: an assay where small molecules are tested against a target macromolecule (e.g., protein).
RNAi screening: a high-throughput experimental approach used to systematically identify genes that regulate a specific cellular process or phenotype by silencing them using RNA interference (RNAi).
In this lesson, we are intrested in small molecule assays, so we want to remove RNAi screening data. They are flagged in the RNAi column of the data frame.
df_bioactivity[ df_bioactivity['RNAi'] == 1 ]
indices_rnai = df_bioactivity[df_bioactivity['RNAi'] == 1].index
df_activity_small = df_bioactivity.drop(indices_rnai)
df_activity_small.shape
Check the number of bioactivity data points for each gene.
df_activity_small.groupby(by='Gene_Symbol').size()
Visualize the above count information, using the seaborn countplot()
. Add the data label for each bar.
# Write your own code.
ax = sns.countplot(data=df_activity_small, x='Gene_Symbol')
for container in ax.containers:
ax.bar_label(container)
In the above count plot, we see that CTSG has the most data points (more than 62 thousands). Let’s look into this graph further. First, group the data points by Gene_symbol
and Activity Outcome
(using .groupby()
) and displays the number of data points for each combination of Gene_Symbol
and Activity Outcome
(using size()
).
# Write your own code
df_activity_small.groupby(by=['Gene_Symbol', 'Activity Outcome']).size()
Generate a count plot that shows the number of data points grouped by Gene_Symbol
and Activity_Outcome
.
Add the data labels to individual bars.
Make the bars horizontal (not vertical).
Use the logarithmic scale to make small bars easier-to-recognize.
# Write your own code
#ax = sns.countplot(data=df_activity_small, x='Gene_Symbol', hue='Activity Outcome')
ax = sns.countplot(data=df_activity_small, y='Gene_Symbol', hue='Activity Outcome')
for container in ax.containers:
ax.bar_label(container, fontsize=8)
ax.set_xscale('log')
The majority of data points for CTSG are declared “inactive”. What we observe here implies that there may be a high-throughput screening (HTS) assay that tested a large number of chemicals against CTSG, because a HTS experiment usally results in a small number of active chemicals with a large number of inactive chemicals.
Find the assay with the most data points tested against CTSG, using groupby()
and size()
.
df_activity_small[ df_activity_small['Gene_Symbol'] == 'CTSG' ].groupby('AID').size()
For the assay identified in the above code cell, show the number of data points for each activity outcome. Use groupby()
and size()
. Alternatively, you can use value_counts()
.
#df_activity_small[ df_activity_small['AID'] == 581 ]['Activity Outcome'].value_counts()
df_activity_small[ df_activity_small['AID'] == 581 ].groupby('Activity Outcome').size()
Quantitative Activity#
Now let’s look into what kind of activity values are in the df_activity_small
data frame.
df_activity_small.groupby(by=['Activity Name'], dropna=True).size().sort_values(ascending=False)
The most common activity types in the data set is IC50
, followed by Ki
. We will focus on these two activity types in the remaining part of this notebook.
It is also worth mentioning that, if you use dropna=False
(rather than dropna=True
), you will see that more than 60 thousand data points do not have activity names. Most of these data points without activity names correspond to the 62+ thousand inactive chemicals from the previously identified HTS assay.
Let’s convert the activity values into a negative log scale.
df_activity_small['p_Activity_Value'] = (-1) * np.log10( df_activity_small['Activity Value [uM]'] * 10**(-6) )
Note that some activity data in the data frame are already in a log scale (e.g., log(1/C), pIC50, pKb, pKi). In this notebook, we will ignore them because we are looking into IC50 and Ki values. If we were to look into all activity types (including those already in a log scale), we would need to a different (and more complex) code.
IC50#
Now let’s look into IC50 values.
df_activity_ic50 = df_activity_small[(df_activity_small['Activity Name'] == "IC50") &
(df_activity_small['Activity Value [uM]'].notnull()) &
((df_activity_small['Activity Qualifier'] == "=") | (df_activity_small['Activity Qualifier'].isnull() ))]
The last part of the above code deals with the activity qualifiers (e.g., =
, <
, >
, >=
, <=
). These qualifiers are crucial for correctly interpreting bioactivity data. For example, a compound with an IC50 of 1 μM has a different binding affinity from a compound with an IC50 < 1 μM or IC50 > 1 μM. However, without the qualifiers, they would be all indistinguishable, potentially leading to misinterpretation. These qualifiers are provided by the assay data depositors. When a data point does not have a qualifier, it is assumed to have the =
qualifier.
Because we want to generate a chart for the activity values, we use only those values with the =
qualifiers (and those with the =
qualifers omitted), ignoring those with other qualifiers.
df_activity_ic50.shape
Generate box plots that show the distribution of pIC50 values for each gene.
sns.boxplot(data=df_activity_ic50, x='p_Activity_Value', hue='Gene_Symbol')
#sns.catplot(data=df_activity_ic50, x='p_Activity_Value', y='Gene_Symbol', kind="box") # Alternative ways to create a chart similar to the agove one.
Generate a multipanel figure containing the histogram of pIC50 values for each gene, using seaborn’s FacetGrid()
.
g = sns.FacetGrid(df_activity_ic50, col="Gene_Symbol", height=2.5, col_wrap=4)
g.map(sns.histplot, "p_Activity_Value")
Ki#
Now repeat the analysis, this time, for the Ki values. First, create a new data frame called df_activity_ki
from df_activity_small
. Make sure to use only the activity values with the =
qualifiers and those with the =
qualifers omitted, excluding those with other qualifiers.
# Write your own code
df_activity_ki = df_activity_small[(df_activity_small['Activity Name'] == "Ki") &
(df_activity_small['Activity Value [uM]'].notnull()) &
((df_activity_small['Activity Qualifier'] == "=") | (df_activity_small['Activity Qualifier'].isnull() ))]
Check the dimension of the resulting data frame.
df_activity_ki.shape
Generate box plots that show the distribution of pKi values for each gene.
sns.boxplot(data=df_activity_ki, x='p_Activity_Value', hue="Gene_Symbol")
#sns.histplot(data=df_activity_ki, x='p_Activity_Value', hue="Gene_Symbol")
Generate a multipanel figure containing the histogram of pKi values for each gene, using seaborn’s FacetGrid()
.
g = sns.FacetGrid(df_activity_ki, col="Gene_Symbol", height=2.5, col_wrap=4)
g.map(sns.histplot, "p_Activity_Value")
Identifying known drugs from the tested compounds#
Now let’s see if any of the tested chemicals in the data set are known drug compounds, and if yes, what they are used for. To do that, we need a list of known drugs, which will be used to compare with the chemicals in the data set. There are multiple ways to get a known drug list with their indications, but in this notebook, we are going to use the list of compounds annotated with the terms in the World Health Organization (WHO) Anatomical Therapeutic Chemical (ATC) classification (https://www.who.int/tools/atc-ddd-toolkit/atc-classification).
Getting one page of the annotation data#
First, let’s get the WHO ATC annotation data from PubChem using PUG-View. Then, load the downloaded json into a dictionary called dict_drugs
url="https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/annotations/heading/JSON/?source=WHO%20Anatomical%20Therapeutic%20Chemical%20(ATC)%20Classification&heading_type=Compound&heading=ATC%20Code&page=1"
res = requests.get(url)
dict_drugs = json.loads(res.text)
#dict_drugs # Uncomment to see the content of the downloaded json
PUG-View supports downloading up to 1,000 annotations at a time. To download a data set with more than 1,000 annotations, the result will be paginated and the PUG-View request will return data for only one specified page (page 1 by default).
Therefore, whenever you download annotations through PUG-View, you should check the total number of pages for the annotatnion data.
dict_drugs['Annotations'].keys()
total_pages = dict_drugs['Annotations']['TotalPages']
print("Current page:", dict_drugs['Annotations']['Page'])
print("Total number of pages:", dict_drugs['Annotations']['TotalPages'])
In the above PUG-View request, we downloaded the first page of the annotation data (out of six pages). Therefore, we should make six PUG-View requests to get drug information.
Extracting the ATC code from the annotation data#
The downloaded data in json format was loaded into a (somewhat complicated) dictionary. From this dictionary, we want to extract necessary information (e.g.,drug name, drug classification, and PubChem Compound ID (CID)) and organize it in a tabular form (e.g., data frame). The following code cells show how this task can be done.
#dict_drugs # Uncomment if you want to check the structure of the downloaded data.
list_annotations = dict_drugs['Annotations']['Annotation']
count = 0
drug_atc_code = []
for annotation in list_annotations :
anid = annotation.get("ANID") # Unique ID for the annotation
atc_code = annotation.get("SourceID") # ATC code
drug_name = annotation.get("Name")
cids = []
value_strings = []
# Get CIDs for the drug. (Note: a drug can be mapped to multiple CIDs)
linked_records = annotation.get('LinkedRecords')
if linked_records :
cids = linked_records.get('CID') # cids : list of CIDs
if len(cids) > 1:
print("# Warning! more than one CIDs:", anid, atc_code, drug_name, cids)
# Get annotation data. (Note: there may be multiple items in this part).
data_tree_level0 = annotation.get('Data')
if data_tree_level0 :
for node_level0 in data_tree_level0:
value_strings_with_markup = node_level0.get("Value").get('StringWithMarkup')
for value_string_with_markup in value_strings_with_markup:
value_strings.append(value_string_with_markup.get('String'))
#print(, mySourceID, myName, cids, value_string)
# Some rows do not have CIDs. For the CID columns of those rows, add an empty string as a spaceholder.
if not cids:
cids.append("")
for cid in cids:
for index, value_string in enumerate(value_strings):
drug_atc_code.append([anid, cid, drug_name, atc_code, index, value_string])
# if count == 10 :
# break
count = count + 1
The above code cell stores the drug information in a list of lists. You can view this list of lists as a 2-D matrix, where rows correspond to individaul drugs and columns correspond to piece of information for each drug. Let’s load df_drug_atc
into a data frame.
column_names = ['ANID', 'CID', 'Drug_Name', 'ATC_Code', 'ATC_Level', 'ATC_Description']
df_drug_atc_tmp = pd.DataFrame(drug_atc_code, columns=column_names)
df_drug_atc_tmp.head(10)
Getting all pages of the annotation data.#
The above code cells demonstrate how to download the first page of the WHO-ATC annotation data, extract the desired drug information, and organize it in a dataframe. Now write a code that repeat this process to get all six pages of the WHO-ATC annotation data.
# Write your own code here
drug_atc_code = []
for page in range(1, total_pages + 1 ):
print("Getting page " + str(page) +" of " + str(total_pages))
url="https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/annotations/heading/JSON/?source=WHO%20Anatomical%20Therapeutic%20Chemical%20(ATC)%20Classification&heading_type=Compound&heading=ATC%20Code&page=" + str(page)
res = requests.get(url)
dict_drugs = json.loads(res.text)
time.sleep(0.5)
count = 0
list_annotations = dict_drugs['Annotations']['Annotation']
for annotation in list_annotations :
anid = annotation.get("ANID") # Unique ID for the annotation
atc_code = annotation.get("SourceID") # ATC code
drug_name = annotation.get("Name")
cids = []
value_strings = []
# Get CIDs for the drug. (Note: a drug can be mapped to multiple CIDs)
linked_records = annotation.get('LinkedRecords')
if linked_records :
cids = linked_records.get('CID') # cids : list of CIDs
if len(cids) > 1:
print("# Warning! more than one CIDs:", anid, atc_code, drug_name, cids)
# Get annotation data. (Note: there may be multiple items in this part).
data_tree_level0 = annotation.get('Data')
if data_tree_level0 :
for node_level0 in data_tree_level0:
value_strings_with_markup = node_level0.get("Value").get('StringWithMarkup')
for value_string_with_markup in value_strings_with_markup:
value_strings.append(value_string_with_markup.get('String'))
#print(, mySourceID, myName, cids, value_string)
# Some rows do not have CIDs. For the CID columns of those rows, add an empty string as a spaceholder.
if not cids:
cids.append("")
for cid in cids:
for index, value_string in enumerate(value_strings):
drug_atc_code.append([anid, cid, drug_name, atc_code, index, value_string])
# if count == 10 :
# break
# count = count + 1
column_names = ['ANID', 'CID', 'Drug_Name', 'ATC_Code', 'ATC_Level', 'ATC_Description']
df_drug_atc_tmp = pd.DataFrame(drug_atc_code, columns=column_names)
df_drug_atc_tmp.head(10)
df_drug_atc_tmp.shape # Check the shape of the data frame
df_drug_atc_tmp['Drug_Name'].nunique() # Check the number of unique drugs
df_drug_atc_tmp['CID'].nunique() # Check the number of unique CIDs.
As explained in the WHO-ATC classification web page, the WHO-ATC classifiction has a hierarchical structure with five levels. For this work, let’s use the middle-level nodes to classify the drugs.
df_drug_atc_lv2 = df_drug_atc_tmp[df_drug_atc_tmp['ATC_Level'] == 2 ]
The resulting data frame (df_drug_atc_lv2
) will be compared with the data frame containing the compounds tested against the genes involved in the RAAS pathway.
Appending drug information to the bioactivity data frame.#
Now create a new data frame by merging df_activity_small
with df_drug_atc_lv2
.
df_activity_small_lv2 = df_activity_small.merge(df_drug_atc_lv2, how='left', on='CID')
Check the first few rows of the new data set.
df_activity_small_lv2.head()
Let’s check how many chemicals in the data set are drugs.
df_activity_small_lv2['ATC_Code'].notnull().sum()
There are more than 700 rows annotated with WHO-ATC codes. However, the actual number of known drugs in the data frame may be smaller because they can be appear multiple times in the df_activity_small_lv2
data frame. It is worth noting that a chemical can be tested multiple times in different samples and different assays. So, we want to check the unique number of CIDs in the data frame to get the number of tested compounds that are known drugs.
df_activity_small_lv2[ df_activity_small_lv2['ATC_Code'].notnull() ]['CID'].nunique()
The output of the above cell indicates that more than 275 drugs were tested against any of the genes involved in the RAAS pathway. However, we don’t know how many of these known drugs actually show decent activity against the genes involved in the RAAS pathway. While some drugs may target the genes involved in the RASS pathway, others may not. Therefore, we want to look into those with strong activity against the target genes involved in the RAAS pathway.
Identifying known drugs from submicromolar activity#
Now create a new data frame called df_submicro_drug_lv2
, by selecting only the rows with submicromolar activities AND ATC_Code annotations from the df_activity_small_lv2
data frame.
df_submicro_drug_lv2 = df_activity_small_lv2[ (df_activity_small_lv2['Activity Value [uM]'] < 1) & (df_activity_small_lv2['ATC_Code'].notnull()) ]
df_submicro_drug_lv2.shape
Show the number of unique CIDs in df_submicro_drug_lv2
df_submicro_drug_lv2['CID'].nunique()
Show the list of unique CIDs, along with their drug names, ATC code, ATC Description, and the number of data points.
df_submicro_drug_lv2[['CID', 'Drug_Name', 'ATC_Code', 'ATC_Description']].value_counts()
Get the unique ATC_Descriptions and the number of data points associated with them, using value_counts()
df_submicro_drug_lv2['ATC_Description'].value_counts()
Now let’s look into a few drugs in the list.
df_losartan = df_submicro_drug_lv2[ df_submicro_drug_lv2['Drug_Name'] == 'Losartan' ]
df_losartan.head()
df_losartan['Gene_Symbol'].value_counts()
df_losartan['Activity Name'].value_counts()
Generate boxplots that compare losartan’s activity against ATGR1 and ATGR2 in terms of IC50 and Kd
fig, axes = plt.subplots(1, 2, figsize=(8,4), sharey=True)
sns.boxplot(data=df_losartan[df_losartan['Activity Name'] == 'IC50'], x='Gene_Symbol', y='p_Activity_Value', hue='Gene_Symbol', fill=False, ax=axes[0])
sns.boxplot(data=df_losartan[df_losartan['Activity Name'] == 'Kd'], x='Gene_Symbol', y='p_Activity_Value', hue='Gene_Symbol', fill=False, ax=axes[1])
axes[0].set_title='IC50'
axes[1].set_title='Kd'
Now let’s look into another drug Captopril. Create a new dataframe, called df_captopril
, containing the rows for drug ‘Captopril’.
# Write your own code here
df_captopril = df_submicro_drug_lv2[ df_submicro_drug_lv2['Drug_Name'] == 'Captopril' ]
Check the number of data points for the gene symbols targeted by captopril.
# Write your own code here
df_captopril['Gene_Symbol'].value_counts()
Check the number of data points for the activity types available for captopril.
# Write your own code here
df_captopril['Activity Name'].value_counts()
Generate boxplots that compare captopril’s activity against ACE and REN in terms of IC50 and Ki.
# Write your own code here
fig, axes = plt.subplots(1, 2, figsize=(8,4), sharey=True)
sns.boxplot(data=df_captopril[df_captopril['Activity Name'] == 'IC50'], x='Gene_Symbol', y='p_Activity_Value', hue='Gene_Symbol', fill=False, ax=axes[0])
sns.boxplot(data=df_captopril[df_captopril['Activity Name'] == 'Ki'], x='Gene_Symbol', y='p_Activity_Value', hue='Gene_Symbol', fill=False, ax=axes[1])
axes[0].set_title='IC50'
axes[1].set_title='Ki'
Discussion#
How many genes are involved in the RAAS pathway, based on the WikiPathways Record WP4756?
# Your answer here -> 7 genes
How many compounds with ATC codes showed submicromolar activities against any gene(s) involved in the RAAS?
# Your answer here -> 23 compounds (CIDs)
What genes does losartan show submicromolar activity against?
# Your answer here -> AGTR1, AGTR2
What genes does catopril show submicromolar activity against?
# Your answer here -> ACE, REN
Visit the summary pages of those compounds and read the descriptions in the box at the top.
losartan (CID 3961) : https://pubchem.ncbi.nlm.nih.gov/compound/3961
captopril (CID 44093) : https://pubchem.ncbi.nlm.nih.gov/compound/44093
What disease are these drugs used to treat?
# Your answer here -> hypertension
Read the following article and summarize the role and clinical importance of the RAAS pathway in 50-60 words.
Physiology, Renin Angiotensin System
By John H. Fountain, Jasleen Kaur & Sarah L. Lappin
https://www.ncbi.nlm.nih.gov/books/NBK470410/
# Write your summary in this markdown cell: