6.1 Exploratory Data Analysis - Protein#
Exploratory Data Analysis 1:
Concordance of Thyroid Hormone Receptor Activity β in Rats and Humans#
In the preclinical stage of drug development, drug candidates are tested in non-human species. One of the commonly used animals in the preclinical stage is rodents (e.g., mice and rats).
The underlying assumption here is that the bioactivity (e.g., toxicity and efficacy) of a drug candidate in animals is similar to its activity in humans.
In this notebook, we will look into the validity of this assumption by comparing the bioactivity data available in PubChem for human and rat thyroid hormone receptor β.
Recall we looked at two PubChem protein summary pages for accessions NP_001257783
and P10828
which were for the Norway Rat and Human thyroid hormone receptor β proteins.
NP_001257783
https://pubchem.ncbi.nlm.nih.gov/protein/NP_001257783 (rat)P10828
https://pubchem.ncbi.nlm.nih.gov/protein/P10828 (Human)
Let’s import necessary libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from scipy.stats import linregress
#import time
#import json
Checking the amount of available data#
This section focuses on assessing the availability of the bioactivity data for the thyroid hormone receptor in both human and rat models. The data sets were downloaded in August 2025 from the PubChem Protein summary pages for protein assesions NP_001257783
(rats) and P10828
(humans).
NP_001257783
(rats)
https://pubchem.ncbi.nlm.nih.gov/protein/NP_001257783#section=Chemicals-and-BioactivitiesP10828
(humans)
https://pubchem.ncbi.nlm.nih.gov/protein/P10828#section=Chemicals-and-Bioactivities
The names of the downloaded files are:
file_protein_rat = "pubchem_protacxn_NP_001257783_bioactivity_protein.zip"
file_protein_human = "pubchem_protacxn_P10828_bioactivity_protein.zip"
Loading data for rat thyroid hormone receptor#
The following cells load and provide an initial exploration of the bioactivity data for the rat thyroid hormone receptor.
This cell defines the data type for the “COMPOUND_CID” column to ensure accurate data loading and to prevent memory-related issues.
# Define CID data type for reading CSV files to optimize memory usage
mydtype = {"Compound_CID" : "Int64"}
#mydtype = {"Compound_CID" : "Int64", "pmid":"Int64", "taxid":"object", "targettaxid":"object"}
We can now read our zip files containing bioactivity data in csv format for the rat thyroid hormone receptor into a Pandas DataFrame and then display the dimensions of the DataFrame. Notice that the read_csv method allows for direct reading of compressed files directly by using the compression = 'zip'
argument.
df_rat = pd.read_csv('pubchem_protacxn_NP_001257783_bioactivity_protein.zip', compression='zip', dtype=mydtype,low_memory=False)
df_rat.shape
(31458, 32)
Display a first few rows of the rat bioactivity DataFrame to get a quick overview of the data’s structure and content.
df_rat.head()
Bioactivity_ID | Activity | BioAssay_AID | Substance_SID | Compound_CID | refsid | Gene_ID | PMID | Aid_Type | Last_Modified_Date | ... | Representative_Protein_Accession | Taxonomy_ID | Cell_ID | Target_Taxonomy_ID | Anatomy_ID | Anatomy | dois | pmcids | pclids | citations | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 394010671 | Active | 743066 | 144203580 | 5920 | NaN | 24831 | NaN | Confirmatory | 20140820 | ... | NP_001257783 | 10116 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 394011094 | Active | 743066 | 144204055 | 92044441 | NaN | 24831 | NaN | Confirmatory | 20140820 | ... | NP_001257783 | 10116 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 394011748 | Active | 743066 | 144203868 | 5803 | NaN | 24831 | NaN | Confirmatory | 20140820 | ... | NP_001257783 | 10116 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 394012257 | Active | 743066 | 144204363 | 444795 | NaN | 24831 | NaN | Confirmatory | 20140820 | ... | NP_001257783 | 10116 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 394012468 | Active | 743066 | 144204799 | 57333099 | NaN | 24831 | NaN | Confirmatory | 20140820 | ... | NP_001257783 | 10116 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 32 columns
df_rat.columns # Uncomment to check the column names.
Index(['Bioactivity_ID', 'Activity', 'BioAssay_AID', 'Substance_SID',
'Compound_CID', 'refsid', 'Gene_ID', 'PMID', 'Aid_Type',
'Last_Modified_Date', 'Has_Dose_Response_Curve', 'RNAi_BioAssay',
'Protein_Accession', 'Activity_Type', 'Activity_Qualifier',
'Activity_Value', 'Bioassay_Data_Source', 'BioAssay_Name',
'Compound_Name', 'Target_Name', 'Target_Link', 'ECs',
'Representative_Protein_Accession', 'Taxonomy_ID', 'Cell_ID',
'Target_Taxonomy_ID', 'Anatomy_ID', 'Anatomy', 'dois', 'pmcids',
'pclids', 'citations'],
dtype='object')
Loading data for human thyroid hormone receptor
In the next two code cells:
Read the CSV file containing bioactivity data for the human thyroid hormone receptor into a pandas DataFrame, called
df_human
and then display the dimensions of the DataFrame.Display the first 5 rows of the dataframe.
# Write your code here to load human data into the datafram and display dimensions
# Write your code here to display the first 5 rows of the human dataframe
#df_human.columns # Uncomment to check the column names.
Checking the type and amount of available data for each species#
Here, we compare the types of bioactivity data available for both rat and human thyroid hormone receptors.
This cell counts the occurrences of each unique value in the 'Activity_Type'
column of the rat bioactivity DataFrame.
df_rat['Activity_Type'].value_counts(dropna=False)
Activity_Type
Potency 31458
Name: count, dtype: int64
Get the occurrences of each unique value in the
'Activity_Type'
column of the human bioactivity DataFrame in the next code cell.Comment on the Activity types found in the rat vs. human data.
# Write your own code here
Explanation of the code
A method is a function bound to an object and uses the .method
notation to attach it to the object.
Here the object is a Pandas DataFrame called df_human
.
Some methods perform actions on the DataFrame and require parentheses with optional arguments, e.g.
df_human.head(10)
returns the first 10 rows.Some properties are attributes rather than methods, so you access them without parentheses, e.g.
df_human.shape
returns the dimensions of the DataFrame.df_human.shape
→ Returns a tuple(rows, columns)
with dataset size.
Example:(5000, 20)
means 5000 rows × 20 columns.df_human.head()
→ Displays the first 5 rows (or more withhead(n)
).
Great for previewing structure and sample values.df_human.columns
→ Lists the column names as an Index object.
Helps you see available variables for analysis.df_human['Activity_Type'].value_counts(dropna=False)
→ Counts frequency of each unique value.
dropna=False
ensures missing values (NaN
) are included in the count.
Sometimes, visualizations can make patterns in the data easier to see. Let’s create bar charts to compare the distribution of activity types in the rat and human bioactivity datasets. The following code uses the Python Seaborn statistical data visualization package to create two side-by-side bar charts, one for the rat data and one for the human data. The sharey=True
argument ensures both charts use the same y-axis scale for easier comparison.
fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)
sns.countplot(data=df_rat, y='Activity_Type', ax=axes[0])
sns.countplot(data=df_human, y='Activity_Type', ax=axes[1])
axes[0].set_xlabel("Number of Data Points (rat)")
axes[1].set_xlabel("Number of Data Points (human)")
for ax in axes : # loop over each chart.
ax.set_xscale('log')
for container in ax.containers:
ax.bar_label(container)
fig.tight_layout()
What do you notice about the activity types available in the rat dataset compared to the human dataset?
If we want to compare small-molecule bioactivity between rats and humans, which activity type should we use, and why?
Reviewing the available bioassays#
Recall from the Bioactivity Primer we found there were 3 bioassays (AIDs 743065, 743066, 743067) associated with the rat thyroid receptor β: https://pubchem.ncbi.nlm.nih.gov/protein/NP_001257783#section=Chemicals-and-Bioactivities
We can now find the unique bioassays programmatically from the downloaded CSV files in our dataframe from both the rat and human receptors.
First, let’s get the number of unique AIDs for rat thyroid hormone receptor.
# the .nunique() function gives the number of unique values in a dataframe column
df_rat['BioAssay_AID'].nunique()
3
Let’s get the number of data points for rat thyroid hormone receptor for each assay. These will be the total tested substances in the assay regardless of active or inactive.
df_rat['BioAssay_AID'].value_counts()
BioAssay_AID
743066 10486
743065 10486
743067 10486
Name: count, dtype: int64
Confirm that the totals above match what is in the bioassay information at Pubchem: https://pubchem.ncbi.nlm.nih.gov/protein/NP_001257783#section=Chemicals-and-Bioactivities
In the code cells below check the number of unique assays and the number of data points for the human thyroid hormone receptor beta (THRβ).
# Write your own code for number of unique assays in human receptor
# Write your own code to determine number of assay data points (tested compounds) for each unique assay in human receptor
Notice that there are more than one hundred assays for human THRβ. However, we are only interested in the assays with the potency values, because only the potency data is available for both rats and humans. Therefore, we will focus on those assays.
df_rat[df_rat['Activity_Type'] == 'Potency'][['BioAssay_AID', 'BioAssay_Name']].value_counts()
BioAssay_AID BioAssay_Name
743065 qHTS assay to identify small molecule antagonists of the thyroid receptor (TR) signaling pathway 10486
743066 qHTS assay to identify small molecule agonists of the thyroid receptor (TR) signaling pathway 10486
743067 qHTS assay to identify small molecule antagonists of the thyroid receptor (TR) signaling pathway: Summary 10486
Name: count, dtype: int64
Explanation of the code
df_rat[df_rat['Activity_Type'] == 'Potency'][['BioAssay_AID', 'BioAssay_Name']].value_counts()
df_rat[df_rat['Activity_Type'] == 'Potency']
This actualy creates a mask and applies it to the dataframe, returning the filtered rows. There are two processes going on.df_rat['Activity_Type'] == 'Potency'
- This creates a Boolean mask: a series of True/False values indicating whether each row meets the condition. We could write this asmask = df_rat['Activity_Type'] == 'Potency'
df_rat[mask]
then filters the DataFrame and selects the rows from df_rat where the mask is True.
[['BioAssay_AID', 'BioAssay_Name']]
selects only the ‘BioAssay_AID’ and ‘BioAssay_Name’ columns from the filtered DataFrame..value_counts()
is a Pandas method that counts the occurrences of each unique combination of ‘BioAssay_AID’ and ‘BioAssay_Name’.Returns a Series with the counts of each unique combination.
The index is a MultiIndex with ‘BioAssay_AID’ and ‘BioAssay_Name’, this returns all observed combinations of these two columns and their counts.
# Write your code here to filter the data frame for human receptor and count the number of assays with potency data.
Considering the title and the data point counts for the returned assays, the three assays for rat thyroid hormone receptor are high-throughput screening (HTS) assays. Therefore, we also need HTS assays for human thyroid hormone receptor.
In the remaining part of this notebook, we will use the qHTS assays for antagonists of rat and human thyroid hormone receptors.
AID 743065: Confirmatory Assay (Antagonists) in rat receptor
AID 588547: Confirmatory (Antagonists) in human receptor
They were chosen based on the following considerations.
The four assays for human thyroid hormone receptors with more than 1000 data points (i.e., 1469, 1479, 588547, 588545) are HTS screening.
Strictly speaking, AIDs 1469 and 1479 are not targeting the human thyroid hormone receptor, but its interaction with steroid receptor coregulator.
At the summary page of AID 743067, it is explained that the assay summarizes from an confirmatory antagonist assay (AID 743065) and a cell viability counter screening (AID 743064).
More active compounds were identified in the antagonist HTS assays for both species (AIDs 743065 and 588547, compared to the agonist assays (AIDs 743066 and 588545).
AID |
Assay Type |
species |
Number of tested substances |
number of actives |
---|---|---|---|---|
743065 |
Antagonist |
rat 🐀 |
10,486 |
1880 |
588547 |
Antagonist |
human 🧑🏽🔬 |
2,858 |
12 |
743066 |
Agonist |
rat 🐀 |
10,486 |
64 |
588545 |
Agonist |
human🧑🏽🔬 |
2,825 |
10 |
Comparing HTS screening data for Rat and human#
The following section uses the quantitative HTS (qHTS) data for thyroid hormone receptor antagonsts. The data files were downloaded from their BioAssay record pages: (Note: Links below take you to the Download csv dropdown in the data table.)
AID 743065: Confirmatory Assay (Antagonists)
AID 588547: Confirmatory (antagonists)
We have provided you with the data files as compressed ‘zip’ files, but you can also download them directly as .csv files. The names of the downloaded files are:
#file_aid_rat = "AID_743065_datatable_all.csv" #uncomment to use uncompressed csv files
#file_aid_human = "AID_588547_datatable_all.csv" #uncomment to use uncompressed csv files
file_aid_rat = "AID_743065_datatable_all.zip"
file_aid_human = "AID_588547_datatable_all.zip"
Loading the HTS screening data for Rats#
Now let’s load the HTS data for the rat thyroid hormone receptor.
The data type for the
PUBCHEM_CID
column is explicitly specifiedmydtype = {'PUBCHEM_CID':'Int64' }
to ensure it is read as an integer, which is important for accurate data handling.The first 5 rows are skipped
skiprows=[1,2,3,4,5]
because they do not contain data, but meta data (e.g., the description of each column).
mydtype = {'PUBCHEM_CID':'Int64' }
df_aid_rat = pd.read_csv(file_aid_rat, compression='zip', skiprows=[1,2,3,4,5], dtype=mydtype, low_memory=False)
df_aid_rat.head()
PUBCHEM_RESULT_TAG | PUBCHEM_SID | PUBCHEM_CID | PUBCHEM_EXT_DATASOURCE_SMILES | PUBCHEM_ACTIVITY_OUTCOME | PUBCHEM_ACTIVITY_SCORE | PUBCHEM_ACTIVITY_URL | PUBCHEM_ASSAYDATA_COMMENT | Phenotype-Replicate_1 | Potency-Replicate_1 | ... | Activity at 0.762 uM-Replicate_46 | Activity at 1.702 uM-Replicate_46 | Activity at 3.793 uM-Replicate_46 | Activity at 8.439 uM-Replicate_46 | Activity at 18.15 uM-Replicate_46 | Activity at 32.49 uM-Replicate_46 | Activity at 70.35 uM-Replicate_46 | Activity at 94.96 uM-Replicate_46 | Activity at 260.0 uM-Replicate_46 | Activity at 920.3 uM-Replicate_46 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 144203552 | 12850184 | C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... | Inactive | 0 | http://assay.nih.gov/htsws/rest/display/tox21-... | NaN | Inactive | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2 | 144203553 | 89753 | C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... | Inactive | 0 | http://assay.nih.gov/htsws/rest/display/tox21-... | NaN | Inactive | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 3 | 144203554 | 9403 | C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... | Inconclusive | 10 | http://assay.nih.gov/htsws/rest/display/tox21-... | NaN | Inhibitor | 33.4915 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 4 | 144203555 | 13218779 | C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N | Inactive | 0 | http://assay.nih.gov/htsws/rest/display/tox21-... | NaN | Inactive | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 5 | 144203556 | 142766 | C1=CC=C(C(=C1)C(=O)O)O.C1=CC2=C(C(=C1)O)N=CC=C2 | Active | 83 | http://assay.nih.gov/htsws/rest/display/tox21-... | NaN | Inhibitor | 8.4127 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 1803 columns
df_aid_rat.shape
(10486, 1803)
This code cell identifies and counts all columns that do not have values.
nan_columns = df_aid_rat.columns[df_aid_rat.isna().all()].tolist()
print(len(nan_columns))
389
Explanation of the code
nan_columns = df_aid_rat.columns[df_aid_rat.isna().all()].tolist()
df_aid_rat.isna()
Returns a DataFrame of the same shapeEvery cell is
True
if NaN, otherwiseFalse
.
.all()
→ Checks if all values in each column are True (i.e., all values are NaN).applied column-wise (by default axis=0), so it goes down each column to see if it is true, so there is one Boolean value for each column.
the result is a Pandas Series of Booleans, where the index is the name of the column, and each value is a Boolean True or False based on the values of the entire column.
df_aid_rat.columns[...]
Filters the DataFrame’s columns (a Pandas Series) to keep only those where the condition is True..tolist()
→ Converts the filtered column names to a list.nan_columns = df_aid_rat.columns[df_aid_rat.isna().all()].tolist()
Assigns the list of column names with all NaN values to the variablenan_columns
.len(nan_columns)
Returns the number of columns that have all values as NaN.print(len(nan_columns)) Prints the number of columns with all NaN values.
The following cell removes those columns and displays the updated dimensions.
df_aid_rat = df_aid_rat.dropna(how='all', axis=1)
df_aid_rat.shape
(10486, 1414)
Explanation of the code
df_aid_rat = df_aid_rat.dropna(how='all', axis=1)
.dropna()
is a Pandas method that removes missing values (NaN) from a DataFrame.it only looks for cells with missing data (NaN), not for strings like “active” or “CCO”. A column with text values will not be dropped unless every single value in it is missing (NaN)
how='all'
specifies that only columns where all values are NaN should be dropped.how
is a keyword parameter and ‘all’ is the argument that you pass to it.axis
is another keyword parameter that specifies whether to drop rows (axis=0) or columns (axis=1), where 0 and 1 are the arguments you pass to itQuick Definitions
Function → A reusable block of code that does something when you call it. Example:
print("hello")
.Method → A function that belongs to an object and is called using dot notation. Example:
df.dropna()
.Parameter → The placeholder names defined inside the function/method (e.g.
axis
,how
).Argument → The actual values you pass to those parameters when you call the function/method (e.g.
axis=1
,"all"
).Positional vs Keyword arguments →
Positional: arguments matched by their order. Example:
df.dropna(1, "all")
(first goes toaxis
, second goes tohow
).Keyword: arguments matched by name, not order. Example:
df.dropna(axis=1, how="all")
.
NaN (“Not a Number”) is a special floating-point value defined by the IEEE 754 standard, used in Python to represent missing, undefined, or invalid numerical data.
df_aid_rat.head()
PUBCHEM_RESULT_TAG | PUBCHEM_SID | PUBCHEM_CID | PUBCHEM_EXT_DATASOURCE_SMILES | PUBCHEM_ACTIVITY_OUTCOME | PUBCHEM_ACTIVITY_SCORE | PUBCHEM_ACTIVITY_URL | Phenotype-Replicate_1 | Potency-Replicate_1 | Efficacy-Replicate_1 | ... | Activity at 0.153 uM-Replicate_46 | Activity at 0.341 uM-Replicate_46 | Activity at 0.762 uM-Replicate_46 | Activity at 1.702 uM-Replicate_46 | Activity at 3.793 uM-Replicate_46 | Activity at 8.439 uM-Replicate_46 | Activity at 18.15 uM-Replicate_46 | Activity at 32.49 uM-Replicate_46 | Activity at 70.35 uM-Replicate_46 | Activity at 94.96 uM-Replicate_46 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 144203552 | 12850184 | C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... | Inactive | 0 | http://assay.nih.gov/htsws/rest/display/tox21-... | Inactive | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2 | 144203553 | 89753 | C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... | Inactive | 0 | http://assay.nih.gov/htsws/rest/display/tox21-... | Inactive | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 3 | 144203554 | 9403 | C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... | Inconclusive | 10 | http://assay.nih.gov/htsws/rest/display/tox21-... | Inhibitor | 33.4915 | 71.8523 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 4 | 144203555 | 13218779 | C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N | Inactive | 0 | http://assay.nih.gov/htsws/rest/display/tox21-... | Inactive | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 5 | 144203556 | 142766 | C1=CC=C(C(=C1)C(=O)O)O.C1=CC2=C(C(=C1)O)N=CC=C2 | Active | 83 | http://assay.nih.gov/htsws/rest/display/tox21-... | Inhibitor | 8.4127 | 111.3000 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 1414 columns
Loading the HTS data for humans
In the following code cells you will read in the human HTS data as we did for the rat.
Read the zip (or csv) file for the human HTS data into a dataframe called df_aid_human
. Make sure to skip the initial descriptive rows and set the data type for ‘PUBCHEM_CID’ to ‘Int64’.
# Write your own code here.
Output the first 5 rows of the human AID dataframe to check that it properly loaded.
# Write your own code here.
Display the dimensions of the human HTS DataFrame.
# Write your own code here.
In the following cell remove the columns containing no values and displays the updated dimensions.
# Write your own code here.
df_aid_human.head()
Substance-level analysis#
This section compares the HTS data for rats and humans at the substance level
df_aid_rat.columns # Uncomment this line if you want to see what columns the data frame has.
Index(['PUBCHEM_RESULT_TAG', 'PUBCHEM_SID', 'PUBCHEM_CID',
'PUBCHEM_EXT_DATASOURCE_SMILES', 'PUBCHEM_ACTIVITY_OUTCOME',
'PUBCHEM_ACTIVITY_SCORE', 'PUBCHEM_ACTIVITY_URL',
'Phenotype-Replicate_1', 'Potency-Replicate_1', 'Efficacy-Replicate_1',
...
'Activity at 0.153 uM-Replicate_46',
'Activity at 0.341 uM-Replicate_46',
'Activity at 0.762 uM-Replicate_46',
'Activity at 1.702 uM-Replicate_46',
'Activity at 3.793 uM-Replicate_46',
'Activity at 8.439 uM-Replicate_46',
'Activity at 18.15 uM-Replicate_46',
'Activity at 32.49 uM-Replicate_46',
'Activity at 70.35 uM-Replicate_46',
'Activity at 94.96 uM-Replicate_46'],
dtype='object', length=1414)
df_aid_human.columns # Uncomment this line if you want to see what columns the data frame has.
Activity outcomes and phenotypes#
This part of the analysis focuses on comparing the activity outcomes and observed phenotypes from the HTS data for both species.
First, let’s generate two bar charts to visually compare the distribution of PUBCHEM_ACTIVITY_OUTCOME
in the rat and human HTS datasets.
fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)
sns.countplot(data=df_aid_rat, y='PUBCHEM_ACTIVITY_OUTCOME', ax=axes[0])
sns.countplot(data=df_aid_human, y='PUBCHEM_ACTIVITY_OUTCOME', ax=axes[1])
axes[0].set_xlabel("Number of Data Points (rat)")
axes[1].set_xlabel("Number of Data Points (human)")
for ax in axes : # loop over each chart.
for container in ax.containers:
ax.bar_label(container)
fig.tight_layout()
What is the majority activity data type in each species?
How many actives are found for each species?
Is an active molecule an agonist or antagonist in this assay? (recall each of the bioassay titles)
Both data sets have a column containing information on the type of activity found for the substance tested. These columns are:
Phenotype-Replicate_1
for ratsPhenotype
for humans
It is worth mentioning that the column for rats is suffixed with -Replicate_1
. The rat data set contains data from repeated assay experiments, which are indicated as Replicate_1, Replicate_2, Replicate_3, and so on. For simplicity, we will use the data for -Replicate_1
, but if we were to perform a stricter analysis, we should consider data from all replicates.
Now let’s create two bar charts to compare the distribution of phenotypes (‘Phenotype-Replicate_1’ for rat and ‘Phenotype’ for human) across the two HTS datasets.
# Write your own code
In each case the phenotype indicates the type of activity observed: inhibitor, activator, fluorescent, cytotoxic, inactive, or inconclusive.
To find the relationship between the activity outcomes and phenotypes, let’s count the occurrences of each unique combination of ‘PUBCHEM_ACTIVITY_OUTCOME’ and ‘Phenotype-Replicate_1’ in the rat HTS data.
df_aid_rat[['PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype-Replicate_1' ]].value_counts()
PUBCHEM_ACTIVITY_OUTCOME Phenotype-Replicate_1
Inactive Inactive 7041
Active Inhibitor 1880
Inconclusive Inhibitor 1038
Inactive Activator 527
Name: count, dtype: int64
.groupby()
and .size()
can be used to get the same information (but sorted differently).
df_aid_rat.groupby(by=['PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype-Replicate_1']).size()
PUBCHEM_ACTIVITY_OUTCOME Phenotype-Replicate_1
Active Inhibitor 1880
Inactive Activator 527
Inactive 7041
Inconclusive Inhibitor 1038
dtype: int64
Explanation of the code
df_aid_rat.groupby(by=['PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype-Replicate_1']).size()
groupby(by=[...])
Splits the DataFrame into groups based on the unique combinations of two columns:
PUBCHEM_ACTIVITY_OUTCOME
(e.g., Active, Inactive, Inconclusive)Phenotype-Replicate_1
(e.g., Activator, Inhibitor, Inactive)
Each unique pair of values is a group key.
.size()
Counts the number of rows in each group.
The result is a Pandas Series, where:
The index = the group keys (a MultiIndex of outcome + phenotype).
The values = the row counts for each group.
The result is a Pandas Series with a MultiIndex built from the two grouping columns, with values being the counts (the output of .size()).
Level 0 → PUBCHEM_ACTIVITY_OUTCOME
Level 1 → Phenotype-Replicate_1
So internally it looks like this:
index = MultiIndex([
('Active', 'Inhibitor'),
('Inactive', 'Activator'),
('Inactive', 'Inactive'),
('Inconclusive', 'Inhibitor')
], names=['PUBCHEM_ACTIVITY_OUTCOME','Phenotype-Replicate_1'])
values = [1880, 527, 7041, 1038]
It does not print the ‘Inactive’ label again for the second third level because it is the same as the second level, so it is not repeated in the output.
PUBCHEM_ACTIVITY_OUTCOME Phenotype-Replicate_1
Active Inhibitor 1880
Inactive Activator 527
Inactive 7041
Inconclusive Inhibitor 1038
dtype: int64
Active, Inhibitor → 1880 substances
Inactive, Activator → 527 substances
Inactive, Inactive → 7041 substances (the blank under “Inactive” just means “same as above” — Pandas doesn’t repeat the label)
Inconclusive, Inhibitor → 1038 substances
Now let’s count the occurrences of each unique combination of ‘PUBCHEM_ACTIVITY_OUTCOME’ and ‘Phenotype’ in the human HTS data.
df_aid_human[['PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype' ]].value_counts()
Get the same information using .groupby()
and .size()
df_aid_human.groupby(by=['PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype']).size()
Look at the objective of the rat and human assays. What type of compounds were the assays designed to identify?
In your rat dataset, how are compounds with an Activator phenotype classified? How is this different from the human dataset?
Why do you think Activators are not declared as “Active” even though they show biological activity?
In the human dataset, some compounds are labeled Inconclusive. What are possible reasons a compound might be classified this way rather than as Active or Inactive?
The “Inconclusive” category in the human dataset includes additional phenotypes such as Cytotoxic and Fluorescent.
What does it mean if a compound is labeled Cytotoxic?
Why might a fluorescent compound interfere with interpreting assay results?
What issues with a dose-response curve could make a compound labeled as an Inhibitor still end up in the Inconclusive category?
(Hint for these: We explored the test methods in the BioActivity Primier activity.)
Compare the way activators are treated in the rat and human datasets. How could these differences affect your ability to compare results between species?
If you were the scientist in charge of designing the assay, how might you justify the choice to classify activators as Inactive versus Inconclusive?
Efficacy#
In this section, we compare the efficacy data from the HTS assays for both rat and human thyroid hormone receptors.
For agonist-mode assays, high efficacy = strong activation relative to baseline.
For this antagonist-mode assay, efficacy instead reflects the maximum inhibition observed relative to the agonist control (thyroid hormone T3).
An efficacy near -100% means the compound fully blocked the receptor’s activity back to baseline (complete antagonist).
An efficacy near 0% means little or no inhibition.
Intermediate values (e.g., -40%, -60%) mean partial inhibition.
So in this dataset, the Efficacy column tells you “how much did the compound reduce the T3 signal at its maximum effect?”
Let’s look at some examples from the rat dataset to get a sense of the type of data here:
cids_of_interest = [6, 142766, 89753, 102428, 142766, 2724411, 12850184, 89753, 13218779, 774, 5803]
df_aid_rat[df_aid_rat['PUBCHEM_CID'].isin(cids_of_interest)] \
.drop_duplicates(subset='PUBCHEM_CID') \
.loc[:, ['PUBCHEM_CID', 'PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype-Replicate_1', 'Efficacy-Replicate_1']] \
.sort_values(by='PUBCHEM_ACTIVITY_OUTCOME')
PUBCHEM_CID | PUBCHEM_ACTIVITY_OUTCOME | Phenotype-Replicate_1 | Efficacy-Replicate_1 | |
---|---|---|---|---|
4 | 142766 | Active | Inhibitor | 111.3000 |
7 | 2724411 | Active | Inhibitor | 105.6700 |
5600 | 6 | Active | Inhibitor | 121.3320 |
0 | 12850184 | Inactive | Inactive | NaN |
1 | 89753 | Inactive | Inactive | NaN |
3 | 13218779 | Inactive | Inactive | NaN |
165 | 774 | Inactive | Activator | 21.8768 |
316 | 5803 | Inactive | Activator | 49.7967 |
1312 | 102428 | Inactive | Activator | 19.7701 |
Explanation of the code
`df_aid_rat ['PUBCHEM_CID'].isin(cids_of_interest).drop_duplicates(subset='PUBCHEM_CID').loc[:, ['PUBCHEM_CID', 'PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype-Replicate_1', 'Efficacy-Replicate_1']].sort_values(by='PUBCHEM_ACTIVITY_OUTCOME')
` Note, in Python the ‘’ is used to indicate that the line continues on the next line, so you can break long lines into multiple lines for readability, and the code box is really two lines of code, one defining the list and the other what is done with it.
df_aid_rat
is a Pandas Dataframedf_aid_rat['PUBCHEM_CID']
is a Pandas Series of made of the PubChem_CID field of the Dataframe.df_aid_rat['PUBCHEM_CID'].isin(cids_of_interest)
noteisin()
is a series method operating on the CID coluns to create a Boolean mask that is True for rows where the PUBCHEM_CID is in the list of cids_of_interest.df_aid_rat[df_aid_rat['PUBCHEM_CID'].isin(cids_of_interest)]
creates a new filtered DataFrame that keeps row where the mask is True. It does not overwrite the original dataframe..drop_duplicates(subset='PUBCHEM_CID')
removes duplicate rows based on the ‘PUBCHEM_CID’ column, keeping only the first occurrence..loc[:, ['PUBCHEM_CID', 'PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype-Replicate_1', 'Efficacy-Replicate_1']]
selects only the specified columns from the filtered DataFrame..sort_values(by='PUBCHEM_ACTIVITY_OUTCOME')
sorts the DataFrame by the ‘PUBCHEM_ACTIVITY_OUTCOME’ column in ascending order.
To be labeled as an active antagonist in this model, what must the efficacy value be? Which CIDs in the data subset are antagonists? How is that indicated by phenotype?
How does the activity of inactive compounds compare to active compounds?
Some inhibitors have efficacy values greater than 100%, meaning they reduce the assay signal below the baseline. What experimental or biological factors could cause an antagonist to produce a signal lower than baseline?
Activators such as CID 774, 5803, and 102428 only show partial inhibition of the assay signal. Could these compounds be acting as partial agonists rather than true activators or antagonists? How might partial agonists appear in this type of antagonist assay?
We can see there is similar data in the human dataset:
cids_of_interest = [457193, 15910, 443939, 5995, 5281575,7529]
df_aid_human[df_aid_human['PUBCHEM_CID'].isin(cids_of_interest)] \
.drop_duplicates(subset='PUBCHEM_CID') \
.loc[:, ['PUBCHEM_CID', 'PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype', 'Efficacy']] \
.sort_values(by='PUBCHEM_ACTIVITY_OUTCOME')
Plotting the data:
A histogram groups data into bins to display how frequently values occur withing each range. It is best used for large datasets and shows the shape of the distribution. This can show overall patterns in data, not the individual values. This can be very useful for comparing data like potency or efficacy distributions.
Now that we have an overview of the types of data in the rat dataset, let’s generate two histograms to visualize and compare the distribution of efficacy values (‘Efficacy-Replicate_1’ for rat and ‘Efficacy’ for human) for different activity outcomes.
fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)
color_dict = {"Active": "red", "Inconclusive": "purple", "Conflicting": "blue", "Inactive": "green"}
sns.histplot(data=df_aid_rat, x='Efficacy-Replicate_1', hue='PUBCHEM_ACTIVITY_OUTCOME',ax=axes[0], palette=color_dict)
sns.histplot(data=df_aid_human, x='Efficacy', hue='PUBCHEM_ACTIVITY_OUTCOME', ax=axes[1], palette=color_dict)
axes[0].set_xlabel("Efficacy (%, rat)")
axes[1].set_xlabel("Efficacy (%, human)")
fig.tight_layout()
Notice in the above, it is clear we can see three distinct Activity Outcome categories in the rat data. However, with the human data being a smaller set, it is challenging to see which molecules are labeled as active or inconclusive.
Another way to visualize this data is with a strip plot. It will show every individual data point along an axis. If two datapoints are the same, the program may provide slight “jitter” to reduce overap. This plot is best used for showing discrete observations and identifying clusters, outliers or repeated values. It can also useful for comparing disbributions across categories. In this case we are looking at efficacy by activity outcome.
fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)
hue_order = ['Active', 'Inconclusive', 'Inactive']
sns.stripplot(data=df_aid_rat, x='Efficacy-Replicate_1', y='PUBCHEM_ACTIVITY_OUTCOME', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[0])
sns.stripplot(data=df_aid_human, x='Efficacy', y='PUBCHEM_ACTIVITY_OUTCOME', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[1])
axes[0].set_xlabel("Efficacy (%, rat)")
axes[1].set_xlabel("Efficacy (%, human)")
Another useful visualization tool is the box plot which summarizes the distribution of a datase by showing five key statistics:
minimum
Q1 - The First quartile (25th percentile of values)
Q2 - The Second quartile (50th percentile of values)
Median- The value exactly in the middle of the values
Q3 - The Third quartile (75th percentile of values)
Maximum - The highest non-outlier value
Outliers - points beyond the “whiskers” that are anomolous data points.
You can use it to summarize variables (like efficacy or potency) and how they are distributed across categories (active, inactive, inconclusive), how to compare medians and spread between groups, and spot outliers quickly without visualizing every individual point.
Let’s create box plots of the activity and efficacy data for rats and humans.
fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)
hue_order = ['Active', 'Inconclusive', 'Inactive']
sns.boxplot(data=df_aid_rat, x='Efficacy-Replicate_1', y='PUBCHEM_ACTIVITY_OUTCOME', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[0])
sns.boxplot(data=df_aid_human, x='Efficacy', y='PUBCHEM_ACTIVITY_OUTCOME', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[1])
axes[0].set_xlabel("Efficacy (%, rat)")
axes[1].set_xlabel("Efficacy (%, human)")
Based on the various plots of the data, were efficacy values determined for all compounds in the human dataset?
The box plots of Efficacy for the rat and human datasets show that the Inconclusive and Inactive outcomes have many more outliers than the Active outcomes.
Why do you think the Inconclusive and Inactive groups show such a wide spread of efficacy values?
What does this suggest about how compounds are classified in assays compared to those labeled Active?
Potency#
Now visualize the potency of the tested chemicals on the rat and human thyroid hormone receptors.
Generate two histograms to visualize and compare the distribution of potency values for different activity outcomes in the rat and human HTS datasets.
# Write your own code
Generate two strip plots to visualize the potency values for different activity outcomes in the rat and human HTS datasets.
# Write your own code
Generate two box plots to visualize the potency data for different activity outcomes in the rat and human HTS datasets.
# Write your own code
Based on the various plots of the data, were potency values determined for all compounds in the human dataset?
When comparing the rat and human datasets, the box and strip plots for Potency show fewer outliers than those for Efficacy.
Why do you think fewer outliers appear in the potency data?
What does this tell you about the type of compounds that have measurable potency values compared to those that only have efficacy values?
We can compute summary statistics such as the mean, median, minimum, and maximum for the efficacy and potency values across different activity outcomes (Active, Inactive, Inconclusive) in both the rat and human datasets. These summaries help describe how the data are distributed and how results compare between species.
df_aid_rat.groupby('PUBCHEM_ACTIVITY_OUTCOME')[['Efficacy-Replicate_1', 'Potency-Replicate_1']].mean()
Efficacy-Replicate_1 | Potency-Replicate_1 | |
---|---|---|
PUBCHEM_ACTIVITY_OUTCOME | ||
Active | 110.130054 | 22.882127 |
Inactive | 50.140904 | 27.487615 |
Inconclusive | 58.654658 | 43.571585 |
df_aid_rat.groupby('PUBCHEM_ACTIVITY_OUTCOME')[['Efficacy-Replicate_1', 'Potency-Replicate_1']].median()
Efficacy-Replicate_1 | Potency-Replicate_1 | |
---|---|---|
PUBCHEM_ACTIVITY_OUTCOME | ||
Active | 113.98150 | 18.83360 |
Inactive | 39.19415 | 24.44795 |
Inconclusive | 49.11970 | 33.49150 |
df_aid_rat.groupby('PUBCHEM_ACTIVITY_OUTCOME')[['Efficacy-Replicate_1', 'Potency-Replicate_1']].min()
Efficacy-Replicate_1 | Potency-Replicate_1 | |
---|---|---|
PUBCHEM_ACTIVITY_OUTCOME | ||
Active | 49.0617 | 0.0007 |
Inactive | 13.5294 | 0.0019 |
Inconclusive | 19.5822 | 0.0033 |
df_aid_rat.groupby('PUBCHEM_ACTIVITY_OUTCOME')[['Efficacy-Replicate_1', 'Potency-Replicate_1']].max()
Efficacy-Replicate_1 | Potency-Replicate_1 | |
---|---|---|
PUBCHEM_ACTIVITY_OUTCOME | ||
Active | 246.592 | 159.683 |
Inactive | 246.847 | 107.353 |
Inconclusive | 199.769 | 343.762 |
Now it’s your turn to compute the mean, median, minimum, maximum of the efficacy and potency values for the different outcomes of the human data set.
Explanation of the code
`df_aid_rat.groupby('PUBCHEM_ACTIVITY_OUTCOME')[['Efficacy-Replicate_1', 'Potency-Replicate_1']].mean()`df_aid_rat
This is your Pandas DataFrame containing PubChem assay results. It has many columns, including:PUBCHEM_ACTIVITY_OUTCOME
(Inactive, Active, Inconclusive, etc.)replicate values like
Efficacy-Replicate_1
andPotency-Replicate_1
.
.groupby('PUBCHEM_ACTIVITY_OUTCOME')
Splits your DataFrame into sub-tables (groups), one for each unique value in
PUBCHEM_ACTIVITY_OUTCOME
.For example:
Group 1: all rows with
Inactive
Group 2: all rows with
Active
Group 3: all rows with
Inconclusive
Each group can now be analyzed independently.
[['Efficacy-Replicate_1', 'Potency-Replicate_1']]
This selects just two numeric columns out of the full DataFrame.
That way, the statistics are calculated only on these measures, not on all columns.
.mean()
Applies the arithmetic mean function (average) to each column, within each group.
Statistics Methods
.mean()
→ Calculates the average value for each group..median()
→ Calculates the median value for each group..min()
→ Finds the minimum value for each group..max()
→ Finds the maximum value for each group..std()
→ Calculates the standard deviation for each group..var()
→ Calculates the variance for each group..count()
→ Counts the number of non-null values in each group..size()
→ Counts the total number of rows in each group (including NaN)..quantile([0.25, 0.5, 0.75])
→ Calculates the 25th, 50th (median), and 75th percentiles for each group..describe()
→ Provides a summary of statistics (count, mean, std, min, 25%, 50%, 75%, max) for each group..agg()
→ Allows you to apply multiple aggregation functions at once, e.g.df.groupby('col').agg(['mean', 'std', 'min', 'max'])
.
# Write your own code for mean
# Write your own code for median
# Write your own code for minimum
# Write your own code for maximum
How would you expect the mean and median efficacy values of Active compounds to compare with those of Inactive or Inconclusive compounds? Is that portrayed in the data?
Did you observe that active compounds have higher average efficacy in one species than the other? What might that suggest about species differences in thyroid hormone receptor response?
Recall the lower the potency value, a lower amount of compound is needed to cause an effect. Are there species differences in minimum potency values?
Efficacy vs Potency#
This section explores the relationship between the efficacy and potency of the tested compounds for both the rat and human thyroid hormone receptors.
The following code creates two scatter plots, one for the rat data and one for the human data, to visualize the relationship between potency and efficacy, with points colored by their activity outcome.
fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=False)
hue_order = ['Active', 'Inconclusive', 'Inactive']
sns.scatterplot(data=df_aid_rat, x='Potency-Replicate_1', y='Efficacy-Replicate_1', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[0])
sns.scatterplot(data=df_aid_human, x='Potency', y='Efficacy', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[1])
axes[0].set_xlabel("Potency (micro-M, rat)")
axes[1].set_xlabel("Potency (micro-M, human)")
axes[0].set_ylabel("Efficacy (%, rat)")
axes[1].set_ylabel("Efficacy (%, human)")
# Uncomment next three lines to limit the x-axis to 200.
#for ax in axes : # loop over each panel
# ax.set_xlim(0, 200)
# ax.set_ylim(0, 200)
Compound-level Analysis#
In PubChem, substances are chemical records provided by individual data contributors and compounds are unique chemical structures extracted from substance records.
In many assays, substances can be viewed as “samples” of chemicals. A chemical can be tested in multiple samples containing that chemical in a single assay. The samples are not necessary the same as each other. For example, they may be different in terms of its concentration, temperature, solvent used, etc.
Therefore, if you want to determine the activity of a chemical, you may need to collapse the data from multiple samples (i.e., substances) into a single data point for the chemical.
The code cells below checks how many unique compounds in the data set and how many of them are associated with multiple substances.
df_rat_cidcount = pd.DataFrame(df_aid_rat['PUBCHEM_CID'].value_counts()).reset_index()
df_human_cidcount = pd.DataFrame(df_aid_human['PUBCHEM_CID'].value_counts()).reset_index()
print("#-- Rat data set")
print("# CIDs in total :", len(df_rat_cidcount) )
print("# CIDs associated with multiple SIDs :", len(df_rat_cidcount[df_rat_cidcount['count'] > 1]) )
print("------------------------------------------------")
print("#-- Human data set")
print("# CIDs in total :", len(df_human_cidcount) )
print("# CIDs associated with multiple SIDs :", len(df_human_cidcount[df_human_cidcount['count'] > 1]) )
fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=False)
sns.countplot(data=df_rat_cidcount, x='count', ax=axes[0])
sns.countplot(data=df_human_cidcount, x='count', ax=axes[1])
axes[0].set_xlabel("Number of SIDs per CID")
axes[1].set_xlabel("Number of SIDs per CID")
axes[0].set_ylabel("Number of CIDs")
axes[1].set_ylabel("Number of CIDs")
axes[0].set_title("Rat")
axes[1].set_title("Human")
Aggregating substance data rats#
Now we aggregate the substance data for each unique compound (identified by its CID).
This code groups the rat HTS data by ‘PUBCHEM_CID’ and store them in a data frame with three columns (PUBCHEM_CID
, Potency-Replicate_1
, and Efficacy-Replicate_1
). Note that some CIDs are associated with multiple SIDs, and hence, multiple activity values. For those cases, we are going to take the average of the multiple activity values.
df_aid_rat_cid = df_aid_rat.groupby('PUBCHEM_CID').mean(numeric_only=True).reset_index()
df_aid_rat_cid = df_aid_rat_cid[['PUBCHEM_CID', 'Potency-Replicate_1', 'Efficacy-Replicate_1']]
df_aid_rat_cid.head()
PUBCHEM_CID | Potency-Replicate_1 | Efficacy-Replicate_1 | |
---|---|---|---|
0 | 4 | NaN | NaN |
1 | 6 | 4.6248 | 118.2025 |
2 | 11 | NaN | NaN |
3 | 13 | NaN | NaN |
4 | 33 | NaN | NaN |
df_aid_rat_cid.shape
(8099, 3)
This code defines a function, get_cid_outcome
, that determines a single activity outcome for each unique compound ID (CID) in a given DataFrame. If a compound has multiple, conflicting outcomes recorded, it is labeled as “Conflicting”.
def get_cid_outcome(df) :
outcomes = {}
for cid in df['PUBCHEM_CID'].dropna().unique() : # loop over unique CIDs (without Null)
outcome =""
unique_outcomes = df[ df['PUBCHEM_CID'] == cid ]['PUBCHEM_ACTIVITY_OUTCOME'].unique()
if ( len(unique_outcomes) == 1 ) :
outcome = unique_outcomes[0]
elif ( len(unique_outcomes) > 1 ) :
outcome = "Conflicting"
else :
print("Warning: Something is wrong", cid, unique_outcomes)
outcomes[ cid ] = outcome
df_outcomes = pd.DataFrame.from_dict(outcomes, orient='index', columns=['PUBCHEM_ACTIVITY_OUTCOME']).reset_index(names='PUBCHEM_CID')
return(df_outcomes)
This line of code applies the get_cid_outcome function to the rat HTS data to generate a DataFrame that summarizes the activity outcome for each unique compound.
df_cid_outcome_rat = get_cid_outcome(df_aid_rat)
df_cid_outcome_rat.head()
PUBCHEM_CID | PUBCHEM_ACTIVITY_OUTCOME | |
---|---|---|
0 | 12850184 | Inactive |
1 | 89753 | Inactive |
2 | 9403 | Inconclusive |
3 | 13218779 | Inactive |
4 | 142766 | Active |
df_cid_outcome_rat.shape
(8099, 2)
Now let’s merge the aggregated rat compound data (containing mean potency and efficacy) with the summarized compound outcomes.
df_aid_rat_cid = df_aid_rat_cid.merge(df_cid_outcome_rat, on='PUBCHEM_CID', how='left')
df_aid_rat_cid.head()
PUBCHEM_CID | Potency-Replicate_1 | Efficacy-Replicate_1 | PUBCHEM_ACTIVITY_OUTCOME | |
---|---|---|---|---|
0 | 4 | NaN | NaN | Inactive |
1 | 6 | 4.6248 | 118.2025 | Active |
2 | 11 | NaN | NaN | Inactive |
3 | 13 | NaN | NaN | Inactive |
4 | 33 | NaN | NaN | Inactive |
This cell counts the occurrences of each unique ‘PUBCHEM_ACTIVITY_OUTCOME’ in the aggregated rat compound data.
df_aid_rat_cid['PUBCHEM_ACTIVITY_OUTCOME'].value_counts()
PUBCHEM_ACTIVITY_OUTCOME
Inactive 5745
Active 1383
Inconclusive 690
Conflicting 281
Name: count, dtype: int64
Aggregating substance data for humans#
Now let’s repeat the data aggregation process, this time for the human thyroid hormone receptor HTS data.
Group the human HTS data by ‘PUBCHEM_CID’ and store the mean efficacy and potency values in a data frame called ‘df_aid_human_cid’.
# Write your own code
Show a first few line of the df_aid_human_cid
data frame.
# Write your own code
Use the get_cid_outcome
function to the human HTS data to generate a DataFrame that summarizes the activity outcome for each unique compound.
# Write your own code
This command displays the first few rows of the DataFrame containing the summarized compound outcomes for the human data.
# Write your own code
This code merges the aggregated human compound data with the summarized compound outcomes.
# Write your own code
This command displays the first few rows of the merged human compound DataFrame.
# Write your own code
Use .value_counts()
the occurrences of each unique ‘PUBCHEM_ACTIVITY_OUTCOME’ in the aggregated human compound data.
# Write your own code
Merging rat data and human data#
This subsection focuses on combining the aggregated compound-level data from both the rat and human HTS assays to facilitate a direct comparison.
This command displays the column names of the aggregated rat compound DataFrame.
df_aid_rat_cid.columns
Let’s rename the columns in the rat compound DataFrame to have consistent naming with the human compound DataFrame before merging.
df_aid_rat_cid = df_aid_rat_cid.rename(columns={'Potency-Replicate_1':'Potency' , 'Efficacy-Replicate_1':'Efficacy', })
This command displays the first three rows of the renamed rat compound DataFrame to verify the column name changes.
df_aid_rat_cid.head(3)
Now merge the aggregated rat and human compound DataFrames based on ‘PUBCHEM_CID’, keeping only the compounds that are present in both datasets.
df_aid_merged_cid = df_aid_rat_cid.merge(df_aid_human_cid, on='PUBCHEM_CID', how='inner', suffixes=('_rat','_human'))
Let’s check if the merged data frame contains desired information.
df_aid_merged_cid.shape
df_aid_merged_cid
This cell counts the occurrences of each unique ‘PUBCHEM_ACTIVITY_OUTCOME’ for the rat data within the merged DataFrame.
df_aid_merged_cid['PUBCHEM_ACTIVITY_OUTCOME_rat'].value_counts()
This cell counts the occurrences of each unique ‘PUBCHEM_ACTIVITY_OUTCOME’ for the human data within the merged DataFrame.
df_aid_merged_cid['PUBCHEM_ACTIVITY_OUTCOME_human'].value_counts()
Compound-level comparison#
Let’s compare the bioactivity of the same compounds between the rat and human thyroid hormone receptors.
The following code cell generates two box plots to compare the distribution of efficacy for different activity outcomes between the rat and human data in the merged dataset.
fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)
hue_order = ['Active', 'Inactive', 'Conflicting', 'Inconclusive']
sns.boxplot(data=df_aid_merged_cid, x='Efficacy_rat', y='PUBCHEM_ACTIVITY_OUTCOME_rat', hue='PUBCHEM_ACTIVITY_OUTCOME_rat', hue_order=hue_order, ax=axes[0])
sns.boxplot(data=df_aid_merged_cid, x='Efficacy_human', y='PUBCHEM_ACTIVITY_OUTCOME_human', hue='PUBCHEM_ACTIVITY_OUTCOME_human', hue_order=hue_order, ax=axes[1])
axes[0].set_xlabel("Efficacy (%, rat)")
axes[1].set_xlabel("Efficacy (%, human)")
axes[0].set_ylabel("PUBCHEM_ACTIVITY_OUTCOME")
fig.tight_layout()
Generate two box plots to compare the distribution of potency for different activity outcomes between the rat and human data in the merged dataset.
# Write your own code
This code creates two scatter plots to visualize the relationship between potency and efficacy for the rat and human data in the merged dataset.
fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=False)
color_dict = {"Active": "red", "Inconclusive": "purple", "Conflicting": "blue", "Inactive": "green"}
sns.scatterplot(data=df_aid_merged_cid, x='Potency_rat', y='Efficacy_rat', hue='PUBCHEM_ACTIVITY_OUTCOME_rat', ax=axes[0], palette=color_dict)
sns.scatterplot(data=df_aid_merged_cid, x='Potency_human', y='Efficacy_human', hue='PUBCHEM_ACTIVITY_OUTCOME_human', ax=axes[1], palette=color_dict)
axes[0].set_xlabel("Potency (micro-M, rat)")
axes[1].set_xlabel("Potency (micro-M, human)")
axes[0].set_ylabel("Efficacy (%, rat)")
axes[1].set_ylabel("Efficacy (%, human)")
# Uncomment the next three lines to limit the x and y axes to 200.
#for ax in axes : # loop over each panel
# ax.set_xlim(0, 200)
# ax.set_ylim(0, 200)
The above graph output is challenging to interpret. Another useful tool for visualizing the data is a heatmap. In a heatmap, each cell’s color corresponds to the magnitude of a value. This makes it easier to see patterns, trends or correllations at a glance. They are particularly useful for comparing multiple variable (such as potency or efficacy in differen species) because you can quickly see clusters of high or low values.
Now we will generate a heatmap that shows the counts of rat and human activity outcome pairs. (For example, how many chemicals are active in both species, and how many chemicals are active in one species and inactive in the other?).
To do that we want to get the outcome pair count information from the df_aid_merged_cid
data frame.
outcome_counts = pd.DataFrame(df_aid_merged_cid.groupby(['PUBCHEM_ACTIVITY_OUTCOME_human', 'PUBCHEM_ACTIVITY_OUTCOME_rat']).size(), columns=['Count']).reset_index()
outcome_counts
Then, use .pivot()
to change the ‘outcome_counts’ DataFrame into a square matrix form, where the rows are human activity outcomes, the columns are rat activity outcomes, and the values are the counts of compounds.
heatmap_data = outcome_counts.pivot(index='PUBCHEM_ACTIVITY_OUTCOME_human', columns='PUBCHEM_ACTIVITY_OUTCOME_rat', values='Count')
heatmap_data
Generate a heatmap to visualize the activity outcome pair counts.
sns.heatmap(heatmap_data, cmap="flare", annot=True, fmt='4.0f')
This code creates two scatter plots to directly compare the efficacy and potency of compounds between the rat and human assays.
fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=False)
hue_order = ['Active', 'Inactive', 'Inconclusive', 'Conflicting']
sns.scatterplot(data=df_aid_merged_cid, x='Efficacy_rat', y='Efficacy_human', hue='PUBCHEM_ACTIVITY_OUTCOME_rat', hue_order=hue_order, ax=axes[0], legend=None)
sns.scatterplot(data=df_aid_merged_cid, x='Potency_rat', y='Potency_human', hue='PUBCHEM_ACTIVITY_OUTCOME_rat', hue_order=hue_order, ax=axes[1])
axes[0].set_xlabel("Efficacy (%, rat)")
axes[0].set_ylabel("Efficacy (%, human)")
axes[0].set_title("Efficacy")
axes[1].set_xlabel("Potency (micro-M, rat)")
axes[1].set_ylabel("Potency (micro-M, human)")
axes[1].set_title("Potency")
axes[1].legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)
# Uncomment to limit the axes to 200.
#for ax in axes : # loop over each panel
# ax.set_xlim(0, 200)
# ax.set_ylim(0, 200)
In the above graphs, it is not easy to see if there is any correlation between rat and human activities (in terms of the efficacy and potency).
Let’s focus only on the active compounds in both the rat and human assays.
df_aid_merged_cid_active = df_aid_merged_cid[(df_aid_merged_cid['PUBCHEM_ACTIVITY_OUTCOME_rat'] == 'Active') & (df_aid_merged_cid['PUBCHEM_ACTIVITY_OUTCOME_human'] == 'Active')]
This code generates two scatter plots to compare the efficacy and potency of the compounds that were active in both rat and human assays.
fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=False)
sns.scatterplot(data=df_aid_merged_cid_active, x='Efficacy_rat', y='Efficacy_human', ax=axes[0])
sns.scatterplot(data=df_aid_merged_cid_active, x='Potency_rat', y='Potency_human', ax=axes[1])
axes[0].set_title("Efficacy")
axes[0].set_xlabel("Efficacy (%, rat)")
axes[0].set_ylabel("Efficacy (%, human)")
axes[1].set_title("Potency")
axes[1].set_xlabel("Potency (micro-M, rat)")
axes[1].set_ylabel("Potency (micro-M, human)")
res_efficacy = linregress(df_aid_merged_cid_active['Efficacy_rat'], df_aid_merged_cid_active['Efficacy_human'])
res_potency = linregress(df_aid_merged_cid_active['Potency_rat'], df_aid_merged_cid_active['Potency_human'])
print(f" Efficacy\tPotency")
print(f"Slope : {res_efficacy.slope:4.4f}\t{res_potency.slope:<4.4f}")
print(f"intercept : {res_efficacy.intercept:4.4f}\t{res_potency.intercept:4.4f}")
print(f"r-value : {res_efficacy.rvalue:4.4f}\t{res_potency.rvalue:4.4f}")
print(f"p-value : {res_efficacy.pvalue:4.4f}\t{res_potency.pvalue:4.4f}")
print(f"std-err (slope) : {res_efficacy.stderr:4.4f}\t{res_potency.stderr:4.4f}")
print(f"std-err (intercept) : {res_efficacy.intercept_stderr:4.4f}\t{res_potency.intercept_stderr:4.4f}")
When we consider the compounds that are active in both species, there appears to be a positive correlation between the rat and human data. However, there are only six compounds active in both species, we cannot confidently say there is a definitive relationship.
Homework
Answer the following questions, based on the analysis of the compounds tested in both rat and human thyroid hormone receptors (i.e., the compounds in the df_aid_merged_cid
data frame).
(1) How many compounds were tested against both rat thyroid hormone receptor?
# Provide your answer here
(2) Among the compounds tested in both species, how many were active against rat thyroid hormone receptor?
# Provide your answer here
(3) Among the compounds active in rats (i.e., those in (2)), how many were active against human thyroid hormone receptor?
# Provide your answer here
(4) Read the following paper (open access) on the concordance of the toxicity of drugs between animals and humans and summarize its main conclusion.
Limitations of Animal Studies for Predicting Toxicity in Clinical Trials: Is it Time to Rethink Our Current Approach?
Van Norman GA.
JACC Basic Transl Sci. 2019 Nov 25;4(7):845-854.
doi: 10.1016/j.jacbts.2019.10.008. PMID: 31998852; PMCID: PMC6978558.
# Write your summary in this markdown cell: