Exploratory Data Analysis 1:

Concordance of Thyroid Hormone Receptor Activity in Rats and Humans#

In the preclinical stage of drug development, drug candidates are tested against in non-human species. One of the commonly used animals in the preclinical stage is rodents (e.g., mice and rats).
The underlying assumption here is that the bioactivity (e.g., toxicity and efficacy) of a drug candidate in animals is similar to its activity in humans.

In this notebook, we will look into the validity of this assumption by comparing the bioactivity data available in PubChem for human and rat thyroid hormone receptor.

Let’s import necessary libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from scipy.stats import linregress
#import time
#import json

Checking the amount of available data#

This section focuses on assessing the availability of the bioactivity data for the thyroid hormone receptor in both human and rat models. The data sets were downloaded (in August 2025) from the PubChem Protein summary pages for protein assesions NP_001257783 (rats) and P10828 (humans).

  • NP_001257783 (rats)
    https://pubchem.ncbi.nlm.nih.gov/protein/NP_001257783#section=Chemicals-and-Bioactivities

  • P10828 (humans)
    https://pubchem.ncbi.nlm.nih.gov/protein/P10828#section=Chemicals-and-Bioactivities

The names of the downloaded files are:

Loading data for rat thyroid hormone receptor#

The following cells load and provide an initial exploration of the bioactivity data for the rat thyroid hormone receptor.

This cell defines the data type for the “COMPOUND_CID” column to ensure accurate data loading and to prevent memory-related issues.

mydtype = {"Compound_CID" : "Int64"}
#mydtype = {"Compound_CID" : "Int64", "pmid":"Int64", "taxid":"object", "targettaxid":"object"}

Now read a CSV file containing bioactivity data for the rat thyroid hormone receptorinto a pandas DataFrame and then displays the dimensions of the DataFrame.

df_rat = pd.read_csv('pubchem_protacxn_NP_001257783_bioactivity_protein.zip', compression='zip', dtype=mydtype,low_memory=False)
df_rat.shape

Display a first few rows of the rat bioactivity DataFrame to get a quick overview of the data’s structure and content.

df_rat.head()
#df_rat.columns    # Uncomment to check the column names.

Loading data for human thyroid hormone receptor#

Now read a CSV file containing bioactivity data for the human thyroid hormone receptor into a pandas DataFrame, called df_human, and then displays the dimensions of the DataFrame.

# Write your code here

Displays a first few rows of the new data frame.

# Write your code here
#df_human.columns    # Uncomment to check the column names.

Checking the amount of available data#

Here, we compare the types of bioactivity data available for both rat and human thyroid hormone receptors.

This cell counts the occurrences of each unique value in the ‘Activity_Type’ column of the rat bioactivity DataFrame.

df_rat['Activity_Type'].value_counts(dropna=False)

Now get the occurrences of each unique value in the ‘Activity_Type’ column of the human bioactivity DataFrame.

# Write your own code here

Let’s generate a pair of bar charts to visually compare the distribution of different activity types in the rat and human bioactivity datasets.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)
sns.countplot(data=df_rat, y='Activity_Type', ax=axes[0])
sns.countplot(data=df_human, y='Activity_Type', ax=axes[1])

axes[0].set_xlabel("Number of Data Points (rat)")
axes[1].set_xlabel("Number of Data Points (human)")

for ax in axes :    # loop over each chart.

    ax.set_xscale('log')
    
    for container in ax.containers:
        ax.bar_label(container)

fig.tight_layout()

To compare the small-molecule bioactivity between rats and humans, we need to use the same activity measures. The rat data set has only the potency values, while the human data has other types of activities (e.g., EC50, IC50, Ki, Kd, …). Because the potency is the only activity type available for both species, we will use the potency data for comparison.

Checking the available bioassays#

This section identifies the unique bioassays in which the rat and human thyroid hormone receptor were targeted.

First, let’s get the number of unique AIDs for rat thyroid hormone receptor.

df_rat['BioAssay_AID'].nunique()

Let’s get the number of data points for rat thyroid hormone receptor.

df_rat['BioAssay_AID'].value_counts()

Now check the number of unique assays and the number of data points for each unique assay for human thyroid hormone receptor.

# Write your own code
# Write your own code

There are more than one hundred assays for human thyroid hormone receptor. However, we are interested in the assays with the potency values, because only the potency data is available for both rats and humans. Therefore, let’s focus on them.

df_rat[ df_rat['Activity_Type'] == 'Potency'][['BioAssay_AID', 'BioAssay_Name']].value_counts()
df_human[ df_human['Activity_Type'] == 'Potency'][['BioAssay_AID', 'BioAssay_Name']].value_counts()

Considering the title and the data point counts for the returned assays, the three assays for rat thyroid hormone receptor are high-throughput screening (HTS) assays. Therefore, we also need HTS assays for human thyroid hormone receptor.

In the remaining part of this notebook, we will use the qHTS assays for antagonists of rat and human thyroid hormone receptors.

They were chosen based on the following considerations.

  • The four assays for human thyroid hormone receptors with more than 1000 data points (i.e., 1469, 1479, 588547, 588545) are HTS screening.

  • Strictly speaking, AIDs 1469 and 1479 are not targeting the human thyroid hormone receptor, but its interaction with steroid receptor coregulator.

  • At the summary page of AID 743067, it is explained that the assay summarizes from an confirmatory antagonist assay (AID 743065) and a cell viability counter screening (AID 743064).

  • More active compounds were identified in the antagonist HTS assays for both species (AIDs 743065 and 588547), compared to the agonist assays (AIDs 743066 and 588545).

Comparing HTS screening data for Rat and human#

The following section uses the quantitative HTS (qHTS) data for thyroid hormone receptor antagonsts. The data files were downlonaded from their BioAssay record pages:

The names of the downloaded files are:

Loading the HTS screening data for Rats#

Now let’s load the HTS data for the rat thyroid hormone receptor.

  • The data type for the PUBCHEM_CID column is explicitly specified to ensure it is read as an integer, which is important for accurate data handling.

  • The first 5 rows are skipped because they do not contain data, but meta data (e.g., the description of each column).

mydtype = {'PUBCHEM_CID':'Int64' }
df_aid_rat = pd.read_csv('AID_743065_datatable_all.zip', compression='zip', skiprows=[1,2,3,4,5], dtype=mydtype, low_memory=False)
df_aid_rat.head()
df_aid_rat.shape

This data frame has many columns that don’t have any values. The following cell removes those columns and displays the updated dimensions.

df_aid_rat = df_aid_rat.dropna(how='all', axis=1)
df_aid_rat.shape

Loading the HTS data for humans#

Now read the CSV file for the human HTS data into a dataframe called df_aid_human. Make sure to skip the initial descriptive rows and set the data type for ‘PUBCHEM_CID’ to ‘Int64’.

# Write your own code here.

Check a first few lines of the new data frame.

# Write your own code here.

Displays the dimensions of the human HTS DataFrame.

# Write your own code here.

The following cell Remove the columns containing no values and displays the updated dimensions.

# Write your own code here.

Substance-level analysis#

This section compares the HTS data for rats and humans at the substance level

#df_aid_rat.columns    # Uncomment this line if you want to see what columns the data frame has.
#df_aid_human.columns

Activity outcomes and phenotypes#

This part of the analysis focuses on comparing the activity outcomes and observed phenotypes from the HTS data for both species.

First, let’s generate two bar charts to visually compare the distribution of PUBCHEM_ACTIVITY_OUTCOME in the rat and human HTS datasets.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)
sns.countplot(data=df_aid_rat, y='PUBCHEM_ACTIVITY_OUTCOME', ax=axes[0])
sns.countplot(data=df_aid_human, y='PUBCHEM_ACTIVITY_OUTCOME', ax=axes[1])

axes[0].set_xlabel("Number of Data Points (rat)")
axes[1].set_xlabel("Number of Data Points (human)")

for ax in axes :    # loop over each chart.

    for container in ax.containers:
        ax.bar_label(container)

fig.tight_layout()
  • The majority of the data points in the two data sets are declared in Inactive.

  • There are 1880 Active data points in the rat data set and 12 Active data points in the human data set.

Both data sets have a column containing information on the observed characteristics of the chemicals tested in the assays. These columns are:

  • Phenotype-Replicate_1 for rats

  • Phenotype for humans

It is worth mentioning that the column for rats is suffixed with -Replicate_1. The rat data set contains data from repeated assay experiments, which are indicated as Replicate_1, Replicate_2, Replicate_3, and so on. For simplicity, we will use the data for -Replicate_1, but if we were to perform a stricter analysis, we should consider data from all replicates.

Now let’s create two bar charts to compare the distribution of phenotypes (‘Phenotype-Replicate_1’ for rat and ‘Phenotype’ for human) across the two HTS datasets.

# Write your own code

To find the relationship between the activity outcomes and phenotypes, let’s count the occurrences of each unique combination of ‘PUBCHEM_ACTIVITY_OUTCOME’ and ‘Phenotype-Replicate_1’ in the rat HTS data.

df_aid_rat[['PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype-Replicate_1' ]].value_counts()

.groupby() and .size() can be used to get the same information (but sorted differently).

df_aid_rat.groupby(by=['PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype-Replicate_1']).size()

Now counts the occurrences of each unique combination of ‘PUBCHEM_ACTIVITY_OUTCOME’ and ‘Phenotype’ in the human HTS data.

df_aid_human[['PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype' ]].value_counts()

Get the same information using .groupby() and .size()

df_aid_human.groupby(by=['PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype']).size()

There are several noteworthy things.

  • The objective of the two assays considered here (i.e., AID 743065 and AID 588547) was to identify antagonists (i.e., inhibitors) of thyroid hormone receptor. With that said, the chemicals with the Activitor phenotype (i.e., agonist) are not declared as Active because the activator identification was not the objective of the assays. They are declared as Inactive in the rat data set and as Inconclusive in the human data set. Note that the two assays are different experiments and whether to view activators as inactive or inconclusive is up to those who performed the experiment.

  • In the human data set, the compounds with the “inconclusive” outcomes have additional phenotypes, including Cytotoxic, Flurescent, Inconclusive, Inhibitor and Activator.

    • cytotoxic: some chemicals are so toxic that they kill cells.

    • fluorescent: some assays measure the activity of a chemical against the target protein by measuring the change in the fluorescence intensity caused by the protein-ligand interaction. However, some chemicals can absorb light energy and emit light through flourescence even without interaction with the target proteins.

    • inconclusive or inhibitor: some chemicals did not show a clear dose-response curve to compute the quantitative inhibitory activities (e.g., efficacy and potency), or did not show enough inhibitory activity.

Efficacy#

In this section, we compare the efficacy data from the HTS assays for both rat and human thyroid hormone receptors.

First, let’s generate two histograms to visualize and compare the distribution of efficacy values (‘Efficacy-Replicate_1’ for rat and ‘Efficacy’ for human) for different activity outcomes.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)

sns.histplot(data=df_aid_rat, x='Efficacy-Replicate_1', hue='PUBCHEM_ACTIVITY_OUTCOME',ax=axes[0])
sns.histplot(data=df_aid_human, x='Efficacy', hue='PUBCHEM_ACTIVITY_OUTCOME', ax=axes[1])

axes[0].set_xlabel("Efficacy (%, rat)")
axes[1].set_xlabel("Efficacy (%, human)")

fig.tight_layout()

This time, visualize the same data using strip plot.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)

hue_order = ['Active',  'Inconclusive', 'Inactive']

sns.stripplot(data=df_aid_rat, x='Efficacy-Replicate_1', y='PUBCHEM_ACTIVITY_OUTCOME', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[0])
sns.stripplot(data=df_aid_human, x='Efficacy', y='PUBCHEM_ACTIVITY_OUTCOME', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[1])

axes[0].set_xlabel("Efficacy (micro-M, rat)")
axes[1].set_xlabel("Efficacy (micro-M, human)")

You can use box plots to visualize the minimum, maximum, median, and outliers of the efficacy values.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)

hue_order = ['Active',  'Inconclusive', 'Inactive']

sns.boxplot(data=df_aid_rat, x='Efficacy-Replicate_1', y='PUBCHEM_ACTIVITY_OUTCOME', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[0])
sns.boxplot(data=df_aid_human, x='Efficacy', y='PUBCHEM_ACTIVITY_OUTCOME', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[1])

axes[0].set_xlabel("Efficacy (micro-M, rat)")
axes[1].set_xlabel("Efficacy (micro-M, human)")

The above histograms, strip plots, and box plots indicate that the assay against the human thyroid hormone receptor did not determine efficacy values for inactive compounds.

Potency#

Now visualize the potency of the tested chemicals on the rat and human thyroid hormone receptors.

Generate two histograms to visualize and compare the distribution of potency values for different activity outcomes in the rat and human HTS datasets.

# Write your own code

Generate two strip plots to visualize the potency values for different activity outcomes in the rat and human HTS datasets.

# Write your own code

Generate two box plots to visualize the potency data for different activity outcomes in the rat and human HTS datasets.

# Write your own code

Compute the mean, median, minimum, maximum of the efficacy and potency values for the different outcomes of the rat data set.

df_aid_rat.groupby('PUBCHEM_ACTIVITY_OUTCOME')[['Efficacy-Replicate_1', 'Potency-Replicate_1']].mean()
df_aid_rat.groupby('PUBCHEM_ACTIVITY_OUTCOME')[['Efficacy-Replicate_1', 'Potency-Replicate_1']].median()
df_aid_rat.groupby('PUBCHEM_ACTIVITY_OUTCOME')[['Efficacy-Replicate_1', 'Potency-Replicate_1']].min()
df_aid_rat.groupby('PUBCHEM_ACTIVITY_OUTCOME')[['Efficacy-Replicate_1', 'Potency-Replicate_1']].max()

Compute the mean, median, minimum, maximum of the efficacy and potency values for the different outcomes of the human data set.

# Write your own code for mean
# Write your own code for median
# Write your own code for minimum
# Write your own code for maximum

Efficacy vs Potency#

This section explores the relationship between the efficacy and potency of the tested compounds for both the rat and human thyroid hormone receptors.

The following code creates two scatter plots, one for the rat data and one for the human data, to visualize the relationship between potency and efficacy, with points colored by their activity outcome.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=False)

hue_order = ['Active',  'Inconclusive', 'Inactive']

sns.scatterplot(data=df_aid_rat, x='Potency-Replicate_1', y='Efficacy-Replicate_1', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[0])
sns.scatterplot(data=df_aid_human, x='Potency', y='Efficacy', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[1])

axes[0].set_xlabel("Potency (micro-M, rat)")
axes[1].set_xlabel("Potency (micro-M, human)")

axes[0].set_ylabel("Efficacy (%, rat)")
axes[1].set_ylabel("Efficacy (%, human)")

# Uncomment next three lines to limit the x-axis to 200.
#for ax in axes :    # loop over each panel
#    ax.set_xlim(0, 200)
#    ax.set_ylim(0, 200)

Compound-level Analysis#

In PubChem, substances are chemical records provided by individual data contributors and compounds are unique chemical structures extracted from substance records.
In many assays, substances can be viewed as “samples” of chemicals. A chemical can be tested in multiple samples containing that chemical in a single assay. The samples are not necessary the same as each other. For example, they may be different in terms of its concentration, temperature, solvent used, etc.
Therefore, if you want to determine the activity of a chemical, you may need to collapse the data from multiple samples (i.e., substances) into a single data point for the chemical.

The code cells below checks how many unique compounds in the data set and how many of them are associated with multiple substances.

df_rat_cidcount = pd.DataFrame(df_aid_rat['PUBCHEM_CID'].value_counts()).reset_index()
df_human_cidcount = pd.DataFrame(df_aid_human['PUBCHEM_CID'].value_counts()).reset_index()
print("#-- Rat data set")
print("# CIDs in total                      :", len(df_rat_cidcount) )
print("# CIDs associated with multiple SIDs :", len(df_rat_cidcount[df_rat_cidcount['count'] > 1]) )
print("------------------------------------------------")
print("#-- Human data set")
print("# CIDs in total                      :", len(df_human_cidcount) )
print("# CIDs associated with multiple SIDs :", len(df_human_cidcount[df_human_cidcount['count'] > 1]) )
fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=False)

sns.countplot(data=df_rat_cidcount, x='count',  ax=axes[0])
sns.countplot(data=df_human_cidcount, x='count',  ax=axes[1])

axes[0].set_xlabel("Number of SIDs per CID")
axes[1].set_xlabel("Number of SIDs per CID")

axes[0].set_ylabel("Number of CIDs")
axes[1].set_ylabel("Number of CIDs")

axes[0].set_title("Rat")
axes[1].set_title("Human")

Aggregating substance data rats#

Now we aggregate the substance data for each unique compound (identified by its CID).

This code groups the rat HTS data by ‘PUBCHEM_CID’ and store them in a data frame with three columns (PUBCHEM_CID, Potency-Replicate_1, and Efficacy-Replicate_1). Note that some CIDs are associated with multiple SIDs, and hence, multiple activity values. For those cases, we are going to take the average of the multiple activity values.

df_aid_rat_cid = df_aid_rat.groupby('PUBCHEM_CID').mean(numeric_only=True).reset_index()
df_aid_rat_cid = df_aid_rat_cid[['PUBCHEM_CID', 'Potency-Replicate_1', 'Efficacy-Replicate_1']]
df_aid_rat_cid.head()

This code defines a function, get_cid_outcome, that determines a single activity outcome for each unique compound ID (CID) in a given DataFrame. If a compound has multiple, conflicting outcomes recorded, it is labeled as “Conflicting”.

def get_cid_outcome(df) :

    outcomes = {}

    for cid in df['PUBCHEM_CID'].dropna().unique() :   # loop over unique CIDs (without Null)

        outcome =""
    
        unique_outcomes = df[ df['PUBCHEM_CID'] == cid ]['PUBCHEM_ACTIVITY_OUTCOME'].unique()
    
        if ( len(unique_outcomes) == 1 ) :
            outcome = unique_outcomes[0]
        elif ( len(unique_outcomes) > 1 ) :
            outcome = "Conflicting"
        else :
            print("Warning: Something is wrong", cid, unique_outcomes)

        outcomes[ cid ] = outcome

    df_outcomes = pd.DataFrame.from_dict(outcomes, orient='index', columns=['PUBCHEM_ACTIVITY_OUTCOME']).reset_index(names='PUBCHEM_CID')

    return(df_outcomes)

This line of code applies the get_cid_outcome function to the rat HTS data to generate a DataFrame that summarizes the activity outcome for each unique compound.

df_cid_outcome_rat = get_cid_outcome(df_aid_rat)
df_cid_outcome_rat.head()

Now let’s merge the aggregated rat compound data (containing mean potency and efficacy) with the summarized compound outcomes.

df_aid_rat_cid = df_aid_rat_cid.merge(df_cid_outcome_rat, on='PUBCHEM_CID', how='left')
df_aid_rat_cid.head()

This cell counts the occurrences of each unique ‘PUBCHEM_ACTIVITY_OUTCOME’ in the aggregated rat compound data.

df_aid_rat_cid['PUBCHEM_ACTIVITY_OUTCOME'].value_counts()

Aggregating substance data for humans#

Now let’s repeat the data aggregation process, this time for the human thyroid hormone receptor HTS data.

Group the human HTS data by ‘PUBCHEM_CID’ and store the mean efficacy and potency values in a data frame called ‘df_aid_human_cid’.

# Write your own code

Show a first few line of the df_aid_human_cid data frame.

# Write your own code

Use the get_cid_outcome function to the human HTS data to generate a DataFrame that summarizes the activity outcome for each unique compound.

# Write your own code

This command displays the first few rows of the DataFrame containing the summarized compound outcomes for the human data.

# Write your own code

This code merges the aggregated human compound data with the summarized compound outcomes.

# Write your own code

This command displays the first few rows of the merged human compound DataFrame.

# Write your own code

Use .value_counts() the occurrences of each unique ‘PUBCHEM_ACTIVITY_OUTCOME’ in the aggregated human compound data.

# Write your own code

Merging rat data and human data#

This subsection focuses on combining the aggregated compound-level data from both the rat and human HTS assays to facilitate a direct comparison.

This command displays the column names of the aggregated rat compound DataFrame.

df_aid_rat_cid.columns

Let’s rename the columns in the rat compound DataFrame to have consistent naming with the human compound DataFrame before merging.

df_aid_rat_cid = df_aid_rat_cid.rename(columns={'Potency-Replicate_1':'Potency' , 'Efficacy-Replicate_1':'Efficacy', })

This command displays the first three rows of the renamed rat compound DataFrame to verify the column name changes.

df_aid_rat_cid.head(3)

Now merge the aggregated rat and human compound DataFrames based on ‘PUBCHEM_CID’, keeping only the compounds that are present in both datasets.

df_aid_merged_cid = df_aid_rat_cid.merge(df_aid_human_cid, on='PUBCHEM_CID', how='inner', suffixes=('_rat','_human'))

Let’s check if the merged data frame contains desired information.

df_aid_merged_cid.shape
df_aid_merged_cid

This cell counts the occurrences of each unique ‘PUBCHEM_ACTIVITY_OUTCOME’ for the rat data within the merged DataFrame.

df_aid_merged_cid['PUBCHEM_ACTIVITY_OUTCOME_rat'].value_counts()

This cell counts the occurrences of each unique ‘PUBCHEM_ACTIVITY_OUTCOME’ for the human data within the merged DataFrame.

df_aid_merged_cid['PUBCHEM_ACTIVITY_OUTCOME_human'].value_counts()

Compound-level comparison#

Let’s compare the bioactivity of the same compounds between the rat and human thyroid hormone receptors.

The following code cell generates two box plots to compare the distribution of efficacy for different activity outcomes between the rat and human data in the merged dataset.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)

hue_order = ['Active',  'Inactive', 'Conflicting', 'Inconclusive']

sns.boxplot(data=df_aid_merged_cid, x='Efficacy_rat', y='PUBCHEM_ACTIVITY_OUTCOME_rat', hue='PUBCHEM_ACTIVITY_OUTCOME_rat', hue_order=hue_order, ax=axes[0])
sns.boxplot(data=df_aid_merged_cid, x='Efficacy_human', y='PUBCHEM_ACTIVITY_OUTCOME_human', hue='PUBCHEM_ACTIVITY_OUTCOME_human', hue_order=hue_order, ax=axes[1])

axes[0].set_xlabel("Efficacy (micro-M, rat)")
axes[1].set_xlabel("Efficacy (micro-M, human)")

axes[0].set_ylabel("PUBCHEM_ACTIVITY_OUTCOME")

fig.tight_layout()

Generate two box plots to compare the distribution of potency for different activity outcomes between the rat and human data in the merged dataset.

# Write your own code

This code creates two scatter plots to visualize the relationship between potency and efficacy for the rat and human data in the merged dataset.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=False)

#hue_order = ['Active',  'Inconclusive', 'Inactive']

sns.scatterplot(data=df_aid_merged_cid, x='Potency_rat', y='Efficacy_rat', hue='PUBCHEM_ACTIVITY_OUTCOME_rat', ax=axes[0])
sns.scatterplot(data=df_aid_merged_cid, x='Potency_human', y='Efficacy_human', hue='PUBCHEM_ACTIVITY_OUTCOME_human', ax=axes[1])

axes[0].set_xlabel("Potency (micro-M, rat)")
axes[1].set_xlabel("Potency (micro-M, human)")

axes[0].set_ylabel("Efficacy (%, rat)")
axes[1].set_ylabel("Efficacy (%, human)")

# Uncomment the next three lines to limit the x and y axes to 200.
#for ax in axes :    # loop over each panel
#    ax.set_xlim(0, 200)
#    ax.set_ylim(0, 200)

Now we want to generate a heatmap that shows the counts of rat and human activity outcome pairs. (For example, how many chemicals are active in both species, and how many chemicals are active in one species and inactive in the other?).
To do that we want to get the outcome pair count information from the df_aid_merged_cid data frame.

outcome_counts = pd.DataFrame(df_aid_merged_cid.groupby(['PUBCHEM_ACTIVITY_OUTCOME_human', 'PUBCHEM_ACTIVITY_OUTCOME_rat']).size(), columns=['Count']).reset_index()
outcome_counts

Then, use .pivot() to change the ‘outcome_counts’ DataFrame into a square matrix form, where the rows are human activity outcomes, the columns are rat activity outcomes, and the values are the counts of compounds.

heatmap_data = outcome_counts.pivot(index='PUBCHEM_ACTIVITY_OUTCOME_human', columns='PUBCHEM_ACTIVITY_OUTCOME_rat', values='Count')
heatmap_data

Generate a heatmap to visualize the activity outcome pair counts.

sns.heatmap(heatmap_data, cmap="flare", annot=True, fmt='4.0f')

This code creates two scatter plots to directly compare the efficacy and potency of compounds between the rat and human assays.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=False)

hue_order = ['Active',  'Inactive', 'Inconclusive', 'Conflicting']

sns.scatterplot(data=df_aid_merged_cid, x='Efficacy_rat', y='Efficacy_human', hue='PUBCHEM_ACTIVITY_OUTCOME_rat', hue_order=hue_order, ax=axes[0], legend=None)
sns.scatterplot(data=df_aid_merged_cid, x='Potency_rat', y='Potency_human', hue='PUBCHEM_ACTIVITY_OUTCOME_rat', hue_order=hue_order, ax=axes[1])

axes[0].set_xlabel("Efficacy (%, rat)")
axes[0].set_ylabel("Efficacy (%, human)")
axes[0].set_title("Efficacy")

axes[1].set_xlabel("Potency (micro-M, rat)")
axes[1].set_ylabel("Potency (micro-M, human)")
axes[1].set_title("Potency")

axes[1].legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)

# Uncomment to limit the axes to 200.
#for ax in axes :    # loop over each panel
#   ax.set_xlim(0, 200)
#    ax.set_ylim(0, 200)

In the above graphs, it is not easy to see if there is any correlation between rat and human activities (in terms of the efficacy and potency).

Let’s focus only on the active compounds in both the rat and human assays.

df_aid_merged_cid_active = df_aid_merged_cid[(df_aid_merged_cid['PUBCHEM_ACTIVITY_OUTCOME_rat'] == 'Active') & (df_aid_merged_cid['PUBCHEM_ACTIVITY_OUTCOME_human'] == 'Active')]

This code generates two scatter plots to compare the efficacy and potency of the compounds that were active in both rat and human assays.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=False)

sns.scatterplot(data=df_aid_merged_cid_active, x='Efficacy_rat', y='Efficacy_human', ax=axes[0])
sns.scatterplot(data=df_aid_merged_cid_active, x='Potency_rat', y='Potency_human', ax=axes[1])

axes[0].set_title("Efficacy")
axes[0].set_xlabel("Efficacy (%, rat)")
axes[0].set_ylabel("Efficacy (%, human)")

axes[1].set_title("Potency")
axes[1].set_xlabel("Potency (micro-M, rat)")
axes[1].set_ylabel("Potency (micro-M, human)")
res_efficacy = linregress(df_aid_merged_cid_active['Efficacy_rat'], df_aid_merged_cid_active['Efficacy_human'])
res_potency = linregress(df_aid_merged_cid_active['Potency_rat'], df_aid_merged_cid_active['Potency_human'])
print(f"                      Efficacy\tPotency")
print(f"Slope               : {res_efficacy.slope:4.4f}\t{res_potency.slope:<4.4f}")
print(f"intercept           : {res_efficacy.intercept:4.4f}\t{res_potency.intercept:4.4f}")
print(f"r-value             : {res_efficacy.rvalue:4.4f}\t{res_potency.rvalue:4.4f}")
print(f"p-value             : {res_efficacy.pvalue:4.4f}\t{res_potency.pvalue:4.4f}")
print(f"std-err (slope)     : {res_efficacy.stderr:4.4f}\t{res_potency.stderr:4.4f}")
print(f"std-err (intercept) : {res_efficacy.intercept_stderr:4.4f}\t{res_potency.intercept_stderr:4.4f}")

When we consider the compounds that are active in both species, there appears to be a positive correlation between the rat and human data. However, there are only six compounds active in both species, we cannot confidently say there is a definitive relationship.

Discussion#

Answer the following questions, based on the analysis of the compounds tested in both rat and human thyroid hormone receptors (i.e., the compounds in the df_aid_merged_cid data frame).

(1) How many compounds were tested against both rat thyroid hormone receptor?

# Provide your answer here -> 

(2) Among the compounds tested in both species, how many were active against rat thyroid hormone receptor?

# Provide your answer here -> 

(3) Among the compounds active in rats (i.e., those in (2)), how many were active against human thyroid hormone receptor?

# Provide your answer here -> 

(4) Read the following paper on the concordance of the toxicity of drugs between animals and humans and summarize its main conclusion (in 50-60 words).

Limitations of Animal Studies for Predicting Toxicity in Clinical Trials: Is it Time to Rethink Our Current Approach?
Van Norman GA.
JACC Basic Transl Sci. 2019 Nov 25;4(7):845-854.
doi: 10.1016/j.jacbts.2019.10.008. PMID: 31998852; PMCID: PMC6978558.

# Write your summary in this markdown cell: