6.1 Exploratory Data Analysis - Protein

6.1 Exploratory Data Analysis - Protein#

Exploratory Data Analysis 1:

Concordance of Thyroid Hormone Receptor Activity β in Rats and Humans#

In the preclinical stage of drug development, drug candidates are tested in non-human species. One of the commonly used animals in the preclinical stage is rodents (e.g., mice and rats).

The underlying assumption here is that the bioactivity (e.g., toxicity and efficacy) of a drug candidate in animals is similar to its activity in humans.

In this notebook, we will look into the validity of this assumption by comparing the bioactivity data available in PubChem for human and rat thyroid hormone receptor β.

Recall we looked at two PubChem protein summary pages for accessions NP_001257783 and P10828 which were for the Norway Rat and Human thyroid hormone receptor β proteins.

NP_001257783 https://pubchem.ncbi.nlm.nih.gov/protein/NP_001257783 (rat)
P10828 https://pubchem.ncbi.nlm.nih.gov/protein/P10828 (Human)

Let’s import necessary libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from scipy.stats import linregress
#import time
#import json

Checking the amount of available data#

This section focuses on assessing the availability of the bioactivity data for the thyroid hormone receptor in both human and rat models. The data sets were downloaded in August 2025 from the PubChem Protein summary pages for protein assesions NP_001257783 (rats) and P10828 (humans).

NP_001257783 (rats)
https://pubchem.ncbi.nlm.nih.gov/protein/NP_001257783#section=Chemicals-and-Bioactivities
P10828 (humans)
https://pubchem.ncbi.nlm.nih.gov/protein/P10828#section=Chemicals-and-Bioactivities

The names of the downloaded files are:

file_protein_rat = "pubchem_protacxn_NP_001257783_bioactivity_protein.zip"
file_protein_human = "pubchem_protacxn_P10828_bioactivity_protein.zip"

Note that the csv files are very large so we have provided the data in a compressed format. We will use Pandas Dataframes for our exploratory analysis, which can directly open a zip file, so there is no need to unzip them after you download them.

Loading data for rat thyroid hormone receptor#

The following cells load and provide an initial exploration of the bioactivity data for the rat thyroid hormone receptor.

This cell defines the data type for the “COMPOUND_CID” column to ensure accurate data loading and to prevent memory-related issues.

# Define CID data type for reading CSV files to optimize memory usage
mydtype = {"Compound_CID" : "Int64"}
#mydtype = {"Compound_CID" : "Int64", "pmid":"Int64", "taxid":"object", "targettaxid":"object"}

We can now read our zip files containing bioactivity data in csv format for the rat thyroid hormone receptor into a Pandas DataFrame and then display the dimensions of the DataFrame. Notice that the read_csv method allows for direct reading of compressed files directly by using the compression = 'zip' argument.

df_rat = pd.read_csv('pubchem_protacxn_NP_001257783_bioactivity_protein.zip', compression='zip', dtype=mydtype,low_memory=False)
df_rat.shape

(31458, 32)

Display a first few rows of the rat bioactivity DataFrame to get a quick overview of the data’s structure and content.

df_rat.head()

	Bioactivity_ID	Activity	BioAssay_AID	Substance_SID	Compound_CID	refsid	Gene_ID	PMID	Aid_Type	Last_Modified_Date	...	Representative_Protein_Accession	Taxonomy_ID	Cell_ID	Target_Taxonomy_ID	Anatomy_ID	Anatomy	dois	pmcids	pclids	citations
0	394010671	Active	743066	144203580	5920	NaN	24831	NaN	Confirmatory	20140820	...	NP_001257783	10116	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	394011094	Active	743066	144204055	92044441	NaN	24831	NaN	Confirmatory	20140820	...	NP_001257783	10116	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	394011748	Active	743066	144203868	5803	NaN	24831	NaN	Confirmatory	20140820	...	NP_001257783	10116	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	394012257	Active	743066	144204363	444795	NaN	24831	NaN	Confirmatory	20140820	...	NP_001257783	10116	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	394012468	Active	743066	144204799	57333099	NaN	24831	NaN	Confirmatory	20140820	...	NP_001257783	10116	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 32 columns

df_rat.columns    # Uncomment to check the column names.

Index(['Bioactivity_ID', 'Activity', 'BioAssay_AID', 'Substance_SID',
       'Compound_CID', 'refsid', 'Gene_ID', 'PMID', 'Aid_Type',
       'Last_Modified_Date', 'Has_Dose_Response_Curve', 'RNAi_BioAssay',
       'Protein_Accession', 'Activity_Type', 'Activity_Qualifier',
       'Activity_Value', 'Bioassay_Data_Source', 'BioAssay_Name',
       'Compound_Name', 'Target_Name', 'Target_Link', 'ECs',
       'Representative_Protein_Accession', 'Taxonomy_ID', 'Cell_ID',
       'Target_Taxonomy_ID', 'Anatomy_ID', 'Anatomy', 'dois', 'pmcids',
       'pclids', 'citations'],
      dtype='object')

Check your understanding

Loading data for human thyroid hormone receptor

In the next two code cells:

Read the CSV file containing bioactivity data for the human thyroid hormone receptor into a pandas DataFrame, called df_human and then display the dimensions of the DataFrame.
Display the first 5 rows of the dataframe.

# Write your code here to load human data into the datafram and display dimensions

# Write your code here to display the first 5 rows of the human dataframe

#df_human.columns    # Uncomment to check the column names.

Checking the type and amount of available data for each species#

Here, we compare the types of bioactivity data available for both rat and human thyroid hormone receptors.

This cell counts the occurrences of each unique value in the 'Activity_Type' column of the rat bioactivity DataFrame.

df_rat['Activity_Type'].value_counts(dropna=False)

Activity_Type
Potency    31458
Name: count, dtype: int64

Check your understanding

Get the occurrences of each unique value in the 'Activity_Type' column of the human bioactivity DataFrame in the next code cell.
Comment on the Activity types found in the rat vs. human data.

# Write your own code here

Python Dive: In the last few lines of code you used Pandas DataFrame methods to explore the data. Let’s take a closer look.

Explanation of the code

A method is a function bound to an object and uses the .method notation to attach it to the object.
Here the object is a Pandas DataFrame called df_human.

Some methods perform actions on the DataFrame and require parentheses with optional arguments, e.g. df_human.head(10) returns the first 10 rows.
Some properties are attributes rather than methods, so you access them without parentheses, e.g. df_human.shape returns the dimensions of the DataFrame.
df_human.shape → Returns a tuple (rows, columns) with dataset size.
Example: (5000, 20) means 5000 rows × 20 columns.
df_human.head() → Displays the first 5 rows (or more with head(n)).
Great for previewing structure and sample values.
df_human.columns → Lists the column names as an Index object.
Helps you see available variables for analysis.
df_human['Activity_Type'].value_counts(dropna=False) → Counts frequency of each unique value.
dropna=False ensures missing values (NaN) are included in the count.

Sometimes, visualizations can make patterns in the data easier to see. Let’s create bar charts to compare the distribution of activity types in the rat and human bioactivity datasets. The following code uses the Python Seaborn statistical data visualization package to create two side-by-side bar charts, one for the rat data and one for the human data. The sharey=True argument ensures both charts use the same y-axis scale for easier comparison.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)
sns.countplot(data=df_rat, y='Activity_Type', ax=axes[0])
sns.countplot(data=df_human, y='Activity_Type', ax=axes[1])

axes[0].set_xlabel("Number of Data Points (rat)")
axes[1].set_xlabel("Number of Data Points (human)")

for ax in axes :    # loop over each chart.

    ax.set_xscale('log')
    
    for container in ax.containers:
        ax.bar_label(container)

fig.tight_layout()

Check your understanding

What do you notice about the activity types available in the rat dataset compared to the human dataset?
If we want to compare small-molecule bioactivity between rats and humans, which activity type should we use, and why?

Reviewing the available bioassays#

Recall from the Bioactivity Primer we found there were 3 bioassays (AIDs 743065, 743066, 743067) associated with the rat thyroid receptor β: https://pubchem.ncbi.nlm.nih.gov/protein/NP_001257783#section=Chemicals-and-Bioactivities

We can now find the unique bioassays programmatically from the downloaded CSV files in our dataframe from both the rat and human receptors.

First, let’s get the number of unique AIDs for rat thyroid hormone receptor.

# the .nunique() function gives the number of unique values in a dataframe column
df_rat['BioAssay_AID'].nunique()

Let’s get the number of data points for rat thyroid hormone receptor for each assay. These will be the total tested substances in the assay regardless of active or inactive.

df_rat['BioAssay_AID'].value_counts()

BioAssay_AID
743066    10486
743065    10486
743067    10486
Name: count, dtype: int64

Check your understanding

Confirm that the totals above match what is in the bioassay information at Pubchem: https://pubchem.ncbi.nlm.nih.gov/protein/NP_001257783#section=Chemicals-and-Bioactivities
In the code cells below check the number of unique assays and the number of data points for the human thyroid hormone receptor beta (THRβ).

# Write your own code for number of unique assays in human receptor

# Write your own code to determine number of assay data points (tested compounds) for each unique assay in human receptor

Notice that there are more than one hundred assays for human THRβ. However, we are only interested in the assays with the potency values, because only the potency data is available for both rats and humans. Therefore, we will focus on those assays.

df_rat[df_rat['Activity_Type'] == 'Potency'][['BioAssay_AID', 'BioAssay_Name']].value_counts()

BioAssay_AID  BioAssay_Name                                                                                            
743065        qHTS assay to identify small molecule antagonists of the thyroid receptor (TR) signaling pathway             10486
743066        qHTS assay to identify small molecule agonists of the thyroid receptor (TR) signaling pathway                10486
743067        qHTS assay to identify small molecule antagonists of the thyroid receptor (TR) signaling pathway: Summary    10486
Name: count, dtype: int64

Python Dive: The above code is creating a Boolean mask to filter the DataFrame for rows where the 'Activity_Type' is 'Potency'. It then selects the 'BioAssay_AID' and 'BioAssay_Name' columns and counts the occurrences of each unique combination.

Explanation of the code

df_rat[df_rat['Activity_Type'] == 'Potency'][['BioAssay_AID', 'BioAssay_Name']].value_counts()

df_rat[df_rat['Activity_Type'] == 'Potency'] This actualy creates a mask and applies it to the dataframe, returning the filtered rows. There are two processes going on.
- df_rat['Activity_Type'] == 'Potency' - This creates a Boolean mask: a series of True/False values indicating whether each row meets the condition. We could write this as mask = df_rat['Activity_Type'] == 'Potency'
- df_rat[mask] then filters the DataFrame and selects the rows from df_rat where the mask is True.
[['BioAssay_AID', 'BioAssay_Name']] selects only the ‘BioAssay_AID’ and ‘BioAssay_Name’ columns from the filtered DataFrame.
.value_counts() is a Pandas method that counts the occurrences of each unique combination of ‘BioAssay_AID’ and ‘BioAssay_Name’.
- Returns a Series with the counts of each unique combination.
- The index is a MultiIndex with ‘BioAssay_AID’ and ‘BioAssay_Name’, this returns all observed combinations of these two columns and their counts.

# Write your code here to filter the data frame for human receptor and count the number of assays with potency data.

Considering the title and the data point counts for the returned assays, the three assays for rat thyroid hormone receptor are high-throughput screening (HTS) assays. Therefore, we also need HTS assays for human thyroid hormone receptor.

In the remaining part of this notebook, we will use the qHTS assays for antagonists of rat and human thyroid hormone receptors.

AID 743065: Confirmatory Assay (Antagonists) in rat receptor
AID 588547: Confirmatory (Antagonists) in human receptor

They were chosen based on the following considerations.

The four assays for human thyroid hormone receptors with more than 1000 data points (i.e., 1469, 1479, 588547, 588545) are HTS screening.
Strictly speaking, AIDs 1469 and 1479 are not targeting the human thyroid hormone receptor, but its interaction with steroid receptor coregulator.
At the summary page of AID 743067, it is explained that the assay summarizes from an confirmatory antagonist assay (AID 743065) and a cell viability counter screening (AID 743064).
More active compounds were identified in the antagonist HTS assays for both species (AIDs 743065 and 588547, compared to the agonist assays (AIDs 743066 and 588545).

AID	Assay Type	species	Number of tested substances	number of actives
743065	Antagonist	rat 🐀	10,486	1880
588547	Antagonist	human 🧑🏽‍🔬	2,858	12
743066	Agonist	rat 🐀	10,486	64
588545	Agonist	human🧑🏽‍🔬	2,825	10

Comparing HTS screening data for Rat and human#

The following section uses the quantitative HTS (qHTS) data for thyroid hormone receptor antagonsts. The data files were downloaded from their BioAssay record pages: (Note: Links below take you to the Download csv dropdown in the data table.)

AID 743065: Confirmatory Assay (Antagonists)
AID 588547: Confirmatory (antagonists)

We have provided you with the data files as compressed ‘zip’ files, but you can also download them directly as .csv files. The names of the downloaded files are:

#file_aid_rat = "AID_743065_datatable_all.csv"  #uncomment to use uncompressed csv files
#file_aid_human = "AID_588547_datatable_all.csv" #uncomment to use uncompressed csv files

file_aid_rat = "AID_743065_datatable_all.zip"
file_aid_human = "AID_588547_datatable_all.zip"

Loading the HTS screening data for Rats#

Now let’s load the HTS data for the rat thyroid hormone receptor.

The data type for the PUBCHEM_CID column is explicitly specified mydtype = {'PUBCHEM_CID':'Int64' } to ensure it is read as an integer, which is important for accurate data handling.
The first 5 rows are skipped skiprows=[1,2,3,4,5] because they do not contain data, but meta data (e.g., the description of each column).

mydtype = {'PUBCHEM_CID':'Int64' }
df_aid_rat = pd.read_csv(file_aid_rat, compression='zip', skiprows=[1,2,3,4,5], dtype=mydtype, low_memory=False)

df_aid_rat.head()

	PUBCHEM_RESULT_TAG	PUBCHEM_SID	PUBCHEM_CID	PUBCHEM_EXT_DATASOURCE_SMILES	PUBCHEM_ACTIVITY_OUTCOME	PUBCHEM_ACTIVITY_SCORE	PUBCHEM_ACTIVITY_URL	PUBCHEM_ASSAYDATA_COMMENT	Phenotype-Replicate_1	Potency-Replicate_1	...	Activity at 0.762 uM-Replicate_46	Activity at 1.702 uM-Replicate_46	Activity at 3.793 uM-Replicate_46	Activity at 8.439 uM-Replicate_46	Activity at 18.15 uM-Replicate_46	Activity at 32.49 uM-Replicate_46	Activity at 70.35 uM-Replicate_46	Activity at 94.96 uM-Replicate_46	Activity at 260.0 uM-Replicate_46	Activity at 920.3 uM-Replicate_46
0	1	144203552	12850184	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...	Inactive	0	http://assay.nih.gov/htsws/rest/display/tox21-...	NaN	Inactive	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2	144203553	89753	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...	Inactive	0	http://assay.nih.gov/htsws/rest/display/tox21-...	NaN	Inactive	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	3	144203554	9403	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...	Inconclusive	10	http://assay.nih.gov/htsws/rest/display/tox21-...	NaN	Inhibitor	33.4915	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	4	144203555	13218779	C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N	Inactive	0	http://assay.nih.gov/htsws/rest/display/tox21-...	NaN	Inactive	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	5	144203556	142766	C1=CC=C(C(=C1)C(=O)O)O.C1=CC2=C(C(=C1)O)N=CC=C2	Active	83	http://assay.nih.gov/htsws/rest/display/tox21-...	NaN	Inhibitor	8.4127	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 1803 columns

df_aid_rat.shape

(10486, 1803)

This code cell identifies and counts all columns that do not have values.

nan_columns = df_aid_rat.columns[df_aid_rat.isna().all()].tolist()
print(len(nan_columns))

Python Dive: The above code tells us there are 389 columns with no values.

Explanation of the code

nan_columns = df_aid_rat.columns[df_aid_rat.isna().all()].tolist()

df_aid_rat.isna() Returns a DataFrame of the same shape
- Every cell is True if NaN, otherwise False.
.all() → Checks if all values in each column are True (i.e., all values are NaN).
- applied column-wise (by default axis=0), so it goes down each column to see if it is true, so there is one Boolean value for each column.
- the result is a Pandas Series of Booleans, where the index is the name of the column, and each value is a Boolean True or False based on the values of the entire column.
df_aid_rat.columns[...] Filters the DataFrame’s columns (a Pandas Series) to keep only those where the condition is True.
.tolist() → Converts the filtered column names to a list.
nan_columns = df_aid_rat.columns[df_aid_rat.isna().all()].tolist() Assigns the list of column names with all NaN values to the variable nan_columns.
len(nan_columns) Returns the number of columns that have all values as NaN.
print(len(nan_columns)) Prints the number of columns with all NaN values.

The following cell removes those columns and displays the updated dimensions.

df_aid_rat = df_aid_rat.dropna(how='all', axis=1)
df_aid_rat.shape

(10486, 1414)

Python Dive: The above code is an important step in cleaning data where you drop all values that are *Not a Number* and we reduced our columns from 1803 to 1414.

Explanation of the code

df_aid_rat = df_aid_rat.dropna(how='all', axis=1)

.dropna()is a Pandas method that removes missing values (NaN) from a DataFrame.
- it only looks for cells with missing data (NaN), not for strings like “active” or “CCO”. A column with text values will not be dropped unless every single value in it is missing (NaN)
how='all' specifies that only columns where all values are NaN should be dropped.
- how is a keyword parameter and ‘all’ is the argument that you pass to it.
- axis is another keyword parameter that specifies whether to drop rows (axis=0) or columns (axis=1), where 0 and 1 are the arguments you pass to it
- Quick Definitions
  - Function → A reusable block of code that does something when you call it. Example: print("hello").
  - Method → A function that belongs to an object and is called using dot notation. Example: df.dropna().
  - Parameter → The placeholder names defined inside the function/method (e.g. axis, how).
  - Argument → The actual values you pass to those parameters when you call the function/method (e.g. axis=1, "all").
  - Positional vs Keyword arguments →
    - Positional: arguments matched by their order. Example: df.dropna(1, "all") (first goes to axis, second goes to how).
    - Keyword: arguments matched by name, not order. Example: df.dropna(axis=1, how="all").
  - NaN (“Not a Number”) is a special floating-point value defined by the IEEE 754 standard, used in Python to represent missing, undefined, or invalid numerical data.

df_aid_rat.head()

	PUBCHEM_RESULT_TAG	PUBCHEM_SID	PUBCHEM_CID	PUBCHEM_EXT_DATASOURCE_SMILES	PUBCHEM_ACTIVITY_OUTCOME	PUBCHEM_ACTIVITY_SCORE	PUBCHEM_ACTIVITY_URL	Phenotype-Replicate_1	Potency-Replicate_1	Efficacy-Replicate_1	...	Activity at 0.153 uM-Replicate_46	Activity at 0.341 uM-Replicate_46	Activity at 0.762 uM-Replicate_46	Activity at 1.702 uM-Replicate_46	Activity at 3.793 uM-Replicate_46	Activity at 8.439 uM-Replicate_46	Activity at 18.15 uM-Replicate_46	Activity at 32.49 uM-Replicate_46	Activity at 70.35 uM-Replicate_46	Activity at 94.96 uM-Replicate_46
0	1	144203552	12850184	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...	Inactive	0	http://assay.nih.gov/htsws/rest/display/tox21-...	Inactive	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2	144203553	89753	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...	Inactive	0	http://assay.nih.gov/htsws/rest/display/tox21-...	Inactive	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	3	144203554	9403	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...	Inconclusive	10	http://assay.nih.gov/htsws/rest/display/tox21-...	Inhibitor	33.4915	71.8523	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	4	144203555	13218779	C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N	Inactive	0	http://assay.nih.gov/htsws/rest/display/tox21-...	Inactive	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	5	144203556	142766	C1=CC=C(C(=C1)C(=O)O)O.C1=CC2=C(C(=C1)O)N=CC=C2	Active	83	http://assay.nih.gov/htsws/rest/display/tox21-...	Inhibitor	8.4127	111.3000	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 1414 columns

Loading the HTS data for humans

In the following code cells you will read in the human HTS data as we did for the rat.

Read the zip (or csv) file for the human HTS data into a dataframe called df_aid_human. Make sure to skip the initial descriptive rows and set the data type for ‘PUBCHEM_CID’ to ‘Int64’.

# Write your own code here.

Output the first 5 rows of the human AID dataframe to check that it properly loaded.

# Write your own code here.

Display the dimensions of the human HTS DataFrame.

# Write your own code here.

In the following cell remove the columns containing no values and displays the updated dimensions.

# Write your own code here.

df_aid_human.head()

Substance-level analysis#

This section compares the HTS data for rats and humans at the substance level

df_aid_rat.columns    # Uncomment this line if you want to see what columns the data frame has.

Index(['PUBCHEM_RESULT_TAG', 'PUBCHEM_SID', 'PUBCHEM_CID',
       'PUBCHEM_EXT_DATASOURCE_SMILES', 'PUBCHEM_ACTIVITY_OUTCOME',
       'PUBCHEM_ACTIVITY_SCORE', 'PUBCHEM_ACTIVITY_URL',
       'Phenotype-Replicate_1', 'Potency-Replicate_1', 'Efficacy-Replicate_1',
       ...
       'Activity at 0.153 uM-Replicate_46',
       'Activity at 0.341 uM-Replicate_46',
       'Activity at 0.762 uM-Replicate_46',
       'Activity at 1.702 uM-Replicate_46',
       'Activity at 3.793 uM-Replicate_46',
       'Activity at 8.439 uM-Replicate_46',
       'Activity at 18.15 uM-Replicate_46',
       'Activity at 32.49 uM-Replicate_46',
       'Activity at 70.35 uM-Replicate_46',
       'Activity at 94.96 uM-Replicate_46'],
      dtype='object', length=1414)

df_aid_human.columns # Uncomment this line if you want to see what columns the data frame has.

Activity outcomes and phenotypes#

This part of the analysis focuses on comparing the activity outcomes and observed phenotypes from the HTS data for both species.

First, let’s generate two bar charts to visually compare the distribution of PUBCHEM_ACTIVITY_OUTCOME in the rat and human HTS datasets.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)
sns.countplot(data=df_aid_rat, y='PUBCHEM_ACTIVITY_OUTCOME', ax=axes[0])
sns.countplot(data=df_aid_human, y='PUBCHEM_ACTIVITY_OUTCOME', ax=axes[1])

axes[0].set_xlabel("Number of Data Points (rat)")
axes[1].set_xlabel("Number of Data Points (human)")

for ax in axes :    # loop over each chart.

    for container in ax.containers:
        ax.bar_label(container)

fig.tight_layout()

Check your understanding

What is the majority activity data type in each species?
How many actives are found for each species?
Is an active molecule an agonist or antagonist in this assay? (recall each of the bioassay titles)

Both data sets have a column containing information on the type of activity found for the substance tested. These columns are:

Phenotype-Replicate_1 for rats
Phenotype for humans

It is worth mentioning that the column for rats is suffixed with -Replicate_1. The rat data set contains data from repeated assay experiments, which are indicated as Replicate_1, Replicate_2, Replicate_3, and so on. For simplicity, we will use the data for -Replicate_1, but if we were to perform a stricter analysis, we should consider data from all replicates.

Now let’s create two bar charts to compare the distribution of phenotypes (‘Phenotype-Replicate_1’ for rat and ‘Phenotype’ for human) across the two HTS datasets.

# Write your own code

In each case the phenotype indicates the type of activity observed: inhibitor, activator, fluorescent, cytotoxic, inactive, or inconclusive.

To find the relationship between the activity outcomes and phenotypes, let’s count the occurrences of each unique combination of ‘PUBCHEM_ACTIVITY_OUTCOME’ and ‘Phenotype-Replicate_1’ in the rat HTS data.

df_aid_rat[['PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype-Replicate_1' ]].value_counts()

PUBCHEM_ACTIVITY_OUTCOME  Phenotype-Replicate_1
Inactive                  Inactive                 7041
Active                    Inhibitor                1880
Inconclusive              Inhibitor                1038
Inactive                  Activator                 527
Name: count, dtype: int64

.groupby() and .size() can be used to get the same information (but sorted differently).

df_aid_rat.groupby(by=['PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype-Replicate_1']).size()

PUBCHEM_ACTIVITY_OUTCOME  Phenotype-Replicate_1
Active                    Inhibitor                1880
Inactive                  Activator                 527
                          Inactive                 7041
Inconclusive              Inhibitor                1038
dtype: int64

Python Dive: The above code uses the pandas method groupby to group the data fields of PUBCHEM_ACTIVITY_OUTCOME and Phenotype-Replicate_1, and then uses .size() method to count the occurrences of each unique combination.

Explanation of the code

df_aid_rat.groupby(by=['PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype-Replicate_1']).size()

groupby(by=[...])
- Splits the DataFrame into groups based on the unique combinations of two columns:
  - PUBCHEM_ACTIVITY_OUTCOME (e.g., Active, Inactive, Inconclusive)
  - Phenotype-Replicate_1 (e.g., Activator, Inhibitor, Inactive)
- Each unique pair of values is a group key.
.size()
- Counts the number of rows in each group.
- The result is a Pandas Series, where:
  - The index = the group keys (a MultiIndex of outcome + phenotype).
  - The values = the row counts for each group.

The result is a Pandas Series with a MultiIndex built from the two grouping columns, with values being the counts (the output of .size()).

Level 0 → PUBCHEM_ACTIVITY_OUTCOME
Level 1 → Phenotype-Replicate_1

So internally it looks like this:

index = MultiIndex([
    ('Active', 'Inhibitor'),
    ('Inactive', 'Activator'),
    ('Inactive', 'Inactive'),
    ('Inconclusive', 'Inhibitor')
], names=['PUBCHEM_ACTIVITY_OUTCOME','Phenotype-Replicate_1'])

values = [1880, 527, 7041, 1038]

It does not print the ‘Inactive’ label again for the second third level because it is the same as the second level, so it is not repeated in the output.

Understand the output:

PUBCHEM_ACTIVITY_OUTCOME  Phenotype-Replicate_1
Active                    Inhibitor                1880
Inactive                  Activator                 527
                          Inactive                 7041
Inconclusive              Inhibitor                1038
dtype: int64

Active, Inhibitor → 1880 substances
Inactive, Activator → 527 substances
Inactive, Inactive → 7041 substances (the blank under “Inactive” just means “same as above” — Pandas doesn’t repeat the label)
Inconclusive, Inhibitor → 1038 substances

Now let’s count the occurrences of each unique combination of ‘PUBCHEM_ACTIVITY_OUTCOME’ and ‘Phenotype’ in the human HTS data.

df_aid_human[['PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype' ]].value_counts()

Get the same information using .groupby() and .size()

df_aid_human.groupby(by=['PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype']).size()

Check your understanding

Look at the objective of the rat and human assays. What type of compounds were the assays designed to identify?
In your rat dataset, how are compounds with an Activator phenotype classified? How is this different from the human dataset?
Why do you think Activators are not declared as “Active” even though they show biological activity?
In the human dataset, some compounds are labeled Inconclusive. What are possible reasons a compound might be classified this way rather than as Active or Inactive?
The “Inconclusive” category in the human dataset includes additional phenotypes such as Cytotoxic and Fluorescent.
- What does it mean if a compound is labeled Cytotoxic?
- Why might a fluorescent compound interfere with interpreting assay results?
- What issues with a dose-response curve could make a compound labeled as an Inhibitor still end up in the Inconclusive category?
- (Hint for these: We explored the test methods in the BioActivity Primier activity.)
Compare the way activators are treated in the rat and human datasets. How could these differences affect your ability to compare results between species?
If you were the scientist in charge of designing the assay, how might you justify the choice to classify activators as Inactive versus Inconclusive?

Efficacy#

In this section, we compare the efficacy data from the HTS assays for both rat and human thyroid hormone receptors.

For agonist-mode assays, high efficacy = strong activation relative to baseline.

For this antagonist-mode assay, efficacy instead reflects the maximum inhibition observed relative to the agonist control (thyroid hormone T3).

An efficacy near -100% means the compound fully blocked the receptor’s activity back to baseline (complete antagonist).
An efficacy near 0% means little or no inhibition.
Intermediate values (e.g., -40%, -60%) mean partial inhibition.

So in this dataset, the Efficacy column tells you “how much did the compound reduce the T3 signal at its maximum effect?”

Let’s look at some examples from the rat dataset to get a sense of the type of data here:

cids_of_interest = [6, 142766, 89753, 102428, 142766,  2724411, 12850184, 89753, 13218779, 774, 5803]
df_aid_rat[df_aid_rat['PUBCHEM_CID'].isin(cids_of_interest)] \
    .drop_duplicates(subset='PUBCHEM_CID') \
    .loc[:, ['PUBCHEM_CID', 'PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype-Replicate_1', 'Efficacy-Replicate_1']] \
    .sort_values(by='PUBCHEM_ACTIVITY_OUTCOME')

	PUBCHEM_CID	PUBCHEM_ACTIVITY_OUTCOME	Phenotype-Replicate_1	Efficacy-Replicate_1
4	142766	Active	Inhibitor	111.3000
7	2724411	Active	Inhibitor	105.6700
5600	6	Active	Inhibitor	121.3320
0	12850184	Inactive	Inactive	NaN
1	89753	Inactive	Inactive	NaN
3	13218779	Inactive	Inactive	NaN
165	774	Inactive	Activator	21.8768
316	5803	Inactive	Activator	49.7967
1312	102428	Inactive	Activator	19.7701

Python Dive: The above code extracts your compounds of interest, removes duplicates, narrows the view to the 4 most relevant columns, and sorts them by assay outcome so you can quickly compare Active vs. Inactive compounds along with their phenotypes and efficacy values.

Explanation of the code

`df_aid_rat ['PUBCHEM_CID'].isin(cids_of_interest).drop_duplicates(subset='PUBCHEM_CID').loc[:, ['PUBCHEM_CID', 'PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype-Replicate_1', 'Efficacy-Replicate_1']].sort_values(by='PUBCHEM_ACTIVITY_OUTCOME')

` Note, in Python the ‘’ is used to indicate that the line continues on the next line, so you can break long lines into multiple lines for readability, and the code box is really two lines of code, one defining the list and the other what is done with it.

df_aid_rat is a Pandas Dataframe
df_aid_rat['PUBCHEM_CID'] is a Pandas Series of made of the PubChem_CID field of the Dataframe.
df_aid_rat['PUBCHEM_CID'].isin(cids_of_interest) note isin()is a series method operating on the CID coluns to create a Boolean mask that is True for rows where the PUBCHEM_CID is in the list of cids_of_interest.
df_aid_rat[df_aid_rat['PUBCHEM_CID'].isin(cids_of_interest)] creates a new filtered DataFrame that keeps row where the mask is True. It does not overwrite the original dataframe.
.drop_duplicates(subset='PUBCHEM_CID') removes duplicate rows based on the ‘PUBCHEM_CID’ column, keeping only the first occurrence.
.loc[:, ['PUBCHEM_CID', 'PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype-Replicate_1', 'Efficacy-Replicate_1']] selects only the specified columns from the filtered DataFrame.
.sort_values(by='PUBCHEM_ACTIVITY_OUTCOME') sorts the DataFrame by the ‘PUBCHEM_ACTIVITY_OUTCOME’ column in ascending order.

Check your understanding

To be labeled as an active antagonist in this model, what must the efficacy value be? Which CIDs in the data subset are antagonists? How is that indicated by phenotype?
How does the activity of inactive compounds compare to active compounds?
Some inhibitors have efficacy values greater than 100%, meaning they reduce the assay signal below the baseline. What experimental or biological factors could cause an antagonist to produce a signal lower than baseline?
Activators such as CID 774, 5803, and 102428 only show partial inhibition of the assay signal. Could these compounds be acting as partial agonists rather than true activators or antagonists? How might partial agonists appear in this type of antagonist assay?

We can see there is similar data in the human dataset:

cids_of_interest = [457193, 15910, 443939, 5995, 5281575,7529]
df_aid_human[df_aid_human['PUBCHEM_CID'].isin(cids_of_interest)] \
    .drop_duplicates(subset='PUBCHEM_CID') \
    .loc[:, ['PUBCHEM_CID', 'PUBCHEM_ACTIVITY_OUTCOME', 'Phenotype', 'Efficacy']] \
    .sort_values(by='PUBCHEM_ACTIVITY_OUTCOME')

Plotting the data:

A histogram groups data into bins to display how frequently values occur withing each range. It is best used for large datasets and shows the shape of the distribution. This can show overall patterns in data, not the individual values. This can be very useful for comparing data like potency or efficacy distributions.

Now that we have an overview of the types of data in the rat dataset, let’s generate two histograms to visualize and compare the distribution of efficacy values (‘Efficacy-Replicate_1’ for rat and ‘Efficacy’ for human) for different activity outcomes.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)

color_dict = {"Active": "red", "Inconclusive": "purple", "Conflicting": "blue", "Inactive": "green"}

sns.histplot(data=df_aid_rat, x='Efficacy-Replicate_1', hue='PUBCHEM_ACTIVITY_OUTCOME',ax=axes[0], palette=color_dict)
sns.histplot(data=df_aid_human, x='Efficacy', hue='PUBCHEM_ACTIVITY_OUTCOME', ax=axes[1], palette=color_dict)

axes[0].set_xlabel("Efficacy (%, rat)")
axes[1].set_xlabel("Efficacy (%, human)")

fig.tight_layout()

Notice in the above, it is clear we can see three distinct Activity Outcome categories in the rat data. However, with the human data being a smaller set, it is challenging to see which molecules are labeled as active or inconclusive.

Another way to visualize this data is with a strip plot. It will show every individual data point along an axis. If two datapoints are the same, the program may provide slight “jitter” to reduce overap. This plot is best used for showing discrete observations and identifying clusters, outliers or repeated values. It can also useful for comparing disbributions across categories. In this case we are looking at efficacy by activity outcome.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)

hue_order = ['Active',  'Inconclusive', 'Inactive']

sns.stripplot(data=df_aid_rat, x='Efficacy-Replicate_1', y='PUBCHEM_ACTIVITY_OUTCOME', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[0])
sns.stripplot(data=df_aid_human, x='Efficacy', y='PUBCHEM_ACTIVITY_OUTCOME', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[1])

axes[0].set_xlabel("Efficacy (%, rat)")
axes[1].set_xlabel("Efficacy (%, human)")

Another useful visualization tool is the box plot which summarizes the distribution of a datase by showing five key statistics:

minimum
Q1 - The First quartile (25th percentile of values)
Q2 - The Second quartile (50th percentile of values)
Median- The value exactly in the middle of the values
Q3 - The Third quartile (75th percentile of values)
Maximum - The highest non-outlier value
Outliers - points beyond the “whiskers” that are anomolous data points.

You can use it to summarize variables (like efficacy or potency) and how they are distributed across categories (active, inactive, inconclusive), how to compare medians and spread between groups, and spot outliers quickly without visualizing every individual point.

Let’s create box plots of the activity and efficacy data for rats and humans.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)

hue_order = ['Active',  'Inconclusive', 'Inactive']

sns.boxplot(data=df_aid_rat, x='Efficacy-Replicate_1', y='PUBCHEM_ACTIVITY_OUTCOME', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[0])
sns.boxplot(data=df_aid_human, x='Efficacy', y='PUBCHEM_ACTIVITY_OUTCOME', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[1])

axes[0].set_xlabel("Efficacy (%, rat)")
axes[1].set_xlabel("Efficacy (%, human)")

Check your understanding

Based on the various plots of the data, were efficacy values determined for all compounds in the human dataset?
The box plots of Efficacy for the rat and human datasets show that the Inconclusive and Inactive outcomes have many more outliers than the Active outcomes.
- Why do you think the Inconclusive and Inactive groups show such a wide spread of efficacy values?
- What does this suggest about how compounds are classified in assays compared to those labeled Active?

Potency#

Now visualize the potency of the tested chemicals on the rat and human thyroid hormone receptors.

Generate two histograms to visualize and compare the distribution of potency values for different activity outcomes in the rat and human HTS datasets.

# Write your own code

Generate two strip plots to visualize the potency values for different activity outcomes in the rat and human HTS datasets.

# Write your own code

Generate two box plots to visualize the potency data for different activity outcomes in the rat and human HTS datasets.

# Write your own code

Check your understanding

Based on the various plots of the data, were potency values determined for all compounds in the human dataset?
When comparing the rat and human datasets, the box and strip plots for Potency show fewer outliers than those for Efficacy.
- Why do you think fewer outliers appear in the potency data?
- What does this tell you about the type of compounds that have measurable potency values compared to those that only have efficacy values?

We can compute summary statistics such as the mean, median, minimum, and maximum for the efficacy and potency values across different activity outcomes (Active, Inactive, Inconclusive) in both the rat and human datasets. These summaries help describe how the data are distributed and how results compare between species.

df_aid_rat.groupby('PUBCHEM_ACTIVITY_OUTCOME')[['Efficacy-Replicate_1', 'Potency-Replicate_1']].mean()

	Efficacy-Replicate_1	Potency-Replicate_1
PUBCHEM_ACTIVITY_OUTCOME
Active	110.130054	22.882127
Inactive	50.140904	27.487615
Inconclusive	58.654658	43.571585

df_aid_rat.groupby('PUBCHEM_ACTIVITY_OUTCOME')[['Efficacy-Replicate_1', 'Potency-Replicate_1']].median()

	Efficacy-Replicate_1	Potency-Replicate_1
PUBCHEM_ACTIVITY_OUTCOME
Active	113.98150	18.83360
Inactive	39.19415	24.44795
Inconclusive	49.11970	33.49150

df_aid_rat.groupby('PUBCHEM_ACTIVITY_OUTCOME')[['Efficacy-Replicate_1', 'Potency-Replicate_1']].min()

	Efficacy-Replicate_1	Potency-Replicate_1
PUBCHEM_ACTIVITY_OUTCOME
Active	49.0617	0.0007
Inactive	13.5294	0.0019
Inconclusive	19.5822	0.0033

df_aid_rat.groupby('PUBCHEM_ACTIVITY_OUTCOME')[['Efficacy-Replicate_1', 'Potency-Replicate_1']].max()

	Efficacy-Replicate_1	Potency-Replicate_1
PUBCHEM_ACTIVITY_OUTCOME
Active	246.592	159.683
Inactive	246.847	107.353
Inconclusive	199.769	343.762

Now it’s your turn to compute the mean, median, minimum, maximum of the efficacy and potency values for the different outcomes of the human data set.

Python Dive: In the above coded cells Pandas is calling numpy statistics functions to compute the mean, median, minimum and maximum values for the efficacy and potency columns grouped by the activity outcome.

Explanation of the code

`df_aid_rat.groupby('PUBCHEM_ACTIVITY_OUTCOME')[['Efficacy-Replicate_1', 'Potency-Replicate_1']].mean()`

df_aid_rat This is your Pandas DataFrame containing PubChem assay results. It has many columns, including:
- PUBCHEM_ACTIVITY_OUTCOME (Inactive, Active, Inconclusive, etc.)
- replicate values like Efficacy-Replicate_1 and Potency-Replicate_1.
.groupby('PUBCHEM_ACTIVITY_OUTCOME')
- Splits your DataFrame into sub-tables (groups), one for each unique value in PUBCHEM_ACTIVITY_OUTCOME.
- For example:
  - Group 1: all rows with Inactive
  - Group 2: all rows with Active
  - Group 3: all rows with Inconclusive
- Each group can now be analyzed independently.
[['Efficacy-Replicate_1', 'Potency-Replicate_1']]
- This selects just two numeric columns out of the full DataFrame.
- That way, the statistics are calculated only on these measures, not on all columns.
.mean()
- Applies the arithmetic mean function (average) to each column, within each group.

Statistics Methods

.mean() → Calculates the average value for each group.
.median() → Calculates the median value for each group.
.min() → Finds the minimum value for each group.
.max() → Finds the maximum value for each group.
.std() → Calculates the standard deviation for each group.
.var() → Calculates the variance for each group.
.count() → Counts the number of non-null values in each group.
.size() → Counts the total number of rows in each group (including NaN).
.quantile([0.25, 0.5, 0.75]) → Calculates the 25th, 50th (median), and 75th percentiles for each group.
.describe() → Provides a summary of statistics (count, mean, std, min, 25%, 50%, 75%, max) for each group.
.agg() → Allows you to apply multiple aggregation functions at once, e.g. df.groupby('col').agg(['mean', 'std', 'min', 'max']).

# Write your own code for mean

# Write your own code for median

# Write your own code for minimum

# Write your own code for maximum

Check your understanding

How would you expect the mean and median efficacy values of Active compounds to compare with those of Inactive or Inconclusive compounds? Is that portrayed in the data?
Did you observe that active compounds have higher average efficacy in one species than the other? What might that suggest about species differences in thyroid hormone receptor response?
Recall the lower the potency value, a lower amount of compound is needed to cause an effect. Are there species differences in minimum potency values?

Efficacy vs Potency#

This section explores the relationship between the efficacy and potency of the tested compounds for both the rat and human thyroid hormone receptors.

The following code creates two scatter plots, one for the rat data and one for the human data, to visualize the relationship between potency and efficacy, with points colored by their activity outcome.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=False)

hue_order = ['Active',  'Inconclusive', 'Inactive']

sns.scatterplot(data=df_aid_rat, x='Potency-Replicate_1', y='Efficacy-Replicate_1', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[0])
sns.scatterplot(data=df_aid_human, x='Potency', y='Efficacy', hue='PUBCHEM_ACTIVITY_OUTCOME', hue_order=hue_order, ax=axes[1])

axes[0].set_xlabel("Potency (micro-M, rat)")
axes[1].set_xlabel("Potency (micro-M, human)")

axes[0].set_ylabel("Efficacy (%, rat)")
axes[1].set_ylabel("Efficacy (%, human)")

# Uncomment next three lines to limit the x-axis to 200.
#for ax in axes :    # loop over each panel
#    ax.set_xlim(0, 200)
#    ax.set_ylim(0, 200)

Compound-level Analysis#

In PubChem, substances are chemical records provided by individual data contributors and compounds are unique chemical structures extracted from substance records.

In many assays, substances can be viewed as “samples” of chemicals. A chemical can be tested in multiple samples containing that chemical in a single assay. The samples are not necessary the same as each other. For example, they may be different in terms of its concentration, temperature, solvent used, etc.

Therefore, if you want to determine the activity of a chemical, you may need to collapse the data from multiple samples (i.e., substances) into a single data point for the chemical.

The code cells below checks how many unique compounds in the data set and how many of them are associated with multiple substances.

df_rat_cidcount = pd.DataFrame(df_aid_rat['PUBCHEM_CID'].value_counts()).reset_index()
df_human_cidcount = pd.DataFrame(df_aid_human['PUBCHEM_CID'].value_counts()).reset_index()

print("#-- Rat data set")
print("# CIDs in total                      :", len(df_rat_cidcount) )
print("# CIDs associated with multiple SIDs :", len(df_rat_cidcount[df_rat_cidcount['count'] > 1]) )
print("------------------------------------------------")
print("#-- Human data set")
print("# CIDs in total                      :", len(df_human_cidcount) )
print("# CIDs associated with multiple SIDs :", len(df_human_cidcount[df_human_cidcount['count'] > 1]) )

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=False)

sns.countplot(data=df_rat_cidcount, x='count',  ax=axes[0])
sns.countplot(data=df_human_cidcount, x='count',  ax=axes[1])

axes[0].set_xlabel("Number of SIDs per CID")
axes[1].set_xlabel("Number of SIDs per CID")

axes[0].set_ylabel("Number of CIDs")
axes[1].set_ylabel("Number of CIDs")

axes[0].set_title("Rat")
axes[1].set_title("Human")

Aggregating substance data rats#

Now we aggregate the substance data for each unique compound (identified by its CID).

This code groups the rat HTS data by ‘PUBCHEM_CID’ and store them in a data frame with three columns (PUBCHEM_CID, Potency-Replicate_1, and Efficacy-Replicate_1). Note that some CIDs are associated with multiple SIDs, and hence, multiple activity values. For those cases, we are going to take the average of the multiple activity values.

df_aid_rat_cid = df_aid_rat.groupby('PUBCHEM_CID').mean(numeric_only=True).reset_index()
df_aid_rat_cid = df_aid_rat_cid[['PUBCHEM_CID', 'Potency-Replicate_1', 'Efficacy-Replicate_1']]

df_aid_rat_cid.head()

	PUBCHEM_CID	Potency-Replicate_1	Efficacy-Replicate_1
0	4	NaN	NaN
1	6	4.6248	118.2025
2	11	NaN	NaN
3	13	NaN	NaN
4	33	NaN	NaN

df_aid_rat_cid.shape

(8099, 3)

This code defines a function, get_cid_outcome, that determines a single activity outcome for each unique compound ID (CID) in a given DataFrame. If a compound has multiple, conflicting outcomes recorded, it is labeled as “Conflicting”.

def get_cid_outcome(df) :

    outcomes = {}

    for cid in df['PUBCHEM_CID'].dropna().unique() :   # loop over unique CIDs (without Null)

        outcome =""
    
        unique_outcomes = df[ df['PUBCHEM_CID'] == cid ]['PUBCHEM_ACTIVITY_OUTCOME'].unique()
    
        if ( len(unique_outcomes) == 1 ) :
            outcome = unique_outcomes[0]
        elif ( len(unique_outcomes) > 1 ) :
            outcome = "Conflicting"
        else :
            print("Warning: Something is wrong", cid, unique_outcomes)

        outcomes[ cid ] = outcome

    df_outcomes = pd.DataFrame.from_dict(outcomes, orient='index', columns=['PUBCHEM_ACTIVITY_OUTCOME']).reset_index(names='PUBCHEM_CID')

    return(df_outcomes)

This line of code applies the get_cid_outcome function to the rat HTS data to generate a DataFrame that summarizes the activity outcome for each unique compound.

df_cid_outcome_rat = get_cid_outcome(df_aid_rat)

df_cid_outcome_rat.head()

	PUBCHEM_CID	PUBCHEM_ACTIVITY_OUTCOME
0	12850184	Inactive
1	89753	Inactive
2	9403	Inconclusive
3	13218779	Inactive
4	142766	Active

df_cid_outcome_rat.shape

(8099, 2)

Now let’s merge the aggregated rat compound data (containing mean potency and efficacy) with the summarized compound outcomes.

df_aid_rat_cid = df_aid_rat_cid.merge(df_cid_outcome_rat, on='PUBCHEM_CID', how='left')

df_aid_rat_cid.head()

	PUBCHEM_CID	Potency-Replicate_1	Efficacy-Replicate_1	PUBCHEM_ACTIVITY_OUTCOME
0	4	NaN	NaN	Inactive
1	6	4.6248	118.2025	Active
2	11	NaN	NaN	Inactive
3	13	NaN	NaN	Inactive
4	33	NaN	NaN	Inactive

This cell counts the occurrences of each unique ‘PUBCHEM_ACTIVITY_OUTCOME’ in the aggregated rat compound data.

df_aid_rat_cid['PUBCHEM_ACTIVITY_OUTCOME'].value_counts()

PUBCHEM_ACTIVITY_OUTCOME
Inactive        5745
Active          1383
Inconclusive     690
Conflicting      281
Name: count, dtype: int64

Aggregating substance data for humans#

Now let’s repeat the data aggregation process, this time for the human thyroid hormone receptor HTS data.

Group the human HTS data by ‘PUBCHEM_CID’ and store the mean efficacy and potency values in a data frame called ‘df_aid_human_cid’.

# Write your own code

Show a first few line of the df_aid_human_cid data frame.

# Write your own code

Use the get_cid_outcome function to the human HTS data to generate a DataFrame that summarizes the activity outcome for each unique compound.

# Write your own code

This command displays the first few rows of the DataFrame containing the summarized compound outcomes for the human data.

# Write your own code

This code merges the aggregated human compound data with the summarized compound outcomes.

# Write your own code

This command displays the first few rows of the merged human compound DataFrame.

# Write your own code

Use .value_counts() the occurrences of each unique ‘PUBCHEM_ACTIVITY_OUTCOME’ in the aggregated human compound data.

# Write your own code

Merging rat data and human data#

This subsection focuses on combining the aggregated compound-level data from both the rat and human HTS assays to facilitate a direct comparison.

This command displays the column names of the aggregated rat compound DataFrame.

df_aid_rat_cid.columns

Let’s rename the columns in the rat compound DataFrame to have consistent naming with the human compound DataFrame before merging.

df_aid_rat_cid = df_aid_rat_cid.rename(columns={'Potency-Replicate_1':'Potency' , 'Efficacy-Replicate_1':'Efficacy', })

This command displays the first three rows of the renamed rat compound DataFrame to verify the column name changes.

df_aid_rat_cid.head(3)

Now merge the aggregated rat and human compound DataFrames based on ‘PUBCHEM_CID’, keeping only the compounds that are present in both datasets.

df_aid_merged_cid = df_aid_rat_cid.merge(df_aid_human_cid, on='PUBCHEM_CID', how='inner', suffixes=('_rat','_human'))

Let’s check if the merged data frame contains desired information.

df_aid_merged_cid.shape

df_aid_merged_cid

This cell counts the occurrences of each unique ‘PUBCHEM_ACTIVITY_OUTCOME’ for the rat data within the merged DataFrame.

df_aid_merged_cid['PUBCHEM_ACTIVITY_OUTCOME_rat'].value_counts()

This cell counts the occurrences of each unique ‘PUBCHEM_ACTIVITY_OUTCOME’ for the human data within the merged DataFrame.

df_aid_merged_cid['PUBCHEM_ACTIVITY_OUTCOME_human'].value_counts()

Compound-level comparison#

Let’s compare the bioactivity of the same compounds between the rat and human thyroid hormone receptors.

The following code cell generates two box plots to compare the distribution of efficacy for different activity outcomes between the rat and human data in the merged dataset.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=True)

hue_order = ['Active',  'Inactive', 'Conflicting', 'Inconclusive']

sns.boxplot(data=df_aid_merged_cid, x='Efficacy_rat', y='PUBCHEM_ACTIVITY_OUTCOME_rat', hue='PUBCHEM_ACTIVITY_OUTCOME_rat', hue_order=hue_order, ax=axes[0])
sns.boxplot(data=df_aid_merged_cid, x='Efficacy_human', y='PUBCHEM_ACTIVITY_OUTCOME_human', hue='PUBCHEM_ACTIVITY_OUTCOME_human', hue_order=hue_order, ax=axes[1])

axes[0].set_xlabel("Efficacy (%, rat)")
axes[1].set_xlabel("Efficacy (%, human)")

axes[0].set_ylabel("PUBCHEM_ACTIVITY_OUTCOME")

fig.tight_layout()

Generate two box plots to compare the distribution of potency for different activity outcomes between the rat and human data in the merged dataset.

# Write your own code

This code creates two scatter plots to visualize the relationship between potency and efficacy for the rat and human data in the merged dataset.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=False)

color_dict = {"Active": "red", "Inconclusive": "purple", "Conflicting": "blue", "Inactive": "green"}



sns.scatterplot(data=df_aid_merged_cid, x='Potency_rat', y='Efficacy_rat', hue='PUBCHEM_ACTIVITY_OUTCOME_rat', ax=axes[0], palette=color_dict)
sns.scatterplot(data=df_aid_merged_cid, x='Potency_human', y='Efficacy_human', hue='PUBCHEM_ACTIVITY_OUTCOME_human', ax=axes[1], palette=color_dict)

axes[0].set_xlabel("Potency (micro-M, rat)")
axes[1].set_xlabel("Potency (micro-M, human)")

axes[0].set_ylabel("Efficacy (%, rat)")
axes[1].set_ylabel("Efficacy (%, human)")

# Uncomment the next three lines to limit the x and y axes to 200.
#for ax in axes :    # loop over each panel
#    ax.set_xlim(0, 200)
#    ax.set_ylim(0, 200)

The above graph output is challenging to interpret. Another useful tool for visualizing the data is a heatmap. In a heatmap, each cell’s color corresponds to the magnitude of a value. This makes it easier to see patterns, trends or correllations at a glance. They are particularly useful for comparing multiple variable (such as potency or efficacy in differen species) because you can quickly see clusters of high or low values.

Now we will generate a heatmap that shows the counts of rat and human activity outcome pairs. (For example, how many chemicals are active in both species, and how many chemicals are active in one species and inactive in the other?).

To do that we want to get the outcome pair count information from the df_aid_merged_cid data frame.

outcome_counts = pd.DataFrame(df_aid_merged_cid.groupby(['PUBCHEM_ACTIVITY_OUTCOME_human', 'PUBCHEM_ACTIVITY_OUTCOME_rat']).size(), columns=['Count']).reset_index()
outcome_counts

Then, use .pivot() to change the ‘outcome_counts’ DataFrame into a square matrix form, where the rows are human activity outcomes, the columns are rat activity outcomes, and the values are the counts of compounds.

heatmap_data = outcome_counts.pivot(index='PUBCHEM_ACTIVITY_OUTCOME_human', columns='PUBCHEM_ACTIVITY_OUTCOME_rat', values='Count')
heatmap_data

Generate a heatmap to visualize the activity outcome pair counts.

sns.heatmap(heatmap_data, cmap="flare", annot=True, fmt='4.0f')

This code creates two scatter plots to directly compare the efficacy and potency of compounds between the rat and human assays.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=False)

hue_order = ['Active',  'Inactive', 'Inconclusive', 'Conflicting']

sns.scatterplot(data=df_aid_merged_cid, x='Efficacy_rat', y='Efficacy_human', hue='PUBCHEM_ACTIVITY_OUTCOME_rat', hue_order=hue_order, ax=axes[0], legend=None)
sns.scatterplot(data=df_aid_merged_cid, x='Potency_rat', y='Potency_human', hue='PUBCHEM_ACTIVITY_OUTCOME_rat', hue_order=hue_order, ax=axes[1])

axes[0].set_xlabel("Efficacy (%, rat)")
axes[0].set_ylabel("Efficacy (%, human)")
axes[0].set_title("Efficacy")

axes[1].set_xlabel("Potency (micro-M, rat)")
axes[1].set_ylabel("Potency (micro-M, human)")
axes[1].set_title("Potency")

axes[1].legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)

# Uncomment to limit the axes to 200.
#for ax in axes :    # loop over each panel
#   ax.set_xlim(0, 200)
#    ax.set_ylim(0, 200)

In the above graphs, it is not easy to see if there is any correlation between rat and human activities (in terms of the efficacy and potency).

Let’s focus only on the active compounds in both the rat and human assays.

df_aid_merged_cid_active = df_aid_merged_cid[(df_aid_merged_cid['PUBCHEM_ACTIVITY_OUTCOME_rat'] == 'Active') & (df_aid_merged_cid['PUBCHEM_ACTIVITY_OUTCOME_human'] == 'Active')]

This code generates two scatter plots to compare the efficacy and potency of the compounds that were active in both rat and human assays.

fig, axes = plt.subplots(1, 2, figsize=(10,4), sharey=False)

sns.scatterplot(data=df_aid_merged_cid_active, x='Efficacy_rat', y='Efficacy_human', ax=axes[0])
sns.scatterplot(data=df_aid_merged_cid_active, x='Potency_rat', y='Potency_human', ax=axes[1])

axes[0].set_title("Efficacy")
axes[0].set_xlabel("Efficacy (%, rat)")
axes[0].set_ylabel("Efficacy (%, human)")

axes[1].set_title("Potency")
axes[1].set_xlabel("Potency (micro-M, rat)")
axes[1].set_ylabel("Potency (micro-M, human)")

res_efficacy = linregress(df_aid_merged_cid_active['Efficacy_rat'], df_aid_merged_cid_active['Efficacy_human'])
res_potency = linregress(df_aid_merged_cid_active['Potency_rat'], df_aid_merged_cid_active['Potency_human'])

print(f"                      Efficacy\tPotency")
print(f"Slope               : {res_efficacy.slope:4.4f}\t{res_potency.slope:<4.4f}")
print(f"intercept           : {res_efficacy.intercept:4.4f}\t{res_potency.intercept:4.4f}")
print(f"r-value             : {res_efficacy.rvalue:4.4f}\t{res_potency.rvalue:4.4f}")
print(f"p-value             : {res_efficacy.pvalue:4.4f}\t{res_potency.pvalue:4.4f}")
print(f"std-err (slope)     : {res_efficacy.stderr:4.4f}\t{res_potency.stderr:4.4f}")
print(f"std-err (intercept) : {res_efficacy.intercept_stderr:4.4f}\t{res_potency.intercept_stderr:4.4f}")

When we consider the compounds that are active in both species, there appears to be a positive correlation between the rat and human data. However, there are only six compounds active in both species, we cannot confidently say there is a definitive relationship.

Homework

Answer the following questions, based on the analysis of the compounds tested in both rat and human thyroid hormone receptors (i.e., the compounds in the df_aid_merged_cid data frame).

(1) How many compounds were tested against both rat thyroid hormone receptor?

# Provide your answer here

(2) Among the compounds tested in both species, how many were active against rat thyroid hormone receptor?

# Provide your answer here

(3) Among the compounds active in rats (i.e., those in (2)), how many were active against human thyroid hormone receptor?

# Provide your answer here

(4) Read the following paper (open access) on the concordance of the toxicity of drugs between animals and humans and summarize its main conclusion.

Limitations of Animal Studies for Predicting Toxicity in Clinical Trials: Is it Time to Rethink Our Current Approach?
Van Norman GA.
JACC Basic Transl Sci. 2019 Nov 25;4(7):845-854.
doi: 10.1016/j.jacbts.2019.10.008. PMID: 31998852; PMCID: PMC6978558.

# Write your summary in this markdown cell:

6.1 Exploratory Data Analysis - Protein

Contents

6.1 Exploratory Data Analysis - Protein#

Exploratory Data Analysis 1: Concordance of Thyroid Hormone Receptor Activity β in Rats and Humans#

Checking the amount of available data#

Loading data for rat thyroid hormone receptor#

Checking the type and amount of available data for each species#

Reviewing the available bioassays#

Comparing HTS screening data for Rat and human#

Loading the HTS screening data for Rats#

Loading the HTS data for humans

Substance-level analysis#

Activity outcomes and phenotypes#

Efficacy#

Potency#

Efficacy vs Potency#

Compound-level Analysis#

Aggregating substance data rats#

Aggregating substance data for humans#

Merging rat data and human data#

Compound-level comparison#

Homework

Exploratory Data Analysis 1:

Concordance of Thyroid Hormone Receptor Activity β in Rats and Humans#