2.1 PubChem Data Types#
PubChem Training Course#
This notebook is designed to follow the NLM training course on the use of PubChem. Each student will have their own copy of this notebook and should edit it as they see fit. That is, part of the goal of this initial activity is to assist students in creating content within Jupyter Notebooks. This is also a chance for students to develop skills with vibe coding.
PubChem Data Types#
The terms “compound” and “substance” refer to two different data types in PubChem, see what is the difference between a substance and a compound in PubChem?
Substance (SID)#
Substances are the records associated with data uploaded to PubChem by individual depositors (sources) and include metadata like who the source is, the batch and testing conditions. Each substance has a unique identifier, its SID. This allows PubChem to track the data provenance and ensures the traceability, accountability and contextual richness of the chemical data. In addition to providing the data record answers:
Who submitted the data
Where the data came from
How was the data generated
Under what conditions of protocols was it collected.
One chemical (compound) can have multiple SIDs as more than one source can upload data related to a chemical.
Compound (CID)#
If a substance can be described by a chemical structure it can be assigned to a PubChem compound record. This requires a structure standardization pipeline and involves chemical identifiers like the IUPAC InChI. In essence a compound record aggregates:
Data from multiple SIDs
Computed Properties (LogP, molecular weight,…)
Synonyms
Cross-links to literature, patents, spectra and other databases
Computed Properties#
Not all properties are always present — they depend on successful structure processing.
You can access them via:
The PubChem Compound Summary Page under “Computed Properties”
The PubChem REST API using:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/property/MolecularWeight,InChIKey/JSON
1. Structural Identifiers#
Property |
Description |
---|---|
InChI |
IUPAC International Chemical Identifier, a textual representation of a compound’s structure. |
InChIKey |
A hashed version of the InChI, fixed-length and easier for indexing/search. |
Canonical SMILES |
A unique SMILES string for a compound (canonicalized). |
Isomeric SMILES |
A SMILES string including stereochemistry and isotopes. |
IUPAC Name |
Systematic name as per IUPAC rules (can be multiple variants). |
2. Molecular Properties#
Property |
Description |
Units |
---|---|---|
Molecular Weight |
Sum of atomic weights (average, not monoisotopic). |
g/mol |
Exact Mass |
Monoisotopic mass — uses most abundant isotopes. |
g/mol |
Monoisotopic Mass |
Identical to Exact Mass. |
g/mol |
Heavy Atom Count |
Number of non-hydrogen atoms. |
unitless |
Atom Count |
Total number of atoms, including H. |
unitless |
Isotope Atom Count |
Atoms with isotopic specification. |
unitless |
Defined Atom Stereocenter Count |
Number of chiral centers with defined stereochemistry. |
unitless |
Undefined Atom Stereocenter Count |
Chiral centers without defined stereochemistry. |
unitless |
Defined Bond Stereocenter Count |
Bonds with defined E/Z (cis/trans) stereochemistry. |
unitless |
Undefined Bond Stereocenter Count |
Bonds with undefined stereochemistry. |
unitless |
Covalently-Bonded Unit Count |
Number of disconnected molecular units (e.g., salts). |
unitless |
Component Count |
Number of discrete parts in a compound (e.g., ion pairs). |
unitless |
Hydrogen Bond Donor Count |
Number of -OH or -NH groups that can donate hydrogen. |
unitless |
Hydrogen Bond Acceptor Count |
Number of atoms that can accept hydrogen bonds. |
unitless |
Rotatable Bond Count |
Bonds that can rotate freely (single bonds between non-terminal heavy atoms). |
unitless |
3. Charge and Partitioning#
Property |
Description |
Units |
---|---|---|
Formal Charge |
Net integer charge assigned to the molecule. |
unitless |
Topological Polar Surface Area (TPSA) |
Approximate surface area involved in polar interactions. |
Ų |
XLogP3 |
Predicted logP (partition coefficient octanol/water) using XLogP3 method. |
unitless |
4. Electronic Properties (Less Common)#
Property |
Description |
Units |
---|---|---|
Complexity |
A computed measure of structural complexity (based on rings, branches, etc.). |
unitless |
Charge |
Calculated net molecular charge (can vary by protonation state). |
unitless |
5. Geometric and Topological Descriptors#
Property |
Description |
---|---|
Feature Count |
Total count of 2D/3D features like rings, stereocenters. |
Ring Count |
Number of rings detected in structure. |
Tautomer Count |
Number of tautomers (computed or predicted). |
Bond Count |
Total number of bonds in the structure. |
Bioassays (AID)#
BioAssays in PubChem are biological test results that evaluate the activity or behavior of chemical substances (usually small molecules, sometimes RNAi or biologics) in a biological system (e.g., a cell line, protein target, or organism). They represent experimentally derived bioactivity data, which are critical for understanding drug efficacy, toxicity, and mechanism of action. Compounds are typically assigned as active, inactived or inconclusive under the conditions of the bioassay.
Each BioAssay has:
AID: Assay Identifier
CID / SID associations: Shows which compound/substance was tested
Activity Outcome: Active / Inactive / Inconclusive
Concentration-response: EC50, IC50, etc.
Target/Organism: What was tested
Protocol Description
Links to depositor source, gene/protein, PubMed ID
Bioassay Categories#
Assay Type |
Description |
---|---|
Primary Screening |
High-throughput screens (HTS) used to identify compounds with potential biological activity. |
Confirmatory Assays |
Follow-up tests that validate and refine initial screening hits. |
Summary Assays |
Integrative results or analyses across multiple primary/confirmatory assays. |
Counter Screens |
Used to rule out false positives or identify interference (e.g., cytotoxicity not target-specific). |
RNAi Assays |
Use gene knockdown to assess how silencing a gene affects biological pathways. |
ADMET Assays |
Evaluate Absorption, Distribution, Metabolism, Excretion, and Toxicity characteristics. |
Mechanism-of-Action |
Designed to reveal how a compound works (e.g., kinase inhibition, GPCR activation). |
Structure-Activity Relationship (SAR) |
Explore how structural changes in molecules affect bioactivity. |
Potency Metrics (Half-Maximal Effect Values)#
Metric |
Meaning |
Common Use |
---|---|---|
EC₅₀ |
Effective Concentration at which 50% of maximum effect is observed. |
Agonists or stimulators |
IC₅₀ |
Inhibitory Concentration that reduces response (e.g., enzyme activity, cell viability) by 50%. |
Antagonists, inhibitors |
AC₅₀ |
Activity Concentration at which 50% of measured biological activity is seen. |
Often used in PubChem for general bioactivity assays |
GI₅₀ |
Concentration that causes 50% growth inhibition in a cell population. |
Oncology / cytotoxicity studies |
LD₅₀ |
Lethal Dose at which 50% of test organisms die (usually in vivo). |
Toxicology (often animals) |
TD₅₀ |
Toxic Dose causing toxic effects in 50% of subjects. |
Clinical/toxicology threshold |
ED₅₀ |
Effective Dose at which 50% of a population shows a therapeutic effect (typically in vivo). |
Pharmacology, drug response in organisms |
Gene Records#
Derived from NCBI Gene
Each gene has a Gene ID, symbol, synonyms, genomic context
PubChem Gene records list:
BioAssays where this gene is the target or regulated
Compounds known to affect this gene (activators, inhibitors, etc.)
Associated diseases or biological pathways
Example: TP53 Gene Record in PubChem
Protein Records#
Derived from NCBI Protein and UniProt databases
Indexed using UniProt IDs
Each protein has a Protein Accession Number (e.g., NP_000537)
PubChem Protein records describe:
Structure and function of the protein
Protein–compound interactions (from BioAssays)
Sequence information and domain structures
Example: SARS-CoV-2 Spike glycoprotein (s)
Pathways#
PubChem Pathways describe information related to biological pathways PubChem Pathways Documentation These are maps of biochemical reactions or processes that occur in living organisms like the Citric Acid Cycle. As of May 2025 there are over 250,000 pathways available through PubChem. These are not stored in PubChem and the URL involves the identification of the source and the source ID number, https://pubchem.ncbi.nlm.nih.gov/pathway/SOURCE:ExternalID.
Sources include:
SOURCE |
Description |
---|---|
Reactome |
Expert-curated human pathways |
KEGG |
Pathways across species, includes metabolic maps |
WikiPathways |
Community-curated, often includes disease pathways |
NCBI BioSystems (legacy) |
Original PubChem pathways source (now deprecated) |
Examples:#
Reactome apoptosis pathway: https://pubchem.ncbi.nlm.nih.gov/pathway/Reactome:R-HSA-109581
WikiPathways cholesterol biosynthesis: https://pubchem.ncbi.nlm.nih.gov/pathway/WikiPathways:WP197
Cell Lines#
A cell line is a population of cells grown in a lab (in vitro) that originates from a single source — such as human tissue, animal tissue, or cancer cells — and is capable of continuous division. Information on cell lines can be found at [PubChem Cell Lines](https://pubchem.ncbi.nlm.nih.gov/docs/cell-lines and as of May 2025 there are over 2,000 cell lines within PubChem
Features of Cell Lines#
Feature |
Description |
---|---|
Clonal origin |
All cells in the line descend from a single cell |
In vitro growth |
Can be grown indefinitely under lab conditions |
Reproducibility |
Provide consistent biological behavior — ideal for experiments |
Defined characteristics |
Origin (e.g., human lung), type (e.g., epithelial), cancerous or not |
Applications of Cell Lines?#
Use Case |
Purpose |
---|---|
Drug screening |
See how a compound affects cancer or healthy cells |
Toxicity testing |
Determine if a chemical is harmful to cells |
Mechanism of action |
Reveal how a drug alters cell growth, gene expression, metabolism |
Virology and infection |
Study how viruses enter or replicate in host cells |
Gene silencing (RNAi) |
Understand the function of specific genes in a controlled setting |
Taxonomies#
Taxonomies are where data is aggregated by a specific organism (taxon), like a human being. The URL is https://pubchem.ncbi.nlm.nih.gov/taxonomy/TAXON, where the TAXON is the organisms name like https://pubchem.ncbi.nlm.nih.gov/taxonomy/human. PubChem also gets data from the NCBI Taxonomy Database https://www.ncbi.nlm.nih.gov/taxonomy and you can use the taxonomy ID in the link https://pubchem.ncbi.nlm.nih.gov/taxonomy/9606
Types of Data
Data Type |
How It Relates to the Organism |
---|---|
Substances (SIDs) |
Extracted from or derived from the organism |
Compounds (CIDs) |
Mapped from substances or associated with assays targeting the organism |
BioAssays (AIDs) |
Run using the organism’s cells, tissues, or proteins |
Proteins / Genes |
From the organism’s genome |
Pathways |
Mapped from the organism’s molecular biology |
Literature |
References about those biological systems |
Taxonomy’s allow you to filter for species specific drug effects, understand what species are most studied and compare activities across species.
Patents#
A patent is a legal right granted by a government that gives the inventor exclusive control over how an invention is used — for a limited time — in exchange for publishing how it works.
Patents and IP (Intellectual Property)
Feature |
Explanation |
---|---|
Exclusive rights |
The patent holder can stop others from making, using, or selling the invention |
Time-limited |
Usually 20 years from filing (varies by country and patent type) |
Public disclosure |
In exchange, the invention’s details are published, contributing to scientific knowledge |
Enforced nationally |
Patents are jurisdiction-specific, meaning they only apply in countries where they are granted |
Even though PubChem is a scientific database, it includes patent links because:
Many small molecules are first disclosed in patents — especially in drug discovery.
Patents often describe bioactivity, structure–activity relationships (SAR), and pharmacological targets, even before academic publication.
Scientists and companies need to know:
Is this compound already patented?
Has it been claimed for a therapeutic use?
Can I freely study or commercialize it?
PubChem aggregates patent–compound links from:
Source |
Description |
---|---|
SureChEMBL |
Text-mined chemical structures from full-text patents, curated by EMBL-EBI |
IBM/NIH Open Access |
A legacy collaboration using machine learning to extract structures |
PatentsView |
U.S. patent metadata, searchable at https://patentsview.org |
Depositor-provided |
Some data submitters include patent associations explicitly |
Acknowledgements#
This content was developed using a vibe coding type of process where the author interacted with an AI to generate the content. This content follows the PubChem Tutorial with multiple queries to Perplexity AI and ChatGPT being conducted over the summer of 2025. Copyright CC 0.0 by Bob Belford