2.1 PubChem Data Types

2.1 PubChem Data Types#

PubChem Training Course#

This notebook is designed to follow the NLM training course on the use of PubChem. Each student will have their own copy of this notebook and should edit it as they see fit. That is, part of the goal of this initial activity is to assist students in creating content within Jupyter Notebooks. This is also a chance for students to develop skills with vibe coding.

PubChem Data Types#

The terms “compound” and “substance” refer to two different data types in PubChem, see what is the difference between a substance and a compound in PubChem?

Substance (SID)#

Substances are the records associated with data uploaded to PubChem by individual depositors (sources) and include metadata like who the source is, the batch and testing conditions. Each substance has a unique identifier, its SID. This allows PubChem to track the data provenance and ensures the traceability, accountability and contextual richness of the chemical data. In addition to providing the data record answers:

Who submitted the data
Where the data came from
How was the data generated
Under what conditions of protocols was it collected.

One chemical (compound) can have multiple SIDs as more than one source can upload data related to a chemical.

Compound (CID)#

If a substance can be described by a chemical structure it can be assigned to a PubChem compound record. This requires a structure standardization pipeline and involves chemical identifiers like the IUPAC InChI. In essence a compound record aggregates:

Data from multiple SIDs
Computed Properties (LogP, molecular weight,…)
Synonyms
Cross-links to literature, patents, spectra and other databases

Computed Properties#

Not all properties are always present — they depend on successful structure processing.
You can access them via:
- The PubChem Compound Summary Page under “Computed Properties”
- The PubChem REST API using: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/property/MolecularWeight,InChIKey/JSON

1. Structural Identifiers#

Property	Description
InChI	IUPAC International Chemical Identifier, a textual representation of a compound’s structure.
InChIKey	A hashed version of the InChI, fixed-length and easier for indexing/search.
Canonical SMILES	A unique SMILES string for a compound (canonicalized).
Isomeric SMILES	A SMILES string including stereochemistry and isotopes.
IUPAC Name	Systematic name as per IUPAC rules (can be multiple variants).

2. Molecular Properties#

Property	Description	Units
Molecular Weight	Sum of atomic weights (average, not monoisotopic).	g/mol
Exact Mass	Monoisotopic mass — uses most abundant isotopes.	g/mol
Monoisotopic Mass	Identical to Exact Mass.	g/mol
Heavy Atom Count	Number of non-hydrogen atoms.	unitless
Atom Count	Total number of atoms, including H.	unitless
Isotope Atom Count	Atoms with isotopic specification.	unitless
Defined Atom Stereocenter Count	Number of chiral centers with defined stereochemistry.	unitless
Undefined Atom Stereocenter Count	Chiral centers without defined stereochemistry.	unitless
Defined Bond Stereocenter Count	Bonds with defined E/Z (cis/trans) stereochemistry.	unitless
Undefined Bond Stereocenter Count	Bonds with undefined stereochemistry.	unitless
Covalently-Bonded Unit Count	Number of disconnected molecular units (e.g., salts).	unitless
Component Count	Number of discrete parts in a compound (e.g., ion pairs).	unitless
Hydrogen Bond Donor Count	Number of -OH or -NH groups that can donate hydrogen.	unitless
Hydrogen Bond Acceptor Count	Number of atoms that can accept hydrogen bonds.	unitless
Rotatable Bond Count	Bonds that can rotate freely (single bonds between non-terminal heavy atoms).	unitless

3. Charge and Partitioning#

Property	Description	Units
Formal Charge	Net integer charge assigned to the molecule.	unitless
Topological Polar Surface Area (TPSA)	Approximate surface area involved in polar interactions.	Å²
XLogP3	Predicted logP (partition coefficient octanol/water) using XLogP3 method.	unitless

4. Electronic Properties (Less Common)#

Property	Description	Units
Complexity	A computed measure of structural complexity (based on rings, branches, etc.).	unitless
Charge	Calculated net molecular charge (can vary by protonation state).	unitless

5. Geometric and Topological Descriptors#

Property	Description
Feature Count	Total count of 2D/3D features like rings, stereocenters.
Ring Count	Number of rings detected in structure.
Tautomer Count	Number of tautomers (computed or predicted).
Bond Count	Total number of bonds in the structure.

Bioassays (AID)#

BioAssays in PubChem are biological test results that evaluate the activity or behavior of chemical substances (usually small molecules, sometimes RNAi or biologics) in a biological system (e.g., a cell line, protein target, or organism). They represent experimentally derived bioactivity data, which are critical for understanding drug efficacy, toxicity, and mechanism of action. Compounds are typically assigned as active, inactived or inconclusive under the conditions of the bioassay.

Each BioAssay has:

AID: Assay Identifier
CID / SID associations: Shows which compound/substance was tested
Activity Outcome: Active / Inactive / Inconclusive
Concentration-response: EC50, IC50, etc.
Target/Organism: What was tested
Protocol Description
Links to depositor source, gene/protein, PubMed ID

Bioassay Categories#

Assay Type	Description
Primary Screening	High-throughput screens (HTS) used to identify compounds with potential biological activity.
Confirmatory Assays	Follow-up tests that validate and refine initial screening hits.
Summary Assays	Integrative results or analyses across multiple primary/confirmatory assays.
Counter Screens	Used to rule out false positives or identify interference (e.g., cytotoxicity not target-specific).
RNAi Assays	Use gene knockdown to assess how silencing a gene affects biological pathways.
ADMET Assays	Evaluate Absorption, Distribution, Metabolism, Excretion, and Toxicity characteristics.
Mechanism-of-Action	Designed to reveal how a compound works (e.g., kinase inhibition, GPCR activation).
Structure-Activity Relationship (SAR)	Explore how structural changes in molecules affect bioactivity.

Potency Metrics (Half-Maximal Effect Values)#

Metric	Meaning	Common Use
EC₅₀	Effective Concentration at which 50% of maximum effect is observed.	Agonists or stimulators
IC₅₀	Inhibitory Concentration that reduces response (e.g., enzyme activity, cell viability) by 50%.	Antagonists, inhibitors
AC₅₀	Activity Concentration at which 50% of measured biological activity is seen.	Often used in PubChem for general bioactivity assays
GI₅₀	Concentration that causes 50% growth inhibition in a cell population.	Oncology / cytotoxicity studies
LD₅₀	Lethal Dose at which 50% of test organisms die (usually in vivo).	Toxicology (often animals)
TD₅₀	Toxic Dose causing toxic effects in 50% of subjects.	Clinical/toxicology threshold
ED₅₀	Effective Dose at which 50% of a population shows a therapeutic effect (typically in vivo).	Pharmacology, drug response in organisms

Gene Records#

Derived from NCBI Gene
Each gene has a Gene ID, symbol, synonyms, genomic context
PubChem Gene records list:
- BioAssays where this gene is the target or regulated
- Compounds known to affect this gene (activators, inhibitors, etc.)
- Associated diseases or biological pathways

Example: TP53 Gene Record in PubChem

Protein Records#

Derived from NCBI Protein and UniProt databases
Indexed using UniProt IDs
Each protein has a Protein Accession Number (e.g., NP_000537)
PubChem Protein records describe:
- Structure and function of the protein
- Protein–compound interactions (from BioAssays)
- Sequence information and domain structures

Example: SARS-CoV-2 Spike glycoprotein (s)

Pathways#

PubChem Pathways describe information related to biological pathways PubChem Pathways Documentation These are maps of biochemical reactions or processes that occur in living organisms like the Citric Acid Cycle. As of May 2025 there are over 250,000 pathways available through PubChem. These are not stored in PubChem and the URL involves the identification of the source and the source ID number, https://pubchem.ncbi.nlm.nih.gov/pathway/SOURCE:ExternalID.

Sources include:

SOURCE	Description
Reactome	Expert-curated human pathways
KEGG	Pathways across species, includes metabolic maps
WikiPathways	Community-curated, often includes disease pathways
NCBI BioSystems (legacy)	Original PubChem pathways source (now deprecated)

Examples:#

Reactome apoptosis pathway: https://pubchem.ncbi.nlm.nih.gov/pathway/Reactome:R-HSA-109581
WikiPathways cholesterol biosynthesis: https://pubchem.ncbi.nlm.nih.gov/pathway/WikiPathways:WP197

Cell Lines#

A cell line is a population of cells grown in a lab (in vitro) that originates from a single source — such as human tissue, animal tissue, or cancer cells — and is capable of continuous division. Information on cell lines can be found at [PubChem Cell Lines](https://pubchem.ncbi.nlm.nih.gov/docs/cell-lines and as of May 2025 there are over 2,000 cell lines within PubChem

Features of Cell Lines#

Feature	Description
Clonal origin	All cells in the line descend from a single cell
In vitro growth	Can be grown indefinitely under lab conditions
Reproducibility	Provide consistent biological behavior — ideal for experiments
Defined characteristics	Origin (e.g., human lung), type (e.g., epithelial), cancerous or not

Applications of Cell Lines?#

Use Case	Purpose
Drug screening	See how a compound affects cancer or healthy cells
Toxicity testing	Determine if a chemical is harmful to cells
Mechanism of action	Reveal how a drug alters cell growth, gene expression, metabolism
Virology and infection	Study how viruses enter or replicate in host cells
Gene silencing (RNAi)	Understand the function of specific genes in a controlled setting

Taxonomies#

Taxonomies are where data is aggregated by a specific organism (taxon), like a human being. The URL is https://pubchem.ncbi.nlm.nih.gov/taxonomy/TAXON, where the TAXON is the organisms name like https://pubchem.ncbi.nlm.nih.gov/taxonomy/human. PubChem also gets data from the NCBI Taxonomy Database https://www.ncbi.nlm.nih.gov/taxonomy and you can use the taxonomy ID in the link https://pubchem.ncbi.nlm.nih.gov/taxonomy/9606

Types of Data

Data Type	How It Relates to the Organism
Substances (SIDs)	Extracted from or derived from the organism
Compounds (CIDs)	Mapped from substances or associated with assays targeting the organism
BioAssays (AIDs)	Run using the organism’s cells, tissues, or proteins
Proteins / Genes	From the organism’s genome
Pathways	Mapped from the organism’s molecular biology
Literature	References about those biological systems

Taxonomy’s allow you to filter for species specific drug effects, understand what species are most studied and compare activities across species.

Patents#

A patent is a legal right granted by a government that gives the inventor exclusive control over how an invention is used — for a limited time — in exchange for publishing how it works.

Patents and IP (Intellectual Property)

Feature	Explanation
Exclusive rights	The patent holder can stop others from making, using, or selling the invention
Time-limited	Usually 20 years from filing (varies by country and patent type)
Public disclosure	In exchange, the invention’s details are published, contributing to scientific knowledge
Enforced nationally	Patents are jurisdiction-specific, meaning they only apply in countries where they are granted

Even though PubChem is a scientific database, it includes patent links because:

Many small molecules are first disclosed in patents — especially in drug discovery.
Patents often describe bioactivity, structure–activity relationships (SAR), and pharmacological targets, even before academic publication.
Scientists and companies need to know:
- Is this compound already patented?
- Has it been claimed for a therapeutic use?
- Can I freely study or commercialize it?

PubChem aggregates patent–compound links from:

Source	Description
SureChEMBL	Text-mined chemical structures from full-text patents, curated by EMBL-EBI
IBM/NIH Open Access	A legacy collaboration using machine learning to extract structures
PatentsView	U.S. patent metadata, searchable at https://patentsview.org
Depositor-provided	Some data submitters include patent associations explicitly

Acknowledgements#

This content was developed using a vibe coding type of process where the author interacted with an AI to generate the content. This content follows the PubChem Tutorial with multiple queries to Perplexity AI and ChatGPT being conducted over the summer of 2025. Copyright CC 0.0 by Bob Belford