2.1 PubChem Data Types#

PubChem Training Course#

This notebook is designed to follow the NLM training course on the use of PubChem. Each student will have their own copy of this notebook and should edit it as they see fit. That is, part of the goal of this initial activity is to assist students in creating content within Jupyter Notebooks. This is also a chance for students to develop skills with vibe coding.

PubChem Data Types#

The terms “compound” and “substance” refer to two different data types in PubChem, see what is the difference between a substance and a compound in PubChem?

Substance (SID)#

Substances are the records associated with data uploaded to PubChem by individual depositors (sources) and include metadata like who the source is, the batch and testing conditions. Each substance has a unique identifier, its SID. This allows PubChem to track the data provenance and ensures the traceability, accountability and contextual richness of the chemical data. In addition to providing the data record answers:

  • Who submitted the data

  • Where the data came from

  • How was the data generated

  • Under what conditions of protocols was it collected.

One chemical (compound) can have multiple SIDs as more than one source can upload data related to a chemical.

Compound (CID)#

If a substance can be described by a chemical structure it can be assigned to a PubChem compound record. This requires a structure standardization pipeline and involves chemical identifiers like the IUPAC InChI. In essence a compound record aggregates:

  • Data from multiple SIDs

  • Computed Properties (LogP, molecular weight,…)

  • Synonyms

  • Cross-links to literature, patents, spectra and other databases

Computed Properties#

  • Not all properties are always present — they depend on successful structure processing.

  • You can access them via:

    • The PubChem Compound Summary Page under “Computed Properties”

    • The PubChem REST API using: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/property/MolecularWeight,InChIKey/JSON

1. Structural Identifiers#

Property

Description

InChI

IUPAC International Chemical Identifier, a textual representation of a compound’s structure.

InChIKey

A hashed version of the InChI, fixed-length and easier for indexing/search.

Canonical SMILES

A unique SMILES string for a compound (canonicalized).

Isomeric SMILES

A SMILES string including stereochemistry and isotopes.

IUPAC Name

Systematic name as per IUPAC rules (can be multiple variants).


2. Molecular Properties#

Property

Description

Units

Molecular Weight

Sum of atomic weights (average, not monoisotopic).

g/mol

Exact Mass

Monoisotopic mass — uses most abundant isotopes.

g/mol

Monoisotopic Mass

Identical to Exact Mass.

g/mol

Heavy Atom Count

Number of non-hydrogen atoms.

unitless

Atom Count

Total number of atoms, including H.

unitless

Isotope Atom Count

Atoms with isotopic specification.

unitless

Defined Atom Stereocenter Count

Number of chiral centers with defined stereochemistry.

unitless

Undefined Atom Stereocenter Count

Chiral centers without defined stereochemistry.

unitless

Defined Bond Stereocenter Count

Bonds with defined E/Z (cis/trans) stereochemistry.

unitless

Undefined Bond Stereocenter Count

Bonds with undefined stereochemistry.

unitless

Covalently-Bonded Unit Count

Number of disconnected molecular units (e.g., salts).

unitless

Component Count

Number of discrete parts in a compound (e.g., ion pairs).

unitless

Hydrogen Bond Donor Count

Number of -OH or -NH groups that can donate hydrogen.

unitless

Hydrogen Bond Acceptor Count

Number of atoms that can accept hydrogen bonds.

unitless

Rotatable Bond Count

Bonds that can rotate freely (single bonds between non-terminal heavy atoms).

unitless


3. Charge and Partitioning#

Property

Description

Units

Formal Charge

Net integer charge assigned to the molecule.

unitless

Topological Polar Surface Area (TPSA)

Approximate surface area involved in polar interactions.

Ų

XLogP3

Predicted logP (partition coefficient octanol/water) using XLogP3 method.

unitless


4. Electronic Properties (Less Common)#

Property

Description

Units

Complexity

A computed measure of structural complexity (based on rings, branches, etc.).

unitless

Charge

Calculated net molecular charge (can vary by protonation state).

unitless


5. Geometric and Topological Descriptors#

Property

Description

Feature Count

Total count of 2D/3D features like rings, stereocenters.

Ring Count

Number of rings detected in structure.

Tautomer Count

Number of tautomers (computed or predicted).

Bond Count

Total number of bonds in the structure.


Bioassays (AID)#

BioAssays in PubChem are biological test results that evaluate the activity or behavior of chemical substances (usually small molecules, sometimes RNAi or biologics) in a biological system (e.g., a cell line, protein target, or organism). They represent experimentally derived bioactivity data, which are critical for understanding drug efficacy, toxicity, and mechanism of action. Compounds are typically assigned as active, inactived or inconclusive under the conditions of the bioassay.

Each BioAssay has:

  • AID: Assay Identifier

  • CID / SID associations: Shows which compound/substance was tested

  • Activity Outcome: Active / Inactive / Inconclusive

  • Concentration-response: EC50, IC50, etc.

  • Target/Organism: What was tested

  • Protocol Description

  • Links to depositor source, gene/protein, PubMed ID

Bioassay Categories#

Assay Type

Description

Primary Screening

High-throughput screens (HTS) used to identify compounds with potential biological activity.

Confirmatory Assays

Follow-up tests that validate and refine initial screening hits.

Summary Assays

Integrative results or analyses across multiple primary/confirmatory assays.

Counter Screens

Used to rule out false positives or identify interference (e.g., cytotoxicity not target-specific).

RNAi Assays

Use gene knockdown to assess how silencing a gene affects biological pathways.

ADMET Assays

Evaluate Absorption, Distribution, Metabolism, Excretion, and Toxicity characteristics.

Mechanism-of-Action

Designed to reveal how a compound works (e.g., kinase inhibition, GPCR activation).

Structure-Activity Relationship (SAR)

Explore how structural changes in molecules affect bioactivity.

Potency Metrics (Half-Maximal Effect Values)#

Metric

Meaning

Common Use

EC₅₀

Effective Concentration at which 50% of maximum effect is observed.

Agonists or stimulators

IC₅₀

Inhibitory Concentration that reduces response (e.g., enzyme activity, cell viability) by 50%.

Antagonists, inhibitors

AC₅₀

Activity Concentration at which 50% of measured biological activity is seen.

Often used in PubChem for general bioactivity assays

GI₅₀

Concentration that causes 50% growth inhibition in a cell population.

Oncology / cytotoxicity studies

LD₅₀

Lethal Dose at which 50% of test organisms die (usually in vivo).

Toxicology (often animals)

TD₅₀

Toxic Dose causing toxic effects in 50% of subjects.

Clinical/toxicology threshold

ED₅₀

Effective Dose at which 50% of a population shows a therapeutic effect (typically in vivo).

Pharmacology, drug response in organisms

Gene Records#

  • Derived from NCBI Gene

  • Each gene has a Gene ID, symbol, synonyms, genomic context

  • PubChem Gene records list:

    • BioAssays where this gene is the target or regulated

    • Compounds known to affect this gene (activators, inhibitors, etc.)

    • Associated diseases or biological pathways

Example: TP53 Gene Record in PubChem


Protein Records#

  • Derived from NCBI Protein and UniProt databases

  • Indexed using UniProt IDs

  • Each protein has a Protein Accession Number (e.g., NP_000537)

  • PubChem Protein records describe:

    • Structure and function of the protein

    • Protein–compound interactions (from BioAssays)

    • Sequence information and domain structures

Example: SARS-CoV-2 Spike glycoprotein (s)

Pathways#

PubChem Pathways describe information related to biological pathways PubChem Pathways Documentation These are maps of biochemical reactions or processes that occur in living organisms like the Citric Acid Cycle. As of May 2025 there are over 250,000 pathways available through PubChem. These are not stored in PubChem and the URL involves the identification of the source and the source ID number, https://pubchem.ncbi.nlm.nih.gov/pathway/SOURCE:ExternalID.

Sources include:

SOURCE

Description

Reactome

Expert-curated human pathways

KEGG

Pathways across species, includes metabolic maps

WikiPathways

Community-curated, often includes disease pathways

NCBI BioSystems (legacy)

Original PubChem pathways source (now deprecated)

Examples:#

Cell Lines#

A cell line is a population of cells grown in a lab (in vitro) that originates from a single source — such as human tissue, animal tissue, or cancer cells — and is capable of continuous division. Information on cell lines can be found at [PubChem Cell Lines](https://pubchem.ncbi.nlm.nih.gov/docs/cell-lines and as of May 2025 there are over 2,000 cell lines within PubChem

Features of Cell Lines#

Feature

Description

Clonal origin

All cells in the line descend from a single cell

In vitro growth

Can be grown indefinitely under lab conditions

Reproducibility

Provide consistent biological behavior — ideal for experiments

Defined characteristics

Origin (e.g., human lung), type (e.g., epithelial), cancerous or not

Applications of Cell Lines?#

Use Case

Purpose

Drug screening

See how a compound affects cancer or healthy cells

Toxicity testing

Determine if a chemical is harmful to cells

Mechanism of action

Reveal how a drug alters cell growth, gene expression, metabolism

Virology and infection

Study how viruses enter or replicate in host cells

Gene silencing (RNAi)

Understand the function of specific genes in a controlled setting

Taxonomies#

Taxonomies are where data is aggregated by a specific organism (taxon), like a human being. The URL is https://pubchem.ncbi.nlm.nih.gov/taxonomy/TAXON, where the TAXON is the organisms name like https://pubchem.ncbi.nlm.nih.gov/taxonomy/human. PubChem also gets data from the NCBI Taxonomy Database https://www.ncbi.nlm.nih.gov/taxonomy and you can use the taxonomy ID in the link https://pubchem.ncbi.nlm.nih.gov/taxonomy/9606

Types of Data

Data Type

How It Relates to the Organism

Substances (SIDs)

Extracted from or derived from the organism

Compounds (CIDs)

Mapped from substances or associated with assays targeting the organism

BioAssays (AIDs)

Run using the organism’s cells, tissues, or proteins

Proteins / Genes

From the organism’s genome

Pathways

Mapped from the organism’s molecular biology

Literature

References about those biological systems

Taxonomy’s allow you to filter for species specific drug effects, understand what species are most studied and compare activities across species.

Patents#

A patent is a legal right granted by a government that gives the inventor exclusive control over how an invention is used — for a limited time — in exchange for publishing how it works.

Patents and IP (Intellectual Property)

Feature

Explanation

Exclusive rights

The patent holder can stop others from making, using, or selling the invention

Time-limited

Usually 20 years from filing (varies by country and patent type)

Public disclosure

In exchange, the invention’s details are published, contributing to scientific knowledge

Enforced nationally

Patents are jurisdiction-specific, meaning they only apply in countries where they are granted

Even though PubChem is a scientific database, it includes patent links because:

  1. Many small molecules are first disclosed in patents — especially in drug discovery.

  2. Patents often describe bioactivity, structure–activity relationships (SAR), and pharmacological targets, even before academic publication.

  3. Scientists and companies need to know:

    • Is this compound already patented?

    • Has it been claimed for a therapeutic use?

    • Can I freely study or commercialize it?

PubChem aggregates patent–compound links from:

Source

Description

SureChEMBL

Text-mined chemical structures from full-text patents, curated by EMBL-EBI

IBM/NIH Open Access

A legacy collaboration using machine learning to extract structures

PatentsView

U.S. patent metadata, searchable at https://patentsview.org

Depositor-provided

Some data submitters include patent associations explicitly

Acknowledgements#

This content was developed using a vibe coding type of process where the author interacted with an AI to generate the content. This content follows the PubChem Tutorial with multiple queries to Perplexity AI and ChatGPT being conducted over the summer of 2025. Copyright CC 0.0 by Bob Belford