Module 1.1: Public Compound Databases

Module 1.1: Public Compound Databases#

Public Compound Databases#

This course will focus on PubChem but we will start with a survey of public compound databases. In the following tables the name of the database links to the database, and the API Available column links to API resource guides when available. This should assist you in discovering new data and methods for programmatic integration of the data into your workflow. But due to time constraints, we will focus on PubChem during this class.

Table 1: Public Chemical Databases (with API Resources)#

The following table highlights a set of widely used, freely accessible chemical databases that serve as foundational resources for modern cheminformatics, bioinformatics, and data-driven chemistry. These platforms are hosted by academic institutions, government agencies, or non-profit organizations. Each database has a distinct focus and most offer programmatic access through APIs, enabling integration into computational workflows.

Database Name	Description	API Available
PubChem	The largest open chemistry database, hosted by NCBI/NIH. Includes chemical structures, properties, bioactivity, safety data, and links to literature and patents.	Yes – PUG REST API Guide
EPA CompTox Chemicals Dashboard	Provides environmental chemistry and toxicology data, including experimental and predicted properties, exposure pathways, and use cases.	Yes – CompTox API Overview
ChemSpider	Aggregates chemical data from many sources including structures, spectra, and literature references. Managed by the Royal Society of Chemistry.	Yes – ChemSpider API Guide (SOAP/XML) (registration required)
ChEMBL	A database of bioactive drug-like small molecules and their targets, curated from the scientific literature. Includes activity and assay data.	Yes – ChEMBL API Docs
DrugBank	Integrates drug and drug target information with detailed pharmacology, interactions, and mechanisms of action. Free for academic use.	Yes – DrugBank API (free account required)
Protein Data Bank (RCSB PDB)	Repository of 3D structural data of biomolecules such as proteins, DNA, and small ligands. Essential for structural bioinformatics.	Yes – RCSB PDB API Docs
UniProt	A comprehensive protein sequence and functional information resource, integrating data from genomics, proteomics, and structural biology.	Yes – UniProt API Docs
MassBank Europe	Open access database of mass spectra for small chemical compounds, useful for metabolomics and analytical chemistry.	Yes – MassBank API
ZINC Database	A free database of commercially available compounds for virtual screening. Hosted by the Irwin and Shoichet labs at UCSF.	Yes – ZINC API (Subset)
HMDB (Human Metabolome Database)	Contains detailed information about human metabolites including spectral data, pathways, and concentrations in biological fluids.	Yes – HMDB API Guide

1.1 PubChem #

Launched in 2004 by the U.S. National Institutes of Health (NIH), PubChem was originally created to support the NIH Molecular Libraries Program and the wider biomedical research community. It has since evolved into one of the largest public repositories of chemical information in the world, integrating data on chemical structures, bioactivities, toxicology, and literature links. Managed by the National Center for Biotechnology Information (NCBI), its mission is to provide open access to high-quality chemical data and facilitate structure-based search and computational reuse.

1.2 EPA CompTox Chemicals Dashboard #

Developed and maintained by the U.S. Environmental Protection Agency (EPA), the CompTox Dashboard emerged from the ToxCast and Tox21 initiatives in the early 2010s. Its goal is to provide integrated access to environmental and toxicological data for tens of thousands of chemicals. It supports chemical safety assessments and green chemistry research by combining QSAR models, in vitro bioassays, predicted physicochemical properties, and regulatory data in a single, searchable platform.

1.3 ChemSpider #

Initially developed by Antony Williams as a hobby project and formally announced in 2007, and then acquired and developed by the Royal Society of Chemistry (RSC) in 2009, ChemSpider is a curated chemical structure database that aggregates data from over 270 data sources, including chemical vendors, academic repositories, and publications. It supports structure and substructure searching, links to spectral and biological data, and offers a community curation model. It was built to bridge the gap between chemical publication and chemical search, offering identifier resolution, structure drawing, and compound annotation tools.

1.5 ChEMBL #

Originally developed by Galapagos NV and later adopted by the European Bioinformatics Institute (EBI), ChEMBL has been publicly available since 2009. It is a manually curated database of bioactive small molecules and their biological targets, extracted primarily from peer-reviewed medicinal chemistry literature. Its mission is to support drug discovery, cheminformatics, and predictive modeling by making SAR and assay data machine-readable and openly accessible.

1.6 DrugBank #

First released in 2006 by researchers at the University of Alberta, DrugBank is a hybrid database that combines chemical, pharmacological, and molecular biology data about drugs and their targets. It integrates FDA-approved drugs, experimental compounds, pharmacokinetic data, and biochemical interactions. While commercial licensing is available, DrugBank remains freely accessible for academic use and is a staple in bioinformatics pipelines and drug repurposing studies.

1.7 Protein Data Bank (RCSB PDB)#

Founded in 1971 and now managed by the Worldwide Protein Data Bank (wwPDB) consortium, the PDB is the definitive open-access archive of 3D structures of biological macromolecules, including proteins, DNA, RNA, and small ligand complexes. Originally hosted at Brookhaven National Laboratory, its U.S. operations are now coordinated by RCSB. It serves structural biologists, chemists, and drug designers by enabling structure-based drug design and molecular docking research.

1.8 UniProt #

UniProt (Universal Protein Resource) is a globally recognized protein sequence and annotation database co-managed by EMBL-EBI (Europe), SIB (Switzerland), and PIR (USA). It evolved from earlier efforts like Swiss-Prot and TrEMBL, consolidating them into a single resource in 2002. UniProt is dedicated to the systematic curation of protein function, structure, domain architecture, and evolutionary relationships, supporting a wide spectrum of bioinformatics workflows.

1.9 MassBank Europe #

MassBank is an open-access spectral database launched in Japan and expanded into Europe through collaborative efforts. The MassBank Europe node provides high-resolution mass spectra for small molecules, including environmental chemicals, metabolites, and synthetic compounds. Its mission is to support metabolomics, environmental analysis, and analytical chemistry through community-contributed, validated mass spectral records.

1.10 ZINC Database #

Created by the Irwin and Shoichet labs at UCSF, ZINC is a curated collection of commercially available chemical compounds optimized for virtual screening. First launched in 2004 and now in its ZINC15 and ZINC20 iterations, the database focuses on delivering ready-to-dock 3D structures and annotations to computational chemists and drug discovery researchers. Its mission is to make molecular docking more accessible and reproducible in the academic and biotech space.

1.11 HMDB (Human Metabolome Database)#

Launched in 2007 by the University of Alberta, HMDB is a comprehensive resource for human endogenous metabolites, including their biological roles, spectral signatures, disease associations, and concentrations in fluids and tissues. It was developed under the Metabolomics Innovation Centre and supports clinical metabolomics, nutritional biochemistry, and systems biology. Its goal is to bridge the gap between metabolomics data generation and biomedical interpretation.

Table 2: Crystallography, Materials Science and Spectroscopy Databases#

Table 2 highlights a selection of public databases that support research in solid-state chemistry, crystallography, computational materials science, and molecular spectroscop*. These resources are essential for understanding the structural, electronic, and vibrational properties of chemical compounds and materials. Many originated from national laboratories, research consortia, or international collaborations, and they provide access to both experimental and computed data, including CIF files, DFT calculations, and reference spectra. These databases enable researchers to reproduce results, benchmark computational methods, and explore the relationship between structure and function in materials and molecules.

Database Name	Description	API Available
Crystallography Open Database (COD)	A repository of crystal structure data for small molecules and minerals. Freely available CIF files.	Partial – COD FTP Access
Materials Project	Provides computed materials properties using high-throughput density functional theory (DFT) calculations. Great for solid-state chemistry.	Yes – Materials API Docs (free account required)
NIST Chemistry WebBook	Contains thermochemical, IR spectra, mass spectra, and other data for thousands of substances.	No traditional API, but bulk downloads and URL query interface exist.
CCDC Access Structures	Offers free access to individual crystal structures from the Cambridge Structural Database (CSD).	Limited – structure lookup and download via web interface only.
AFLOWlib	A repository of ab initio calculated materials data, including elastic, electronic, and thermodynamic properties.	Yes – AFLOW API Docs
NOMAD Repository	European-hosted repository of raw and processed materials science data from electronic structure calculations.	Yes – NOMAD API
Spectral Database for Organic Compounds (SDBS)	NMR, IR, MS, and UV-VIS spectra for thousands of organic compounds. Hosted by AIST Japan.	No API, but individual spectra can be accessed via parameterized URLs.
IRUG Spectral Database	Infrared and Raman spectra of artist materials (pigments, binders, resins), useful in conservation science.	No API, search via web interface.
Open Crystallography Database (OCD)	A newer, open initiative to unify crystallographic data sharing.	In development – Project site outlines roadmap.

2.1 Crystallography Open Database (COD)#

Established in 2003, COD is a fully open-access database of crystallographic information files (CIFs) for organic, inorganic, and metal-organic compounds. It was created in response to the growing need for a free and transparent alternative to commercial crystallography databases, such as the Cambridge Structural Database (CSD). Its mission is to promote reproducible science by making crystal structures available to all researchers without licensing restrictions. COD is maintained by an international team of scientists and relies on community contributions.

2.2Materials Project #

Launched in 2011 by the Lawrence Berkeley National Laboratory (LBNL) and the Department of Energy (DOE), the Materials Project aims to accelerate materials discovery through high-throughput density functional theory (DFT) calculations. It provides computed properties for thousands of materials, including band gaps, elastic constants, and formation energies. Its web interface and API make computational data accessible to experimentalists, theorists, and machine learning researchers alike. The platform is a flagship of the Materials Genome Initiative.

2.3 NIST Chemistry WebBook #

Developed and maintained by the National Institute of Standards and Technology (NIST), this web resource has been online since 1996. It provides access to a wide range of reference data for gas-phase thermochemistry, IR spectra, UV/Vis spectra, mass spectra, and more. The WebBook is built on the long history of NIST’s physical chemistry data programs and is aimed at serving both educational and industrial research needs. It has become a gold standard for validated spectroscopic and thermochemical reference data.

2.4CCDC Access Structures #

The Cambridge Crystallographic Data Centre (CCDC) is home to the Cambridge Structural Database (CSD), the world’s most comprehensive repository of small-molecule organic and metal-organic crystal structures. While full access to CSD is subscription-based, Access Structures offers free downloads of individual crystal structures deposited by researchers. This service supports open science, reproducibility, and transparency in crystallography by making peer-reviewed structural data freely available on a case-by-case basis.

2.5AFLOWlib #

The Automatic FLOW for Materials Discovery (AFLOW) database is a high-throughput repository of ab initio calculations for thousands of crystalline compounds. Founded by Stefano Curtarolo’s group at Duke University, AFLOW grew from the need to systematically explore the materials design space using automation and machine learning. It provides computed data such as elastic moduli, electronic band structures, and thermal properties, all freely accessible through a RESTful API.

2.6 NOMAD Repository #

The Novel Materials Discovery (NOMAD) repository is a European open science platform funded by the EU’s Horizon 2020 program. It collects, processes, and shares raw and derived data from electronic structure calculations, making it one of the most comprehensive platforms for FAIR (Findable, Accessible, Interoperable, Reusable) computational materials science. It supports standardization efforts in computational chemistry and offers tools for visualization and data mining.

2.7 Spectral Database for Organic Compounds (SDBS)#

Developed by the National Institute of Advanced Industrial Science and Technology (AIST), Japan, SDBS is a long-standing online repository of experimental NMR, IR, MS, and UV/Vis spectra for thousands of organic compounds. It has served the analytical chemistry community since the early 2000s and is designed for use in chemical education, structure elucidation, and instrument calibration. Though it lacks an API, spectra are individually accessible through URL parameters.

2.8 IRUG Spectral Database #

Managed by the Infrared and Raman Users Group (IRUG), this specialized database focuses on vibrational spectra of artist materials, including pigments, binders, and resins. It supports the conservation science and cultural heritage communities by enabling non-destructive analysis of artworks and archaeological materials. IRUG’s mission is to foster collaboration and data sharing among museums, universities, and conservation labs worldwide.

2.9 Open Crystallography Database (OCD)#

Still under development, OCD is a next-generation initiative aiming to unify and expand the accessibility of crystallographic data through an open framework. It builds on lessons from COD and other projects and seeks to provide more consistent metadata, linked data integration, and programmatic access. Its long-term goal is to enable interoperable and machine-readable crystallographic knowledge for the next era of data-driven chemistry and materials science.

Table 3: Natural Products and Bioactive Compounds#

Natural products have long served as a foundation for pharmaceutical discovery, agrochemical innovation, and chemical ecology. Table 3 presents a selection of public databases that catalog the chemical structures, biological activities, species origins, and commercial availability of natural compounds. These resources are especially valuable in drug discovery, biodiversity studies, and cheminformatics workflows. Several are the result of collaborative academic projects, while others serve as centralized hubs for previously fragmented data sources. The emphasis here is on open-access resources that support exploration of natural chemical diversity across microbial, plant, and marine origins.

Database Name	Description	API Available
NPAtlas	A manually curated collection of microbial natural products with chemical structures, taxonomy, and references.	No formal API, but data downloads are available.
COCONUT	The largest open-source collection of natural products, integrated from many sources with structure, taxonomy, and metadata.	Yes – COCONUT API Docs
LOTUS	Linked Open Data for natural products, combining multiple repositories and Wikidata links.	Yes – SPARQL Endpoint for linked data queries.
SuperNatural II	A curated database of purchasable natural products, optimized for virtual screening and drug discovery.	No API; search available via web interface.
NPASS	Natural Product Activity and Species Source database with quantitative bioactivity data and species origin.	No API, but batch search and bulk downloads are available.

3.1 NPAtlas #

The Natural Products Atlas (NPAtlas) was launched in 2019 by researchers from the University of Aberdeen and collaborators in the microbial natural products community. It focuses specifically on microbial metabolites, including those produced by bacteria and fungi. The data is manually curated from peer-reviewed literature and includes chemical structures, taxonomy, and links to the original references. The mission of NPAtlas is to provide a transparent, open-access, and up-to-date database to support microbial natural products research, particularly in the context of antibiotics and bioactive discovery.

3.2 COCONUT #

COlleCtion of Open Natural prodUcTs (COCONUT) is the largest freely accessible repository of natural product structures. It was created by a consortium of researchers in the cheminformatics and drug discovery community, with the goal of aggregating structures from disparate public and semi-public sources. Launched around 2019–2020, it provides standardized, deduplicated chemical structure data along with **metadata such as taxonomic source, publication references, and vendor availability. COCONUT is an essential tool for virtual screening and scaffold diversity analysis in natural product chemistry.

3.3 LOTUS #

LOTUS (Linked Open data for Taxonomy and Organic Substances) is a Wikidata-integrated initiative launched by the same team behind COCONUT, in partnership with the Wikidata and cheminformatics communities. The database seeks to represent natural product knowledge in a linked open data format, enabling semantic querying across species, compound classes, and biological functions. LOTUS builds on SPARQL queries and RDF triples to allow users to query information on natural product occurrences and provenance. Its long-term vision is to **make natural product data machine-interoperable within the Semantic Web.

3.4 SuperNatural II #

Developed by researchers at Charité – Universitätsmedizin Berlin, SuperNatural II is a curated library of purchasable natural product-like compounds, optimized for computational drug discovery and virtual screening**. The current version (SuperNatural II) includes over 300,000 entries, combining true natural products with their synthetic analogs. Though it lacks an API, the resource supports structure-based search and physicochemical property filtering, making it useful for pharmacophore modeling and early-stage screening.

3.5 NPASS #

The Natural Product Activity and Species Source (NPASS) database was developed by the Bioinformatics and Drug Design (BIDD) Group at the National University of Singapore. It provides a unique focus on the quantitative bioactivities of natural products, such as IC₅₀, Kd, and MIC values, and ties these to their species of origin. The database supports both compound-based and species-based searches and offers batch retrieval tools. NPASS helps bridge the gap between biodiversity and pharmacological activity, aiding in lead prioritization and mechanism-of-action studies.

Table 4: Environmental and Toxicology Databases#

Natural products have long served as a foundation for pharmaceutical discovery, agrochemical innovation, and chemical ecology. Table 3 presents a selection of public databases that catalog the chemical structures, biological activities, species origins, and commercial availability of natural compounds. These resources are especially valuable in drug discovery, biodiversity studies, and cheminformatics workflows. Several are the result of collaborative academic projects, while others serve as centralized hubs for previously fragmented data sources. The emphasis here is on open-access resources that support exploration of natural chemical diversity across microbial, plant, and marine origins.

Database Name	Description	API Available
EPA CompTox Dashboard	Environmental chemicals with toxicity, exposure, and use data. Integrates QSAR, experimental, and regulatory sources.	Yes – CompTox API
TOXNET Archive	Legacy databases on toxicology and environmental health (e.g., HSDB, IRIS). Now archived, with some data available through other services.	No current API (TOXNET retired); data preserved in downloadable format.
ECOTOX	US EPA database with curated ecological toxicity data for aquatic and terrestrial organisms.	No public API; CSV downloads and search available.
ToxCast Dashboard	High-throughput screening data on thousands of chemicals across hundreds of assays.	Yes – TOXCAST in CompTox API
HELM	Health and environmental low dose data for emerging materials. Supports green chemistry and risk reduction.	No API, but metadata tools and documents downloadable.

4.1 NPAtlas #

The Natural Products Atlas (NPAtlas) was launched in 2019 by researchers from the University of Aberdeen and collaborators in the microbial natural products community. It focuses specifically on microbial metabolites, including those produced by bacteria and fungi. The data is manually curated from peer-reviewed literature and includes chemical structures, taxonomy, and links to the original references. The mission of NPAtlas is to provide a transparent, open-access, and up-to-date database to support microbial natural products research, particularly in the context of antibiotics and bioactive discovery.

4.2 COCONUT #

COlleCtion of Open Natural prodUcTs (COCONUT) is the largest freely accessible repository of natural product structures. It was created by a consortium of researchers in the cheminformatics and drug discovery community, with the goal of aggregating structures from disparate public and semi-public sources. Launched around 2019–2020, it provides standardized, deduplicated chemical structure data along with metadata such as taxonomic source, publication references, and vendor availability. COCONUT is an essential tool for virtual screening and scaffold diversity analysis in natural product chemistry.

4.3 LOTUS #

LOTUS (Linked Open data for Taxonomy and Organic Substances) is a Wikidata-integrated initiative launched by the same team behind COCONUT, in partnership with the Wikidata and cheminformatics communities. The database seeks to represent natural product knowledge in a linked open data format, enabling semantic querying across species, compound classes, and biological functions. LOTUS builds on SPARQL queries and RDF triples to allow users to query information on natural product occurrences and provenance. Its long-term vision is to make natural product data machine-interoperable within the Semantic Web.

4.4 SuperNatural II #

Developed by researchers at Charité – Universitätsmedizin Berlin, SuperNatural II is a curated library of purchasable natural product-like compounds, optimized for computational drug discovery and virtual screening. The current version (SuperNatural II) includes over 300,000 entries, combining true natural products with their synthetic analogs. Though it lacks an API, the resource supports structure-based search and physicochemical property filtering, making it useful for pharmacophore modeling and early-stage screening.

4.5 NPASS #

The Natural Product Activity and Species Source (NPASS) database was developed by the Bioinformatics and Drug Design (BIDD) Group at the National University of Singapore. It provides a unique focus on the quantitative bioactivities of natural products, such as IC₅₀, Kd, and MIC values, and ties these to their species of origin. The database supports both compound-based and species-based searches and offers batch retrieval tools. NPASS helps bridge the gap between biodiversity and pharmacological activity, aiding in lead prioritization and mechanism-of-action studies.

Table 5: Analytical Chemistry and Spectral Repositories#

Spectral data play a foundational role in identifying, characterizing, and quantifying chemical compounds. Table 5 introduces a range of public databases that specialize in mass spectrometry (MS), nuclear magnetic resonance (NMR), infrared (IR), UV/Vis, and other analytical techniques. These repositories serve a variety of fields, including metabolomics, pharmacology, natural product chemistry, and analytical instrumentation. Some are community-curated and crowd-sourced, while others are developed by government or academic institutions. Together, they reflect a global effort to make spectral reference data freely available for chemical identification and validation.

Database Name	Description	API Available
MassBank	High-quality mass spectral database for metabolites, drugs, and other small molecules.	Yes – MassBank API Docs
GNPS (Global Natural Products Social)	Crowdsourced MS/MS spectral library for natural products and metabolomics with workflow tools.	Yes – GNPS REST API
SDBS	Japanese spectral database including NMR, IR, MS, and UV-VIS for organic compounds.	No API; individual spectra accessible via parameterized queries.
NMRShiftDB2	Open NMR database with chemical structure-based query and prediction tools.	Yes – NMRShiftDB2 API (via NFDI4Chem)
MetaboLights	European metabolomics repository with experimental metadata, raw and processed data.	Yes – MetaboLights API

5.1 MassBank #

MassBank was launched in 2006 as the first public repository for high-resolution mass spectra of small molecules. Initially developed in Japan and later expanded through European collaborations, MassBank is now a core part of the MassBank Europe initiative. It supports the needs of metabolomics, environmental chemistry, and drug screening by providing validated MS/MS spectra, with full metadata on instruments, ionization modes, and collision energies. Its mission is to promote reproducibility and transparency in analytical science by enabling researchers to search, compare, and download high-quality reference spectra.

5.3 SDBS (Spectral Database for Organic Compounds)#

Developed and maintained by the **National Institute of Advanced Industrial Science and Technology (AIST) in Japan, the SDBS is one of the oldest and most comprehensive web-based spectral databases. It was made public in the early 2000s and offers NMR (¹H and ¹³C), IR, MS, and UV/Vis spectra for thousands of organic compounds. The data are collected under controlled experimental conditions and accompanied by detailed metadata. While it does not offer a modern API, individual spectra can be retrieved via parameterized URLs, making it a go-to resource for students and educators in analytical chemistry.

5.4 NMRShiftDB2 #

NMRShiftDB was originally started in the early 2000s as an academic project to crowdsource NMR shift data for organic molecules, enabling prediction and validation of chemical structures. The second-generation platform, NMRShiftDB2, is now part of the German NFDI4Chem initiative, which aims to build FAIR (Findable, Accessible, Interoperable, and Reusable) infrastructures for chemical research. The database includes prediction algorithms, chemical structure search, and user-contributed data. Its mission is to improve accessibility and interoperability of NMR reference data for computational and bench chemists alike.

5.5 MetaboLights #

MetaboLights is a metabolomics data repository hosted by the European Bioinformatics Institute (EBI). It was launched in 2012 to support the archiving and sharing of experimental metabolomics data, including raw instrument files, processed spectra, and associated metadata such as experimental design, organism, and analytical method. It is cross-linked with other omics resources (e.g., ChEBI, UniProt) and supports both targeted and untargeted studies. MetaboLights facilitates data reuse, method validation, and meta-analyses in systems biology, environmental science, and nutrition research.

Table 6: Chemical Reactions and Synthesis Databases#

Chemical reaction databases vary widely in their accessibility and purpose. Some, like the Open Reaction Database and USPTO bulk data, are fully public, supporting open science and machine learning applications. Others, such as SYNTHIA or NextMove’s NameRxn, are proprietary commercial tools, though some offer limited academic trials or free user accounts. When working with these resources, it’s important to consider licensing, usage rights, and availability of structured data formats or APIs. The table below includes an access column to help distinguish open-access tools from subscription-based platforms.

Database Name	Description	API Available	Access
Open Reaction Database (ORD)	Structured open repository of chemical reactions, optimized for machine learning and informatics.	Yes – ORD API Docs	Public
USPTO Bulk Data	Reaction and compound information extracted from US patent filings.	No API; bulk XML/JSON downloads.	Public
NextMove NameRxn Dataset	Annotated reaction types from patents and literature; limited academic use.	No API; available as downloadable files.	Commercial
SYNTHIA Free Trial	AI-guided retrosynthesis tool with proprietary algorithms. Offers limited public access for academic users.	No public API; account required.	Academic Trial
IBM RXN for Chemistry	AI-based reaction prediction and retrosynthesis tool with a graphical interface and programmatic access.	Yes – IBM RXN API Docs (account required)	Free with Account

6.1 Open Reaction Database (ORD)#

The Open Reaction Database (ORD) was launched in 2021 by a consortium of academics and industry partners (including Pfizer and MIT) with the goal of creating a structured, open-access repository of chemical reactions. It was designed from the ground up to support machine learning and cheminformatics applications, using standardized data schemas to capture reaction conditions, yields, and substrates. ORD’s mission is to break the cycle of reaction data being locked away in private corporate archives by making high-quality, machine-readable data available to the broader community.

6.2 USPTO Bulk Data #

The United States Patent and Trademark Office (USPTO) provides bulk downloads of full-text patent grants and applications, which include a vast amount of reaction and compound information. These datasets are published in XML and JSON formats and have been widely mined for reaction databases and text-mining projects. Although USPTO data is not structured for chemistry natively, it remains a crucial open source of synthetic precedent because it covers a huge swath of the chemical patent literature.

6.3 NextMove NameRxn Dataset #

Developed by NextMove Software, the NameRxn dataset is a collection of annotated reaction types extracted from literature and patent sources. It classifies reactions into well-known transformations (e.g., Suzuki coupling, Friedel-Crafts alkylation) using a proprietary classification engine. Although primarily commercial, NameRxn has been used in academic collaborations and projects. Its mission is to help chemists quickly identify and categorize reactions for synthetic planning and knowledge management.

6.4 SYNTHIA Free Trial #

SYNTHIA (formerly Chematica) is an AI-guided retrosynthesis platform originally developed by Prof. Bartosz Grzybowski and now commercialized by MilliporeSigma. It uses a curated set of reaction rules and algorithms to design optimized synthetic routes. The free academic trial allows students and researchers to explore retrosynthetic strategies for common targets. SYNTHIA’s mission is to combine chemical knowledge and computational heuristics to accelerate synthesis planning.

Table 7: Biochemical Pathways and Systems Biology Databases#

Biochemical pathway databases are essential for understanding the networked nature of metabolism, signal transduction, and cellular regulation. These resources provide curated or computationally inferred pathways that map how molecules interact in living systems. They are used in systems biology, pharmacology, metabolic engineering, and bioinformatics. Table 7 features a set of databases that offer interactive maps, cross-species data integration, and API access, helping researchers explore metabolism at both molecular and systems levels.

Database Name	Description	API Available	Access
KEGG	Kyoto Encyclopedia of Genes and Genomes for metabolic pathways, compounds, reactions, and enzymes.	Yes – KEGG REST API	⚠️ Free for academic use; some tools restricted
Reactome	Curated biological pathways and reactions across many species; includes human metabolism and disease maps.	Yes – Reactome Content Service API	✅ Fully Open
BioCyc	Collection of pathway/genome databases for many organisms, including EcoCyc and MetaCyc.	Yes – BioCyc Web Services (free for public tier)	⚠️ Tiered: Public + subscription options
MetaNetX	Reconciles metabolites and reactions across multiple genome-scale metabolic networks.	No API; downloadable TSVs for integration.	✅ Fully Open
PathBank	Visual, searchable pathway database for small molecule and metabolite interactions.	Yes – PathBank API Info	✅ Fully Open

7.1 KEGG #

The Kyoto Encyclopedia of Genes and Genomes (KEGG) was initiated in 1995 by Dr. Minoru Kanehisa in Japan to support genome interpretation and annotation. It is one of the earliest and most influential bioinformatics projects and provides metabolic and signaling pathways, gene/protein annotations, compound libraries, and reaction networks. KEGG pathways are manually curated and extensively referenced in genomics, metabolomics, and pharmacology. Its mission is to transform genome data into system-level functional information using standardized, graphical pathway maps.

7.2 Reactome #

Reactome is an open-source, manually curated pathway database launched in 2003 and managed by a collaboration of the Ontario Institute for Cancer Research (OICR), EMBL-EBI, and other institutions. It focuses on human biology, but also includes inferred pathways for other species. Reactome’s strength lies in its high-quality curation of biochemical reactions, gene-protein-pathway relationships, and disease mechanisms. It integrates with omics tools for pathway enrichment and is designed to support reproducible and interoperable systems biology research.

7.3 BioCyc #

Developed by SRI International, the BioCyc collection includes over 20,000 pathway/genome databases, ranging from curated resources like EcoCyc (for E. coli) and MetaCyc (a reference for metabolism) to automatically generated databases for thousands of organisms. First launched in the 1990s, BioCyc’s goal is to link genomic data with metabolic and regulatory network models. It offers visualization tools, downloadable data, and computational modeling interfaces. Access is tiered: many features are free in the public tier, with enhanced features for subscribers.

7.4 MetaNetX #

MetaNetX is a project from the Swiss Institute of Bioinformatics (SIB) that aims to reconcile differences between various metabolic network reconstructions. It provides a unified namespace to map metabolites, reactions, and pathways across models such as KEGG, BioCyc, and ModelSEED. MetaNetX supports genome-scale metabolic modeling and comparative systems biology by providing downloadable TSVs for integration with metabolic network simulators. Though it lacks an API, it plays a vital role in harmonizing biochemical knowledge across platforms.

7.5 PathBank #

Launched in 2019 by The Metabolomics Innovation Centre (TMIC) in Canada, PathBank is a visual and searchable repository of over 100,000 small-molecule and metabolite-centric pathways across multiple organisms. It includes metabolism, drug action, nutrition, and disease pathways, integrating with other TMIC resources like HMDB and DrugBank. Its mission is to provide highly detailed, manually curated visualizations for educators, clinical researchers, and bioinformaticians. API access and bulk downloads support integration into pathway analysis pipelines.

Public vs. Fee-Based Chemical Databases#

In the predigital era databases were printed documents with the data often being excerpted from the primary literature and evolving out of the Gutenberg era of scientific communication where information was disseminated through printed journals. There are a variety of databases that evolved out of this model and most of these are fee based. The following table goes over some of the differences between public and fee based databases.

Feature	Public Databases (e.g., PubChem, ChEMBL)	Fee-Based Databases (e.g., SciFinderⁿ, Reaxys)
Access	Free, open access (no subscription required)	Institutional or personal subscription required
Hosting Organizations	Government agencies, academic consortia, open science communities	Commercial publishers (ACS, Elsevier, Clarivate, Wiley)
Primary Content Source	Voluntarily submitted data, open literature, patents, regulatory databases	Manually curated and extracted from paywalled literature and patents
Coverage of Literature	Limited or indirect (open-access journals, PubMed, patent offices)	Comprehensive, including legacy journals and full-text patents
Structure Searching	Available (e.g., PubChem, ChemSpider)	Available, often with more advanced search filters and algorithms
Reaction Data	Available (e.g., Open Reaction Database, ChEMBL assays)	Comprehensive (e.g., CASREACT, Reaxys synthetic pathways)
Bioactivity & SAR	Available (e.g., ChEMBL, BindingDB)	Deeply curated and linked with pharmacological context
Spectral Data	Available (e.g., MassBank, NIST WebBook)	Often not included directly; sometimes linked
Export & Reuse Licensing	Open licenses (e.g., CC-BY, CC0), often downloadable	Restricted reuse; data often under publisher copyright
APIs & Interoperability	Well-documented APIs, bulk downloads, linked data support	Often proprietary formats; limited interoperability
Update Frequency	Varies (some real-time, others infrequent)	Regular updates, with professional curation teams
Learning Curve	Easier for newcomers; more open documentation	High functionality, but requires training or institutional onboarding

Table of Fee Based Databases#

The following is an incomplete list of some of the most common fee-based databases. These databases typically require institutional subscriptions and can be accessed through university libraries.

Database Name	Description
SciFinderⁿ (CAS)	Produced by the American Chemical Society’s Chemical Abstracts Service. Offers curated chemical and bibliographic data from journals, patents, reactions, bioactivity, spectra, and regulatory info. Includes CAS Registry (substances) and CASREACT (reactions).
Reaxys (Elsevier)	A comprehensive chemistry database indexing millions of reactions, compounds, and experimental procedures extracted from journals and patents. Includes chemical structures, properties, and synthesis routes.
PsycINFO (APA) with Chemical Supplementation	Although psychology-focused, this database sometimes includes chemical or pharmacological context in behavioral studies. (Not a core chemistry DB, but intersects with neurochemistry and psychopharmacology.)
Derwent Innovation (Clarivate)	Proprietary patent database with advanced chemical indexing, including Derwent Chemistry Resource (DCR). Focused on innovation tracking across chemical/pharma sectors.
Minesoft PatBase	Global patent database with chemical structure search capabilities. Especially useful for competitive intelligence in pharma and materials sectors.
GOSTAR (Excelra)	Comprehensive database of structure–activity relationships (SAR), pharmacokinetics, and bioactivity data, curated from literature and patents.
Synthesis Digital Library (Morgan & Claypool)	Fee-based access to digital monographs in chemistry and engineering with a focus on advanced education and research. Not a data repository per se, but highly relevant.
PerkinElmer Signals Notebook	Commercial ELN (Electronic Lab Notebook) with chemistry data indexing, internal data management, and integration with proprietary or public datasets.
Wiley ChemPlanner (formerly InfoChem)	A retrosynthetic analysis tool incorporating curated reaction data, reaction prediction, and synthesis route planning.

Acknowledgements#

This content was developed with assistance from Chat GPT. Multiple queries were made during the Summer of 2025 and this material is in the Public Domain. Please contact Bob Belford (rebelford@ualr.edu) if you have any questions or concerns about this content.

Module 1.1: Public Compound Databases

Contents

Module 1.1: Public Compound Databases#

Public Compound Databases#

Table 1: Public Chemical Databases (with API Resources)#

1.1 PubChem#

1.2 EPA CompTox Chemicals Dashboard#

1.3 ChemSpider#

1.5 ChEMBL#

1.6 DrugBank#

1.7 Protein Data Bank (RCSB PDB)#

1.8 UniProt#

1.9 MassBank Europe#

1.10 ZINC Database#

1.11 HMDB (Human Metabolome Database)#

Table 2: Crystallography, Materials Science and Spectroscopy Databases#

2.1 Crystallography Open Database (COD)#

2.2Materials Project#

2.3 NIST Chemistry WebBook#

2.4CCDC Access Structures#

2.5AFLOWlib#

2.6 NOMAD Repository#

2.7 Spectral Database for Organic Compounds (SDBS)#

2.8 IRUG Spectral Database#

2.9 Open Crystallography Database (OCD)#

Table 3: Natural Products and Bioactive Compounds#

3.1 NPAtlas#

3.2 COCONUT#

3.3 LOTUS#

3.4 SuperNatural II#

3.5 NPASS#