9. WebScraping as External Data Structures

9. WebScraping as External Data Structures#

Web Scraping and Databases: A Hybrid Approach#

While web scraping is often used to extract data for immediate use, it does not store it. A powerful workflow would be:

Scrape data from online sources.
Store it in a structured database (SQL or NoSQL) for long-term analysis.
Query it later instead of repeatedly scraping.

Web Scraping as a Data Acquisition Method#

Web scraping is a method of extracting external data from structured or semi-structured sources on the web and transforming it into a usable format. Unlike databases or file storage, web scraping does not inherently store data—it is a way to retrieve and structure data from the web dynamically. It allows access to data stored in HTML web pages that might not be available via an API.

Web Scraping vs. APIs#

Feature	Web Scraping	APIs
Access	Extracts data from web pages (HTML tables, text, lists)	Queries structured data from a service (often JSON or XML)
Structure	Often semi-structured (needs parsing)	Well-structured
Reliability	Pages may change, breaking the scraper	More stable (unless API changes)
Use Case	Extracting tables, research data, metadata from articles	Accessing structured datasets (PubChem, NCBI, weather data)

Thus, web scraping is an alternative to APIs when structured access is unavailable.

Web Scraping as a Bridge from Classical Literature to Structured Data#

Scientific data has historically been communicated through journal articles, textbooks, and reports. Many modern scientific knowledge repositories (e.g., Wikipedia, research databases) still store information in text-based formats rather than structured databases. Web scraping allows you to:

Extract tabular data (like chemical properties from Wikipedia or patents).
Retrieve text-based metadata (such as author names, abstracts, and citations).
Collect non-tabular structured information (like structured web pages with lists of elements).

By applying text parsing, table extraction, and structured storage, web scraping allows researchers to convert human-readable content into machine-readable data.

2. Setting up your env#

conda activate your-webscrape-env
conda install -c conda-forge beautifulsoup4 requests lxml

beautifulsoup4: The core parsing library.
requests: To fetch web pages.
lxml: A fast and efficient HTML/XML parser (optional but recommended for performance).

the image below shows how I set up a webscrape environment while also connecting the kernel to Jupyter lab runninig in the base env.

Webscrape a wikipedia element box

import os
import json
import requests
from bs4 import BeautifulSoup
from datetime import datetime  # <- make sure this is at the top of your script

# Define the directory and file path
wikipedia_dir = os.path.expanduser("~/data/wikipedia/")
os.makedirs(wikipedia_dir, exist_ok=True)  # Ensure the directory exists

carbon_wikipedia_path = os.path.join(wikipedia_dir, "carbon_data.json")

def scrape_infobox(element_name):
    """Scrapes the Wikipedia infobox for a given element and returns a dictionary of properties."""
    
    url = f"https://en.wikipedia.org/wiki/{element_name}"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find the Infobox element
    infobox = soup.find('table', class_='infobox')

    # Dictionary to store element properties
    element_data = {}


    # Loop through table rows and extract key-value pairs
    for row in infobox.find_all('tr'):
        header = row.find('th')
        value = row.find('td')

        if header and value:
            key = header.get_text(strip=True)  # Get property name
            val = value.get_text(" ", strip=True)  # Extract value, keeping spaces
            element_data[key] = val

    # Add provenance metadata
    element_data["_source"] = {
        "Wikipedia_URL": url,
        "Scraped_from": "Wikipedia Infobox Element",
        #"Scraped_on": requests.get("https://worldtimeapi.org/api/timezone/Etc/UTC").json()['datetime']
        "Scraped_on": datetime.utcnow().isoformat() + "Z"
    }

    return {element_name: element_data}

# Scrape data for Hydrogen
carbon_data = scrape_infobox("Carbon")

# Save JSON data to file
with open(carbon_wikipedia_path, "w", encoding="utf-8") as f:
    json.dump(carbon_data, f, indent=4, ensure_ascii=False)

print(f"Data saved successfully to: {carbon_wikipedia_path}")

Data saved successfully to: /home/rebelford/data/wikipedia/carbon_data.json

Print out the available properties for hydrogen

import os
import json

# Define file path
wikipedia_dir = os.path.expanduser("~/data/wikipedia/")
hydrogen_wikipedia_path = os.path.join(wikipedia_dir, "hydrogen_data.json")

# Load the JSON file
with open(hydrogen_wikipedia_path, "r", encoding="utf-8") as f:
    hydrogen_data = json.load(f)

# Extract Hydrogen properties
hydrogen_properties = hydrogen_data.get("Hydrogen", {})

# Print all keys (property names)
print("Properties available for Hydrogen:")
for key in hydrogen_properties.keys():
    print("-", key)

print("\n")


# Print value for a specific key ("Named by") with provenance
key_to_lookup = "Named by"
value = hydrogen_properties.get(key_to_lookup, "Property not found")

source_url = hydrogen_properties.get("_source", {}).get("Wikipedia_URL", "Unknown Source")

print(f"{key_to_lookup}: {value}")
print(f"(Source: {source_url})")

Properties available for Hydrogen:
- Appearance
- 
- Atomic number(Z)
- Group
- Period
- Block
- Electron configuration
- Electrons per shell
- PhaseatSTP
- Melting point
- Boiling point
- Density(at STP)
- when liquid (atm.p.)
- when liquid (atb.p.)
- Triple point
- Critical point
- Heat of fusion
- Heat of vaporization
- Molar heat capacity
- P(Pa)
- atT(K)
- Oxidation states
- Electronegativity
- Ionization energies
- Covalent radius
- Van der Waals radius
- Natural occurrence
- Crystal structure
- Lattice constants
- Thermal conductivity
- Magnetic ordering
- Molar magnetic susceptibility
- Speed of sound
- CAS Number
- Discoveryand first isolation
- Named by
- Recognized as anelementby
- Main isotopes
- 1H
- 2H
- 3H
- _source


Named by: Property not found
(Source: https://en.wikipedia.org/wiki/Hydrogen)

Scrape the chem infoboxes of the halogens

import os
import json
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timezone

# Define the list of halogens (either names or symbols)
halogens = ["Fluorine", "Chlorine", "Bromine", "Iodine", "Astatine"]

# Define file path for saving the scraped data
wikipedia_dir = os.path.expanduser("~/data/wikipedia/")
os.makedirs(wikipedia_dir, exist_ok=True)  # Ensure directory exists
halogens_wikipedia_path = os.path.join(wikipedia_dir, "halogens_data.json")


def scrape_infobox(element_name):
    """Scrapes the Wikipedia infobox for a given element and returns a dictionary of properties."""
    
    url = f"https://en.wikipedia.org/wiki/{element_name}"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find the Infobox element
    infobox = soup.find('table', class_='infobox')

    # Dictionary to store element properties
    element_data = {}
   

    # Loop through table rows and extract key-value pairs
    for row in infobox.find_all('tr'):
        header = row.find('th')
        value = row.find('td')

        if header and value:
            key = header.get_text(strip=True)  # Get property name
            val = value.get_text(" ", strip=True)  # Extract value, preserving spaces
            element_data[key] = val

    # Add provenance metadata
    element_data["_source"] = {
        "Wikipedia_URL": url,
        "Scraped_from": "Wikipedia Infobox Element",
        "Scraped_on": datetime.now(timezone.utc).isoformat()  # UTC time with explicit timezone
    }
  
    return {element_name: element_data}


# Create dictionary of dictionaries for all halogens
halogens_data = {}

for element in halogens:
    halogens_data.update(scrape_infobox(element))

# Save JSON data to file
with open(halogens_wikipedia_path, "w", encoding="utf-8") as f:
    json.dump(halogens_data, f, indent=4, ensure_ascii=False)

print(f"Data saved successfully to: {halogens_wikipedia_path}")

Data saved successfully to: /home/rebelford/data/wikipedia/halogens_data.json

Now go to user-home/data/wikipedia/ and open the json file for halogens and explore