7. Databases as External Data Structures#

How Databases and Web Services Fit Into the Picture#

They are an interesting hybrid, they store non-volatile data from the perspective of the python program, but access to them introduces elements of volatility:

  1. Non-Volatile: Data persists when the Python kernel/interpreter is shut down.

  2. Volatile: Unlike files, a database can be unavailable (e.g., server downtime), and data integrity can be affected by concurrent access.

In a structured progression from volatile to non-volatile data from the perspective of a program, databases fit in as external but queryable structures:

  • Unlike files, they support efficient querying (e.g., SQL queries).

  • Unlike in-memory Python data structures, they persist beyond execution of a program.

  • Unlike JSON or CSV files, they allow dynamic updates and complex relationships.

1. How a Computer Stores Data: Directory vs. Database#

So far we have looked at directories (file systems) and we need to understand the difference between a file system and a database.

1.1 Directory (File Storage)#

  • A directory is a hierarchical tree structure where files and folders are organized.

  • Files are stored as binary data on disk, and each file has metadata (name, size, type, last modified date).

  • You can perform basic searches (filename, date modified).

  • No structured relations between files, beyond directory hierarchies.

Example: How a Directory Stores Chemistry Data#

/research_data/
│── periodic_table.csv
│── elements.json
│── hydrogen.png
│── lab_notes.txt
  • You can find “hydrogen.png” by looking in /research_data/ and searching filenames.

  • But if you want to ask, “What are all the elements with atomic number < 10?”, the directory itself does not support such structured queries.

1.2 Database Storage#

A database is an organized system for storing and retrieving data with structured querying.

  • Data is stored in a structured format (tables, documents, graphs).

  • Indexing enables fast queries (retrieving data without scanning everything).

  • Relationships exist between data.

2. Database Models: SQL, NoSQL, and Beyond#

There are multiple ways to structure and organize data in a database. The most common types include:

2.1 Relational Databases (SQL)#

  • Data is stored in tables with predefined columns (fields).

  • Tables are related to each other through keys.

  • Uses Structured Query Language (SQL) to manage and query data.

  • Example: SQLite, PostgreSQL, MySQL.

  • Example Schema:

    
      Element Table:
      | AtomicNumber | Symbol | Name  | AtomicWeight |
      |-------------|--------|-------|--------------|
      | 1           | H      | Hydrogen | 1.008      |
      | 2           | He     | Helium   | 4.0026     |
    
    
    
  • Relationships allow for structured queries like:

    SELECT * FROM Element WHERE AtomicNumber = 1;
    

2.2 NoSQL Databases#

  • Designed for flexibility and scale, handling semi-structured or unstructured data.

  • No fixed table structure; data is stored as documents, key-value pairs, graphs, or columns.

Example JSON document in MongoDB:

{
  "AtomicNumber": 1,
  "Symbol": "H",
  "Name": "Hydrogen",
  "AtomicWeight": 1.008
}

Queries use JSON-based query languages, e.g.:

db.elements.find({"Symbol": "H"})

2.3 Graph Databases (RDF and Triple Stores)#

  • Used to model complex relationships, often found in semantic web and linked data applications.

  • Data is stored as triples: (subject, predicate, object).

  • Example:

    ("Hydrogen", "isElement", "True")
    ("Hydrogen", "atomicNumber", "1")
    ("Hydrogen", "bondsWith", "Oxygen")
    
  • Queries use SPARQL instead of SQL.

Understanding Databases#

The core role of a database is to efficiently and persistently store data and in a structured and useful manner that allows access to the data. These goals can be achieved through 5 key database activities.

Key Database Activity

Purpose

1. Transactions (Data Modification & Integrity)

Manage data changes (INSERT, UPDATE, DELETE) ensuring consistency with commit & rollback mechanisms.

2. Schema Management

Define tables, columns, relationships, and constraints to structure the data.

3. Data Querying & Retrieval

Access stored data without modifying it (SELECT queries), allowing filtering, aggregation, and reporting.

4. Indexing & Optimization

Speed up data access and searches by using indexes and optimizing queries for performance.

5. Concurrency & Multi-User Access

Manage simultaneous access to prevent conflicts and ensure data consistency across multiple users.

3. SQL & Declarative Programming#

In a declarative programming language (SQL)you describe what you want, now how to get it. In an object oriented procedural approach (Python) you specify step-by-step how to proceed with a task

3.1 Comparing SQL to Python#

Feature

SQL (Database Language)

Python (Interpreted Language)

Purpose

Query, store, and manipulate structured data

General-purpose programming

Execution

Runs inside a database engine (SQLite, MySQL)

Runs in an interpreter (CPython, PyPy)

Typing

Strongly typed schema (e.g., INTEGER, TEXT)

Dynamically typed (int, str, etc.)

Flow Control

Limited (CASE, loops in stored procedures)

Full control (if, for, while)

Usage

Optimized for fast data retrieval and transactions

Used for programming logic, automation, and data processing

3.2: SQL Breakdown Using Core Database Activities#

Each database activity maps to a set of SQL commands (syntax).

Database Activity

Purpose

SQL Syntax to Learn

1. Transactions (Data Modification & Integrity)

Modify data while ensuring consistency.

INSERT, UPDATE, DELETE, COMMIT, ROLLBACK

2. Schema Management

Define the structure of the database.

CREATE TABLE, ALTER TABLE, DROP TABLE, PRAGMA (SQLite-specific)

3. Data Querying & Retrieval

Retrieve data without modifying it.

SELECT, WHERE, ORDER BY, GROUP BY, HAVING, LIMIT

4. Indexing & Optimization

Improve query performance.

CREATE INDEX, EXPLAIN QUERY PLAN, ANALYZE

5. Concurrency & Multi-User Access

Ensure consistency in multi-user environments.

SQLite handles this differently, but BEGIN TRANSACTION, LOCKING concepts apply.

3.2a: SQL Data Structures#

“How Data is Stored”, this connects to schema management (Activity 2).

Concept

Python Equivalent

SQL Equivalent (Schema Management)

Lists & Dicts

list, dict (collections of data)

Tables (collections of structured data)

Attributes & Keys

Dictionary keys, object attributes

Columns & Primary Keys

Relationships

Dictionary of dictionaries, nested lists

Foreign keys, joins

Example:

CREATE TABLE elements (
    AtomicNumber INTEGER PRIMARY KEY,
    Name TEXT NOT NULL,
    Symbol TEXT NOT NULL,
    AtomicMass REAL NOT NULL
);
  • This defines a structured storage format just like defining a Python dict.

3.2b: SQL Operations#

“How Data is Manipulated”, this connects to transactions (Activity 1) and querying (Activity 3).

Python Concept

SQL Equivalent

Example Query

Adding data (append)

INSERT INTO

INSERT INTO elements VALUES (1, 'Hydrogen', 'H', 1.008);

Updating values

UPDATE

UPDATE elements SET AtomicMass = 1.0079 WHERE Symbol = 'H';

Removing items (del, pop)

DELETE FROM

DELETE FROM elements WHERE AtomicNumber = 1;

Filtering (if, list comprehensions)

WHERE

SELECT * FROM elements WHERE AtomicNumber > 10;

Sorting (sorted function)

ORDER BY

SELECT * FROM elements ORDER BY AtomicMass DESC;


3.2c: SQL Query Execution#

“How Queries Work”, this connects to Indexing & Optimization (Activity 4).

  • In Python, functions improve code efficiency.

  • In SQL, indexes speed up searches.

Example:

CREATE INDEX idx_symbol ON elements(Symbol);
  • This acts like a Python dictionary lookup for faster queries on Symbol.


3.2d: SQL Transactions & Multi-User Access#

This aligns with Transactions (Activity 1) and Concurrency (Activity 5).

  • In Python, you use try-except blocks to handle errors.

  • In SQL, transactions ensure integrity.

Example:

BEGIN TRANSACTION;
UPDATE elements SET AtomicMass = 1.0079 WHERE Symbol = 'H';
COMMIT;
  • BEGIN TRANSACTION acts like starting a Python function.

  • COMMIT saves the changes (like return in Python).

  • ROLLBACK undoes changes if there’s an error (like try-except in Python).

4. SQLite#

SQLite is a single file SQL database that can be generated and manipulated with the built-in Python sqlite3 module. Unlike larger SQL databases it does not require running a database server, and so it is an excellent tool to introduce basic database functions with.

The Role of SQLite in the SQL Ecosystem#

Feature

SQLite

Server-Based SQL (MySQL, PostgreSQL)

Setup

No setup needed (file-based)

Requires installation & server

Best For

Local storage, embedded apps, small projects

Large-scale web applications, enterprise systems

Performance

Fast for small datasets

Scales well with huge data

Concurrency

Limited multi-user access

Handles multiple users simultaneously

Portability

Just a single .db file

Requires database servers & configurations

4.1 SQLite Objects (Classes)#

Databases involve several key operations and there are several SQLite python objects (classes) that we need to understand in order to communicate with and manipulate data within a SQLite database. These are

  1. Connection: Handles database access of a session (open, close, commit, rollback).

  2. Cursor: Handles query execution and navigation, you can have multiple cursor objects in a connection.

  3. Row Allows you to access values by column name instead of index.

Since SQLite is, making it a great tool for small to medium datasets, research projects, and prototyping.

4.1a: Connection Methods (sqlite3.Connection)#

Method

Description

connect(database)

Opens a connection to the SQLite database file.

close()

Closes the connection to free up resources.

commit()

Saves (commits) changes to the database.

rollback()

Rolls back uncommitted changes.

cursor()

Creates a cursor object to execute SQL commands.

execute(sql, params)

Executes a single SQL command.

executemany(sql, seq_of_params)

Executes a SQL command multiple times with different values.


4.1b Cursor Methods (sqlite3.Cursor)#

Method

Description

execute(sql, params)

Executes a SQL query.

executemany(sql, seq_of_params)

Executes a batch of SQL queries.

fetchone()

Fetches the next row of the result set.

fetchall()

Fetches all remaining rows of the result set.

fetchmany(size)

Fetches size number of rows.

description

A tuple containing metadata about the columns.


4.2Common Database Tasks in SQLite#

Task

SQL Query

SQLite (sqlite3) Example

Show all tables

SELECT name FROM sqlite_master WHERE type='table';

cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")

Show table structure

PRAGMA table_info(elements);

cursor.execute("PRAGMA table_info(elements);")

Get all records

SELECT * FROM elements;

cursor.execute("SELECT * FROM elements;")

Get first 5 records

SELECT * FROM elements LIMIT 5;

cursor.execute("SELECT * FROM elements LIMIT 5;")

Find specific element

SELECT * FROM elements WHERE symbol='H';

cursor.execute("SELECT * FROM elements WHERE symbol='H';")

Filter by condition

SELECT name, symbol FROM elements WHERE boiling_point < 25;

cursor.execute("SELECT name, symbol FROM elements WHERE boiling_point < 25;")

Sort results

SELECT * FROM elements ORDER BY atomic_mass DESC;

cursor.execute("SELECT * FROM elements ORDER BY atomic_mass DESC;")

Insert new record

INSERT INTO elements (symbol, name, atomic_mass) VALUES ('Xx', 'XenonX', 200);

cursor.execute("INSERT INTO elements (symbol, name, atomic_mass) VALUES (?, ?, ?)", ('Xx', 'XenonX', 200))

Update a record

UPDATE elements SET atomic_mass=18.02 WHERE symbol='H2O';

cursor.execute("UPDATE elements SET atomic_mass=? WHERE symbol=?", (18.02, 'H2O'))

Delete a record

DELETE FROM elements WHERE symbol='Xx';

cursor.execute("DELETE FROM elements WHERE symbol=?", ('Xx',))

5. SQLite Activity#

5.1: Create SQL Database#

We will start by creating a simple database called halogens.db in the sandbox with a table called elements that contains 4 fields (AtomicNumber, Name, Symbol and AtomicMass), along with specifying the type of data in each field.

import sqlite3
import os

# Define the directory structure
sandbox_dir = os.path.expanduser("~/sandbox")  # Parent directory
os.makedirs(sandbox_dir, exist_ok=True)  # Ensure directories exist
halogens_db_sbpath = os.path.join(sandbox_dir, "halogens.db")  # SQLite database file

# Connect to SQLite database (creates file if it doesn't exist)
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()

# Create a table called elements
cursor.execute('''
    CREATE TABLE IF NOT EXISTS elements (
        AtomicNumber INTEGER PRIMARY KEY,
        Name TEXT NOT NULL,
        Symbol TEXT NOT NULL,
        AtomicMass REAL NOT NULL
    )
''')

# Commit changes and close connection
conn.commit()
conn.close()
import sqlite3

conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()

cursor.execute("PRAGMA table_info(elements)")
schema = cursor.fetchall()
for col in schema:
    print(col)

conn.close()
(0, 'Element', 'TEXT', 0, None, 0)
(1, 'AtomicNumber', 'INTEGER', 0, None, 0)
(2, 'AtomicMass', 'REAL', 0, None, 0)
(3, 'Electronegativity', 'REAL', 0, None, 0)
import sqlite3

# Connect to your database
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()

# Select all rows from the elements table
cursor.execute("SELECT * FROM elements")
rows = cursor.fetchall()

# Print each row
for row in rows:
    print(row)

# Close the connection
conn.close()
('F', 9, 18.998, 3.98)
('Cl', 17, 35.45, 3.16)
('Br', 35, 79.904, 2.96)
('I', 53, 126.9, 2.66)
('At', 85, 210.0, 2.2)
('Ts', 117, 294.0, None)

5.2: Insert Elements#

import sqlite3

# Reconnect to the database
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()

# Insert element data
elements_data = [
    ("H", 1, 1.008, 2.4),
    ("Li", 3, 6.94, 1.0)
]

cursor.executemany("INSERT INTO elements VALUES (?, ?, ?, ?)", elements_data)

# Commit changes and close connection
conn.commit()
conn.close()

Reopen DB and Show Inserted Elements

import sqlite3

# Connect to your database
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()

# Select all rows from the elements table
cursor.execute("SELECT * FROM elements")
rows = cursor.fetchall()

# Print each row
for row in rows:
    print(row)

# Close the connection
conn.close()
('F', 9, 18.998, 3.98)
('Cl', 17, 35.45, 3.16)
('Br', 35, 79.904, 2.96)
('I', 53, 126.9, 2.66)
('At', 85, 210.0, 2.2)
('Ts', 117, 294.0, None)
('H', 1, 1.008, 2.4)
('Li', 3, 6.94, 1.0)

5.3: Changing Data Value#

import sqlite3

# Connect to your database
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()

# Update Electronegativity for Hydrogen and Lithium
cursor.execute("""
    UPDATE elements
    SET Electronegativity = 2.2
    WHERE Element = 'H'
""")

cursor.execute("""
    UPDATE elements
    SET Electronegativity = 0.98
    WHERE Element = 'Li'
""")

# Commit changes and close connection
conn.commit()
conn.close()

Show Values have Changed

import sqlite3

# Connect to your database
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()

# Select all rows from the elements table
cursor.execute("SELECT * FROM elements")
rows = cursor.fetchall()

# Print each row
for row in rows:
    print(row)

# Close the connection
conn.close()
('F', 9, 18.998, 3.98)
('Cl', 17, 35.45, 3.16)
('Br', 35, 79.904, 2.96)
('I', 53, 126.9, 2.66)
('At', 85, 210.0, 2.2)
('Ts', 117, 294.0, None)
('H', 1, 1.008, 2.2)
('Li', 3, 6.94, 0.98)

5.4: Deleting Data Value#

Hydrogen and Lithium are not halogens so lets remove them from the database

import sqlite3

conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()

# Delete incorrect rows based on suspicious 'Element' values
cursor.execute("DELETE FROM elements WHERE Element IN ('H', 'Li')")

conn.commit()
conn.close()

Show Hydrogen and Lithium are removed from the data base

5.5: Query Data#

Find all elements with Atomic Number >= 10.

import sqlite3

# Connect to your database
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()

# Select all rows from the elements table
cursor.execute("SELECT * FROM elements")
rows = cursor.fetchall()

# Print each row
for row in rows:
    print(row)

# Close the connection
conn.close()
('F', 9, 18.998, 3.98)
('Cl', 17, 35.45, 3.16)
('Br', 35, 79.904, 2.96)
('I', 53, 126.9, 2.66)
('At', 85, 210.0, 2.2)
('Ts', 117, 294.0, None)
import sqlite3

# Reconnect and create a cursor
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()

# Run a query
#cursor.execute("SELECT Element FROM elements WHERE AtomicNumber >= 10")
cursor.execute("SELECT Element, AtomicNumber FROM elements WHERE AtomicNumber >= 10")


# Fetch and print results
results = cursor.fetchall()
for row in results:
    print(row)  # Each row is a tuple (Name, Symbol)

# Close connection
conn.close()
('Cl', 17)
('Br', 35)
('I', 53)
('At', 85)
('Ts', 117)

5.6: Convert to Pandas DataFrame#

This converts the data to a Pandas DataFrame and prints top 5 rows

import sqlite3
import pandas as pd

# Reconnect to the database
conn = sqlite3.connect(halogens_db_sbpath)

# Read SQL query into DataFrame
df = pd.read_sql_query("SELECT * FROM elements", conn)

# Display DataFrame
print(df.head(10))

# Close connection
conn.close()
  Element  AtomicNumber  AtomicMass  Electronegativity
0       F             9      18.998               3.98
1      Cl            17      35.450               3.16
2      Br            35      79.904               2.96
3       I            53     126.900               2.66
4      At            85     210.000               2.20
5      Ts           117     294.000                NaN

5.7: Create a Database from halogens.csv#

Note, the workflow: we convert csv file to pandas dataframe and then the dataframe to the sql database, and we are replacing the data in the elements table, not appending to it

import sqlite3
import pandas as pd
import os

# Define the directory structure
sandbox_dir = os.path.expanduser("~/sandbox")  # Parent directory
os.makedirs(sandbox_dir, exist_ok=True)  # Ensure directories exist

# Define file paths
halogens_csv_sbpath = os.path.join(sandbox_dir, "halogens.csv")
halogens_db_sbpath = os.path.join(sandbox_dir, "halogens.db")  # SQLite database file

# Load CSV file into Pandas DataFrame
df = pd.read_csv(halogens_csv_sbpath)

# Connect to SQLite database (creates the file if it doesn’t exist)
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()

# Create table (adjust column names to match CSV headers)
cursor.execute("""
    CREATE TABLE IF NOT EXISTS elements (
        Element TEXT NOT NULL,
        atomic_number INTEGER PRIMARY KEY,
        atomic_mass REAL,
        Electronegativity REAL
    )
""")

# Insert data from DataFrame into SQLite table, this replaces  the data
df.to_sql("elements", conn, if_exists="replace", index=False)

# Commit changes and close connection
conn.commit()
conn.close()

5.11: Find Elements with mass less than 50 amu.#

conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()

cursor.execute("SELECT Element FROM elements WHERE AtomicMass < 50;")
light_elements = cursor.fetchall()

print("Elements that weight less than 50 amu:")
for element in light_elements:
    print(element)
cursor.execute("SELECT AtomicMass FROM elements WHERE Element LIKE 'Cl';")
var=cursor.fetchall()
conn.close()
print(var)
print(type(var))
Elements that weight less than 50 amu:
('F',)
('Cl',)
[(35.45,)]
<class 'list'>

Why Does fetchall() Return [(35.45,)]?

  • fetchall() always returns a list of tuples, even if there’s only one result.

  • Each row is returned as a tuple, and fetchall() collects all rows into a list.

var = cursor.fetchall()
atomic_mass = var[0][0]  # Extract first row (tuple) and first column (value)
print(atomic_mass)  # Output: 35.45
print(type(atomic_mass))  # Output: <class 'float'>
  • var[0] gives the first row(35.45,)

  • var[0][0] gives the first value35.45

Query Method

Output

How to Extract Float

fetchall()

[(35.45,)]

var[0][0]

fetchone()

(35.45,)

var[0]

conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()

cursor.execute("SELECT AtomicMass FROM elements WHERE Element LIKE 'Cl';")
var = cursor.fetchone()  # Use fetchone() for a single value
conn.close()
atomic_mass = var[0]  # Extract float value
print(atomic_mass, type(atomic_mass))  # Output: 35.45 <class 'float'>
35.45 <class 'float'>

5.12: Find Element with largest Atomic Mass#

conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()

cursor.execute("SELECT Element, AtomicMass FROM elements ORDER BY AtomicMass DESC LIMIT 1;")
heaviest_element = cursor.fetchone()

print("Heaviest element:", heaviest_element)
conn.close()
Heaviest element: ('Ts', 294.0)

5.13: Convert db to pandas dataframe#

import sqlite3
import pandas as pd

# Connect to SQLite database
conn = sqlite3.connect(halogens_db_sbpath)

# Load entire 'elements' table into a DataFrame
df = pd.read_sql_query("SELECT * FROM elements;", conn)

# Close connection
conn.close()

# Display DataFrame
print(df)
  Element  AtomicNumber  AtomicMass  Electronegativity
0       F             9      18.998               3.98
1      Cl            17      35.450               3.16
2      Br            35      79.904               2.96
3       I            53     126.900               2.66
4      At            85     210.000               2.20
5      Ts           117     294.000                NaN

Acknowledgements#

This content was developed with assistance from Perplexity AI and Chat GPT. Multiple queries were made during the Fall 2024 and the Spring 2025.