7. Databases as External Data Structures#
How Databases and Web Services Fit Into the Picture#
They are an interesting hybrid, they store non-volatile data from the perspective of the python program, but access to them introduces elements of volatility:
Non-Volatile: Data persists when the Python kernel/interpreter is shut down.
Volatile: Unlike files, a database can be unavailable (e.g., server downtime), and data integrity can be affected by concurrent access.
In a structured progression from volatile to non-volatile data from the perspective of a program, databases fit in as external but queryable structures:
Unlike files, they support efficient querying (e.g., SQL queries).
Unlike in-memory Python data structures, they persist beyond execution of a program.
Unlike JSON or CSV files, they allow dynamic updates and complex relationships.
1. How a Computer Stores Data: Directory vs. Database#
So far we have looked at directories (file systems) and we need to understand the difference between a file system and a database.
1.1 Directory (File Storage)#
A directory is a hierarchical tree structure where files and folders are organized.
Files are stored as binary data on disk, and each file has metadata (name, size, type, last modified date).
You can perform basic searches (filename, date modified).
No structured relations between files, beyond directory hierarchies.
Example: How a Directory Stores Chemistry Data#
/research_data/
│── periodic_table.csv
│── elements.json
│── hydrogen.png
│── lab_notes.txt
You can find “hydrogen.png” by looking in
/research_data/
and searching filenames.But if you want to ask, “What are all the elements with atomic number < 10?”, the directory itself does not support such structured queries.
1.2 Database Storage#
A database is an organized system for storing and retrieving data with structured querying.
Data is stored in a structured format (tables, documents, graphs).
Indexing enables fast queries (retrieving data without scanning everything).
Relationships exist between data.
2. Database Models: SQL, NoSQL, and Beyond#
There are multiple ways to structure and organize data in a database. The most common types include:
2.1 Relational Databases (SQL)#
Data is stored in tables with predefined columns (fields).
Tables are related to each other through keys.
Uses Structured Query Language (SQL) to manage and query data.
Example: SQLite, PostgreSQL, MySQL.
Example Schema:
Element Table: | AtomicNumber | Symbol | Name | AtomicWeight | |-------------|--------|-------|--------------| | 1 | H | Hydrogen | 1.008 | | 2 | He | Helium | 4.0026 |
Relationships allow for structured queries like:
SELECT * FROM Element WHERE AtomicNumber = 1;
2.2 NoSQL Databases#
Designed for flexibility and scale, handling semi-structured or unstructured data.
No fixed table structure; data is stored as documents, key-value pairs, graphs, or columns.
Example JSON document in MongoDB:
{
"AtomicNumber": 1,
"Symbol": "H",
"Name": "Hydrogen",
"AtomicWeight": 1.008
}
Queries use JSON-based query languages, e.g.:
db.elements.find({"Symbol": "H"})
2.3 Graph Databases (RDF and Triple Stores)#
Used to model complex relationships, often found in semantic web and linked data applications.
Data is stored as triples: (subject, predicate, object).
Example:
("Hydrogen", "isElement", "True") ("Hydrogen", "atomicNumber", "1") ("Hydrogen", "bondsWith", "Oxygen")
Queries use SPARQL instead of SQL.
Understanding Databases#
The core role of a database is to efficiently and persistently store data and in a structured and useful manner that allows access to the data. These goals can be achieved through 5 key database activities.
Key Database Activity |
Purpose |
---|---|
1. Transactions (Data Modification & Integrity) |
Manage data changes ( |
2. Schema Management |
Define tables, columns, relationships, and constraints to structure the data. |
3. Data Querying & Retrieval |
Access stored data without modifying it ( |
4. Indexing & Optimization |
Speed up data access and searches by using indexes and optimizing queries for performance. |
5. Concurrency & Multi-User Access |
Manage simultaneous access to prevent conflicts and ensure data consistency across multiple users. |
3. SQL & Declarative Programming#
In a declarative programming language (SQL)you describe what you want, now how to get it. In an object oriented procedural approach (Python) you specify step-by-step how to proceed with a task
3.1 Comparing SQL to Python#
Feature |
SQL (Database Language) |
Python (Interpreted Language) |
---|---|---|
Purpose |
Query, store, and manipulate structured data |
General-purpose programming |
Execution |
Runs inside a database engine (SQLite, MySQL) |
Runs in an interpreter (CPython, PyPy) |
Typing |
Strongly typed schema (e.g., |
Dynamically typed ( |
Flow Control |
Limited ( |
Full control ( |
Usage |
Optimized for fast data retrieval and transactions |
Used for programming logic, automation, and data processing |
3.2: SQL Breakdown Using Core Database Activities#
Each database activity maps to a set of SQL commands (syntax).
Database Activity |
Purpose |
SQL Syntax to Learn |
---|---|---|
1. Transactions (Data Modification & Integrity) |
Modify data while ensuring consistency. |
|
2. Schema Management |
Define the structure of the database. |
|
3. Data Querying & Retrieval |
Retrieve data without modifying it. |
|
4. Indexing & Optimization |
Improve query performance. |
|
5. Concurrency & Multi-User Access |
Ensure consistency in multi-user environments. |
SQLite handles this differently, but |
3.2a: SQL Data Structures#
“How Data is Stored”, this connects to schema management (Activity 2).
Concept |
Python Equivalent |
SQL Equivalent (Schema Management) |
---|---|---|
Lists & Dicts |
|
Tables (collections of structured data) |
Attributes & Keys |
Dictionary keys, object attributes |
Columns & Primary Keys |
Relationships |
Dictionary of dictionaries, nested lists |
Foreign keys, joins |
Example:
CREATE TABLE elements (
AtomicNumber INTEGER PRIMARY KEY,
Name TEXT NOT NULL,
Symbol TEXT NOT NULL,
AtomicMass REAL NOT NULL
);
This defines a structured storage format just like defining a Python
dict
.
3.2b: SQL Operations#
“How Data is Manipulated”, this connects to transactions (Activity 1) and querying (Activity 3).
Python Concept |
SQL Equivalent |
Example Query |
---|---|---|
Adding data (append) |
|
|
Updating values |
|
|
Removing items (del, pop) |
|
|
Filtering (if, list comprehensions) |
|
|
Sorting (sorted function) |
|
|
3.2c: SQL Query Execution#
“How Queries Work”, this connects to Indexing & Optimization (Activity 4).
In Python, functions improve code efficiency.
In SQL, indexes speed up searches.
Example:
CREATE INDEX idx_symbol ON elements(Symbol);
This acts like a Python dictionary lookup for faster queries on
Symbol
.
3.2d: SQL Transactions & Multi-User Access#
This aligns with Transactions (Activity 1) and Concurrency (Activity 5).
In Python, you use
try-except
blocks to handle errors.In SQL, transactions ensure integrity.
Example:
BEGIN TRANSACTION;
UPDATE elements SET AtomicMass = 1.0079 WHERE Symbol = 'H';
COMMIT;
BEGIN TRANSACTION
acts like starting a Python function.COMMIT
saves the changes (likereturn
in Python).ROLLBACK
undoes changes if there’s an error (liketry-except
in Python).
4. SQLite#
SQLite is a single file SQL database that can be generated and manipulated with the built-in Python sqlite3 module. Unlike larger SQL databases it does not require running a database server, and so it is an excellent tool to introduce basic database functions with.
The Role of SQLite in the SQL Ecosystem#
Feature |
SQLite |
Server-Based SQL (MySQL, PostgreSQL) |
---|---|---|
Setup |
No setup needed (file-based) |
Requires installation & server |
Best For |
Local storage, embedded apps, small projects |
Large-scale web applications, enterprise systems |
Performance |
Fast for small datasets |
Scales well with huge data |
Concurrency |
Limited multi-user access |
Handles multiple users simultaneously |
Portability |
Just a single |
Requires database servers & configurations |
4.1 SQLite Objects (Classes)#
Databases involve several key operations and there are several SQLite python objects (classes) that we need to understand in order to communicate with and manipulate data within a SQLite database. These are
Connection: Handles database access of a session (open, close, commit, rollback).
Cursor: Handles query execution and navigation, you can have multiple cursor objects in a connection.
Row Allows you to access values by column name instead of index.
Since SQLite is, making it a great tool for small to medium datasets, research projects, and prototyping.
4.1a: Connection Methods (sqlite3.Connection
)#
Method |
Description |
---|---|
|
Opens a connection to the SQLite database file. |
|
Closes the connection to free up resources. |
|
Saves (commits) changes to the database. |
|
Rolls back uncommitted changes. |
|
Creates a cursor object to execute SQL commands. |
|
Executes a single SQL command. |
|
Executes a SQL command multiple times with different values. |
4.1b Cursor Methods (sqlite3.Cursor
)#
Method |
Description |
---|---|
|
Executes a SQL query. |
|
Executes a batch of SQL queries. |
|
Fetches the next row of the result set. |
|
Fetches all remaining rows of the result set. |
|
Fetches |
|
A tuple containing metadata about the columns. |
4.2Common Database Tasks in SQLite#
Task |
SQL Query |
SQLite ( |
---|---|---|
Show all tables |
|
|
Show table structure |
|
|
Get all records |
|
|
Get first 5 records |
|
|
Find specific element |
|
|
Filter by condition |
|
|
Sort results |
|
|
Insert new record |
|
|
Update a record |
|
|
Delete a record |
|
|
5. SQLite Activity#
5.1: Create SQL Database#
We will start by creating a simple database called halogens.db in the sandbox with a table called elements that contains 4 fields (AtomicNumber, Name, Symbol and AtomicMass), along with specifying the type of data in each field.
import sqlite3
import os
# Define the directory structure
sandbox_dir = os.path.expanduser("~/sandbox") # Parent directory
os.makedirs(sandbox_dir, exist_ok=True) # Ensure directories exist
halogens_db_sbpath = os.path.join(sandbox_dir, "halogens.db") # SQLite database file
# Connect to SQLite database (creates file if it doesn't exist)
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
# Create a table called elements
cursor.execute('''
CREATE TABLE IF NOT EXISTS elements (
AtomicNumber INTEGER PRIMARY KEY,
Name TEXT NOT NULL,
Symbol TEXT NOT NULL,
AtomicMass REAL NOT NULL
)
''')
# Commit changes and close connection
conn.commit()
conn.close()
import sqlite3
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
cursor.execute("PRAGMA table_info(elements)")
schema = cursor.fetchall()
for col in schema:
print(col)
conn.close()
(0, 'Element', 'TEXT', 0, None, 0)
(1, 'AtomicNumber', 'INTEGER', 0, None, 0)
(2, 'AtomicMass', 'REAL', 0, None, 0)
(3, 'Electronegativity', 'REAL', 0, None, 0)
import sqlite3
# Connect to your database
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
# Select all rows from the elements table
cursor.execute("SELECT * FROM elements")
rows = cursor.fetchall()
# Print each row
for row in rows:
print(row)
# Close the connection
conn.close()
('F', 9, 18.998, 3.98)
('Cl', 17, 35.45, 3.16)
('Br', 35, 79.904, 2.96)
('I', 53, 126.9, 2.66)
('At', 85, 210.0, 2.2)
('Ts', 117, 294.0, None)
5.2: Insert Elements#
import sqlite3
# Reconnect to the database
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
# Insert element data
elements_data = [
("H", 1, 1.008, 2.4),
("Li", 3, 6.94, 1.0)
]
cursor.executemany("INSERT INTO elements VALUES (?, ?, ?, ?)", elements_data)
# Commit changes and close connection
conn.commit()
conn.close()
Reopen DB and Show Inserted Elements
import sqlite3
# Connect to your database
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
# Select all rows from the elements table
cursor.execute("SELECT * FROM elements")
rows = cursor.fetchall()
# Print each row
for row in rows:
print(row)
# Close the connection
conn.close()
('F', 9, 18.998, 3.98)
('Cl', 17, 35.45, 3.16)
('Br', 35, 79.904, 2.96)
('I', 53, 126.9, 2.66)
('At', 85, 210.0, 2.2)
('Ts', 117, 294.0, None)
('H', 1, 1.008, 2.4)
('Li', 3, 6.94, 1.0)
5.3: Changing Data Value#
import sqlite3
# Connect to your database
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
# Update Electronegativity for Hydrogen and Lithium
cursor.execute("""
UPDATE elements
SET Electronegativity = 2.2
WHERE Element = 'H'
""")
cursor.execute("""
UPDATE elements
SET Electronegativity = 0.98
WHERE Element = 'Li'
""")
# Commit changes and close connection
conn.commit()
conn.close()
Show Values have Changed
import sqlite3
# Connect to your database
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
# Select all rows from the elements table
cursor.execute("SELECT * FROM elements")
rows = cursor.fetchall()
# Print each row
for row in rows:
print(row)
# Close the connection
conn.close()
('F', 9, 18.998, 3.98)
('Cl', 17, 35.45, 3.16)
('Br', 35, 79.904, 2.96)
('I', 53, 126.9, 2.66)
('At', 85, 210.0, 2.2)
('Ts', 117, 294.0, None)
('H', 1, 1.008, 2.2)
('Li', 3, 6.94, 0.98)
5.4: Deleting Data Value#
Hydrogen and Lithium are not halogens so lets remove them from the database
import sqlite3
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
# Delete incorrect rows based on suspicious 'Element' values
cursor.execute("DELETE FROM elements WHERE Element IN ('H', 'Li')")
conn.commit()
conn.close()
Show Hydrogen and Lithium are removed from the data base
5.5: Query Data#
Find all elements with Atomic Number >= 10.
import sqlite3
# Connect to your database
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
# Select all rows from the elements table
cursor.execute("SELECT * FROM elements")
rows = cursor.fetchall()
# Print each row
for row in rows:
print(row)
# Close the connection
conn.close()
('F', 9, 18.998, 3.98)
('Cl', 17, 35.45, 3.16)
('Br', 35, 79.904, 2.96)
('I', 53, 126.9, 2.66)
('At', 85, 210.0, 2.2)
('Ts', 117, 294.0, None)
import sqlite3
# Reconnect and create a cursor
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
# Run a query
#cursor.execute("SELECT Element FROM elements WHERE AtomicNumber >= 10")
cursor.execute("SELECT Element, AtomicNumber FROM elements WHERE AtomicNumber >= 10")
# Fetch and print results
results = cursor.fetchall()
for row in results:
print(row) # Each row is a tuple (Name, Symbol)
# Close connection
conn.close()
('Cl', 17)
('Br', 35)
('I', 53)
('At', 85)
('Ts', 117)
5.6: Convert to Pandas DataFrame#
This converts the data to a Pandas DataFrame and prints top 5 rows
import sqlite3
import pandas as pd
# Reconnect to the database
conn = sqlite3.connect(halogens_db_sbpath)
# Read SQL query into DataFrame
df = pd.read_sql_query("SELECT * FROM elements", conn)
# Display DataFrame
print(df.head(10))
# Close connection
conn.close()
Element AtomicNumber AtomicMass Electronegativity
0 F 9 18.998 3.98
1 Cl 17 35.450 3.16
2 Br 35 79.904 2.96
3 I 53 126.900 2.66
4 At 85 210.000 2.20
5 Ts 117 294.000 NaN
5.7: Create a Database from halogens.csv#
Note, the workflow: we convert csv file to pandas dataframe and then the dataframe to the sql database, and we are replacing the data in the elements table, not appending to it
import sqlite3
import pandas as pd
import os
# Define the directory structure
sandbox_dir = os.path.expanduser("~/sandbox") # Parent directory
os.makedirs(sandbox_dir, exist_ok=True) # Ensure directories exist
# Define file paths
halogens_csv_sbpath = os.path.join(sandbox_dir, "halogens.csv")
halogens_db_sbpath = os.path.join(sandbox_dir, "halogens.db") # SQLite database file
# Load CSV file into Pandas DataFrame
df = pd.read_csv(halogens_csv_sbpath)
# Connect to SQLite database (creates the file if it doesn’t exist)
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
# Create table (adjust column names to match CSV headers)
cursor.execute("""
CREATE TABLE IF NOT EXISTS elements (
Element TEXT NOT NULL,
atomic_number INTEGER PRIMARY KEY,
atomic_mass REAL,
Electronegativity REAL
)
""")
# Insert data from DataFrame into SQLite table, this replaces the data
df.to_sql("elements", conn, if_exists="replace", index=False)
# Commit changes and close connection
conn.commit()
conn.close()
5.8: Print Tables in Database#
import sqlite3
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()
print("Tables in the database:", tables)
conn.close()
Tables in the database: [('elements',)]
5.9: Print Fields (Columns) in Database#
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
cursor.execute("PRAGMA table_info(elements);")
fields = [row[1] for row in cursor.fetchall()]
print("Fields in the 'elements' table:", fields)
conn.close()
Fields in the 'elements' table: ['Element', 'Atomic Number', 'Atomic Mass', 'Electronegativity']
Warning the column names ‘Atomic Number’ and “Atomic Mass’ have two words and we need to make them one word or SQLite will give a name error if we seek to use them. That is, the following would throw an error:
conn = sqlite3.connect(halogens_db_file_path)
cursor = conn.cursor()
cursor.execute("PRAGMA table_info(elements);")
fields = [row[1] for row in cursor.fetchall()]
print("Fields in the 'elements' table:", fields)
conn.close()
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
cursor.execute('ALTER TABLE elements RENAME COLUMN "Atomic Mass" TO AtomicMass;')
cursor.execute('ALTER TABLE elements RENAME COLUMN "Atomic Number" TO AtomicNumber;')
cursor.execute("PRAGMA table_info(elements);")
fields = [row[1] for row in cursor.fetchall()]
print("Fields in the 'elements' table:", fields)
conn.close()
Fields in the 'elements' table: ['Element', 'AtomicNumber', 'AtomicMass', 'Electronegativity']
5.10: Print first 5 Records (rows)#
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
cursor.execute("SELECT * FROM elements LIMIT 5;")
records = cursor.fetchall()
for record in records:
print(record)
conn.close()
('F', 9, 18.998, 3.98)
('Cl', 17, 35.45, 3.16)
('Br', 35, 79.904, 2.96)
('I', 53, 126.9, 2.66)
('At', 85, 210.0, 2.2)
5.11: Find Elements with mass less than 50 amu.#
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
cursor.execute("SELECT Element FROM elements WHERE AtomicMass < 50;")
light_elements = cursor.fetchall()
print("Elements that weight less than 50 amu:")
for element in light_elements:
print(element)
cursor.execute("SELECT AtomicMass FROM elements WHERE Element LIKE 'Cl';")
var=cursor.fetchall()
conn.close()
print(var)
print(type(var))
Elements that weight less than 50 amu:
('F',)
('Cl',)
[(35.45,)]
<class 'list'>
Why Does fetchall()
Return [(35.45,)]
?
fetchall()
always returns a list of tuples, even if there’s only one result.Each row is returned as a tuple, and
fetchall()
collects all rows into a list.
var = cursor.fetchall()
atomic_mass = var[0][0] # Extract first row (tuple) and first column (value)
print(atomic_mass) # Output: 35.45
print(type(atomic_mass)) # Output: <class 'float'>
var[0]
gives the first row →(35.45,)
var[0][0]
gives the first value →35.45
Query Method |
Output |
How to Extract Float |
---|---|---|
|
|
|
|
|
|
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
cursor.execute("SELECT AtomicMass FROM elements WHERE Element LIKE 'Cl';")
var = cursor.fetchone() # Use fetchone() for a single value
conn.close()
atomic_mass = var[0] # Extract float value
print(atomic_mass, type(atomic_mass)) # Output: 35.45 <class 'float'>
35.45 <class 'float'>
5.12: Find Element with largest Atomic Mass#
conn = sqlite3.connect(halogens_db_sbpath)
cursor = conn.cursor()
cursor.execute("SELECT Element, AtomicMass FROM elements ORDER BY AtomicMass DESC LIMIT 1;")
heaviest_element = cursor.fetchone()
print("Heaviest element:", heaviest_element)
conn.close()
Heaviest element: ('Ts', 294.0)
5.13: Convert db to pandas dataframe#
import sqlite3
import pandas as pd
# Connect to SQLite database
conn = sqlite3.connect(halogens_db_sbpath)
# Load entire 'elements' table into a DataFrame
df = pd.read_sql_query("SELECT * FROM elements;", conn)
# Close connection
conn.close()
# Display DataFrame
print(df)
Element AtomicNumber AtomicMass Electronegativity
0 F 9 18.998 3.98
1 Cl 17 35.450 3.16
2 Br 35 79.904 2.96
3 I 53 126.900 2.66
4 At 85 210.000 2.20
5 Ts 117 294.000 NaN
Acknowledgements#
This content was developed with assistance from Perplexity AI and Chat GPT. Multiple queries were made during the Fall 2024 and the Spring 2025.