6. External Data Structures

6. External Data Structures#

Learning Objectives#

Content#

Nature of NonVolatile Memory
Nature of Input/Output Operations
Understand Interfaces
Non-Volatile Data Structures

Type	Source Example	Data Format	Storage/Query Method
File-Based Data	JSON, CSV, Pickle, HDF5	Structured	File I/O (pandas, NumPy)
Relational Databases	PubChem SQL, NOAA Climate Database	SQL (Structured)	Queries (SQL, Relational Model)
NoSQL Databases	MongoDB, Firebase	JSON, BSON (Semi-Structured)	Queries (NoSQL, Document Store)
APIs (REST)	PubChem API, OpenWeather API, NASA API	JSON, XML, CSV	RESTful Requests (`GET`, `POST`)
SPARQL APIs (Linked Data)	Wikidata, PubChem RDF, DBpedia, Gene Ontology	RDF/XML, Turtle, SPARQL-JSON	SPARQL Queries (Graph-Based)
Graph Databases (RDF Stores)	Blazegraph, Virtuoso, Neo4j (for Linked Data)	RDF, GraphML	Queries (SPARQL, Cypher)
Web Scraping	Wikipedia, Research Articles, Google Scholar	HTML (Semi-Structured)	Parsing (BeautifulSoup, Scrapy)
Streaming Data	EPA RSIG, IoT Sensor Networks	JSON, Avro, Protobuf	WebSockets, Kafka, MQTT

Process#

I/O Operations
- Directory Hierarchies (read/write to specific locations)
Working with:
- .txt files
- .csv files
- .json files

Introduction: Non-Volatile External Data Structures#

Python’s built-in data structures (lists, tuples, dictionaries, etc.) reside in volatile memory (RAM) and persist only as long as the Python interpreter is running. Once the interpreter exits, these structures are lost. In contrast, external data structures stored in non-volatile memory (such as files, databases, or serialized objects) persist beyond both the current Python session and hardware restarts, allowing data to be recovered even after the system reboots.

File-Based Data Structures
- Text-Based Files (Structured/Unstructured)
  - .txt: Raw text files (unstructured).
  - .csv: Tabular data (structured, human-readable, but limited).
  - .json: Key-value format (structured, hierarchical, readable).
- Binary Files
  - .pickle: Stores Python objects in a serialized format (not human-readable).
  - .npy/.npz: NumPy’s binary storage for efficient numerical data.
Databases (Structured, Queryable External Data)
- SQL Databases (Relational)
  - Store data in structured tables with defined relationships.
  - Examples: SQLite, PostgreSQL, MySQL.
- NoSQL Databases (Flexible, Key-Value, Document-Based)
  - Store unstructured or semi-structured data in key-value or document formats.
  - Examples: MongoDB (documents), Redis (key-value pairs).
Web APIs and Networks as External Data Sources
- Accessing data from remote servers (e.g., PubChem, weather services).
- Often return data in JSON, XML, or other standardized formats.
- Unlike local files or databases, APIs require a network connection.

Understanding Input/Output (I/O)#

At its core, Input/Output (I/O) refers to any communication between a program and the outside world. It is not limited to data storage and retrieval; it also includes interactions with users, hardware, and network resources. I/O operations can be broadly categorized into:

User Interaction
- Input: Receiving user input via input() or GUI elements.
- Output: Displaying text via print(), rendering graphics, or updating a UI.
File I/O (Non-volatile Storage)
- Reading and writing data to files (e.g., .txt, .csv, .json, .pickle).
- Persistent storage that remains available after the program terminates.
Network I/O
- Communicating with remote servers, APIs, or databases over the internet.
- Sending and receiving data over sockets (e.g., accessing PubChem via an API).
Inter-process and Hardware I/O
- Communicating with external devices like sensors, databases, or microcontrollers.
- Data exchange between different programs or services.

Understanding Interfaces#

Inherent in IO operations is the interface between two entities or systems and we are going to need to introduce the concept of an API (Application Program Interface). If you look at the above IO systems you realize there are human, hardware and sofware components, and so there are different types of interfaces. The following table gives an overview of several interfaces, and as a human, you have used both CLIs and GUIs in this class.

Interface Type	Example	Who/What Interacts?
Graphical User Interface (GUI)	Windows, Web Apps	User ↔ System (via visual elements like buttons, menus)
Command Line Interface (CLI)	Terminal, Bash, Python REPL	User ↔ System (via text commands)
Application Programming Interface (API)	REST API, Database API	Software ↔ Software (via structured requests & responses)
Hardware Interfaces	USB, HDMI, Bluetooth	Physical Devices ↔ System

1. File Based Data Structures#

Before we proceed, we are going to install two new third party packages; Seaborn and pandas. Seaborn is a visualization package built on Matplotlib and it comes with a series of files we can use for various data explorations. Pandas is a data manipulation package built on Numpy and is widely used to handle structured data like csv, json, SQL, Excel…)

Text Files vs. Binary Files#

Feature	Text Files	Binary Files
Storage Format	Human-readable characters	Machine-readable bytes
Encoding	Requires a character encoding (e.g., UTF-8)	Stored as raw data
Editing	Can be edited with a simple text editor	Requires specialized software
Examples	`.txt`, `.csv`, `.json`, `.xml`	`.exe`, `.jpg`, `.png`, `.dat`

Comparing CSV, JSON, and Pickle#

Feature	CSV (Comma-Separated Values) 📝	JSON (JavaScript Object Notation) 🌐	Pickle (Python Object Serialization) 🥒
Human-readable?	✅ Yes (text-based, tabular format)	✅ Yes (nested, structured format)	❌ No (binary format)
Supports structured data?	❌ No (flat, lacks hierarchy)	✅ Yes (nested, supports dictionaries/lists)	✅ Yes (fully supports Python objects)
Supports non-string data?	❌ No (everything is a string)	✅ Yes (numbers, lists, dicts)	✅ Yes (numbers, lists, dicts, tuples, objects)
Portable across languages?	✅ Yes (universal)	✅ Yes (used in APIs, databases)	❌ No (Python-specific)
Best used for?	Storing tabular data (like spreadsheets)	Storing hierarchical, structured data	Saving entire Python objects for fast reloading
Ideal for?	Data exchange (Excel, databases)	Config files, APIs, web apps	Machine learning models, Pandas DataFrames

File Access Modes in Python#

Note, you place a b in front of the mode to run these for binary files, but we will not be modifying binary files, and will be using these with text files.

Mode	Meaning	Behavior
`'r'`	Read mode	Opens a file for reading (default). File must exist.
`'w'`	Write mode	Opens a file for writing. If the file exists, it is erased!
`'a'`	Append mode	Opens a file for writing, but does not erase existing content.
`'x'`	Exclusive creation	Creates a new file. Fails if the file already exists.
`'r+'`	Read + Write	Opens a file for both reading and writing. File must exist.
`'w+'`	Write + Read	Opens a file for both, but erases the file first.
`'a+'`	Append + Read	Opens a file for reading and writing, preserving content.

Text Formats#

A text file contains human-readable characters encoded using a standard like UTF-8 or ASCII. This can be contrasted to binary files that store data in machine-readable formats that can not be read by a human. Examples of binary files jpeg images files, many types of instrumental data files, and a special way of preserving python in the forms of pickle files. Most of our data files will be in text based formats as we want the data to be human readable.

Directory Hierarchy for Read/Writing Files#

When your read (input) or write (output) a file the default location is the directory you are running the script from. This means files can be stored all over the place, and we are going to create a directory hierarchy to organize our files as we read and write them. The Structure we are going to use will be programmatically generated in the next several scripts, and at the end of this module we will have four folders in our user directory that we are using in this class. The miniconda3 and projects directories we created in the first lesson, and two new directories for files, one being for our data and the other being a sandbox to play in. We will develop variables for the file_paths that will make it easy to read and write to these locations, no matter where our program is located.

/home/user/
    ├── data/
    │   ├── project_a/
    │   └── project_b/
    └── sandbox/
    └── minconda3/
    └── projects/
    │   └── py4sci/ (.git repo)
    │   └── other_project/ (.git repo)

There is another reason we are doing this, and that is because I am pushing the class files from my py4sci git repository to github, and I want the project files outside of the git repository as I do not want to fill up my repository with a bunch of data files. If you wish to push data files to the github, then you will need to make another data folder within your git repository.

*.txt Files#

If we want to read a file in the current working directory we would simply use

with open("example.txt", "r", encoding="utf-8") as file:
    content = file.read()
    print(content)

But that code would throw an error as the file does not exist.

In the following activity we are first going to check our home/user/ directory (~/) to see if the folder sandbox exists and if there is a file in it called chemical_safety_log.txt. If it does not exist we will create it. Throughout this class I will be using two folders to write files to, my sandbox and my data folders. These folders are outside of the class git directory and so the files in them will not be pushed to github, and these repesent files I want you to create and not clone or pull from github.

Create .txt File in a Sandbox Directory#

Lets look at the following script before running it.

sandbox_dir = os.path.expanduser("~/sandbox/")
sb_file_path = os.path.join(sandbox_dir, "chemical_safety_log.txt")

os.path.expanduser("~") expands the tilda to the full path (/home/username), by using os.path this will also work on Windows, Mac and Linux systems.

sandbox_dir = os.path.expanduser("~/sandbox/") assigns sandbox_dir to the full path home/username/sandbox

os.path.join(sandbox_dir, "chemical_safety_log.txt") concatenates the file name to the path ‘sb_file_path = os.path.join(sandbox_dir, “chemical_safety_log.txt”)’ assigns the variable sb_file_path to the full path of the file in the sandbox (sb) directory.

NOTE: This script will adjust to different users and operating systems, and so should work without modification for anyone who downloads the script from github.

os.makedirs(sandbox_dir, exist_ok=True)

This script makes the directory assigned by the path of sandbox_dir on the users computer if it does not exist, with the exist_od=True preventing an error if the directory already exists

if not os.path.exists(sb_file_path):
    with open(file_path, "w", encoding="utf-8") as file:
        file.write("Chemical Safety Log\n")
        file.write("="*30 + "\n")
        file.write("Date: 2025-02-13\n")
        file.write("Chemical: Acetone\n")
        file.write("Incident: Small spill in lab. Cleaned with absorbent pads.\n")
        file.write("Preventive Action: Ensure lid is tightly sealed after use.\n\n")
    print("Initial Safety Log created")

Normally an if statement runs the block of code if it is true, but here we are using a Boolean not to negate the condition. So if the path defined by the variable file_path does NOT exist, the block of code is executed, and the block is skipped if the path location exist. So if the file chemical_safety_log.txt exists in the directory /~/sandbox/ the code is skipped, and otherwise the file is created and the data with incident is written to the file.

The final block of code reads the file defined by the variable file_path

import os

# Define file path in ~/sandbox/
# Assign to variable chem_safety_log_sbpath (chemical safety log sandbox path)
sandbox_dir = os.path.expanduser("~/sandbox/")
chem_safety_log_sbpath = os.path.join(sandbox_dir, "chemical_safety_log.txt")

# Ensure the ~/sandbox/ directory exists
os.makedirs(sandbox_dir, exist_ok=True)

# Create the file if it does not exist
if not os.path.exists(chem_safety_log_sbpath):
    with open(chem_safety_log_sbpath, "w", encoding="utf-8") as file:
        file.write("Chemical Safety Log\n")
        file.write("="*30 + "\n")
        file.write("Date: 2025-02-13\n")
        file.write("Chemical: Acetone\n")
        file.write("Incident: Small spill in lab. Cleaned with absorbent pads.\n")
        file.write("Preventive Action: Ensure lid is tightly sealed after use.\n\n")
    print("Initial Safety Log created")

# Now, read the file safely
with open(chem_safety_log_sbpath, "r", encoding="utf-8") as file:
    content = file.read()
    print("File Contents:\n")
    print(content)

File Contents:

Chemical Safety Log
==============================
Date: 2025-02-13
Chemical: Acetone
Incident: Small spill in lab. Cleaned with absorbent pads.
Preventive Action: Ensure lid is tightly sealed after use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

The above code did several things.

Created directory sandbox if it did not exist
Created file chemical_safety_log.txt if it did not exist
created a global variable chem_safety_log_sbpath that represents the path to the chemical_safety_log.txt file in the sandbox and can be used by other scripts during this python session.

Open the File Browser in Jupyter Lab and navigate to the new folder and verify that the file chemical_safety_log.txt has been created. You can double click it to open it as a textfile in the jupyter lab interface.

Note, in the jupyter lab the following does not work:

with open("~/sandbox/chemical_safety_log.txt", "r", encoding="utf-8") as file:
    content = file.read()
    print(content)

(Run the next code cell and you get an error.)

import os

file_path = os.path.expanduser("~/sandbox/chemical_safety_log.txt")

with open(file_path, "r", encoding="utf-8") as file:
    content = file.read()
    print(content)

Chemical Safety Log
==============================
Date: 2025-02-13
Chemical: Acetone
Incident: Small spill in lab. Cleaned with absorbent pads.
Preventive Action: Ensure lid is tightly sealed after use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

What is going on? It is clear that the directory and file exist, and the issue is that Python does not automatically expand ~ to the home directory when inside of open(). There are two solutions.

Use the full path ‘/home/rebelford/sandbox/chemical_safety_log.txt’
Use the global variable file_path for the path.

# Option 1, be sure to replace rebelford with your username
with open("/home/rebelford/sandbox/chemical_safety_log.txt", "r", encoding="utf-8") as file:
    content = file.read()
    print(content)

Chemical Safety Log
==============================
Date: 2025-02-13
Chemical: Acetone
Incident: Small spill in lab. Cleaned with absorbent pads.
Preventive Action: Ensure lid is tightly sealed after use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

with open(chem_safety_log_sbpath, "r", encoding="utf-8") as file:
    content = file.read()
    print(content)

Chemical Safety Log
==============================
Date: 2025-02-13
Chemical: Acetone
Incident: Small spill in lab. Cleaned with absorbent pads.
Preventive Action: Ensure lid is tightly sealed after use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

RESTART KERNEL and Clear All Output Now run the next code cell, and it fails, even though it just worked, and this is because you have not defined the variable file_path, which exists in memory and was lost when you restarted the kernel.

with open(chem_safety_log_sbpath, "r", encoding="utf-8") as file:
    content = file.read()
    print(content)

Chemical Safety Log
==============================
Date: 2025-02-13
Chemical: Acetone
Incident: Small spill in lab. Cleaned with absorbent pads.
Preventive Action: Ensure lid is tightly sealed after use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

import os

chem_safety_log_sbpath = os.path.expanduser("~/sandbox/chemical_safety_log.txt")  # Expands '~' to full path

with open(chem_safety_log_sbpath, "r", encoding="utf-8") as file:
    content = file.read()
    print(content)

Chemical Safety Log
==============================
Date: 2025-02-13
Chemical: Acetone
Incident: Small spill in lab. Cleaned with absorbent pads.
Preventive Action: Ensure lid is tightly sealed after use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Quick Review: We are using the variable chem_safety_log_sbpath so that we can direct our scripts to the file we generated in the sandbox. If we did not do this we would clutter our working directory with files. It is a best practice to organize where you read and write to and you can always change file_path to the actual path, or simply the file name if it is in the current working directory (the directory the notebook you are running resides in).

Appending a File#

We need to add a new incident to our accident log, and do so by changing the Python file access code from ‘w’ to ‘a’

# Appending a new safety log entry
# Note, we are using the variable chem_safety_log_sbpath instead of the path to the file
with open(chem_safety_log_sbpath, "a", encoding="utf-8") as file:
    file.write("Date: 2025-02-14\n")
    file.write("Chemical: Hydrochloric Acid (HCl)\n")
    file.write("Incident: Minor exposure on gloves. No injury.\n")
    file.write("Preventive Action: Double-check gloves for leaks before use.\n\n")

print("New entry added to safety log.")

New entry added to safety log.

# Read the file using the predefined variable
with open(chem_safety_log_sbpath, "r", encoding="utf-8") as file:
    content = file.read()
    print(content)

Chemical Safety Log
==============================
Date: 2025-02-13
Chemical: Acetone
Incident: Small spill in lab. Cleaned with absorbent pads.
Preventive Action: Ensure lid is tightly sealed after use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Date: 2025-02-14
Chemical: Hydrochloric Acid (HCl)
Incident: Minor exposure on gloves. No injury.
Preventive Action: Double-check gloves for leaks before use.

Best Practices#

Use the with statement to ensure the file is properly closed:

with open('datasets_and_images_2022/namd/namd_subset.csv', 'r', encoding='utf-8') as file:
    content = file.read()

Specify the encoding explicitly to avoid potential issues with different systems:

with open('datasets_and_images_2022/namd/namd_subset.csv', 'r', encoding='utf-8') as file:
    content = file.read()

Use appropriate mode for your needs. For reading a CSV file, ‘r’ is sufficient.

Handle potential exceptions, especially FileNotFoundError:

try:
    with open('datasets_and_images_2022/namd/namd_subset.csv', 'r', encoding='utf-8') as file:
        content = file.read()
except FileNotFoundError:
    print("The file was not found.")

Adhoc Convention for Naming File Paths#

The file path needs to define two things, the path and the file, and so we are going to use a two part convention name-directory name-path, that way you know what file and path a path variable goes to. If we wanted to put a spectral file called uv_vis1 in the data directory a good name would be ‘uv_vis1_datapath’, and the following script could create it:

data_dir = os.path.expanduser("~/data/")
uv_vis1_datapath = os.path.join(data_dir, "uv_vis1.csv")

CSV (Comma-Separated Values)#

CSV files store tabular data (rows and columns) in a plain-text format, where values are separated by commas. It is one of the simplest and most widely used file formats for data exchange

Best for flat, table-like data (rows and columns).
Lacks structure: everything is a string, requiring manual conversions (e.g., numbers remain strings).
Easy to share, but loses relationships between data (e.g., nested structures are hard to represent).

csv module (built-in)#

Python has a built-in CSV (Comma Separated Variables) module that we will use for working with CSV files. Later we will use features of Pandas, but right now we want to stick to built-in python features.

Method/Function/Object	Description
`csv.reader()`	Creates a reader object for reading CSV data
`csv.writer()`	Creates a writer object for writing CSV data
`csv.DictReader()`	Creates a dictionary-based reader object
`csv.DictWriter()`	Creates a dictionary-based writer object
`csv.Dialect`	Base class for defining CSV dialects
`csv.register_dialect()`	Registers a custom CSV dialect
`csv.get_dialect()`	Retrieves a registered dialect
`csv.list_dialects()`	Lists all registered dialects
`csv.field_size_limit()`	Sets the maximum field size

CSV Activity - Create a Dictionary of Halogen Properties#

We ill use an activity to learn how to work with csv files and the csv module. In this activity we will convert a CSV file into a python dictionary of dictionaries for the halogens, where the keys are the element symbols, and the values is a dictionary of that element, which has keys of elemental properties and values of the values of those properties

Overview#

Generate a csv file
Read the CSV file
Convert to a list of lists
Separate the Headers (keys) from the data (values)
Use csvdict to convert each row to a dictionary

1. Create CSV file from list of tuples#

csv.writer()#

import csv
import os

# Define file path in ~/sandbox/
sandbox_dir = os.path.expanduser("~/sandbox/")
halogen_csv_sbpath = os.path.join(sandbox_dir, "halogens.csv")

# Data lists
halogens = ['F', 'Cl', 'Br', 'I', 'At', 'Ts']
atomic_numbers = [9, 17, 35, 53, 85, 117]
atomic_masses = [18.998, 35.45, 79.904, 126.9, 210, 294]
electronegativities = [3.98, 3.16, 2.96, 2.66, 2.2, None]

# Combine into rows
data = list(zip(halogens, atomic_numbers, atomic_masses, electronegativities))  # Convert to list to avoid exhaustion
print(data)  # Debugging: Check if data is structured correctly

# Write to CSV file
with open(halogen_csv_sbpath, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["Element", "Atomic Number", "Atomic Mass", "Electronegativity"])  # Header
    writer.writerows(data)  # Data rows

print(f"CSV file created successfully at: {halogen_csv_sbpath}")

[('F', 9, 18.998, 3.98), ('Cl', 17, 35.45, 3.16), ('Br', 35, 79.904, 2.96), ('I', 53, 126.9, 2.66), ('At', 85, 210, 2.2), ('Ts', 117, 294, None)]
CSV file created successfully at: /home/rebelford/sandbox/halogens.csv

2. Read csv file#

restart your kernel and clear all output

csv.reader()#

print each row#

import csv
import os

# Define file path in ~/sandbox/
sandbox_dir = os.path.expanduser("~/sandbox/")
halogen_csv_sbpath = os.path.join(sandbox_dir, "halogens.csv")

with open(halogen_csv_sbpath, mode="r") as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)  # Each row is a list
print(type(reader))

['Element', 'Atomic Number', 'Atomic Mass', 'Electronegativity']
['F', '9', '18.998', '3.98']
['Cl', '17', '35.45', '3.16']
['Br', '35', '79.904', '2.96']
['I', '53', '126.9', '2.66']
['At', '85', '210', '2.2']
['Ts', '117', '294', '']
<class '_csv.reader'>

Create a list of lists representing the csv file#

Now that we have created a csv object we can convert it to a list

with open(halogen_csv_sbpath, mode="r", newline="") as file:
    reader = csv.reader(file)
    halogens_data = list(reader)

# Display the data

print(halogens_data)
print(type(halogens_data))

[['Element', 'Atomic Number', 'Atomic Mass', 'Electronegativity'], ['F', '9', '18.998', '3.98'], ['Cl', '17', '35.45', '3.16'], ['Br', '35', '79.904', '2.96'], ['I', '53', '126.9', '2.66'], ['At', '85', '210', '2.2'], ['Ts', '117', '294', '']]
<class 'list'>

Create lists for keys and values#

with open(halogen_csv_sbpath, mode="r", newline="") as file:
    reader = csv.reader(file)
    header = next(reader)  # Reads the first row (header)
    halogens_data = list(reader)  # Reads the remaining rows
print(header,'\n')    
print(halogens_data)

['Element', 'Atomic Number', 'Atomic Mass', 'Electronegativity'] 

[['F', '9', '18.998', '3.98'], ['Cl', '17', '35.45', '3.16'], ['Br', '35', '79.904', '2.96'], ['I', '53', '126.9', '2.66'], ['At', '85', '210', '2.2'], ['Ts', '117', '294', '']]

3. Convert Each Row to a Dictionary#

csv.DictReader()#

with open(halogen_csv_sbpath, mode="r") as file:
    reader = csv.DictReader(file)
    for row in reader:
        print(row)  # Each row is a dictionary

{'Element': 'F', 'Atomic Number': '9', 'Atomic Mass': '18.998', 'Electronegativity': '3.98'}
{'Element': 'Cl', 'Atomic Number': '17', 'Atomic Mass': '35.45', 'Electronegativity': '3.16'}
{'Element': 'Br', 'Atomic Number': '35', 'Atomic Mass': '79.904', 'Electronegativity': '2.96'}
{'Element': 'I', 'Atomic Number': '53', 'Atomic Mass': '126.9', 'Electronegativity': '2.66'}
{'Element': 'At', 'Atomic Number': '85', 'Atomic Mass': '210', 'Electronegativity': '2.2'}
{'Element': 'Ts', 'Atomic Number': '117', 'Atomic Mass': '294', 'Electronegativity': ''}

4. Create Single Key Dictionary of Dictionaries for each Halogen with symbols as keys#

import csv

# Initialize an empty dictionary to store each halogen's data
halogens_dict = {}

# Open and read the CSV file
with open(halogen_csv_sbpath, mode="r", newline="") as file:
    # Create a DictReader with specified field names
    reader = csv.DictReader(file, fieldnames=["symbol", "Atomic Number", "Atomic Mass", "Electronegativity"])
    next(reader)  # Skip the header row if it exists
    for row in reader:
        # Use the symbol as the key and store the rest of the data as a dictionary
        symbol = row.pop("symbol")
        halogens_dict[symbol] = row

# Display the data
for symbol, data in halogens_dict.items():
    print(f"{symbol}: {data}")

F: {'Atomic Number': '9', 'Atomic Mass': '18.998', 'Electronegativity': '3.98'}
Cl: {'Atomic Number': '17', 'Atomic Mass': '35.45', 'Electronegativity': '3.16'}
Br: {'Atomic Number': '35', 'Atomic Mass': '79.904', 'Electronegativity': '2.96'}
I: {'Atomic Number': '53', 'Atomic Mass': '126.9', 'Electronegativity': '2.66'}
At: {'Atomic Number': '85', 'Atomic Mass': '210', 'Electronegativity': '2.2'}
Ts: {'Atomic Number': '117', 'Atomic Mass': '294', 'Electronegativity': ''}

5. Combine into a Dictionary of Dictionaries#

import csv

# create empty dictionary
halogen_dict = {}

with open(halogen_csv_sbpath, mode="r", newline="") as file:
    reader = csv.DictReader(file)
    for row in reader:
        symbol = row.pop("Element")  # Extract the symbol and remove it from the row dictionary
        halogen_dict[symbol] = row   # Assign the remaining data to the symbol key

print(halogen_dict)

{'F': {'Atomic Number': '9', 'Atomic Mass': '18.998', 'Electronegativity': '3.98'}, 'Cl': {'Atomic Number': '17', 'Atomic Mass': '35.45', 'Electronegativity': '3.16'}, 'Br': {'Atomic Number': '35', 'Atomic Mass': '79.904', 'Electronegativity': '2.96'}, 'I': {'Atomic Number': '53', 'Atomic Mass': '126.9', 'Electronegativity': '2.66'}, 'At': {'Atomic Number': '85', 'Atomic Mass': '210', 'Electronegativity': '2.2'}, 'Ts': {'Atomic Number': '117', 'Atomic Mass': '294', 'Electronegativity': ''}}

print(f"The atomic mass of Iodine is {halogen_dict['I']['Atomic Mass']} amu.")

The atomic mass of Iodine is 126.9 amu.

In the next activity we will convert a halogen dictionary of dictionaries to a json file that will allow us to store a functioning dictionary of halogen properties.

JSON (JavaScript Object Notation)#

Supports dictionaries, lists, numbers, and strings.
Maintains data structure (nested objects, hierarchy).
Works across different languages (Python, JavaScript, C, etc.).
Slight limitation: It doesn’t support Python-specific objects like tuples and sets.

JSON Structure#

{
  "chemicals": [
    {"name": "Water", "melting_point": 0, "boiling_point": 100},
    {"name": "Ethanol", "melting_point": -114, "boiling_point": 78},
    {"name": "Acetone", "melting_point": -95, "boiling_point": 56}
  ]
}

Uses key-value pairs.
Supports hierarchical data (nested dictionaries and lists).
More flexible than CSV.

jason module (built-in)#

Method/Function	Description
`json.dumps()`	Serialize Python object to a JSON formatted string
`json.loads()`	Deserialize JSON string to a Python object
`json.dump()`	Serialize Python object to a JSON formatted stream (file)
`json.load()`	Deserialize JSON formatted stream (file) to a Python object
`json.JSONEncoder`	Base class for custom JSON encoders
`json.JSONDecoder`	Base class for custom JSON decoders

Writing to a json file#

halogen_dict

import json

with open("chemical_safety.json", "w") as file:
    json.dump(chemical_safety_data, file, indent=4)

print("Data saved to chemical_safety.json")

The following code reads the csv file we made earlier and converts it to a json file

Reading a json file#

with open("chemical_safety_log", "r") as file:
    data = json.load(file)
    print(data)

JSON Activity#

In this activity we will:

Convert the halogen.csv file to a dictionary
Save the dictionary as a json file
Open the json file as a dictionary

import os
import csv
import json
# Define file path in ~/sandbox/
sandbox_dir = os.path.expanduser("~/sandbox/")
halogen_csv_sbpath = os.path.join(sandbox_dir, "halogens.csv")
halogen_json_sbpath = os.path.join(sandbox_dir, "halogens.json")

# Halogen name mappings (to match symbol to full element name)
halogen_names = {
    "F": "Fluorine",
    "Cl": "Chlorine",
    "Br": "Bromine",
    "I": "Iodine",
    "At": "Astatine",
    "Ts": "Tennessine"
}

# Read CSV and convert to a dictionary of dictionaries
halogen_data = {}
with open(halogen_csv_sbpath, mode="r", newline="", encoding="utf-8") as csv_file:
    reader = csv.DictReader(csv_file, delimiter=",")  # Use ',' if comma-separated
    for row in reader:
        element_name = halogen_names[row["Element"]]  # Convert symbol to full name
        halogen_data[element_name] = {
            "symbol": row["Element"],  # Include symbol inside the dictionary
            "atomic_number": int(row["Atomic Number"]),
            "atomic_mass": float(row["Atomic Mass"]),
            "electronegativity": float(row["Electronegativity"]) if row["Electronegativity"] else None
        }
print(halogen_data)
# Write to JSON
with open(halogen_json_sbpath, mode="w", encoding="utf-8") as json_file:
    json.dump(halogen_data, json_file, indent=4)

print(f"JSON file '{halogen_json_sbpath}' has been created.")

{'Fluorine': {'symbol': 'F', 'atomic_number': 9, 'atomic_mass': 18.998, 'electronegativity': 3.98}, 'Chlorine': {'symbol': 'Cl', 'atomic_number': 17, 'atomic_mass': 35.45, 'electronegativity': 3.16}, 'Bromine': {'symbol': 'Br', 'atomic_number': 35, 'atomic_mass': 79.904, 'electronegativity': 2.96}, 'Iodine': {'symbol': 'I', 'atomic_number': 53, 'atomic_mass': 126.9, 'electronegativity': 2.66}, 'Astatine': {'symbol': 'At', 'atomic_number': 85, 'atomic_mass': 210.0, 'electronegativity': 2.2}, 'Tennessine': {'symbol': 'Ts', 'atomic_number': 117, 'atomic_mass': 294.0, 'electronegativity': None}}
JSON file '/home/rebelford/sandbox/halogens.json' has been created.

Important Note that the data in a csv file is all strings, and when we created the csv file we converted it the atomic numbers to integers, and the atomic mass and electronegativty to floats

Now that you have created the json file, restart the kernel and clear output of all cells

import os
import json
# Define file path in ~/sandbox/
sandbox_dir = os.path.expanduser("~/sandbox/")
halogen_json_sbpath = os.path.join(sandbox_dir, "halogens.json")

with open(halogen_json_sbpath, "r") as file:
    halogen_dict = json.load(file)
    print(halogen_dict,'\n')

{'Fluorine': {'symbol': 'F', 'atomic_number': 9, 'atomic_mass': 18.998, 'electronegativity': 3.98}, 'Chlorine': {'symbol': 'Cl', 'atomic_number': 17, 'atomic_mass': 35.45, 'electronegativity': 3.16}, 'Bromine': {'symbol': 'Br', 'atomic_number': 35, 'atomic_mass': 79.904, 'electronegativity': 2.96}, 'Iodine': {'symbol': 'I', 'atomic_number': 53, 'atomic_mass': 126.9, 'electronegativity': 2.66}, 'Astatine': {'symbol': 'At', 'atomic_number': 85, 'atomic_mass': 210.0, 'electronegativity': 2.2}, 'Tennessine': {'symbol': 'Ts', 'atomic_number': 117, 'atomic_mass': 294.0, 'electronegativity': None}} 

print(halogen_dict.keys(),"\n")
for element, properties in halogen_dict.items():
    # Print the element symbol (key of the outer dictionary)
 #   print(f"Element: {element}")

    # Since all inner dictionaries have the same keys, we can break after the first iteration
    break

print("\nProperties:", list(properties.keys()))
print(f"\nThe atomic mass of Chlorine is {halogen_dict['Chlorine']['atomic_mass']} amu.")

dict_keys(['Fluorine', 'Chlorine', 'Bromine', 'Iodine', 'Astatine', 'Tennessine']) 

Properties: ['symbol', 'atomic_number', 'atomic_mass', 'electronegativity']

The atomic mass of Chlorine is 35.45 amu.

Pickle Files (.pkl, Python Object Serialization)#

Can store any Python object (dictionaries, lists, NumPy arrays, custom objects).
Faster and more space-efficient than JSON (because it’s binary).
Not human-readable and not cross-language compatible.
Security risk: Do not load a Pickle file from an untrusted source, it could execute malicious code.
Great for saving Python-specific data efficiently (e.g., Pandas DataFrames, machine learning models).
Not a universal format (not for data sharing).
Only use Pickle when working within Python projects.
Pickle does not preserve variable names, pickle a dictionary to preserve names

When you unpickle a list of objects, the original variable names are not preserved. The pickle module only serializes the object data, not the variable names used to refer to those objects. Therefore, there’s no direct way to retrieve the original variable names from the unpickled list.

Function	Description
`dump(obj, file)`	Write a pickled representation of obj to the open file object
`dumps(obj)`	Return the pickled representation of the object as a bytes object
`load(file)`	Read a pickled object from the open file object
`loads(bytes_object)`	Read a pickled object from a bytes object

In summary, while you don’t need to manually reimport modules when unpickling, you do need to ensure that the required modules are available and compatible in the unpickling environment. The pickle module handles the import process automatically, but it doesn’t package the entire module content with the pickled object.

Using a Dictionary to Preserve Names#

When you unpickle a list of objects, the original variable names are not preserved. The pickle module only serializes the object data, not the variable names used to refer to those objects. Therefore, there’s no direct way to retrieve the original variable names from the unpickled list.

However, you can achieve a similar result by using a dictionary instead of a list when pickling your objects. This way, you can associate each object with a meaningful key that represents its name. Here’s how you can modify your code to accomplish this:

Pickle the Dictionary#

import pickle

# Define the Pickle file path
pickle_file_path = os.path.expanduser("~/filepath/filename.pkl")

# Save the dictionary as a Pickle file
with open(halogen_pkl_sbpath, "wb") as pickle_file:
    pickle.dump(pickle_file_path, pickle_file)

print(f"Periodic table data pickled at: {pickle_file_path}"))

Unpickle and Reload Later#

# Load the Pickle file
with open(pickle_file_path, "rb") as pickle_file:
    unpickled_file = pickle.load(pickle_file)

See code below for examples.

import os
import pickle
import json

# Define paths in ~/sandbox/
sandbox_dir = os.path.expanduser("~/sandbox/")
halogen_pkl_sbpath = os.path.join(sandbox_dir, "halogen.pkl")
halogen_json_sbpath = os.path.join(sandbox_dir, "halogens.json")

# Open the dictionary
with open(halogen_json_sbpath, "r") as file:
    halogen_dict = json.load(file)
    print(halogen_dict,'\n')


# Define pickle path in ~/sandbox/
sandbox_dir = os.path.expanduser("~/sandbox/")
halogen_pkl_sbpath = os.path.join(sandbox_dir, "halogen.pkl")

# Save the dictionary as a Pickle file
with open(halogen_pkl_sbpath, "wb") as pickle_file:
    pickle.dump(halogen_dict, pickle_file)

print(f"Periodic table data pickled at: {halogen_pkl_sbpath}")

{'Fluorine': {'symbol': 'F', 'atomic_number': 9, 'atomic_mass': 18.998, 'electronegativity': 3.98}, 'Chlorine': {'symbol': 'Cl', 'atomic_number': 17, 'atomic_mass': 35.45, 'electronegativity': 3.16}, 'Bromine': {'symbol': 'Br', 'atomic_number': 35, 'atomic_mass': 79.904, 'electronegativity': 2.96}, 'Iodine': {'symbol': 'I', 'atomic_number': 53, 'atomic_mass': 126.9, 'electronegativity': 2.66}, 'Astatine': {'symbol': 'At', 'atomic_number': 85, 'atomic_mass': 210.0, 'electronegativity': 2.2}, 'Tennessine': {'symbol': 'Ts', 'atomic_number': 117, 'atomic_mass': 294.0, 'electronegativity': None}} 

Periodic table data pickled at: /home/rebelford/sandbox/halogen.pkl

Unpickle file#

Now that you have pickled the file, let’s unpickle it, but first restart kernel

import os
import pickle

# Define path in ~/sandbox/
sandbox_dir = os.path.expanduser("~/sandbox/")
halogen_pkl_sbpath = os.path.join(sandbox_dir, "halogen.pkl")

# Load the Pickle file
with open(halogen_pkl_sbpath, "rb") as pickle_file:
    unpickled_halogens = pickle.load(pickle_file)

# Display a sample element (Oxygen)
print(unpickled_halogens)
print(f"\nData for Oxygen (from Pickle):")
print(unpickled_halogens.get("Oxygen", "Element not found!"))
print(f"\nData for chlorine (from Pickle):")
print(unpickled_halogens.get("Chlorine", "Element not found!"))
print(f"\nElectronegativity data for chlorine (from Pickle):")
print(unpickled_halogens['Chlorine']['electronegativity'])

{'Fluorine': {'symbol': 'F', 'atomic_number': 9, 'atomic_mass': 18.998, 'electronegativity': 3.98}, 'Chlorine': {'symbol': 'Cl', 'atomic_number': 17, 'atomic_mass': 35.45, 'electronegativity': 3.16}, 'Bromine': {'symbol': 'Br', 'atomic_number': 35, 'atomic_mass': 79.904, 'electronegativity': 2.96}, 'Iodine': {'symbol': 'I', 'atomic_number': 53, 'atomic_mass': 126.9, 'electronegativity': 2.66}, 'Astatine': {'symbol': 'At', 'atomic_number': 85, 'atomic_mass': 210.0, 'electronegativity': 2.2}, 'Tennessine': {'symbol': 'Ts', 'atomic_number': 117, 'atomic_mass': 294.0, 'electronegativity': None}}

Data for Oxygen (from Pickle):
Element not found!

Data for chlorine (from Pickle):
{'symbol': 'Cl', 'atomic_number': 17, 'atomic_mass': 35.45, 'electronegativity': 3.16}

Electronegativity data for chlorine (from Pickle):
3.16

Download csv file from PubChem Periodic Table#

Code to download table#

The following code will create a data directory in your class directory within your home directory where we will keep real data files this semester. It then creates a subdirectory called pubchem_data_dir, where we will keep files dealing with PubChem. Inside of it we will download a csv file of 17 different properties of the periodic table. To do this we will use the requests module that will be discussed in the next session, and the pandas package that we will be using in the near future.
If you navigate to PubChem Periodic Table you will see a download tab that takes you to a series of Machine-Readable Periodic Table Data files. You can obtain the link to the csv file in the code cell below by right-clicking onthe CSV-Save file, and choosing ‘copy link address’, which we have assigned to the variable “file_url”.

If you have not installed pandas you need to do so now.

Install Pandas#

Pandas is a very powerful data science package that we will be using extensively as soon as we finish this primer tutorial on Python. If you have not already installed it in your class environment, In the Ubuntu terminal activate your virtual environment (py4sci), changing py4sci to the nae of your environment

conda activate py4sci
conda install -c conda-forge pandas

We will then use Pandas to print the first 5 lines of the file

import os
import urllib.request
import pandas as pd

# Define the directory structure
base_data_dir = os.path.expanduser("~/data")  # Parent directory
pubchem_data_dir = os.path.join(base_data_dir, "pubchem_data")  # Subdirectory for PubChem
os.makedirs(pubchem_data_dir, exist_ok=True)  # Ensure directories exist

# Define file URL and local path
file_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/periodictable/CSV?response_type=save&response_basename=PubChemElements_all"
local_file_path = os.path.join(pubchem_data_dir, "PubChemElements_all.csv")

# Download the file
print(f"Downloading PubChem CSV to: {local_file_path} ...")
urllib.request.urlretrieve(file_url, local_file_path)
print("Download complete!")

# Verify if the file was saved
if os.path.exists(local_file_path):
    print(f"File successfully saved at: {local_file_path}")

    # Load into Pandas DataFrame
    df = pd.read_csv(local_file_path)
    print("\nFirst few rows of the dataset:")
    print(df.head())  # Display first few rows

else:
    print("Download failed!")

Downloading PubChem CSV to: /home/rebelford/data/pubchem_data/PubChemElements_all.csv ...

Download complete!
File successfully saved at: /home/rebelford/data/pubchem_data/PubChemElements_all.csv

First few rows of the dataset:
   AtomicNumber Symbol       Name  AtomicMass CPKHexColor  \
           1      H   Hydrogen    1.008000      FFFFFF   
           2     He     Helium    4.002600      D9FFFF   
           3     Li    Lithium    7.000000      CC80FF   
           4     Be  Beryllium    9.012183      C2FF00   
           5      B      Boron   10.810000      FFB5B5   

  ElectronConfiguration  Electronegativity  AtomicRadius  IonizationEnergy  \
                 1s1               2.20         120.0            13.598   
                 1s2                NaN         140.0            24.587   
             [He]2s1               0.98         182.0             5.392   
             [He]2s2               1.57         153.0             9.323   
         [He]2s2 2p1               2.04         192.0             8.298   

   ElectronAffinity OxidationStates StandardState  MeltingPoint  BoilingPoint  \
           0.754          +1, -1           Gas         13.81         20.28   
             NaN               0           Gas          0.95          4.22   
           0.618              +1         Solid        453.65       1615.00   
             NaN              +2         Solid       1560.00       2744.00   
           0.277              +3         Solid       2348.00       4273.00   

    Density            GroupBlock YearDiscovered  
0.000090              Nonmetal           1766  
0.000179             Noble gas           1868  
0.534000          Alkali metal           1817  
1.850000  Alkaline earth metal           1798  
2.370000             Metalloid           1808  

Assignment:#

Open the Workbook 4, download the periodic table from PubChem and then make yourself a data dictionary like you did above for four properties of the halogens, but now do it for the entire periodic table with the 17 properties in the csv file you just downloaded. Once done, you should save it as a Json file, that you will upload to your class google drive

Acknowledgements#

This content was developed with assistance from Perplexity AI and Chat GPT. Multiple queries were made during the Fall 2024 and the Spring 2025.