What is a bit vector?#
Recall from the last activity that we downloaded PubChem fingerprints initially as Base64-encoded strings. We then decoded them into bit strings, which are a human readable sequence of 1s and 0s. PubChem stores fingerprints in Base64 format because it allows for safer and more efficient transmission through application programming interfaces (API) like PUG REST. Additionally, databases can handle text-based formats like Base64 more easily than raw binary data.
A bit vector is a data structure designed to efficiently store and manipulate binary data. While a bit string is human readable, it isn’t optimized for computation. A bit vector allows the computer to quickly determine which bits are set on or off, or compare two strings for differences.
You can think of a bit string as a literal sequence of 1’s and 0’s. It has no structure or behavior. However, a bit vector is like a structured container where each bit is stored in separate boxes with information about what is stored in each box. These boxes of data can be acted upon computationally, which we will explore in the Molecular Similarity Part 2 notebook.
Bit Vectors in RDKit#
Instead of using a bit string, RDKit uses bit vectors to store molecular fingerprints. Each 1 and 0 in a bit vector takes up just 1 bit of memory, whereas storing the same data as bit string requires much more memory as each character typically uses a full bit (8 bits = 1 byte).
Another key advantage is that the bit vector allows for efficient indexing and manipulation of the individual bits. This makes comparing two fingerprints very fast and memory efficient.
In the last activity you generated MACCS keys and Morgan Fingerprints using RDKit’s built-in functions. You also downloaded PubChem Fingerprints. In this notebook, we will review the code and explore the rdkit.DataStructs
package for working directly with bit vectors.
# generate MACCS keys from a smiles string
from rdkit import Chem
from rdkit.Chem import MACCSkeys
SMILES ="CC(C)C1=C(C(=C(N1CC[C@H](C[C@H](CC(=O)O)O)O)C2=CC=C(C=C2)F)C3=CC=CC=C3)C(=O)NC4=CC=CC=C4" # atorvastatin
mol = Chem.MolFromSmiles(SMILES)
fp_MACCS = MACCSkeys.GenMACCSKeys(mol)
print(type(fp_MACCS))
# generate Morgan Fingerprint of same SMILES string
from rdkit.Chem import rdFingerprintGenerator
morgan_gen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=1024)
fp_morgan = morgan_gen.GetFingerprint(mol)
print(type(fp_morgan))
What class are the MACCS keys and Morgan fingerprints stored as?
Let’s compare the memory usage of a bit string versus a bit vector for the Morgan fingerprint we generated:
import sys
bitstring = fp_morgan.ToBitString()
print("The bit vector is of type",type(fp_morgan))
print("The bit string is of type",type(bitstring))
print("The bit string is human readable as", bitstring)
print("Bit Vector size (bytes):", sys.getsizeof(fp_morgan))
print("Bit String size (bytes):", sys.getsizeof(bitstring))
Does the code above show that we are using less memory by using a Bit Vector?
Let’s review code for downloading and converting a PubChem fingerprint to a bit vector:
# download the PubChem fingerprint as base64 encoded text
import requests
prolog = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
url = prolog + "/compound/smiles/" + SMILES + "/property/Fingerprint2D/TXT"
res = requests.get(url)
pcfp_base64 = res.text #pcfp = PubChemFingerPrint
print(pcfp_base64)
print(type(pcfp_base64))
from base64 import b64decode
def PCFP_BitString(pcfp_base64) :
pcfp_bitstring = "".join( ["{:08b}".format(x) for x in b64decode( pcfp_base64 )] )[32:913]
return pcfp_bitstring
pcfp_bitstring = PCFP_BitString(pcfp_base64) # use the user defined function to convert the PubChem FingerPrint to a bitstring
print("The number of bits in the bitstring is", len(pcfp_bitstring))
print("The bit string is human readable as", pcfp_bitstring)
from rdkit import DataStructs
PCFP_bitvect = DataStructs.CreateFromBitString(PCFP_BitString(pcfp_base64)) #direct from base64
print(type(PCFP_bitvect))
print("Bit Vector size (bytes):", sys.getsizeof(PCFP_bitvect))
print("Bit String size (bytes):", sys.getsizeof(pcfp_bitstring))
What are bit vectors?
Why are they useful?
Which fingerprints are automatically generated as bit vectors in RDKit and which fingerprints need to be converted to bit vectors?
Note: The exact byte sizes shown by sys.getsizeof()
may from operating system, python version, and RDKit build.
While the absolute byte size may vary, the trend is always the same. Storing 1024 Morgan fingerprint bits (or 881 PubChem or 167 MACCS keys) as a bit vector will always be more memory-efficient than storing as a bit string.
Run the next code cell.
print("Bit Vector size MACCS (bytes):", sys.getsizeof(fp_MACCS))
print("Bit Vector size Morgan(bytes):", sys.getsizeof(fp_morgan))
print("Bit Vector size PubChem(bytes):", sys.getsizeof(PCFP_bitvect))
Even though each of the fingerprints we generated different in bit length (167, 1024 and 881, respectively), RDKit stores the bit vector using the same ExplicitBitVect
structure. This structure includes fixed-size chunks of memory and object overhead, so the sys.getsizeof()
function will report similar size for all three even though the actual bits stored may vary.