Introduction to Pandas#
Become familiar with the Pandas data analysis library in python
Read data into pandas DataFrames from files
Manipulate data in the DataFrame
Export data as CSV
Pandas is a flexible open-source data analysis and manipulation library for Python. It provides data structures in series and DataFrames. A series is like a list and a DataFrame is like a spreadsheet. The key is that this library allows users to easily load, explore, clean, transform and analyzed structured data. It is widely used in data science, machine learning and statitical analysis.
This activity is intended to give you a brief overview of DataFrames and introduce you to pandas concepts we are using in the next notebook.
# load the pandas library
# the pandas community as agreed that the alias for pandas is 'pd'.
# Loading pandas with this alias allows us to use its functions and classes with a shorter name.
import pandas as pd
Much of the data we will be using is availabe as comma separated values files from PubChem or other resources. Pandas has the ability to read in csv data either locally from a downloaded file or as a request if we know the source of the file. In the code cell below, we will download the periodic table from PubChem and store it into a DataFrame variable call df
.
# load the periodic table data from a URL into a pandas DataFrame.
# We do not need to use the `requests` library here because pandas can directly read from URLs.
url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/periodictable/CSV" #url for the periodic table data in CSV format
df = pd.read_csv(url) # read the CSV data from the URL into a pandas DataFrame
df # display the DataFrame to see its contents
AtomicNumber | Symbol | Name | AtomicMass | CPKHexColor | ElectronConfiguration | Electronegativity | AtomicRadius | IonizationEnergy | ElectronAffinity | OxidationStates | StandardState | MeltingPoint | BoilingPoint | Density | GroupBlock | YearDiscovered | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | H | Hydrogen | 1.008000 | FFFFFF | 1s1 | 2.20 | 120.0 | 13.598 | 0.754 | +1, -1 | Gas | 13.81 | 20.28 | 0.000090 | Nonmetal | 1766 |
1 | 2 | He | Helium | 4.002600 | D9FFFF | 1s2 | NaN | 140.0 | 24.587 | NaN | 0 | Gas | 0.95 | 4.22 | 0.000179 | Noble gas | 1868 |
2 | 3 | Li | Lithium | 7.000000 | CC80FF | [He]2s1 | 0.98 | 182.0 | 5.392 | 0.618 | +1 | Solid | 453.65 | 1615.00 | 0.534000 | Alkali metal | 1817 |
3 | 4 | Be | Beryllium | 9.012183 | C2FF00 | [He]2s2 | 1.57 | 153.0 | 9.323 | NaN | +2 | Solid | 1560.00 | 2744.00 | 1.850000 | Alkaline earth metal | 1798 |
4 | 5 | B | Boron | 10.810000 | FFB5B5 | [He]2s2 2p1 | 2.04 | 192.0 | 8.298 | 0.277 | +3 | Solid | 2348.00 | 4273.00 | 2.370000 | Metalloid | 1808 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
113 | 114 | Fl | Flerovium | 290.192000 | NaN | [Rn]7s2 7p2 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | 6, 4,2, 1, 0 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 1998 |
114 | 115 | Mc | Moscovium | 290.196000 | NaN | [Rn]7s2 7p3 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | 3, 1 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 2003 |
115 | 116 | Lv | Livermorium | 293.205000 | NaN | [Rn]7s2 7p4 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +4, +2, -2 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 2000 |
116 | 117 | Ts | Tennessine | 294.211000 | NaN | [Rn]7s2 7p5 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +5, +3, +1, -1 | Expected to be a Solid | NaN | NaN | NaN | Halogen | 2010 |
117 | 118 | Og | Oganesson | 295.216000 | NaN | [Rn]7s2 7p6 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +6, +4, +2, +1, 0, -1 | Expected to be a Gas | NaN | NaN | NaN | Noble gas | 2006 |
118 rows × 17 columns
Notice that the DataFrame output indicates it has 118 rows and 17 columns. There are a few ways to get information about rows and columns in our DataFrame(named df
):
print(df.shape)
print(len(df))
print(df.shape[0])
print(df.shape[1])
print(len(df.columns))
(118, 17)
118
118
17
17
Write comments next to each line in the code above to indicate what the code is doing.
Run the next four code cells to explore the columns property and head, tail and info methods of a DataFrame.
df.columns # display the columns of the DataFrame to understand its structure
Index(['AtomicNumber', 'Symbol', 'Name', 'AtomicMass', 'CPKHexColor',
'ElectronConfiguration', 'Electronegativity', 'AtomicRadius',
'IonizationEnergy', 'ElectronAffinity', 'OxidationStates',
'StandardState', 'MeltingPoint', 'BoilingPoint', 'Density',
'GroupBlock', 'YearDiscovered'],
dtype='object')
df.head(3) # display the first 3 rows of the DataFrame to get a glimpse of the data
AtomicNumber | Symbol | Name | AtomicMass | CPKHexColor | ElectronConfiguration | Electronegativity | AtomicRadius | IonizationEnergy | ElectronAffinity | OxidationStates | StandardState | MeltingPoint | BoilingPoint | Density | GroupBlock | YearDiscovered | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | H | Hydrogen | 1.0080 | FFFFFF | 1s1 | 2.20 | 120.0 | 13.598 | 0.754 | +1, -1 | Gas | 13.81 | 20.28 | 0.000090 | Nonmetal | 1766 |
1 | 2 | He | Helium | 4.0026 | D9FFFF | 1s2 | NaN | 140.0 | 24.587 | NaN | 0 | Gas | 0.95 | 4.22 | 0.000179 | Noble gas | 1868 |
2 | 3 | Li | Lithium | 7.0000 | CC80FF | [He]2s1 | 0.98 | 182.0 | 5.392 | 0.618 | +1 | Solid | 453.65 | 1615.00 | 0.534000 | Alkali metal | 1817 |
df.tail(5) # display the last 5 rows of the DataFrame to see the end of the data
AtomicNumber | Symbol | Name | AtomicMass | CPKHexColor | ElectronConfiguration | Electronegativity | AtomicRadius | IonizationEnergy | ElectronAffinity | OxidationStates | StandardState | MeltingPoint | BoilingPoint | Density | GroupBlock | YearDiscovered | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
113 | 114 | Fl | Flerovium | 290.192 | NaN | [Rn]7s2 7p2 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | 6, 4,2, 1, 0 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 1998 |
114 | 115 | Mc | Moscovium | 290.196 | NaN | [Rn]7s2 7p3 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | 3, 1 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 2003 |
115 | 116 | Lv | Livermorium | 293.205 | NaN | [Rn]7s2 7p4 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +4, +2, -2 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 2000 |
116 | 117 | Ts | Tennessine | 294.211 | NaN | [Rn]7s2 7p5 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +5, +3, +1, -1 | Expected to be a Solid | NaN | NaN | NaN | Halogen | 2010 |
117 | 118 | Og | Oganesson | 295.216 | NaN | [Rn]7s2 7p6 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +6, +4, +2, +1, 0, -1 | Expected to be a Gas | NaN | NaN | NaN | Noble gas | 2006 |
df.info() # display information about the DataFrame, such as the number of entries, column names, and data types
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118 entries, 0 to 117
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AtomicNumber 118 non-null int64
1 Symbol 118 non-null object
2 Name 118 non-null object
3 AtomicMass 118 non-null float64
4 CPKHexColor 108 non-null object
5 ElectronConfiguration 118 non-null object
6 Electronegativity 95 non-null float64
7 AtomicRadius 99 non-null float64
8 IonizationEnergy 102 non-null float64
9 ElectronAffinity 57 non-null float64
10 OxidationStates 117 non-null object
11 StandardState 118 non-null object
12 MeltingPoint 103 non-null float64
13 BoilingPoint 93 non-null float64
14 Density 96 non-null float64
15 GroupBlock 118 non-null object
16 YearDiscovered 118 non-null object
dtypes: float64(8), int64(1), object(8)
memory usage: 15.8+ KB
When displaying a DataFrame, what is the default display?
If you only want to see the first 4 lines of the DataFrame, what command would you type?
If you want to see the last line only of the DataFrame, what command would you type?
Why is hydrogen, atomic number 1, listed as row 0?
Why might some of the data be null for these elements?
Accessing Subsets of Data in the DataFrame#
Now that our data is in a DataFrame, we can access a subset of the data.
# display the row with index 17 to see the details of the element at that index
df.loc[17]
AtomicNumber 18
Symbol Ar
Name Argon
AtomicMass 39.9
CPKHexColor 80D1E3
ElectronConfiguration [Ne]3s2 3p6
Electronegativity NaN
AtomicRadius 188.0
IonizationEnergy 15.76
ElectronAffinity NaN
OxidationStates 0
StandardState Gas
MeltingPoint 83.8
BoilingPoint 87.3
Density 0.001784
GroupBlock Noble gas
YearDiscovered 1894
Name: 17, dtype: object
# This will show the symbol of the element at index 17
df.loc[17, "Symbol"]
'Ar'
# this code cell will display the name and Atomic Radius of the element at index 17
# it uses the `loc` method to access specific rows and columns in the DataFrame
# this require the name of the columns to be specified as a list
df.loc[17, ["Name", "AtomicRadius"]]
Name Argon
AtomicRadius 188.0
Name: 17, dtype: object
# Display the name and atomic radius of the element at index 17 using column indexing
# it uses the `iloc` method to access specific rows and columns by their integer index
# you do not need to specify the column names, just their integer positions
# This can be useful when you know the position of the columns but not their names or the names are long.
df.iloc[17, [0, 7]]
AtomicNumber 18
AtomicRadius 188.0
Name: 17, dtype: object
Sometimes we want to have a subset of the data. This is achieved by identifying which columns we want.
df[["Name", "GroupBlock"]].head(3) # display the first 3 rows of the "Name" and "GroupBlock" columns
Name | GroupBlock | |
---|---|---|
0 | Hydrogen | Nonmetal |
1 | Helium | Noble gas |
2 | Lithium | Alkali metal |
We can also filter the data when we only want a subset of the data.
df[df["AtomicNumber"] < 10.0 ]# display the rows where the "AtomicNumber" is less than 10.0
AtomicNumber | Symbol | Name | AtomicMass | CPKHexColor | ElectronConfiguration | Electronegativity | AtomicRadius | IonizationEnergy | ElectronAffinity | OxidationStates | StandardState | MeltingPoint | BoilingPoint | Density | GroupBlock | YearDiscovered | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | H | Hydrogen | 1.008000 | FFFFFF | 1s1 | 2.20 | 120.0 | 13.598 | 0.754 | +1, -1 | Gas | 13.81 | 20.28 | 0.000090 | Nonmetal | 1766 |
1 | 2 | He | Helium | 4.002600 | D9FFFF | 1s2 | NaN | 140.0 | 24.587 | NaN | 0 | Gas | 0.95 | 4.22 | 0.000179 | Noble gas | 1868 |
2 | 3 | Li | Lithium | 7.000000 | CC80FF | [He]2s1 | 0.98 | 182.0 | 5.392 | 0.618 | +1 | Solid | 453.65 | 1615.00 | 0.534000 | Alkali metal | 1817 |
3 | 4 | Be | Beryllium | 9.012183 | C2FF00 | [He]2s2 | 1.57 | 153.0 | 9.323 | NaN | +2 | Solid | 1560.00 | 2744.00 | 1.850000 | Alkaline earth metal | 1798 |
4 | 5 | B | Boron | 10.810000 | FFB5B5 | [He]2s2 2p1 | 2.04 | 192.0 | 8.298 | 0.277 | +3 | Solid | 2348.00 | 4273.00 | 2.370000 | Metalloid | 1808 |
5 | 6 | C | Carbon | 12.011000 | 909090 | [He]2s2 2p2 | 2.55 | 170.0 | 11.260 | 1.263 | +4, +2, -4 | Solid | 3823.00 | 4098.00 | 2.267000 | Nonmetal | Ancient |
6 | 7 | N | Nitrogen | 14.007000 | 3050F8 | [He] 2s2 2p3 | 3.04 | 155.0 | 14.534 | NaN | +5, +4, +3, +2, +1, -1, -2, -3 | Gas | 63.15 | 77.36 | 0.001251 | Nonmetal | 1772 |
7 | 8 | O | Oxygen | 15.999000 | FF0D0D | [He]2s2 2p4 | 3.44 | 152.0 | 13.618 | 1.461 | -2 | Gas | 54.36 | 90.20 | 0.001429 | Nonmetal | 1774 |
8 | 9 | F | Fluorine | 18.998403 | 90E050 | [He]2s2 2p5 | 3.98 | 135.0 | 17.423 | 3.339 | -1 | Gas | 53.53 | 85.03 | 0.001696 | Halogen | 1670 |
# display rows where AtomicNumber is greater than 10 and less than 20
df[(df["AtomicNumber"] > 10) & (df["AtomicNumber"] < 20)]
AtomicNumber | Symbol | Name | AtomicMass | CPKHexColor | ElectronConfiguration | Electronegativity | AtomicRadius | IonizationEnergy | ElectronAffinity | OxidationStates | StandardState | MeltingPoint | BoilingPoint | Density | GroupBlock | YearDiscovered | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 11 | Na | Sodium | 22.989769 | AB5CF2 | [Ne]3s1 | 0.93 | 227.0 | 5.139 | 0.548 | +1 | Solid | 370.950 | 1156.00 | 0.970000 | Alkali metal | 1807 |
11 | 12 | Mg | Magnesium | 24.305000 | 8AFF00 | [Ne]3s2 | 1.31 | 173.0 | 7.646 | NaN | +2 | Solid | 923.000 | 1363.00 | 1.740000 | Alkaline earth metal | 1808 |
12 | 13 | Al | Aluminum | 26.981538 | BFA6A6 | [Ne]3s2 3p1 | 1.61 | 184.0 | 5.986 | 0.441 | +3 | Solid | 933.437 | 2792.00 | 2.700000 | Post-transition metal | Ancient |
13 | 14 | Si | Silicon | 28.085000 | F0C8A0 | [Ne]3s2 3p2 | 1.90 | 210.0 | 8.152 | 1.385 | +4, +2, -4 | Solid | 1687.000 | 3538.00 | 2.329600 | Metalloid | 1854 |
14 | 15 | P | Phosphorus | 30.973762 | FF8000 | [Ne]3s2 3p3 | 2.19 | 180.0 | 10.487 | 0.746 | +5, +3, -3 | Solid | 317.300 | 553.65 | 1.820000 | Nonmetal | 1669 |
15 | 16 | S | Sulfur | 32.070000 | FFFF30 | [Ne]3s2 3p4 | 2.58 | 180.0 | 10.360 | 2.077 | +6, +4, -2 | Solid | 388.360 | 717.75 | 2.067000 | Nonmetal | Ancient |
16 | 17 | Cl | Chlorine | 35.450000 | 1FF01F | [Ne]3s2 3p5 | 3.16 | 175.0 | 12.968 | 3.617 | +7, +5, +1, -1 | Gas | 171.650 | 239.11 | 0.003214 | Halogen | 1774 |
17 | 18 | Ar | Argon | 39.900000 | 80D1E3 | [Ne]3s2 3p6 | NaN | 188.0 | 15.760 | NaN | 0 | Gas | 83.800 | 87.30 | 0.001784 | Noble gas | 1894 |
18 | 19 | K | Potassium | 39.098300 | 8F40D4 | [Ar]4s1 | 0.82 | 275.0 | 4.341 | 0.501 | +1 | Solid | 336.530 | 1032.00 | 0.890000 | Alkali metal | 1807 |
Notice in the series called GroupBlock has repeating values. If we want to display how many different values are in that column, we can identify how many are .unique
, and then show how many elements in the DataFrame are classified as Noble gasses using different methods.
# display the unique values in the "GroupBlock" column
print("Unique values in 'GroupBlock':")
print(df.GroupBlock.unique())
print()
# display the total count of unique values in the "GroupBlock" column
print("Total unique values in 'GroupBlock':")
print(df["GroupBlock"].nunique()) # nunique(): Returns the count of unique values.
print(len(df["GroupBlock"].unique())) # len(): Returns the count of unique values in the Series.
print()
# display the total count of rows where the "GroupBlock" is "Noble Gas" using the `shape` attribute
print("Total count of rows where 'GroupBlock' is 'Noble gas':")
print(df[df["GroupBlock"] == "Noble gas"].shape[0])
# display the total count of rows where the "GroupBlock" is "Noble Gas" using the length function
print(len(df[df["GroupBlock"] == "Noble gas"]))
# display the total count of rows where the "GroupBlock" is "Noble Gas" using the .sum()` method
print(df["GroupBlock"].eq("Noble gas").sum())
print()
#create a list of elements that are Noble Gases
noble_gases = df[df["GroupBlock"] == "Noble gas"]["Name"].tolist()
# display the list of Noble Gases
print("The noble gases are:", noble_gases)
# display the DataFrame filtered to show only rows where the "GroupBlock" is "Noble gas"
df[(df["GroupBlock"].eq("Noble gas"))]
Unique values in 'GroupBlock':
['Nonmetal' 'Noble gas' 'Alkali metal' 'Alkaline earth metal' 'Metalloid'
'Halogen' 'Post-transition metal' 'Transition metal' 'Lanthanide'
'Actinide']
Total unique values in 'GroupBlock':
10
10
Total count of rows where 'GroupBlock' is 'Noble gas':
7
7
7
The noble gases are: ['Helium', 'Neon', 'Argon', 'Krypton', 'Xenon', 'Radon', 'Oganesson']
AtomicNumber | Symbol | Name | AtomicMass | CPKHexColor | ElectronConfiguration | Electronegativity | AtomicRadius | IonizationEnergy | ElectronAffinity | OxidationStates | StandardState | MeltingPoint | BoilingPoint | Density | GroupBlock | YearDiscovered | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | He | Helium | 4.00260 | D9FFFF | 1s2 | NaN | 140.0 | 24.587 | NaN | 0 | Gas | 0.95 | 4.22 | 0.000179 | Noble gas | 1868 |
9 | 10 | Ne | Neon | 20.18000 | B3E3F5 | [He]2s2 2p6 | NaN | 154.0 | 21.565 | NaN | 0 | Gas | 24.56 | 27.07 | 0.000900 | Noble gas | 1898 |
17 | 18 | Ar | Argon | 39.90000 | 80D1E3 | [Ne]3s2 3p6 | NaN | 188.0 | 15.760 | NaN | 0 | Gas | 83.80 | 87.30 | 0.001784 | Noble gas | 1894 |
35 | 36 | Kr | Krypton | 83.80000 | 5CB8D1 | [Ar]4s2 3d10 4p6 | 3.0 | 202.0 | 14.000 | NaN | 0 | Gas | 115.79 | 119.93 | 0.003733 | Noble gas | 1898 |
53 | 54 | Xe | Xenon | 131.29000 | 429EB0 | [Kr]5s2 4d10 5p6 | 2.6 | 216.0 | 12.130 | NaN | 0 | Gas | 161.36 | 165.03 | 0.005887 | Noble gas | 1898 |
85 | 86 | Rn | Radon | 222.01758 | 428296 | [Xe]6s2 4f14 5d10 6p6 | NaN | 220.0 | 10.745 | NaN | 0 | Gas | 202.00 | 211.45 | 0.009730 | Noble gas | 1900 |
117 | 118 | Og | Oganesson | 295.21600 | NaN | [Rn]7s2 7p6 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +6, +4, +2, +1, 0, -1 | Expected to be a Gas | NaN | NaN | NaN | Noble gas | 2006 |
Sometimes we find that there is data in columns that may be irrelevant to your data analysis. This column can be removed using the .drop()
method. Looking at the data in the DataFrame above, we probably don’t need CPKHexColor column. Let’s remove that data.
df = df.drop('CPKHexColor', axis=1) # drop the 'CPKHexColor' column from the DataFrame
df.head(1)
AtomicNumber | Symbol | Name | AtomicMass | ElectronConfiguration | Electronegativity | AtomicRadius | IonizationEnergy | ElectronAffinity | OxidationStates | StandardState | MeltingPoint | BoilingPoint | Density | GroupBlock | YearDiscovered | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | H | Hydrogen | 1.008 | 1s1 | 2.2 | 120.0 | 13.598 | 0.754 | +1, -1 | Gas | 13.81 | 20.28 | 0.00009 | Nonmetal | 1766 |
Sorting Data#
Pandas provides the sort_values
method to sort data in your series. The default is to sort by ascending value in a column. We can use a keyword argument in the sort_values
method to set optional parameters. In this case using ascending=False
sorts in desending order and ascending=True
sorts in ascending order.
# sort the DataFrame by "AtomicNumber" in descending order (acending=True will sort in ascending order)
df.sort_values(by="AtomicNumber", ascending=False)
AtomicNumber | Symbol | Name | AtomicMass | ElectronConfiguration | Electronegativity | AtomicRadius | IonizationEnergy | ElectronAffinity | OxidationStates | StandardState | MeltingPoint | BoilingPoint | Density | GroupBlock | YearDiscovered | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
117 | 118 | Og | Oganesson | 295.216000 | [Rn]7s2 7p6 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +6, +4, +2, +1, 0, -1 | Expected to be a Gas | NaN | NaN | NaN | Noble gas | 2006 |
116 | 117 | Ts | Tennessine | 294.211000 | [Rn]7s2 7p5 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +5, +3, +1, -1 | Expected to be a Solid | NaN | NaN | NaN | Halogen | 2010 |
115 | 116 | Lv | Livermorium | 293.205000 | [Rn]7s2 7p4 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +4, +2, -2 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 2000 |
114 | 115 | Mc | Moscovium | 290.196000 | [Rn]7s2 7p3 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | 3, 1 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 2003 |
113 | 114 | Fl | Flerovium | 290.192000 | [Rn]7s2 7p2 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | 6, 4,2, 1, 0 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 1998 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4 | 5 | B | Boron | 10.810000 | [He]2s2 2p1 | 2.04 | 192.0 | 8.298 | 0.277 | +3 | Solid | 2348.00 | 4273.00 | 2.370000 | Metalloid | 1808 |
3 | 4 | Be | Beryllium | 9.012183 | [He]2s2 | 1.57 | 153.0 | 9.323 | NaN | +2 | Solid | 1560.00 | 2744.00 | 1.850000 | Alkaline earth metal | 1798 |
2 | 3 | Li | Lithium | 7.000000 | [He]2s1 | 0.98 | 182.0 | 5.392 | 0.618 | +1 | Solid | 453.65 | 1615.00 | 0.534000 | Alkali metal | 1817 |
1 | 2 | He | Helium | 4.002600 | 1s2 | NaN | 140.0 | 24.587 | NaN | 0 | Gas | 0.95 | 4.22 | 0.000179 | Noble gas | 1868 |
0 | 1 | H | Hydrogen | 1.008000 | 1s1 | 2.20 | 120.0 | 13.598 | 0.754 | +1, -1 | Gas | 13.81 | 20.28 | 0.000090 | Nonmetal | 1766 |
118 rows × 16 columns
df.sort_values(by="AtomicRadius")
AtomicNumber | Symbol | Name | AtomicMass | ElectronConfiguration | Electronegativity | AtomicRadius | IonizationEnergy | ElectronAffinity | OxidationStates | StandardState | MeltingPoint | BoilingPoint | Density | GroupBlock | YearDiscovered | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | H | Hydrogen | 1.008000 | 1s1 | 2.20 | 120.0 | 13.598 | 0.754 | +1, -1 | Gas | 13.81 | 20.28 | 0.000090 | Nonmetal | 1766 |
8 | 9 | F | Fluorine | 18.998403 | [He]2s2 2p5 | 3.98 | 135.0 | 17.423 | 3.339 | -1 | Gas | 53.53 | 85.03 | 0.001696 | Halogen | 1670 |
29 | 30 | Zn | Zinc | 65.400000 | [Ar]4s2 3d10 | 1.65 | 139.0 | 9.394 | NaN | +2 | Solid | 692.68 | 1180.00 | 7.134000 | Transition metal | 1746 |
1 | 2 | He | Helium | 4.002600 | 1s2 | NaN | 140.0 | 24.587 | NaN | 0 | Gas | 0.95 | 4.22 | 0.000179 | Noble gas | 1868 |
28 | 29 | Cu | Copper | 63.550000 | [Ar]4s1 3d10 | 1.90 | 140.0 | 7.726 | 1.228 | +2, +1 | Solid | 1357.77 | 2835.00 | 8.933000 | Transition metal | Ancient |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
113 | 114 | Fl | Flerovium | 290.192000 | [Rn]7s2 7p2 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | 6, 4,2, 1, 0 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 1998 |
114 | 115 | Mc | Moscovium | 290.196000 | [Rn]7s2 7p3 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | 3, 1 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 2003 |
115 | 116 | Lv | Livermorium | 293.205000 | [Rn]7s2 7p4 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +4, +2, -2 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 2000 |
116 | 117 | Ts | Tennessine | 294.211000 | [Rn]7s2 7p5 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +5, +3, +1, -1 | Expected to be a Solid | NaN | NaN | NaN | Halogen | 2010 |
117 | 118 | Og | Oganesson | 295.216000 | [Rn]7s2 7p6 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +6, +4, +2, +1, 0, -1 | Expected to be a Gas | NaN | NaN | NaN | Noble gas | 2006 |
118 rows × 16 columns
Notice in the above DataFrame display that after sorting by Atomic Radius, the tail 5 entries have NaN
displayed. This is Not a Number (NaN) and in pandas represents missing data.
If you want to identify NaN values, you can use the isna()
method. Identifying the NaN values then allows us to remove them from the DataFrame. This is accomplished through the dropna()
method. It drops any rows that contain NaN
values. This looks for values that are False
or not True
.
Alternatively, you could identify only True
values by using the notna()
method.
In the following code cells identify which Rows have NaN values for Atomic Radius and then we will create a new DataFrame that has only Atomic Number and Atomic Radii without with NaN values.
# display the number of NaN values in each column
print(df.isna().sum())
print()
# display the number of NaN values in only the "Atomic Radius" column
print("The number of rows where Atomic Radius is NaN is:",df.AtomicRadius.isna().sum())
AtomicNumber 0
Symbol 0
Name 0
AtomicMass 0
ElectronConfiguration 0
Electronegativity 23
AtomicRadius 19
IonizationEnergy 16
ElectronAffinity 61
OxidationStates 1
StandardState 0
MeltingPoint 15
BoilingPoint 25
Density 22
GroupBlock 0
YearDiscovered 0
dtype: int64
The number of rows where Atomic Radius is NaN is: 19
print(df.notna().sum()) # display the number of non-NaN values in each column
AtomicNumber 118
Symbol 118
Name 118
AtomicMass 118
ElectronConfiguration 118
Electronegativity 95
AtomicRadius 99
IonizationEnergy 102
ElectronAffinity 57
OxidationStates 117
StandardState 118
MeltingPoint 103
BoilingPoint 93
Density 96
GroupBlock 118
YearDiscovered 118
dtype: int64
# Identify and print all rows where AtomicRadius is NaN
df[df['AtomicRadius'].isna()]
AtomicNumber | Symbol | Name | AtomicMass | ElectronConfiguration | Electronegativity | AtomicRadius | IonizationEnergy | ElectronAffinity | OxidationStates | StandardState | MeltingPoint | BoilingPoint | Density | GroupBlock | YearDiscovered | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
99 | 100 | Fm | Fermium | 257.09511 | [Rn] 5f12 7s2 | 1.3 | NaN | 6.50 | NaN | +3 | Solid | 1800.0 | NaN | NaN | Actinide | 1952 |
100 | 101 | Md | Mendelevium | 258.09843 | [Rn]7s2 5f13 | 1.3 | NaN | 6.58 | NaN | +3, +2 | Solid | 1100.0 | NaN | NaN | Actinide | 1955 |
101 | 102 | No | Nobelium | 259.10100 | [Rn]7s2 5f14 | 1.3 | NaN | 6.65 | NaN | +3, +2 | Solid | 1100.0 | NaN | NaN | Actinide | 1957 |
102 | 103 | Lr | Lawrencium | 266.12000 | [Rn]7s2 5f14 6d1 | 1.3 | NaN | NaN | NaN | +3 | Solid | 1900.0 | NaN | NaN | Actinide | 1961 |
103 | 104 | Rf | Rutherfordium | 267.12200 | [Rn]7s2 5f14 6d2 | NaN | NaN | NaN | NaN | +4 | Solid | NaN | NaN | NaN | Transition metal | 1964 |
104 | 105 | Db | Dubnium | 268.12600 | [Rn]7s2 5f14 6d3 | NaN | NaN | NaN | NaN | 5, 4, 3 | Solid | NaN | NaN | NaN | Transition metal | 1967 |
105 | 106 | Sg | Seaborgium | 269.12800 | [Rn]7s2 5f14 6d4 | NaN | NaN | NaN | NaN | 6, 5, 4, 3, 0 | Solid | NaN | NaN | NaN | Transition metal | 1974 |
106 | 107 | Bh | Bohrium | 270.13300 | [Rn]7s2 5f14 6d5 | NaN | NaN | NaN | NaN | 7, 5, 4, 3 | Solid | NaN | NaN | NaN | Transition metal | 1976 |
107 | 108 | Hs | Hassium | 269.13360 | [Rn]7s2 5f14 6d6 | NaN | NaN | NaN | NaN | 8, 6, 5, 4, 3, 2 | Solid | NaN | NaN | NaN | Transition metal | 1984 |
108 | 109 | Mt | Meitnerium | 277.15400 | [Rn]7s2 5f14 6d7 (calculated) | NaN | NaN | NaN | NaN | 9, 8, 6, 4, 3, 1 | Solid | NaN | NaN | NaN | Transition metal | 1982 |
109 | 110 | Ds | Darmstadtium | 282.16600 | [Rn]7s2 5f14 6d8 (predicted) | NaN | NaN | NaN | NaN | 8, 6, 4, 2, 0 | Expected to be a Solid | NaN | NaN | NaN | Transition metal | 1994 |
110 | 111 | Rg | Roentgenium | 282.16900 | [Rn]7s2 5f14 6d9 (predicted) | NaN | NaN | NaN | NaN | 5, 3, 1, -1 | Expected to be a Solid | NaN | NaN | NaN | Transition metal | 1994 |
111 | 112 | Cn | Copernicium | 286.17900 | [Rn]7s2 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | 2, 1, 0 | Expected to be a Solid | NaN | NaN | NaN | Transition metal | 1996 |
112 | 113 | Nh | Nihonium | 286.18200 | [Rn]5f14 6d10 7s2 7p1 (predicted) | NaN | NaN | NaN | NaN | NaN | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 2004 |
113 | 114 | Fl | Flerovium | 290.19200 | [Rn]7s2 7p2 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | 6, 4,2, 1, 0 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 1998 |
114 | 115 | Mc | Moscovium | 290.19600 | [Rn]7s2 7p3 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | 3, 1 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 2003 |
115 | 116 | Lv | Livermorium | 293.20500 | [Rn]7s2 7p4 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +4, +2, -2 | Expected to be a Solid | NaN | NaN | NaN | Post-transition metal | 2000 |
116 | 117 | Ts | Tennessine | 294.21100 | [Rn]7s2 7p5 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +5, +3, +1, -1 | Expected to be a Solid | NaN | NaN | NaN | Halogen | 2010 |
117 | 118 | Og | Oganesson | 295.21600 | [Rn]7s2 7p6 5f14 6d10 (predicted) | NaN | NaN | NaN | NaN | +6, +4, +2, +1, 0, -1 | Expected to be a Gas | NaN | NaN | NaN | Noble gas | 2006 |
# Create a new DataFrame with AtomicNumber and AtomicRadius, dropping rows where AtomicRadius is NaN
df_atomic_radius_dropped = df[['AtomicNumber', 'AtomicRadius']].dropna(subset=['AtomicRadius'])
df_atomic_radius_dropped
AtomicNumber | AtomicRadius | |
---|---|---|
0 | 1 | 120.0 |
1 | 2 | 140.0 |
2 | 3 | 182.0 |
3 | 4 | 153.0 |
4 | 5 | 192.0 |
... | ... | ... |
94 | 95 | 244.0 |
95 | 96 | 245.0 |
96 | 97 | 244.0 |
97 | 98 | 245.0 |
98 | 99 | 245.0 |
99 rows × 2 columns
In the cell block below, write code to generate a new DataFrame called df_drop_density that only has Names and Density values that have values.
#write your Check your Understanding code here
Working with data files#
Once you have maninpulated your data, it is a good idea to save this data as a comma separated values (.csv) file. This can be accessed later and has all of your data cleaned for the next project. For example, you now have a DataFrame that is just atomic number and atomic radius. You can write that data to a new csv file using the to_csv()
method.
Your DataFrame not only has columns(series) of data, but also index numbers. Those index numbers may not be meaningful in other contexts. When you save the data as csv, you can choose to include the index value. The default is to include index, but setting an argument to index=False, you will exclude the index an only include the column data.
df_atomic_radius_dropped.to_csv('atomic_radius.csv')
df_atomic_radius_dropped.to_csv('atomic_radius_noindex.csv', index=False)
You can also upload that data from your file into a new DataFrame. One thing to note is that you need to specify its location. In this example, you wrote the csv file into the same directory as the original jupyter notebook, so you don’t need to specify where it is located.
df_fromfile = pd.read_csv("atomic_radius_noindex.csv")
df_fromfile.head(3)
AtomicNumber | AtomicRadius | |
---|---|---|
0 | 1 | 120.0 |
1 | 2 | 140.0 |
2 | 3 | 182.0 |
You have a DataFrame that loaded in your saved atomic_radius_noindex.csv file.
Load the atomic_radius.csv file into another DataFrame called df_index.
Display the first three rows of data.
If there is an unnamed column, remove that column from df_index in another code cell.
# Write your code here
Solution
df_index = pd.read_csv("atomic_radius.csv") df_index.head(3)
New Code cell
df_index = df_index.drop(columns=["Unnamed: 0"], axis =1)
In the next notebook, we will be accessing data from a website. The data is provided as space delimited file meaning the column data is separted by spaces rather than commas. When we load the data we can specify the separation of data with the sep = " "
argument.
The data from this website also does not have column names for the data. We can specify those names as we bring the data in.
colnames = ['smiles','dat', 'id']
df_act = pd.read_csv("https://dude.docking.org/targets/aa2ar/actives_final.ism", sep=" ", names=colnames)
df_act
smiles | dat | id | |
---|---|---|---|
0 | C#CCOc3nc(c1ccccc1)nc4sc2CCCc2c34 | 630004 | CHEMBL1087078 |
1 | OC[C@H]3OC(n1cnc2c(NCCS(O)(=O)=O)ncnc12)[C@H](... | 607385 | CHEMBL610435 |
2 | CCNC(=O)[C@H]4OC(n2cnc3c(N)nc(C#Cc1ccc(CCC(O)=... | 605505 | CHEMBL605065 |
3 | Ic5cccc(n1c(=S)sc3c1ncn4nc(c2ccco2)nc34)c5 | 623098 | CHEMBL1093479 |
4 | C=CCn1c(=S)sc3c1ncn4nc(c2ccco2)nc34 | 623027 | CHEMBL1092271 |
... | ... | ... | ... |
477 | CCNC(=O)[C@H]4OC(n2cnc3c(N)nc(C#Cc1nccs1)nc23)... | 605440 | CHEMBL605265 |
478 | CCCn4c(=O)n(CCC)c3cc(c2ccc(COC(=O)Nc1ccccc1)cc... | 479497 | CHEMBL466021 |
479 | Nc4nc(N2CCN(c1c(F)cc(F)cc1F)CC2)nc5nc(c3ccco3)... | 309346 | CHEMBL360619 |
480 | Cc4ccc(\C=N\Nc3nc(N)c2ncn(C1O[C@H](CO)[C@@H](O... | 606942 | CHEMBL611571 |
481 | Cc3sc2nc(c1ccccc1)n(CC#C)c(=O)c2c3C | 629987 | CHEMBL1088398 |
482 rows × 3 columns
Once we have a DataFrame, we can also save that DataFrame locally as a csv file to load later.
#write the DataFrame to a CSV file named 'aa2ar_actives.csv'
df_act.to_csv('aa2ar_actives.csv')
Joining two DataFrames#
A common operation when working with data is joining two DataFrames together. This is especially useful when information about a common set of molecules is stored in two different datasets. For example, you might have one DataFrame that has SMILES strings and names and a second that has names and experimental data. The .join()
method combines the two DataFrames together using an index value that allows you to analyze all relevant data in a single DataFrame.
DataFrame1 = pd.DataFrame({
'xLogP3': [1.2, 3.5, 3.3],
'MolecularWeight': [180.16, 206.28, 230.26]
}, index=['Aspirin', 'Ibuprofen', 'Naproxen'])
DataFrame2 = pd.DataFrame({
'TPSA': [63.6, 37.3, 46.5],
'SMILES': ['CC(=O)OC1=CC=CC=C1C(=O)O', 'CC(C)CC1=CC=C(C=C1)C(C)C(=O)O', 'C[C@@H](C1=CC2=C(C=C1)C=C(C=C2)OC)C(=O)O']
}, index=['Aspirin', 'Ibuprofen', 'Naproxen'])
# since these are small DataFrames, we can display them side by side in a Jupyter Notebook using HTML
from IPython.display import display, HTML
df1_html = DataFrame1.to_html()
df2_html = DataFrame2.to_html()
display(HTML(f"""
<table>
<tr>
<th><H3>DataFrame1</H3></th>
<th><H3>DataFrame2</H3></th>
</tr>
<tr>
<td>{df1_html}</td>
<td style="padding-left: 30px;">{df2_html}</td>
</tr>
</table>
"""))
DataFrame1 |
DataFrame2 |
||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
DataFramecombined = DataFrame1.join(DataFrame2)
DataFramecombined
xLogP3 | MolecularWeight | TPSA | SMILES | |
---|---|---|---|---|
Aspirin | 1.2 | 180.16 | 63.6 | CC(=O)OC1=CC=CC=C1C(=O)O |
Ibuprofen | 3.5 | 206.28 | 37.3 | CC(C)CC1=CC=C(C=C1)C(C)C(=O)O |
Naproxen | 3.3 | 230.26 | 46.5 | C[C@@H](C1=CC2=C(C=C1)C=C(C=C2)OC)C(=O)O |
Pandas is quite powerful and we have only scratched the surface. To get more savvy with pandas, you can access tutorials available at https://pandas.pydata.org/pandas-docs/version/0.15/tutorials.html
Homework
Familiarize yourself with the DUD-E (Directory of Useful Decoys, Enhanced) data sets (http://dude.docking.org/), which contain known actives and inactives for 102 protein targets.The DUD-E sets are widely used in benchmarking studies that compare the performance of different virtual screening approaches (https://doi.org/10.1021/jm300687e).
Go to the DUD-E target page (http://dude.docking.org/targets) and find muscle glycogen phosphorylase (Target Name: PYGM, PDB ID: 1c8k) from the target list. Clicking the target name “PYGM” directs you to the page that lists various files (http://dude.docking.org/targets/pygm). Download file “actives_final.ism”, which contains the SMILES strings of known actives. Rename the file name as “pygm_1c8k_actives.ism”. [Open the file in a text viewer/editor to check the format of this file].
Identify the data what appropriate column names should be.
Load the pygm_1c8k_actives.ism data into a DataFrame with appropriate column labels.
Create a variable that has the number of rows in the DataFrame.
Create a variable that has the number of columns in the DataFrame.
Identify how many different structures are included in the dataset.
Save this file locally as a .csv file with name: pygm_1c8k_actives_clean.csv without DataFrame index values.
You will be using this .csv file in the next activity.
# Write your code here.
colnames = ['smiles','DUD-E ID', 'CHEMBL_ID']
df_pygm_actives = pd.read_csv("https://dude.docking.org/targets/pygm/actives_final.ism", sep=" ", names=colnames)
# Create a variable that has the number of rows in the DataFrame.
num_rows = df_pygm_actives.shape[0] # or len(df_pygm_actives)
# Create a variable that has the number of columns in the DataFrame.
num_cols = df_pygm_actives.shape[1] # or df_pygm_actives.columns.size
# Identify number of unique SMILES strings in the df_pygm_actives DataFrame
print(len(df_pygm_actives.smiles.unique()))
# Save this file locally as a .csv file with name: pygm_1c8k_actives_clean.csv
df_pygm_actives.to_csv("pygm_1c8k_actives_clean.csv", index=False)
77