Introduction to Pandas#

Objectives
  • Become familiar with the Pandas data analysis library in python

  • Read data into pandas DataFrames from files

  • Manipulate data in the DataFrame

  • Export data as CSV

Pandas is a flexible open-source data analysis and manipulation library for Python. It provides data structures in series and DataFrames. A series is like a list and a DataFrame is like a spreadsheet. The key is that this library allows users to easily load, explore, clean, transform and analyzed structured data. It is widely used in data science, machine learning and statitical analysis.

This activity is intended to give you a brief overview of DataFrames and introduce you to pandas concepts we are using in the next notebook.

# load the pandas library 
# the pandas community as agreed that the alias for pandas is 'pd'. 
# Loading pandas with this alias allows us to use its functions and classes with a shorter name.
import pandas as pd

Much of the data we will be using is availabe as comma separated values files from PubChem or other resources. Pandas has the ability to read in csv data either locally from a downloaded file or as a request if we know the source of the file. In the code cell below, we will download the periodic table from PubChem and store it into a DataFrame variable call df.

# load the periodic table data from a URL into a pandas DataFrame. 
# We do not need to use the `requests` library here because pandas can directly read from URLs.
url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/periodictable/CSV" #url for the periodic table data in CSV format
df = pd.read_csv(url) # read the CSV data from the URL into a pandas DataFrame
df # display the DataFrame to see its contents
AtomicNumber Symbol Name AtomicMass CPKHexColor ElectronConfiguration Electronegativity AtomicRadius IonizationEnergy ElectronAffinity OxidationStates StandardState MeltingPoint BoilingPoint Density GroupBlock YearDiscovered
0 1 H Hydrogen 1.008000 FFFFFF 1s1 2.20 120.0 13.598 0.754 +1, -1 Gas 13.81 20.28 0.000090 Nonmetal 1766
1 2 He Helium 4.002600 D9FFFF 1s2 NaN 140.0 24.587 NaN 0 Gas 0.95 4.22 0.000179 Noble gas 1868
2 3 Li Lithium 7.000000 CC80FF [He]2s1 0.98 182.0 5.392 0.618 +1 Solid 453.65 1615.00 0.534000 Alkali metal 1817
3 4 Be Beryllium 9.012183 C2FF00 [He]2s2 1.57 153.0 9.323 NaN +2 Solid 1560.00 2744.00 1.850000 Alkaline earth metal 1798
4 5 B Boron 10.810000 FFB5B5 [He]2s2 2p1 2.04 192.0 8.298 0.277 +3 Solid 2348.00 4273.00 2.370000 Metalloid 1808
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
113 114 Fl Flerovium 290.192000 NaN [Rn]7s2 7p2 5f14 6d10 (predicted) NaN NaN NaN NaN 6, 4,2, 1, 0 Expected to be a Solid NaN NaN NaN Post-transition metal 1998
114 115 Mc Moscovium 290.196000 NaN [Rn]7s2 7p3 5f14 6d10 (predicted) NaN NaN NaN NaN 3, 1 Expected to be a Solid NaN NaN NaN Post-transition metal 2003
115 116 Lv Livermorium 293.205000 NaN [Rn]7s2 7p4 5f14 6d10 (predicted) NaN NaN NaN NaN +4, +2, -2 Expected to be a Solid NaN NaN NaN Post-transition metal 2000
116 117 Ts Tennessine 294.211000 NaN [Rn]7s2 7p5 5f14 6d10 (predicted) NaN NaN NaN NaN +5, +3, +1, -1 Expected to be a Solid NaN NaN NaN Halogen 2010
117 118 Og Oganesson 295.216000 NaN [Rn]7s2 7p6 5f14 6d10 (predicted) NaN NaN NaN NaN +6, +4, +2, +1, 0, -1 Expected to be a Gas NaN NaN NaN Noble gas 2006

118 rows × 17 columns

Notice that the DataFrame output indicates it has 118 rows and 17 columns. There are a few ways to get information about rows and columns in our DataFrame(named df):

print(df.shape)
print(len(df))
print(df.shape[0])  
print(df.shape[1])  
print(len(df.columns))
(118, 17)
118
118
17
17
Check your understanding

Write comments next to each line in the code above to indicate what the code is doing.

Run the next four code cells to explore the columns property and head, tail and info methods of a DataFrame.

df.columns # display the columns of the DataFrame to understand its structure
Index(['AtomicNumber', 'Symbol', 'Name', 'AtomicMass', 'CPKHexColor',
       'ElectronConfiguration', 'Electronegativity', 'AtomicRadius',
       'IonizationEnergy', 'ElectronAffinity', 'OxidationStates',
       'StandardState', 'MeltingPoint', 'BoilingPoint', 'Density',
       'GroupBlock', 'YearDiscovered'],
      dtype='object')
df.head(3) # display the first 3 rows of the DataFrame to get a glimpse of the data
AtomicNumber Symbol Name AtomicMass CPKHexColor ElectronConfiguration Electronegativity AtomicRadius IonizationEnergy ElectronAffinity OxidationStates StandardState MeltingPoint BoilingPoint Density GroupBlock YearDiscovered
0 1 H Hydrogen 1.0080 FFFFFF 1s1 2.20 120.0 13.598 0.754 +1, -1 Gas 13.81 20.28 0.000090 Nonmetal 1766
1 2 He Helium 4.0026 D9FFFF 1s2 NaN 140.0 24.587 NaN 0 Gas 0.95 4.22 0.000179 Noble gas 1868
2 3 Li Lithium 7.0000 CC80FF [He]2s1 0.98 182.0 5.392 0.618 +1 Solid 453.65 1615.00 0.534000 Alkali metal 1817
df.tail(5)  # display the last 5 rows of the DataFrame to see the end of the data
AtomicNumber Symbol Name AtomicMass CPKHexColor ElectronConfiguration Electronegativity AtomicRadius IonizationEnergy ElectronAffinity OxidationStates StandardState MeltingPoint BoilingPoint Density GroupBlock YearDiscovered
113 114 Fl Flerovium 290.192 NaN [Rn]7s2 7p2 5f14 6d10 (predicted) NaN NaN NaN NaN 6, 4,2, 1, 0 Expected to be a Solid NaN NaN NaN Post-transition metal 1998
114 115 Mc Moscovium 290.196 NaN [Rn]7s2 7p3 5f14 6d10 (predicted) NaN NaN NaN NaN 3, 1 Expected to be a Solid NaN NaN NaN Post-transition metal 2003
115 116 Lv Livermorium 293.205 NaN [Rn]7s2 7p4 5f14 6d10 (predicted) NaN NaN NaN NaN +4, +2, -2 Expected to be a Solid NaN NaN NaN Post-transition metal 2000
116 117 Ts Tennessine 294.211 NaN [Rn]7s2 7p5 5f14 6d10 (predicted) NaN NaN NaN NaN +5, +3, +1, -1 Expected to be a Solid NaN NaN NaN Halogen 2010
117 118 Og Oganesson 295.216 NaN [Rn]7s2 7p6 5f14 6d10 (predicted) NaN NaN NaN NaN +6, +4, +2, +1, 0, -1 Expected to be a Gas NaN NaN NaN Noble gas 2006
df.info() # display information about the DataFrame, such as the number of entries, column names, and data types
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118 entries, 0 to 117
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AtomicNumber           118 non-null    int64  
 1   Symbol                 118 non-null    object 
 2   Name                   118 non-null    object 
 3   AtomicMass             118 non-null    float64
 4   CPKHexColor            108 non-null    object 
 5   ElectronConfiguration  118 non-null    object 
 6   Electronegativity      95 non-null     float64
 7   AtomicRadius           99 non-null     float64
 8   IonizationEnergy       102 non-null    float64
 9   ElectronAffinity       57 non-null     float64
 10  OxidationStates        117 non-null    object 
 11  StandardState          118 non-null    object 
 12  MeltingPoint           103 non-null    float64
 13  BoilingPoint           93 non-null     float64
 14  Density                96 non-null     float64
 15  GroupBlock             118 non-null    object 
 16  YearDiscovered         118 non-null    object 
dtypes: float64(8), int64(1), object(8)
memory usage: 15.8+ KB
Check your understanding
  • When displaying a DataFrame, what is the default display?

  • If you only want to see the first 4 lines of the DataFrame, what command would you type?

  • If you want to see the last line only of the DataFrame, what command would you type?

  • Why is hydrogen, atomic number 1, listed as row 0?

  • Why might some of the data be null for these elements?

Accessing Subsets of Data in the DataFrame#

Now that our data is in a DataFrame, we can access a subset of the data.

# display the row with index 17 to see the details of the element at that index
df.loc[17]
AtomicNumber                      18
Symbol                            Ar
Name                           Argon
AtomicMass                      39.9
CPKHexColor                   80D1E3
ElectronConfiguration    [Ne]3s2 3p6
Electronegativity                NaN
AtomicRadius                   188.0
IonizationEnergy               15.76
ElectronAffinity                 NaN
OxidationStates                    0
StandardState                    Gas
MeltingPoint                    83.8
BoilingPoint                    87.3
Density                     0.001784
GroupBlock                 Noble gas
YearDiscovered                  1894
Name: 17, dtype: object
# This will show the symbol of the element at index 17
df.loc[17, "Symbol"] 
'Ar'
# this code cell will display the name and Atomic Radius of the element at index 17
# it uses the `loc` method to access specific rows and columns in the DataFrame
# this require the name of the columns to be specified as a list
df.loc[17, ["Name", "AtomicRadius"]]
Name            Argon
AtomicRadius    188.0
Name: 17, dtype: object
# Display the name and atomic radius of the element at index 17 using column indexing
# it uses the `iloc` method to access specific rows and columns by their integer index
# you do not need to specify the column names, just their integer positions
# This can be useful when you know the position of the columns but not their names or the names are long.
df.iloc[17, [0, 7]]
AtomicNumber       18
AtomicRadius    188.0
Name: 17, dtype: object

Sometimes we want to have a subset of the data. This is achieved by identifying which columns we want.

df[["Name", "GroupBlock"]].head(3) # display the first 3 rows of the "Name" and "GroupBlock" columns    
Name GroupBlock
0 Hydrogen Nonmetal
1 Helium Noble gas
2 Lithium Alkali metal

We can also filter the data when we only want a subset of the data.

df[df["AtomicNumber"] < 10.0 ]# display the rows where the "AtomicNumber" is less than 10.0
AtomicNumber Symbol Name AtomicMass CPKHexColor ElectronConfiguration Electronegativity AtomicRadius IonizationEnergy ElectronAffinity OxidationStates StandardState MeltingPoint BoilingPoint Density GroupBlock YearDiscovered
0 1 H Hydrogen 1.008000 FFFFFF 1s1 2.20 120.0 13.598 0.754 +1, -1 Gas 13.81 20.28 0.000090 Nonmetal 1766
1 2 He Helium 4.002600 D9FFFF 1s2 NaN 140.0 24.587 NaN 0 Gas 0.95 4.22 0.000179 Noble gas 1868
2 3 Li Lithium 7.000000 CC80FF [He]2s1 0.98 182.0 5.392 0.618 +1 Solid 453.65 1615.00 0.534000 Alkali metal 1817
3 4 Be Beryllium 9.012183 C2FF00 [He]2s2 1.57 153.0 9.323 NaN +2 Solid 1560.00 2744.00 1.850000 Alkaline earth metal 1798
4 5 B Boron 10.810000 FFB5B5 [He]2s2 2p1 2.04 192.0 8.298 0.277 +3 Solid 2348.00 4273.00 2.370000 Metalloid 1808
5 6 C Carbon 12.011000 909090 [He]2s2 2p2 2.55 170.0 11.260 1.263 +4, +2, -4 Solid 3823.00 4098.00 2.267000 Nonmetal Ancient
6 7 N Nitrogen 14.007000 3050F8 [He] 2s2 2p3 3.04 155.0 14.534 NaN +5, +4, +3, +2, +1, -1, -2, -3 Gas 63.15 77.36 0.001251 Nonmetal 1772
7 8 O Oxygen 15.999000 FF0D0D [He]2s2 2p4 3.44 152.0 13.618 1.461 -2 Gas 54.36 90.20 0.001429 Nonmetal 1774
8 9 F Fluorine 18.998403 90E050 [He]2s2 2p5 3.98 135.0 17.423 3.339 -1 Gas 53.53 85.03 0.001696 Halogen 1670
# display rows where AtomicNumber is greater than 10 and less than 20
df[(df["AtomicNumber"] > 10) & (df["AtomicNumber"] < 20)] 
AtomicNumber Symbol Name AtomicMass CPKHexColor ElectronConfiguration Electronegativity AtomicRadius IonizationEnergy ElectronAffinity OxidationStates StandardState MeltingPoint BoilingPoint Density GroupBlock YearDiscovered
10 11 Na Sodium 22.989769 AB5CF2 [Ne]3s1 0.93 227.0 5.139 0.548 +1 Solid 370.950 1156.00 0.970000 Alkali metal 1807
11 12 Mg Magnesium 24.305000 8AFF00 [Ne]3s2 1.31 173.0 7.646 NaN +2 Solid 923.000 1363.00 1.740000 Alkaline earth metal 1808
12 13 Al Aluminum 26.981538 BFA6A6 [Ne]3s2 3p1 1.61 184.0 5.986 0.441 +3 Solid 933.437 2792.00 2.700000 Post-transition metal Ancient
13 14 Si Silicon 28.085000 F0C8A0 [Ne]3s2 3p2 1.90 210.0 8.152 1.385 +4, +2, -4 Solid 1687.000 3538.00 2.329600 Metalloid 1854
14 15 P Phosphorus 30.973762 FF8000 [Ne]3s2 3p3 2.19 180.0 10.487 0.746 +5, +3, -3 Solid 317.300 553.65 1.820000 Nonmetal 1669
15 16 S Sulfur 32.070000 FFFF30 [Ne]3s2 3p4 2.58 180.0 10.360 2.077 +6, +4, -2 Solid 388.360 717.75 2.067000 Nonmetal Ancient
16 17 Cl Chlorine 35.450000 1FF01F [Ne]3s2 3p5 3.16 175.0 12.968 3.617 +7, +5, +1, -1 Gas 171.650 239.11 0.003214 Halogen 1774
17 18 Ar Argon 39.900000 80D1E3 [Ne]3s2 3p6 NaN 188.0 15.760 NaN 0 Gas 83.800 87.30 0.001784 Noble gas 1894
18 19 K Potassium 39.098300 8F40D4 [Ar]4s1 0.82 275.0 4.341 0.501 +1 Solid 336.530 1032.00 0.890000 Alkali metal 1807

Notice in the series called GroupBlock has repeating values. If we want to display how many different values are in that column, we can identify how many are .unique, and then show how many elements in the DataFrame are classified as Noble gasses using different methods.

# display the unique values in the "GroupBlock" column
print("Unique values in 'GroupBlock':")
print(df.GroupBlock.unique()) 
print()
# display the total count of unique values in the "GroupBlock" column
print("Total unique values in 'GroupBlock':")
print(df["GroupBlock"].nunique()) # nunique(): Returns the count of unique values.
print(len(df["GroupBlock"].unique())) # len(): Returns the count of unique values in the Series.
print()
# display the total count of rows where the "GroupBlock" is "Noble Gas" using the `shape` attribute
print("Total count of rows where 'GroupBlock' is 'Noble gas':")
print(df[df["GroupBlock"] == "Noble gas"].shape[0])
# display the total count of rows where the "GroupBlock" is "Noble Gas" using the length function
print(len(df[df["GroupBlock"] == "Noble gas"]))
# display the total count of rows where the "GroupBlock" is "Noble Gas" using the .sum()` method
print(df["GroupBlock"].eq("Noble gas").sum())
print()

#create a list of elements that are Noble Gases
noble_gases = df[df["GroupBlock"] == "Noble gas"]["Name"].tolist()
# display the list of Noble Gases
print("The noble gases are:", noble_gases)
# display the DataFrame filtered to show only rows where the "GroupBlock" is "Noble gas"
df[(df["GroupBlock"].eq("Noble gas"))] 
Unique values in 'GroupBlock':
['Nonmetal' 'Noble gas' 'Alkali metal' 'Alkaline earth metal' 'Metalloid'
 'Halogen' 'Post-transition metal' 'Transition metal' 'Lanthanide'
 'Actinide']

Total unique values in 'GroupBlock':
10
10

Total count of rows where 'GroupBlock' is 'Noble gas':
7
7
7

The noble gases are: ['Helium', 'Neon', 'Argon', 'Krypton', 'Xenon', 'Radon', 'Oganesson']
AtomicNumber Symbol Name AtomicMass CPKHexColor ElectronConfiguration Electronegativity AtomicRadius IonizationEnergy ElectronAffinity OxidationStates StandardState MeltingPoint BoilingPoint Density GroupBlock YearDiscovered
1 2 He Helium 4.00260 D9FFFF 1s2 NaN 140.0 24.587 NaN 0 Gas 0.95 4.22 0.000179 Noble gas 1868
9 10 Ne Neon 20.18000 B3E3F5 [He]2s2 2p6 NaN 154.0 21.565 NaN 0 Gas 24.56 27.07 0.000900 Noble gas 1898
17 18 Ar Argon 39.90000 80D1E3 [Ne]3s2 3p6 NaN 188.0 15.760 NaN 0 Gas 83.80 87.30 0.001784 Noble gas 1894
35 36 Kr Krypton 83.80000 5CB8D1 [Ar]4s2 3d10 4p6 3.0 202.0 14.000 NaN 0 Gas 115.79 119.93 0.003733 Noble gas 1898
53 54 Xe Xenon 131.29000 429EB0 [Kr]5s2 4d10 5p6 2.6 216.0 12.130 NaN 0 Gas 161.36 165.03 0.005887 Noble gas 1898
85 86 Rn Radon 222.01758 428296 [Xe]6s2 4f14 5d10 6p6 NaN 220.0 10.745 NaN 0 Gas 202.00 211.45 0.009730 Noble gas 1900
117 118 Og Oganesson 295.21600 NaN [Rn]7s2 7p6 5f14 6d10 (predicted) NaN NaN NaN NaN +6, +4, +2, +1, 0, -1 Expected to be a Gas NaN NaN NaN Noble gas 2006

Sometimes we find that there is data in columns that may be irrelevant to your data analysis. This column can be removed using the .drop() method. Looking at the data in the DataFrame above, we probably don’t need CPKHexColor column. Let’s remove that data.

df = df.drop('CPKHexColor', axis=1)  # drop the 'CPKHexColor' column from the DataFrame
df.head(1)
AtomicNumber Symbol Name AtomicMass ElectronConfiguration Electronegativity AtomicRadius IonizationEnergy ElectronAffinity OxidationStates StandardState MeltingPoint BoilingPoint Density GroupBlock YearDiscovered
0 1 H Hydrogen 1.008 1s1 2.2 120.0 13.598 0.754 +1, -1 Gas 13.81 20.28 0.00009 Nonmetal 1766

Sorting Data#

Pandas provides the sort_values method to sort data in your series. The default is to sort by ascending value in a column. We can use a keyword argument in the sort_values method to set optional parameters. In this case using ascending=False sorts in desending order and ascending=True sorts in ascending order.

# sort the DataFrame by "AtomicNumber" in descending order (acending=True will sort in ascending order)
df.sort_values(by="AtomicNumber", ascending=False) 
AtomicNumber Symbol Name AtomicMass ElectronConfiguration Electronegativity AtomicRadius IonizationEnergy ElectronAffinity OxidationStates StandardState MeltingPoint BoilingPoint Density GroupBlock YearDiscovered
117 118 Og Oganesson 295.216000 [Rn]7s2 7p6 5f14 6d10 (predicted) NaN NaN NaN NaN +6, +4, +2, +1, 0, -1 Expected to be a Gas NaN NaN NaN Noble gas 2006
116 117 Ts Tennessine 294.211000 [Rn]7s2 7p5 5f14 6d10 (predicted) NaN NaN NaN NaN +5, +3, +1, -1 Expected to be a Solid NaN NaN NaN Halogen 2010
115 116 Lv Livermorium 293.205000 [Rn]7s2 7p4 5f14 6d10 (predicted) NaN NaN NaN NaN +4, +2, -2 Expected to be a Solid NaN NaN NaN Post-transition metal 2000
114 115 Mc Moscovium 290.196000 [Rn]7s2 7p3 5f14 6d10 (predicted) NaN NaN NaN NaN 3, 1 Expected to be a Solid NaN NaN NaN Post-transition metal 2003
113 114 Fl Flerovium 290.192000 [Rn]7s2 7p2 5f14 6d10 (predicted) NaN NaN NaN NaN 6, 4,2, 1, 0 Expected to be a Solid NaN NaN NaN Post-transition metal 1998
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4 5 B Boron 10.810000 [He]2s2 2p1 2.04 192.0 8.298 0.277 +3 Solid 2348.00 4273.00 2.370000 Metalloid 1808
3 4 Be Beryllium 9.012183 [He]2s2 1.57 153.0 9.323 NaN +2 Solid 1560.00 2744.00 1.850000 Alkaline earth metal 1798
2 3 Li Lithium 7.000000 [He]2s1 0.98 182.0 5.392 0.618 +1 Solid 453.65 1615.00 0.534000 Alkali metal 1817
1 2 He Helium 4.002600 1s2 NaN 140.0 24.587 NaN 0 Gas 0.95 4.22 0.000179 Noble gas 1868
0 1 H Hydrogen 1.008000 1s1 2.20 120.0 13.598 0.754 +1, -1 Gas 13.81 20.28 0.000090 Nonmetal 1766

118 rows × 16 columns

df.sort_values(by="AtomicRadius")
AtomicNumber Symbol Name AtomicMass ElectronConfiguration Electronegativity AtomicRadius IonizationEnergy ElectronAffinity OxidationStates StandardState MeltingPoint BoilingPoint Density GroupBlock YearDiscovered
0 1 H Hydrogen 1.008000 1s1 2.20 120.0 13.598 0.754 +1, -1 Gas 13.81 20.28 0.000090 Nonmetal 1766
8 9 F Fluorine 18.998403 [He]2s2 2p5 3.98 135.0 17.423 3.339 -1 Gas 53.53 85.03 0.001696 Halogen 1670
29 30 Zn Zinc 65.400000 [Ar]4s2 3d10 1.65 139.0 9.394 NaN +2 Solid 692.68 1180.00 7.134000 Transition metal 1746
1 2 He Helium 4.002600 1s2 NaN 140.0 24.587 NaN 0 Gas 0.95 4.22 0.000179 Noble gas 1868
28 29 Cu Copper 63.550000 [Ar]4s1 3d10 1.90 140.0 7.726 1.228 +2, +1 Solid 1357.77 2835.00 8.933000 Transition metal Ancient
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
113 114 Fl Flerovium 290.192000 [Rn]7s2 7p2 5f14 6d10 (predicted) NaN NaN NaN NaN 6, 4,2, 1, 0 Expected to be a Solid NaN NaN NaN Post-transition metal 1998
114 115 Mc Moscovium 290.196000 [Rn]7s2 7p3 5f14 6d10 (predicted) NaN NaN NaN NaN 3, 1 Expected to be a Solid NaN NaN NaN Post-transition metal 2003
115 116 Lv Livermorium 293.205000 [Rn]7s2 7p4 5f14 6d10 (predicted) NaN NaN NaN NaN +4, +2, -2 Expected to be a Solid NaN NaN NaN Post-transition metal 2000
116 117 Ts Tennessine 294.211000 [Rn]7s2 7p5 5f14 6d10 (predicted) NaN NaN NaN NaN +5, +3, +1, -1 Expected to be a Solid NaN NaN NaN Halogen 2010
117 118 Og Oganesson 295.216000 [Rn]7s2 7p6 5f14 6d10 (predicted) NaN NaN NaN NaN +6, +4, +2, +1, 0, -1 Expected to be a Gas NaN NaN NaN Noble gas 2006

118 rows × 16 columns

Notice in the above DataFrame display that after sorting by Atomic Radius, the tail 5 entries have NaN displayed. This is Not a Number (NaN) and in pandas represents missing data.

If you want to identify NaN values, you can use the isna() method. Identifying the NaN values then allows us to remove them from the DataFrame. This is accomplished through the dropna() method. It drops any rows that contain NaN values. This looks for values that are False or not True.

Alternatively, you could identify only True values by using the notna() method.

In the following code cells identify which Rows have NaN values for Atomic Radius and then we will create a new DataFrame that has only Atomic Number and Atomic Radii without with NaN values.

 # display the number of NaN values in each column
print(df.isna().sum()) 
print()
# display the number of NaN values in only the "Atomic Radius" column
print("The number of rows where Atomic Radius is NaN is:",df.AtomicRadius.isna().sum()) 
AtomicNumber              0
Symbol                    0
Name                      0
AtomicMass                0
ElectronConfiguration     0
Electronegativity        23
AtomicRadius             19
IonizationEnergy         16
ElectronAffinity         61
OxidationStates           1
StandardState             0
MeltingPoint             15
BoilingPoint             25
Density                  22
GroupBlock                0
YearDiscovered            0
dtype: int64

The number of rows where Atomic Radius is NaN is: 19
print(df.notna().sum())  # display the number of non-NaN values in each column
AtomicNumber             118
Symbol                   118
Name                     118
AtomicMass               118
ElectronConfiguration    118
Electronegativity         95
AtomicRadius              99
IonizationEnergy         102
ElectronAffinity          57
OxidationStates          117
StandardState            118
MeltingPoint             103
BoilingPoint              93
Density                   96
GroupBlock               118
YearDiscovered           118
dtype: int64
# Identify and print all rows where AtomicRadius is NaN
df[df['AtomicRadius'].isna()]
AtomicNumber Symbol Name AtomicMass ElectronConfiguration Electronegativity AtomicRadius IonizationEnergy ElectronAffinity OxidationStates StandardState MeltingPoint BoilingPoint Density GroupBlock YearDiscovered
99 100 Fm Fermium 257.09511 [Rn] 5f12 7s2 1.3 NaN 6.50 NaN +3 Solid 1800.0 NaN NaN Actinide 1952
100 101 Md Mendelevium 258.09843 [Rn]7s2 5f13 1.3 NaN 6.58 NaN +3, +2 Solid 1100.0 NaN NaN Actinide 1955
101 102 No Nobelium 259.10100 [Rn]7s2 5f14 1.3 NaN 6.65 NaN +3, +2 Solid 1100.0 NaN NaN Actinide 1957
102 103 Lr Lawrencium 266.12000 [Rn]7s2 5f14 6d1 1.3 NaN NaN NaN +3 Solid 1900.0 NaN NaN Actinide 1961
103 104 Rf Rutherfordium 267.12200 [Rn]7s2 5f14 6d2 NaN NaN NaN NaN +4 Solid NaN NaN NaN Transition metal 1964
104 105 Db Dubnium 268.12600 [Rn]7s2 5f14 6d3 NaN NaN NaN NaN 5, 4, 3 Solid NaN NaN NaN Transition metal 1967
105 106 Sg Seaborgium 269.12800 [Rn]7s2 5f14 6d4 NaN NaN NaN NaN 6, 5, 4, 3, 0 Solid NaN NaN NaN Transition metal 1974
106 107 Bh Bohrium 270.13300 [Rn]7s2 5f14 6d5 NaN NaN NaN NaN 7, 5, 4, 3 Solid NaN NaN NaN Transition metal 1976
107 108 Hs Hassium 269.13360 [Rn]7s2 5f14 6d6 NaN NaN NaN NaN 8, 6, 5, 4, 3, 2 Solid NaN NaN NaN Transition metal 1984
108 109 Mt Meitnerium 277.15400 [Rn]7s2 5f14 6d7 (calculated) NaN NaN NaN NaN 9, 8, 6, 4, 3, 1 Solid NaN NaN NaN Transition metal 1982
109 110 Ds Darmstadtium 282.16600 [Rn]7s2 5f14 6d8 (predicted) NaN NaN NaN NaN 8, 6, 4, 2, 0 Expected to be a Solid NaN NaN NaN Transition metal 1994
110 111 Rg Roentgenium 282.16900 [Rn]7s2 5f14 6d9 (predicted) NaN NaN NaN NaN 5, 3, 1, -1 Expected to be a Solid NaN NaN NaN Transition metal 1994
111 112 Cn Copernicium 286.17900 [Rn]7s2 5f14 6d10 (predicted) NaN NaN NaN NaN 2, 1, 0 Expected to be a Solid NaN NaN NaN Transition metal 1996
112 113 Nh Nihonium 286.18200 [Rn]5f14 6d10 7s2 7p1 (predicted) NaN NaN NaN NaN NaN Expected to be a Solid NaN NaN NaN Post-transition metal 2004
113 114 Fl Flerovium 290.19200 [Rn]7s2 7p2 5f14 6d10 (predicted) NaN NaN NaN NaN 6, 4,2, 1, 0 Expected to be a Solid NaN NaN NaN Post-transition metal 1998
114 115 Mc Moscovium 290.19600 [Rn]7s2 7p3 5f14 6d10 (predicted) NaN NaN NaN NaN 3, 1 Expected to be a Solid NaN NaN NaN Post-transition metal 2003
115 116 Lv Livermorium 293.20500 [Rn]7s2 7p4 5f14 6d10 (predicted) NaN NaN NaN NaN +4, +2, -2 Expected to be a Solid NaN NaN NaN Post-transition metal 2000
116 117 Ts Tennessine 294.21100 [Rn]7s2 7p5 5f14 6d10 (predicted) NaN NaN NaN NaN +5, +3, +1, -1 Expected to be a Solid NaN NaN NaN Halogen 2010
117 118 Og Oganesson 295.21600 [Rn]7s2 7p6 5f14 6d10 (predicted) NaN NaN NaN NaN +6, +4, +2, +1, 0, -1 Expected to be a Gas NaN NaN NaN Noble gas 2006
# Create a new DataFrame with AtomicNumber and AtomicRadius, dropping rows where AtomicRadius is NaN
df_atomic_radius_dropped = df[['AtomicNumber', 'AtomicRadius']].dropna(subset=['AtomicRadius'])
df_atomic_radius_dropped
AtomicNumber AtomicRadius
0 1 120.0
1 2 140.0
2 3 182.0
3 4 153.0
4 5 192.0
... ... ...
94 95 244.0
95 96 245.0
96 97 244.0
97 98 245.0
98 99 245.0

99 rows × 2 columns

Check Your Understanding

In the cell block below, write code to generate a new DataFrame called df_drop_density that only has Names and Density values that have values.

#write your Check your Understanding code here

Working with data files#

Once you have maninpulated your data, it is a good idea to save this data as a comma separated values (.csv) file. This can be accessed later and has all of your data cleaned for the next project. For example, you now have a DataFrame that is just atomic number and atomic radius. You can write that data to a new csv file using the to_csv() method.

Your DataFrame not only has columns(series) of data, but also index numbers. Those index numbers may not be meaningful in other contexts. When you save the data as csv, you can choose to include the index value. The default is to include index, but setting an argument to index=False, you will exclude the index an only include the column data.

df_atomic_radius_dropped.to_csv('atomic_radius.csv') 
df_atomic_radius_dropped.to_csv('atomic_radius_noindex.csv', index=False) 

You can also upload that data from your file into a new DataFrame. One thing to note is that you need to specify its location. In this example, you wrote the csv file into the same directory as the original jupyter notebook, so you don’t need to specify where it is located.

df_fromfile = pd.read_csv("atomic_radius_noindex.csv")
df_fromfile.head(3)
AtomicNumber AtomicRadius
0 1 120.0
1 2 140.0
2 3 182.0
Check Your Understanding

You have a DataFrame that loaded in your saved atomic_radius_noindex.csv file.

  • Load the atomic_radius.csv file into another DataFrame called df_index.

  • Display the first three rows of data.

  • If there is an unnamed column, remove that column from df_index in another code cell.

# Write your code here
Solution
df_index = pd.read_csv("atomic_radius.csv")
df_index.head(3) 

New Code cell
df_index = df_index.drop(columns=["Unnamed: 0"], axis =1) 

In the next notebook, we will be accessing data from a website. The data is provided as space delimited file meaning the column data is separted by spaces rather than commas. When we load the data we can specify the separation of data with the sep = " " argument.

The data from this website also does not have column names for the data. We can specify those names as we bring the data in.

colnames = ['smiles','dat', 'id']
df_act = pd.read_csv("https://dude.docking.org/targets/aa2ar/actives_final.ism", sep=" ", names=colnames)
df_act
smiles dat id
0 C#CCOc3nc(c1ccccc1)nc4sc2CCCc2c34 630004 CHEMBL1087078
1 OC[C@H]3OC(n1cnc2c(NCCS(O)(=O)=O)ncnc12)[C@H](... 607385 CHEMBL610435
2 CCNC(=O)[C@H]4OC(n2cnc3c(N)nc(C#Cc1ccc(CCC(O)=... 605505 CHEMBL605065
3 Ic5cccc(n1c(=S)sc3c1ncn4nc(c2ccco2)nc34)c5 623098 CHEMBL1093479
4 C=CCn1c(=S)sc3c1ncn4nc(c2ccco2)nc34 623027 CHEMBL1092271
... ... ... ...
477 CCNC(=O)[C@H]4OC(n2cnc3c(N)nc(C#Cc1nccs1)nc23)... 605440 CHEMBL605265
478 CCCn4c(=O)n(CCC)c3cc(c2ccc(COC(=O)Nc1ccccc1)cc... 479497 CHEMBL466021
479 Nc4nc(N2CCN(c1c(F)cc(F)cc1F)CC2)nc5nc(c3ccco3)... 309346 CHEMBL360619
480 Cc4ccc(\C=N\Nc3nc(N)c2ncn(C1O[C@H](CO)[C@@H](O... 606942 CHEMBL611571
481 Cc3sc2nc(c1ccccc1)n(CC#C)c(=O)c2c3C 629987 CHEMBL1088398

482 rows × 3 columns

Once we have a DataFrame, we can also save that DataFrame locally as a csv file to load later.

#write the DataFrame to a CSV file named 'aa2ar_actives.csv'
df_act.to_csv('aa2ar_actives.csv') 

Joining two DataFrames#

A common operation when working with data is joining two DataFrames together. This is especially useful when information about a common set of molecules is stored in two different datasets. For example, you might have one DataFrame that has SMILES strings and names and a second that has names and experimental data. The .join() method combines the two DataFrames together using an index value that allows you to analyze all relevant data in a single DataFrame.

DataFrame1 = pd.DataFrame({
    'xLogP3': [1.2, 3.5, 3.3],
    'MolecularWeight': [180.16, 206.28, 230.26]
}, index=['Aspirin', 'Ibuprofen', 'Naproxen'])

DataFrame2 = pd.DataFrame({
    'TPSA': [63.6, 37.3, 46.5],
    'SMILES': ['CC(=O)OC1=CC=CC=C1C(=O)O', 'CC(C)CC1=CC=C(C=C1)C(C)C(=O)O', 'C[C@@H](C1=CC2=C(C=C1)C=C(C=C2)OC)C(=O)O']
}, index=['Aspirin', 'Ibuprofen', 'Naproxen'])
# since these are small DataFrames, we can display them side by side in a Jupyter Notebook using HTML
from IPython.display import display, HTML

df1_html = DataFrame1.to_html()
df2_html = DataFrame2.to_html()

display(HTML(f"""
<table>
<tr>
    <th><H3>DataFrame1</H3></th>
    <th><H3>DataFrame2</H3></th>
</tr>
<tr>
    <td>{df1_html}</td>
    <td style="padding-left: 30px;">{df2_html}</td>
</tr>
</table>
"""))

DataFrame1

DataFrame2

xLogP3 MolecularWeight
Aspirin 1.2 180.16
Ibuprofen 3.5 206.28
Naproxen 3.3 230.26
TPSA SMILES
Aspirin 63.6 CC(=O)OC1=CC=CC=C1C(=O)O
Ibuprofen 37.3 CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
Naproxen 46.5 C[C@@H](C1=CC2=C(C=C1)C=C(C=C2)OC)C(=O)O
DataFramecombined = DataFrame1.join(DataFrame2)
DataFramecombined
xLogP3 MolecularWeight TPSA SMILES
Aspirin 1.2 180.16 63.6 CC(=O)OC1=CC=CC=C1C(=O)O
Ibuprofen 3.5 206.28 37.3 CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
Naproxen 3.3 230.26 46.5 C[C@@H](C1=CC2=C(C=C1)C=C(C=C2)OC)C(=O)O

Pandas is quite powerful and we have only scratched the surface. To get more savvy with pandas, you can access tutorials available at https://pandas.pydata.org/pandas-docs/version/0.15/tutorials.html

Homework

Familiarize yourself with the DUD-E (Directory of Useful Decoys, Enhanced) data sets (http://dude.docking.org/), which contain known actives and inactives for 102 protein targets.The DUD-E sets are widely used in benchmarking studies that compare the performance of different virtual screening approaches (https://doi.org/10.1021/jm300687e).

Go to the DUD-E target page (http://dude.docking.org/targets) and find muscle glycogen phosphorylase (Target Name: PYGM, PDB ID: 1c8k) from the target list. Clicking the target name “PYGM” directs you to the page that lists various files (http://dude.docking.org/targets/pygm). Download file “actives_final.ism”, which contains the SMILES strings of known actives. Rename the file name as “pygm_1c8k_actives.ism”. [Open the file in a text viewer/editor to check the format of this file].

Identify the data what appropriate column names should be.

  1. Load the pygm_1c8k_actives.ism data into a DataFrame with appropriate column labels.

  2. Create a variable that has the number of rows in the DataFrame.

  3. Create a variable that has the number of columns in the DataFrame.

  4. Identify how many different structures are included in the dataset.

  5. Save this file locally as a .csv file with name: pygm_1c8k_actives_clean.csv without DataFrame index values.

You will be using this .csv file in the next activity.

# Write your code here.

colnames = ['smiles','DUD-E ID', 'CHEMBL_ID']
df_pygm_actives = pd.read_csv("https://dude.docking.org/targets/pygm/actives_final.ism", sep=" ", names=colnames)
# Create a variable that has the number of rows in the DataFrame.
num_rows = df_pygm_actives.shape[0]  # or len(df_pygm_actives)
# Create a variable that has the number of columns in the DataFrame.
num_cols = df_pygm_actives.shape[1]  # or df_pygm_actives.columns.size

# Identify number of unique SMILES strings in the df_pygm_actives DataFrame
print(len(df_pygm_actives.smiles.unique()))
# Save this file locally as a .csv file with name: pygm_1c8k_actives_clean.csv
df_pygm_actives.to_csv("pygm_1c8k_actives_clean.csv", index=False)
77