Introduction to Pandas

Introduction to Pandas#

Objectives

Become familiar with the Pandas data analysis library in python
Read data into pandas DataFrames from files
Manipulate data in the DataFrame
Export data as CSV

Pandas is a flexible open-source data analysis and manipulation library for Python. It provides data structures in series and DataFrames. A series is like a list and a DataFrame is like a spreadsheet. The key is that this library allows users to easily load, explore, clean, transform and analyzed structured data. It is widely used in data science, machine learning and statitical analysis.

This activity is intended to give you a brief overview of DataFrames and introduce you to pandas concepts we are using in the next notebook.

# load the pandas library 
# the pandas community as agreed that the alias for pandas is 'pd'. 
# Loading pandas with this alias allows us to use its functions and classes with a shorter name.
import pandas as pd

Much of the data we will be using is availabe as comma separated values files from PubChem or other resources. Pandas has the ability to read in csv data either locally from a downloaded file or as a request if we know the source of the file. In the code cell below, we will download the periodic table from PubChem and store it into a DataFrame variable call df.

# load the periodic table data from a URL into a pandas DataFrame. 
# We do not need to use the `requests` library here because pandas can directly read from URLs.
url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/periodictable/CSV" #url for the periodic table data in CSV format
df = pd.read_csv(url) # read the CSV data from the URL into a pandas DataFrame
df # display the DataFrame to see its contents

	AtomicNumber	Symbol	Name	AtomicMass	CPKHexColor	ElectronConfiguration	Electronegativity	AtomicRadius	IonizationEnergy	ElectronAffinity	OxidationStates	StandardState	MeltingPoint	BoilingPoint	Density	GroupBlock	YearDiscovered
0	1	H	Hydrogen	1.008000	FFFFFF	1s1	2.20	120.0	13.598	0.754	+1, -1	Gas	13.81	20.28	0.000090	Nonmetal	1766
1	2	He	Helium	4.002600	D9FFFF	1s2	NaN	140.0	24.587	NaN	0	Gas	0.95	4.22	0.000179	Noble gas	1868
2	3	Li	Lithium	7.000000	CC80FF	[He]2s1	0.98	182.0	5.392	0.618	+1	Solid	453.65	1615.00	0.534000	Alkali metal	1817
3	4	Be	Beryllium	9.012183	C2FF00	[He]2s2	1.57	153.0	9.323	NaN	+2	Solid	1560.00	2744.00	1.850000	Alkaline earth metal	1798
4	5	B	Boron	10.810000	FFB5B5	[He]2s2 2p1	2.04	192.0	8.298	0.277	+3	Solid	2348.00	4273.00	2.370000	Metalloid	1808
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
113	114	Fl	Flerovium	290.192000	NaN	[Rn]7s2 7p2 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	6, 4,2, 1, 0	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	1998
114	115	Mc	Moscovium	290.196000	NaN	[Rn]7s2 7p3 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	3, 1	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	2003
115	116	Lv	Livermorium	293.205000	NaN	[Rn]7s2 7p4 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+4, +2, -2	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	2000
116	117	Ts	Tennessine	294.211000	NaN	[Rn]7s2 7p5 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+5, +3, +1, -1	Expected to be a Solid	NaN	NaN	NaN	Halogen	2010
117	118	Og	Oganesson	295.216000	NaN	[Rn]7s2 7p6 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+6, +4, +2, +1, 0, -1	Expected to be a Gas	NaN	NaN	NaN	Noble gas	2006

118 rows × 17 columns

Notice that the DataFrame output indicates it has 118 rows and 17 columns. There are a few ways to get information about rows and columns in our DataFrame(named df):

print(df.shape)
print(len(df))
print(df.shape[0])  
print(df.shape[1])  
print(len(df.columns))

Check your understanding

Write comments next to each line in the code above to indicate what the code is doing.

Run the next four code cells to explore the columns property and head, tail and info methods of a DataFrame.

df.columns # display the columns of the DataFrame to understand its structure

Index(['AtomicNumber', 'Symbol', 'Name', 'AtomicMass', 'CPKHexColor',
       'ElectronConfiguration', 'Electronegativity', 'AtomicRadius',
       'IonizationEnergy', 'ElectronAffinity', 'OxidationStates',
       'StandardState', 'MeltingPoint', 'BoilingPoint', 'Density',
       'GroupBlock', 'YearDiscovered'],
      dtype='object')

df.head(3) # display the first 3 rows of the DataFrame to get a glimpse of the data

	AtomicNumber	Symbol	Name	AtomicMass	CPKHexColor	ElectronConfiguration	Electronegativity	AtomicRadius	IonizationEnergy	ElectronAffinity	OxidationStates	StandardState	MeltingPoint	BoilingPoint	Density	GroupBlock	YearDiscovered
0	1	H	Hydrogen	1.0080	FFFFFF	1s1	2.20	120.0	13.598	0.754	+1, -1	Gas	13.81	20.28	0.000090	Nonmetal	1766
1	2	He	Helium	4.0026	D9FFFF	1s2	NaN	140.0	24.587	NaN	0	Gas	0.95	4.22	0.000179	Noble gas	1868
2	3	Li	Lithium	7.0000	CC80FF	[He]2s1	0.98	182.0	5.392	0.618	+1	Solid	453.65	1615.00	0.534000	Alkali metal	1817

df.tail(5)  # display the last 5 rows of the DataFrame to see the end of the data

	AtomicNumber	Symbol	Name	AtomicMass	CPKHexColor	ElectronConfiguration	Electronegativity	AtomicRadius	IonizationEnergy	ElectronAffinity	OxidationStates	StandardState	MeltingPoint	BoilingPoint	Density	GroupBlock	YearDiscovered
113	114	Fl	Flerovium	290.192	NaN	[Rn]7s2 7p2 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	6, 4,2, 1, 0	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	1998
114	115	Mc	Moscovium	290.196	NaN	[Rn]7s2 7p3 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	3, 1	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	2003
115	116	Lv	Livermorium	293.205	NaN	[Rn]7s2 7p4 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+4, +2, -2	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	2000
116	117	Ts	Tennessine	294.211	NaN	[Rn]7s2 7p5 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+5, +3, +1, -1	Expected to be a Solid	NaN	NaN	NaN	Halogen	2010
117	118	Og	Oganesson	295.216	NaN	[Rn]7s2 7p6 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+6, +4, +2, +1, 0, -1	Expected to be a Gas	NaN	NaN	NaN	Noble gas	2006

df.info() # display information about the DataFrame, such as the number of entries, column names, and data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118 entries, 0 to 117
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 AtomicNumber           118 non-null    int64  
 Symbol                 118 non-null    object 
 Name                   118 non-null    object 
 AtomicMass             118 non-null    float64
 CPKHexColor            108 non-null    object 
 ElectronConfiguration  118 non-null    object 
 Electronegativity      95 non-null     float64
 AtomicRadius           99 non-null     float64
 IonizationEnergy       102 non-null    float64
 ElectronAffinity       57 non-null     float64
OxidationStates        117 non-null    object 
StandardState          118 non-null    object 
MeltingPoint           103 non-null    float64
BoilingPoint           93 non-null     float64
Density                96 non-null     float64
GroupBlock             118 non-null    object 
YearDiscovered         118 non-null    object 
dtypes: float64(8), int64(1), object(8)
memory usage: 15.8+ KB

Check your understanding

When displaying a DataFrame, what is the default display?
If you only want to see the first 4 lines of the DataFrame, what command would you type?
If you want to see the last line only of the DataFrame, what command would you type?
Why is hydrogen, atomic number 1, listed as row 0?
Why might some of the data be null for these elements?

Accessing Subsets of Data in the DataFrame#

Now that our data is in a DataFrame, we can access a subset of the data.

# display the row with index 17 to see the details of the element at that index
df.loc[17]

AtomicNumber                      18
Symbol                            Ar
Name                           Argon
AtomicMass                      39.9
CPKHexColor                   80D1E3
ElectronConfiguration    [Ne]3s2 3p6
Electronegativity                NaN
AtomicRadius                   188.0
IonizationEnergy               15.76
ElectronAffinity                 NaN
OxidationStates                    0
StandardState                    Gas
MeltingPoint                    83.8
BoilingPoint                    87.3
Density                     0.001784
GroupBlock                 Noble gas
YearDiscovered                  1894
Name: 17, dtype: object

# This will show the symbol of the element at index 17
df.loc[17, "Symbol"] 

'Ar'

# this code cell will display the name and Atomic Radius of the element at index 17
# it uses the `loc` method to access specific rows and columns in the DataFrame
# this require the name of the columns to be specified as a list
df.loc[17, ["Name", "AtomicRadius"]]

Name            Argon
AtomicRadius    188.0
Name: 17, dtype: object

# Display the name and atomic radius of the element at index 17 using column indexing
# it uses the `iloc` method to access specific rows and columns by their integer index
# you do not need to specify the column names, just their integer positions
# This can be useful when you know the position of the columns but not their names or the names are long.
df.iloc[17, [0, 7]]

AtomicNumber       18
AtomicRadius    188.0
Name: 17, dtype: object

Sometimes we want to have a subset of the data. This is achieved by identifying which columns we want.

df[["Name", "GroupBlock"]].head(3) # display the first 3 rows of the "Name" and "GroupBlock" columns    

	Name	GroupBlock
0	Hydrogen	Nonmetal
1	Helium	Noble gas
2	Lithium	Alkali metal

We can also filter the data when we only want a subset of the data.

df[df["AtomicNumber"] < 10.0 ]# display the rows where the "AtomicNumber" is less than 10.0

	AtomicNumber	Symbol	Name	AtomicMass	CPKHexColor	ElectronConfiguration	Electronegativity	AtomicRadius	IonizationEnergy	ElectronAffinity	OxidationStates	StandardState	MeltingPoint	BoilingPoint	Density	GroupBlock	YearDiscovered
0	1	H	Hydrogen	1.008000	FFFFFF	1s1	2.20	120.0	13.598	0.754	+1, -1	Gas	13.81	20.28	0.000090	Nonmetal	1766
1	2	He	Helium	4.002600	D9FFFF	1s2	NaN	140.0	24.587	NaN	0	Gas	0.95	4.22	0.000179	Noble gas	1868
2	3	Li	Lithium	7.000000	CC80FF	[He]2s1	0.98	182.0	5.392	0.618	+1	Solid	453.65	1615.00	0.534000	Alkali metal	1817
3	4	Be	Beryllium	9.012183	C2FF00	[He]2s2	1.57	153.0	9.323	NaN	+2	Solid	1560.00	2744.00	1.850000	Alkaline earth metal	1798
4	5	B	Boron	10.810000	FFB5B5	[He]2s2 2p1	2.04	192.0	8.298	0.277	+3	Solid	2348.00	4273.00	2.370000	Metalloid	1808
5	6	C	Carbon	12.011000	909090	[He]2s2 2p2	2.55	170.0	11.260	1.263	+4, +2, -4	Solid	3823.00	4098.00	2.267000	Nonmetal	Ancient
6	7	N	Nitrogen	14.007000	3050F8	[He] 2s2 2p3	3.04	155.0	14.534	NaN	+5, +4, +3, +2, +1, -1, -2, -3	Gas	63.15	77.36	0.001251	Nonmetal	1772
7	8	O	Oxygen	15.999000	FF0D0D	[He]2s2 2p4	3.44	152.0	13.618	1.461	-2	Gas	54.36	90.20	0.001429	Nonmetal	1774
8	9	F	Fluorine	18.998403	90E050	[He]2s2 2p5	3.98	135.0	17.423	3.339	-1	Gas	53.53	85.03	0.001696	Halogen	1670

# display rows where AtomicNumber is greater than 10 and less than 20
df[(df["AtomicNumber"] > 10) & (df["AtomicNumber"] < 20)] 

	AtomicNumber	Symbol	Name	AtomicMass	CPKHexColor	ElectronConfiguration	Electronegativity	AtomicRadius	IonizationEnergy	ElectronAffinity	OxidationStates	StandardState	MeltingPoint	BoilingPoint	Density	GroupBlock	YearDiscovered
10	11	Na	Sodium	22.989769	AB5CF2	[Ne]3s1	0.93	227.0	5.139	0.548	+1	Solid	370.950	1156.00	0.970000	Alkali metal	1807
11	12	Mg	Magnesium	24.305000	8AFF00	[Ne]3s2	1.31	173.0	7.646	NaN	+2	Solid	923.000	1363.00	1.740000	Alkaline earth metal	1808
12	13	Al	Aluminum	26.981538	BFA6A6	[Ne]3s2 3p1	1.61	184.0	5.986	0.441	+3	Solid	933.437	2792.00	2.700000	Post-transition metal	Ancient
13	14	Si	Silicon	28.085000	F0C8A0	[Ne]3s2 3p2	1.90	210.0	8.152	1.385	+4, +2, -4	Solid	1687.000	3538.00	2.329600	Metalloid	1854
14	15	P	Phosphorus	30.973762	FF8000	[Ne]3s2 3p3	2.19	180.0	10.487	0.746	+5, +3, -3	Solid	317.300	553.65	1.820000	Nonmetal	1669
15	16	S	Sulfur	32.070000	FFFF30	[Ne]3s2 3p4	2.58	180.0	10.360	2.077	+6, +4, -2	Solid	388.360	717.75	2.067000	Nonmetal	Ancient
16	17	Cl	Chlorine	35.450000	1FF01F	[Ne]3s2 3p5	3.16	175.0	12.968	3.617	+7, +5, +1, -1	Gas	171.650	239.11	0.003214	Halogen	1774
17	18	Ar	Argon	39.900000	80D1E3	[Ne]3s2 3p6	NaN	188.0	15.760	NaN	0	Gas	83.800	87.30	0.001784	Noble gas	1894
18	19	K	Potassium	39.098300	8F40D4	[Ar]4s1	0.82	275.0	4.341	0.501	+1	Solid	336.530	1032.00	0.890000	Alkali metal	1807

Notice in the series called GroupBlock has repeating values. If we want to display how many different values are in that column, we can identify how many are .unique, and then show how many elements in the DataFrame are classified as Noble gasses using different methods.

# display the unique values in the "GroupBlock" column
print("Unique values in 'GroupBlock':")
print(df.GroupBlock.unique()) 
print()
# display the total count of unique values in the "GroupBlock" column
print("Total unique values in 'GroupBlock':")
print(df["GroupBlock"].nunique()) # nunique(): Returns the count of unique values.
print(len(df["GroupBlock"].unique())) # len(): Returns the count of unique values in the Series.
print()
# display the total count of rows where the "GroupBlock" is "Noble Gas" using the `shape` attribute
print("Total count of rows where 'GroupBlock' is 'Noble gas':")
print(df[df["GroupBlock"] == "Noble gas"].shape[0])
# display the total count of rows where the "GroupBlock" is "Noble Gas" using the length function
print(len(df[df["GroupBlock"] == "Noble gas"]))
# display the total count of rows where the "GroupBlock" is "Noble Gas" using the .sum()` method
print(df["GroupBlock"].eq("Noble gas").sum())
print()

#create a list of elements that are Noble Gases
noble_gases = df[df["GroupBlock"] == "Noble gas"]["Name"].tolist()
# display the list of Noble Gases
print("The noble gases are:", noble_gases)
# display the DataFrame filtered to show only rows where the "GroupBlock" is "Noble gas"
df[(df["GroupBlock"].eq("Noble gas"))] 

Unique values in 'GroupBlock':
['Nonmetal' 'Noble gas' 'Alkali metal' 'Alkaline earth metal' 'Metalloid'
 'Halogen' 'Post-transition metal' 'Transition metal' 'Lanthanide'
 'Actinide']

Total unique values in 'GroupBlock':
10
10

Total count of rows where 'GroupBlock' is 'Noble gas':
7
7
7

The noble gases are: ['Helium', 'Neon', 'Argon', 'Krypton', 'Xenon', 'Radon', 'Oganesson']

	AtomicNumber	Symbol	Name	AtomicMass	CPKHexColor	ElectronConfiguration	Electronegativity	AtomicRadius	IonizationEnergy	ElectronAffinity	OxidationStates	StandardState	MeltingPoint	BoilingPoint	Density	GroupBlock	YearDiscovered
1	2	He	Helium	4.00260	D9FFFF	1s2	NaN	140.0	24.587	NaN	0	Gas	0.95	4.22	0.000179	Noble gas	1868
9	10	Ne	Neon	20.18000	B3E3F5	[He]2s2 2p6	NaN	154.0	21.565	NaN	0	Gas	24.56	27.07	0.000900	Noble gas	1898
17	18	Ar	Argon	39.90000	80D1E3	[Ne]3s2 3p6	NaN	188.0	15.760	NaN	0	Gas	83.80	87.30	0.001784	Noble gas	1894
35	36	Kr	Krypton	83.80000	5CB8D1	[Ar]4s2 3d10 4p6	3.0	202.0	14.000	NaN	0	Gas	115.79	119.93	0.003733	Noble gas	1898
53	54	Xe	Xenon	131.29000	429EB0	[Kr]5s2 4d10 5p6	2.6	216.0	12.130	NaN	0	Gas	161.36	165.03	0.005887	Noble gas	1898
85	86	Rn	Radon	222.01758	428296	[Xe]6s2 4f14 5d10 6p6	NaN	220.0	10.745	NaN	0	Gas	202.00	211.45	0.009730	Noble gas	1900
117	118	Og	Oganesson	295.21600	NaN	[Rn]7s2 7p6 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+6, +4, +2, +1, 0, -1	Expected to be a Gas	NaN	NaN	NaN	Noble gas	2006

Sometimes we find that there is data in columns that may be irrelevant to your data analysis. This column can be removed using the .drop() method. Looking at the data in the DataFrame above, we probably don’t need CPKHexColor column. Let’s remove that data.

df = df.drop('CPKHexColor', axis=1)  # drop the 'CPKHexColor' column from the DataFrame

df.head(1)

	AtomicNumber	Symbol	Name	AtomicMass	ElectronConfiguration	Electronegativity	AtomicRadius	IonizationEnergy	ElectronAffinity	OxidationStates	StandardState	MeltingPoint	BoilingPoint	Density	GroupBlock	YearDiscovered
0	1	H	Hydrogen	1.008	1s1	2.2	120.0	13.598	0.754	+1, -1	Gas	13.81	20.28	0.00009	Nonmetal	1766

Sorting Data#

Pandas provides the sort_values method to sort data in your series. The default is to sort by ascending value in a column. We can use a keyword argument in the sort_values method to set optional parameters. In this case using ascending=False sorts in desending order and ascending=True sorts in ascending order.

# sort the DataFrame by "AtomicNumber" in descending order (acending=True will sort in ascending order)
df.sort_values(by="AtomicNumber", ascending=False) 

	AtomicNumber	Symbol	Name	AtomicMass	ElectronConfiguration	Electronegativity	AtomicRadius	IonizationEnergy	ElectronAffinity	OxidationStates	StandardState	MeltingPoint	BoilingPoint	Density	GroupBlock	YearDiscovered
117	118	Og	Oganesson	295.216000	[Rn]7s2 7p6 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+6, +4, +2, +1, 0, -1	Expected to be a Gas	NaN	NaN	NaN	Noble gas	2006
116	117	Ts	Tennessine	294.211000	[Rn]7s2 7p5 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+5, +3, +1, -1	Expected to be a Solid	NaN	NaN	NaN	Halogen	2010
115	116	Lv	Livermorium	293.205000	[Rn]7s2 7p4 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+4, +2, -2	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	2000
114	115	Mc	Moscovium	290.196000	[Rn]7s2 7p3 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	3, 1	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	2003
113	114	Fl	Flerovium	290.192000	[Rn]7s2 7p2 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	6, 4,2, 1, 0	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	1998
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4	5	B	Boron	10.810000	[He]2s2 2p1	2.04	192.0	8.298	0.277	+3	Solid	2348.00	4273.00	2.370000	Metalloid	1808
3	4	Be	Beryllium	9.012183	[He]2s2	1.57	153.0	9.323	NaN	+2	Solid	1560.00	2744.00	1.850000	Alkaline earth metal	1798
2	3	Li	Lithium	7.000000	[He]2s1	0.98	182.0	5.392	0.618	+1	Solid	453.65	1615.00	0.534000	Alkali metal	1817
1	2	He	Helium	4.002600	1s2	NaN	140.0	24.587	NaN	0	Gas	0.95	4.22	0.000179	Noble gas	1868
0	1	H	Hydrogen	1.008000	1s1	2.20	120.0	13.598	0.754	+1, -1	Gas	13.81	20.28	0.000090	Nonmetal	1766

118 rows × 16 columns

df.sort_values(by="AtomicRadius")

	AtomicNumber	Symbol	Name	AtomicMass	ElectronConfiguration	Electronegativity	AtomicRadius	IonizationEnergy	ElectronAffinity	OxidationStates	StandardState	MeltingPoint	BoilingPoint	Density	GroupBlock	YearDiscovered
0	1	H	Hydrogen	1.008000	1s1	2.20	120.0	13.598	0.754	+1, -1	Gas	13.81	20.28	0.000090	Nonmetal	1766
8	9	F	Fluorine	18.998403	[He]2s2 2p5	3.98	135.0	17.423	3.339	-1	Gas	53.53	85.03	0.001696	Halogen	1670
29	30	Zn	Zinc	65.400000	[Ar]4s2 3d10	1.65	139.0	9.394	NaN	+2	Solid	692.68	1180.00	7.134000	Transition metal	1746
1	2	He	Helium	4.002600	1s2	NaN	140.0	24.587	NaN	0	Gas	0.95	4.22	0.000179	Noble gas	1868
28	29	Cu	Copper	63.550000	[Ar]4s1 3d10	1.90	140.0	7.726	1.228	+2, +1	Solid	1357.77	2835.00	8.933000	Transition metal	Ancient
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
113	114	Fl	Flerovium	290.192000	[Rn]7s2 7p2 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	6, 4,2, 1, 0	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	1998
114	115	Mc	Moscovium	290.196000	[Rn]7s2 7p3 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	3, 1	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	2003
115	116	Lv	Livermorium	293.205000	[Rn]7s2 7p4 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+4, +2, -2	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	2000
116	117	Ts	Tennessine	294.211000	[Rn]7s2 7p5 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+5, +3, +1, -1	Expected to be a Solid	NaN	NaN	NaN	Halogen	2010
117	118	Og	Oganesson	295.216000	[Rn]7s2 7p6 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+6, +4, +2, +1, 0, -1	Expected to be a Gas	NaN	NaN	NaN	Noble gas	2006

118 rows × 16 columns

Notice in the above DataFrame display that after sorting by Atomic Radius, the tail 5 entries have NaN displayed. This is Not a Number (NaN) and in pandas represents missing data.

If you want to identify NaN values, you can use the isna() method. Identifying the NaN values then allows us to remove them from the DataFrame. This is accomplished through the dropna() method. It drops any rows that contain NaN values. This looks for values that are False or not True.

Alternatively, you could identify only True values by using the notna() method.

In the following code cells identify which Rows have NaN values for Atomic Radius and then we will create a new DataFrame that has only Atomic Number and Atomic Radii without with NaN values.

 # display the number of NaN values in each column
print(df.isna().sum()) 
print()
# display the number of NaN values in only the "Atomic Radius" column
print("The number of rows where Atomic Radius is NaN is:",df.AtomicRadius.isna().sum()) 

AtomicNumber              0
Symbol                    0
Name                      0
AtomicMass                0
ElectronConfiguration     0
Electronegativity        23
AtomicRadius             19
IonizationEnergy         16
ElectronAffinity         61
OxidationStates           1
StandardState             0
MeltingPoint             15
BoilingPoint             25
Density                  22
GroupBlock                0
YearDiscovered            0
dtype: int64

The number of rows where Atomic Radius is NaN is: 19

print(df.notna().sum())  # display the number of non-NaN values in each column

AtomicNumber             118
Symbol                   118
Name                     118
AtomicMass               118
ElectronConfiguration    118
Electronegativity         95
AtomicRadius              99
IonizationEnergy         102
ElectronAffinity          57
OxidationStates          117
StandardState            118
MeltingPoint             103
BoilingPoint              93
Density                   96
GroupBlock               118
YearDiscovered           118
dtype: int64

# Identify and print all rows where AtomicRadius is NaN
df[df['AtomicRadius'].isna()]

	AtomicNumber	Symbol	Name	AtomicMass	ElectronConfiguration	Electronegativity	AtomicRadius	IonizationEnergy	ElectronAffinity	OxidationStates	StandardState	MeltingPoint	BoilingPoint	Density	GroupBlock	YearDiscovered
99	100	Fm	Fermium	257.09511	[Rn] 5f12 7s2	1.3	NaN	6.50	NaN	+3	Solid	1800.0	NaN	NaN	Actinide	1952
100	101	Md	Mendelevium	258.09843	[Rn]7s2 5f13	1.3	NaN	6.58	NaN	+3, +2	Solid	1100.0	NaN	NaN	Actinide	1955
101	102	No	Nobelium	259.10100	[Rn]7s2 5f14	1.3	NaN	6.65	NaN	+3, +2	Solid	1100.0	NaN	NaN	Actinide	1957
102	103	Lr	Lawrencium	266.12000	[Rn]7s2 5f14 6d1	1.3	NaN	NaN	NaN	+3	Solid	1900.0	NaN	NaN	Actinide	1961
103	104	Rf	Rutherfordium	267.12200	[Rn]7s2 5f14 6d2	NaN	NaN	NaN	NaN	+4	Solid	NaN	NaN	NaN	Transition metal	1964
104	105	Db	Dubnium	268.12600	[Rn]7s2 5f14 6d3	NaN	NaN	NaN	NaN	5, 4, 3	Solid	NaN	NaN	NaN	Transition metal	1967
105	106	Sg	Seaborgium	269.12800	[Rn]7s2 5f14 6d4	NaN	NaN	NaN	NaN	6, 5, 4, 3, 0	Solid	NaN	NaN	NaN	Transition metal	1974
106	107	Bh	Bohrium	270.13300	[Rn]7s2 5f14 6d5	NaN	NaN	NaN	NaN	7, 5, 4, 3	Solid	NaN	NaN	NaN	Transition metal	1976
107	108	Hs	Hassium	269.13360	[Rn]7s2 5f14 6d6	NaN	NaN	NaN	NaN	8, 6, 5, 4, 3, 2	Solid	NaN	NaN	NaN	Transition metal	1984
108	109	Mt	Meitnerium	277.15400	[Rn]7s2 5f14 6d7 (calculated)	NaN	NaN	NaN	NaN	9, 8, 6, 4, 3, 1	Solid	NaN	NaN	NaN	Transition metal	1982
109	110	Ds	Darmstadtium	282.16600	[Rn]7s2 5f14 6d8 (predicted)	NaN	NaN	NaN	NaN	8, 6, 4, 2, 0	Expected to be a Solid	NaN	NaN	NaN	Transition metal	1994
110	111	Rg	Roentgenium	282.16900	[Rn]7s2 5f14 6d9 (predicted)	NaN	NaN	NaN	NaN	5, 3, 1, -1	Expected to be a Solid	NaN	NaN	NaN	Transition metal	1994
111	112	Cn	Copernicium	286.17900	[Rn]7s2 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	2, 1, 0	Expected to be a Solid	NaN	NaN	NaN	Transition metal	1996
112	113	Nh	Nihonium	286.18200	[Rn]5f14 6d10 7s2 7p1 (predicted)	NaN	NaN	NaN	NaN	NaN	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	2004
113	114	Fl	Flerovium	290.19200	[Rn]7s2 7p2 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	6, 4,2, 1, 0	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	1998
114	115	Mc	Moscovium	290.19600	[Rn]7s2 7p3 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	3, 1	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	2003
115	116	Lv	Livermorium	293.20500	[Rn]7s2 7p4 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+4, +2, -2	Expected to be a Solid	NaN	NaN	NaN	Post-transition metal	2000
116	117	Ts	Tennessine	294.21100	[Rn]7s2 7p5 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+5, +3, +1, -1	Expected to be a Solid	NaN	NaN	NaN	Halogen	2010
117	118	Og	Oganesson	295.21600	[Rn]7s2 7p6 5f14 6d10 (predicted)	NaN	NaN	NaN	NaN	+6, +4, +2, +1, 0, -1	Expected to be a Gas	NaN	NaN	NaN	Noble gas	2006

# Create a new DataFrame with AtomicNumber and AtomicRadius, dropping rows where AtomicRadius is NaN
df_atomic_radius_dropped = df[['AtomicNumber', 'AtomicRadius']].dropna(subset=['AtomicRadius'])
df_atomic_radius_dropped

	AtomicNumber	AtomicRadius
0	1	120.0
1	2	140.0
2	3	182.0
3	4	153.0
4	5	192.0
...	...	...
94	95	244.0
95	96	245.0
96	97	244.0
97	98	245.0
98	99	245.0

99 rows × 2 columns

Check Your Understanding

In the cell block below, write code to generate a new DataFrame called df_drop_density that only has Names and Density values that have values.

#write your Check your Understanding code here

Working with data files#

Once you have maninpulated your data, it is a good idea to save this data as a comma separated values (.csv) file. This can be accessed later and has all of your data cleaned for the next project. For example, you now have a DataFrame that is just atomic number and atomic radius. You can write that data to a new csv file using the to_csv() method.

Your DataFrame not only has columns(series) of data, but also index numbers. Those index numbers may not be meaningful in other contexts. When you save the data as csv, you can choose to include the index value. The default is to include index, but setting an argument to index=False, you will exclude the index an only include the column data.

df_atomic_radius_dropped.to_csv('atomic_radius.csv') 
df_atomic_radius_dropped.to_csv('atomic_radius_noindex.csv', index=False) 

You can also upload that data from your file into a new DataFrame. One thing to note is that you need to specify its location. In this example, you wrote the csv file into the same directory as the original jupyter notebook, so you don’t need to specify where it is located.

df_fromfile = pd.read_csv("atomic_radius_noindex.csv")
df_fromfile.head(3)

	AtomicNumber	AtomicRadius
0	1	120.0
1	2	140.0
2	3	182.0

Check Your Understanding

You have a DataFrame that loaded in your saved atomic_radius_noindex.csv file.

Load the atomic_radius.csv file into another DataFrame called df_index.
Display the first three rows of data.
If there is an unnamed column, remove that column from df_index in another code cell.

# Write your code here

Solution

df_index = pd.read_csv("atomic_radius.csv")
df_index.head(3)

New Code cell

df_index = df_index.drop(columns=["Unnamed: 0"], axis =1)

In the next notebook, we will be accessing data from a website. The data is provided as space delimited file meaning the column data is separted by spaces rather than commas. When we load the data we can specify the separation of data with the sep = " " argument.

The data from this website also does not have column names for the data. We can specify those names as we bring the data in.

colnames = ['smiles','dat', 'id']
df_act = pd.read_csv("https://dude.docking.org/targets/aa2ar/actives_final.ism", sep=" ", names=colnames)
df_act

	smiles	dat	id
0	C#CCOc3nc(c1ccccc1)nc4sc2CCCc2c34	630004	CHEMBL1087078
1	OC[C@H]3OC(n1cnc2c(NCCS(O)(=O)=O)ncnc12)[C@H](...	607385	CHEMBL610435
2	CCNC(=O)[C@H]4OC(n2cnc3c(N)nc(C#Cc1ccc(CCC(O)=...	605505	CHEMBL605065
3	Ic5cccc(n1c(=S)sc3c1ncn4nc(c2ccco2)nc34)c5	623098	CHEMBL1093479
4	C=CCn1c(=S)sc3c1ncn4nc(c2ccco2)nc34	623027	CHEMBL1092271
...	...	...	...
477	CCNC(=O)[C@H]4OC(n2cnc3c(N)nc(C#Cc1nccs1)nc23)...	605440	CHEMBL605265
478	CCCn4c(=O)n(CCC)c3cc(c2ccc(COC(=O)Nc1ccccc1)cc...	479497	CHEMBL466021
479	Nc4nc(N2CCN(c1c(F)cc(F)cc1F)CC2)nc5nc(c3ccco3)...	309346	CHEMBL360619
480	Cc4ccc(\C=N\Nc3nc(N)c2ncn(C1O[C@H](CO)[C@@H](O...	606942	CHEMBL611571
481	Cc3sc2nc(c1ccccc1)n(CC#C)c(=O)c2c3C	629987	CHEMBL1088398

482 rows × 3 columns

Once we have a DataFrame, we can also save that DataFrame locally as a csv file to load later.

#write the DataFrame to a CSV file named 'aa2ar_actives.csv'
df_act.to_csv('aa2ar_actives.csv') 

Joining two DataFrames#

A common operation when working with data is joining two DataFrames together. This is especially useful when information about a common set of molecules is stored in two different datasets. For example, you might have one DataFrame that has SMILES strings and names and a second that has names and experimental data. The .join() method combines the two DataFrames together using an index value that allows you to analyze all relevant data in a single DataFrame.

DataFrame1 = pd.DataFrame({
    'xLogP3': [1.2, 3.5, 3.3],
    'MolecularWeight': [180.16, 206.28, 230.26]
}, index=['Aspirin', 'Ibuprofen', 'Naproxen'])

DataFrame2 = pd.DataFrame({
    'TPSA': [63.6, 37.3, 46.5],
    'SMILES': ['CC(=O)OC1=CC=CC=C1C(=O)O', 'CC(C)CC1=CC=C(C=C1)C(C)C(=O)O', 'C[C@@H](C1=CC2=C(C=C1)C=C(C=C2)OC)C(=O)O']
}, index=['Aspirin', 'Ibuprofen', 'Naproxen'])

# since these are small DataFrames, we can display them side by side in a Jupyter Notebook using HTML
from IPython.display import display, HTML

df1_html = DataFrame1.to_html()
df2_html = DataFrame2.to_html()

display(HTML(f"""
<table>
<tr>
    <th><H3>DataFrame1</H3></th>
    <th><H3>DataFrame2</H3></th>
</tr>
<tr>
    <td>{df1_html}</td>
    <td style="padding-left: 30px;">{df2_html}</td>
</tr>
</table>
"""))

DataFrame1

DataFrame2

	xLogP3	MolecularWeight
Aspirin	1.2	180.16
Ibuprofen	3.5	206.28
Naproxen	3.3	230.26

	TPSA	SMILES
Aspirin	63.6	CC(=O)OC1=CC=CC=C1C(=O)O
Ibuprofen	37.3	CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
Naproxen	46.5	C[C@@H](C1=CC2=C(C=C1)C=C(C=C2)OC)C(=O)O

DataFramecombined = DataFrame1.join(DataFrame2)
DataFramecombined

	xLogP3	MolecularWeight	TPSA	SMILES
Aspirin	1.2	180.16	63.6	CC(=O)OC1=CC=CC=C1C(=O)O
Ibuprofen	3.5	206.28	37.3	CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
Naproxen	3.3	230.26	46.5	C[C@@H](C1=CC2=C(C=C1)C=C(C=C2)OC)C(=O)O

Pandas is quite powerful and we have only scratched the surface. To get more savvy with pandas, you can access tutorials available at https://pandas.pydata.org/pandas-docs/version/0.15/tutorials.html

Homework

Familiarize yourself with the DUD-E (Directory of Useful Decoys, Enhanced) data sets (http://dude.docking.org/), which contain known actives and inactives for 102 protein targets.The DUD-E sets are widely used in benchmarking studies that compare the performance of different virtual screening approaches (https://doi.org/10.1021/jm300687e).

Go to the DUD-E target page (http://dude.docking.org/targets) and find muscle glycogen phosphorylase (Target Name: PYGM, PDB ID: 1c8k) from the target list. Clicking the target name “PYGM” directs you to the page that lists various files (http://dude.docking.org/targets/pygm). Download file “actives_final.ism”, which contains the SMILES strings of known actives. Rename the file name as “pygm_1c8k_actives.ism”. [Open the file in a text viewer/editor to check the format of this file].

Identify the data what appropriate column names should be.

Load the pygm_1c8k_actives.ism data into a DataFrame with appropriate column labels.
Create a variable that has the number of rows in the DataFrame.
Create a variable that has the number of columns in the DataFrame.
Identify how many different structures are included in the dataset.
Save this file locally as a .csv file with name: pygm_1c8k_actives_clean.csv without DataFrame index values.

You will be using this .csv file in the next activity.

# Write your code here.

colnames = ['smiles','DUD-E ID', 'CHEMBL_ID']
df_pygm_actives = pd.read_csv("https://dude.docking.org/targets/pygm/actives_final.ism", sep=" ", names=colnames)
# Create a variable that has the number of rows in the DataFrame.
num_rows = df_pygm_actives.shape[0]  # or len(df_pygm_actives)
# Create a variable that has the number of columns in the DataFrame.
num_cols = df_pygm_actives.shape[1]  # or df_pygm_actives.columns.size

# Identify number of unique SMILES strings in the df_pygm_actives DataFrame
print(len(df_pygm_actives.smiles.unique()))
# Save this file locally as a .csv file with name: pygm_1c8k_actives_clean.csv
df_pygm_actives.to_csv("pygm_1c8k_actives_clean.csv", index=False)