Learning Outcomes
By the end of this section, you should be able to
- 1.5.1 Load data to Python.
- 1.5.2 Perform basic data analysis using Python.
- 1.5.3 Use visualization principles to graphically plot data using Python.
Multiple tools are available for writing and executing Python programs. Jupyter Notebook is one convenient and user-friendly tool. The next section explains how to set up the Jupyter Notebook environment using Google Colaboratory (Colab) and then provides the basics of two open-source Python libraries named Pandas
and Matplotlib
. These libraries are specialized for data analysis and data visualization, respectively.
Exploring Further
Python Programming
In the discussion below, we assume you are familiar with basic Python syntax and know how to write a simple program using Python. If you need a refresher on the basics, please refer to Das, U., Lawson, A., Mayfield, C., & Norouzi, N. (2024). Introduction to Python Programming. OpenStax. https://openstax.org/books/introduction-python-programming/pages/1-introduction.
Jupyter Notebook on Google Colaboratory
Jupyter Notebook is a web-based environment that allows you to run a Python program more interactively, using programming code, math equations, visualizations, and plain texts. There are multiple web applications or software you could use to edit a Jupyter Notebook, but in this textbook we will use Google’s free application named Google Colaboratory (Colab), often abbreviated as Colab. It is a cloud-based platform, which means that you can open, edit, run, and save a Jupyter Notebook on your Google Drive.
Setting up Colab is simple. On your Google Drive, click New > More. If your Google Drive has already installed Colab before, you will see Colaboratory under More. If not, click “Connect more apps” and install Colab by searching “Colaboratory” on the app store (Figure 1.15). For further information, see the Google Colaboratory Ecosystem animation.
Now click New > More > Google Laboratory. A new, empty Jupyter Notebook will show up as in Figure 1.16.
The gray area with the play button is called a cell. A cell is a block where you can type either code or plain text. Notice that there are two buttons on top of the first cell—“+ Code” and “+ Text.” These two buttons add a code or text cell, respectively. A code cell is for the code you want to run; a text cell is to add any text description or note.
Let’s run a Python program on Colab. Type the following code in a code cell.
Python Code
print ("hello world!")
The resulting output will look like this:
hello world!
You can write a Python program across multiple cells and put text cells in between. Colab would treat all the code cells as part of a single program, running from the top to bottom of the current Jupyter Notebook. For example, the two code cells below run as if it is a single program.
When running one cell at a time from the top, we see the following outputs under each cell.
Python Code
a = 1
print ("The a value in the first cell:", a)
The resulting output will look like this:
The a value in the first cell: 1
Python Code
b = 3
print ("a in the second cell:", a)
print ("b in the second cell:", b)
a + b
The resulting output will look like this:
a in the second cell: 1
b in the second cell: 3
4
Conventional Python versus Jupyter Notebook Syntax
While conventional Python syntax requires print()
syntax to print something to the program console, Jupyter Notebook does not require print()
. On Jupyter Notebook, the line a+b
instead of print(a+b)
also prints the value of a+b
as an output. But keep in mind that if there are multiple lines of code that trigger printing some values, only the output from the last line will show.
You can also run multiple cells in bulk. Click Runtime on the menu, and you will see there are multiple ways of running multiple cells at once (Figure 1.17). The two commonly used ones are “Run all” and “Run before.” “Run all” runs all the cells in order from the top; “Run before” runs all the cells before the currently selected one.
One thing to keep in mind is that being able to split a long program into multiple blocks and run one block at a time raises chances of user error. Let’s look at a modified code from the previous example.
Python Code
a = 1
print ("the value in the first cell:", a)
The resulting output will look like this:
the value in the first cell: 1
Python Code
b = 3
print ("a in the second cell:", a)
print ("b in the second cell:", b)
a + b
The resulting output will look like this:
a in the second cell: 1
b in the second cell: 3
4
Python Code
a = 2
a + b
The resulting output will look like this:
5
The modified code has an additional cell at the end, updating a
from 1 to 2. Notice that now a+b
returns 5 as a
has been changed to 2. Now suppose you need to run the second cell for some reason, so you run the second cell again.
Python Code
a = 1
print ("the a value in the first cell:", a)
The resulting output will look like this:
the a value in the first cell: 1
Python Code
b = 3
print ("a in the second cell:", a)
print ("b in the second cell:", b)
a + b
The resulting output will look like this:
a in the second cell: 2
b in the second cell: 3
5
Python Code
a = 2
a + b
The resulting output will look like this:
5
The value of a
has changed to 2. This implies that the execution order of each cell matters! If you have run the third cell before the second cell, the value of a
will have the value from the third one even though the third cell is located below the second cell. Therefore, it is recommended to use “Run all” or “Run before” after you make changes across multiple cells of code. This way your code is guaranteed to run sequentially from the top.
Python Pandas
One of the strengths of Python is that it includes a variety of free, open-source libraries. Libraries are a set of already-implemented methods that a programmer can refer to, allowing a programmer to avoid building common functions from scratch.
Pandas
is a Python library specialized for data manipulation and analysis, and it is very commonly used among data scientists. It offers a variety of methods, which allows data scientists to quickly use them for data analysis. You will learn how to analyze data using Pandas
throughout this textbook.
Colab already has Pandas
installed, so you just need to import Pandas
and you are set to use all the methods in Pandas
. Note that it is convention to abbreviate pandas to pd so that when you call a method from Pandas
, you can do so by using pd
instead of having to type out Pandas
every time. It offers a bit of convenience for a programmer!
Python Code
# import Pandas and assign an abbreviated identifier "pd"
import pandas as pd
Exploring Further
Installing Pandas on Your Computer
If you wish to install Pandas on your own computer, refer to the installation page of the Pandas website.
Load Data Using Python Pandas
The first step for data analysis is to load the data of your interest to your Notebook. Let’s create a folder on Google Drive where you can keep a CSV file for the dataset and a Notebook for data analysis. Download a public dataset, ch1-movieprofit.csv, and store it in a Google Drive folder. Then open a new Notebook in that folder by entering that folder and clicking New > More > Google Colaboratory.
Open the Notebook and allow it to access files in your Google Drive by following these steps:
First, click the Files icon on the side tab (Figure 1.18).
Then click the Mount Drive icon (Figure 1.19) and select “Connect to Google Drive” on the pop-up window.
Notice that a new cell has been inserted on the Notebook as a result (Figure 1.20).
Connect your Google Drive by running the cell, and now your Notebook file can access all the files under content/drive. Navigate folders under drive to find your Notebook and ch1-movieprofit.csv files. Then click “…” > Copy Path (Figure 1.21).
Now replace [Path] with the copied path in the below code. Run the code and you will see the dataset has been loaded as a table and stored as a Python variable data.
Python Code
# import Pandas and assign an abbreviated identifier "pd"
import pandas as pd
data = pd.read_csv("[Path]")
data
The resulting output will look like this:
The read_csv()
method in Pandas
loads a CSV file and stores it as a DataFrame. A DataFrame is a data type that Pandas
uses to store multi-column tabular data. Therefore, the variable data holds the table in ch1-movieprofit.csv in the form of a Pandas
DataFrame.
DataFrame versus Series
Pandas
defines two data types for tabular data—DataFrame and Series. While DataFrame is used for multi-column tabular data, Series is used for single-column data. Many methods in Pandas
support both DataFrame and Series, but some are only for one or the other. It is always good to check if the method you are using works as you expect. For more information, refer to the Pandas documentation or Das, U., Lawson, A., Mayfield, C., & Norouzi, N. (2024). Introduction to Python Programming. OpenStax. https://openstax.org/books/introduction-python-programming/pages/1-introduction.
Example 1.9
Problem
Remember the Iris dataset we used in Data and Datasets? Load the dataset ch1-iris.csv to a Python program using Pandas
.
Solution
The following code loads the ch1-iris.csv that is stored in a Google Drive. Make sure to replace the path with the actual path to ch1-iris.csv on your Google Drive.
Python Code
import pandas as pd
data = pd.read_csv("[Path to ch1-iris.csv]") # Replace the path
data
The resulting output will look like this:
Exploring Further
Can I load a file that is uploaded to someone else’s Google Drive and shared with me?
Yes! This is useful especially when your Google Drive runs out of space. Simply add the shortcut of the shared file to your own drive. Right-click > Organize > Add Shortcut will let you select where to store the shortcut. Once done, you can call pd.read_csv()
using the path of the shortcut.
Summarize Data Using Python Pandas
You can compute basic statistics for data quite quickly by using the DataFrame.describe()
method. Add and run the following code in a new cell. It calls the describe()
method upon data
, the DataFrame we defined earlier with ch1-movieprofit.csv.
Python Code
data = pd.read_csv("[Path to ch1-movieprofit.csv]")
data.describe()
like this:
describe()
returns a table whose columns are a subset of the columns in the entire dataset and whose rows are different statistics. The statistics include the number of unique values in a column (count)
, mean (mean)
, standard deviation (std)
, minimum and maximum values (min/max)
, and different quartiles (25%/50%/75%)
, which you will learn about in Measures of Variation. Using this representation, you can compute such statistics of different columns easily.
Example 1.10
Problem
Summarize the IRIS dataset using describe()
of ch1-iris.csv you loaded in the previous example.
Solution
The following code in a new cell returns the summary of the dataset.
Python Code
data = pd.read_csv("[Path to ch1-iriscsv]")
data.describe()
The resulting output will look like this:
Select Data Using Python Pandas
The Pandas
DataFrame allows a programmer to use the column name itself when selecting a column. For example, the following code prints all the values in the “US_Gross_Million” column in the form of a Series (remember the data from a single column is stored in the Series type in Pandas
).
Python Code
data = pd.read_csv("[Path to ch1-movieprofit.csv]")
data["US_Gross_Million"]
like this:
0 760.51
1 858.37
2 659.33
3 936.66
4 678.82
...
961 77.22
962 177.20
963 102.31
964 106.89
965 75.47
Name: US_Gross_Million, Length: 966, dtype: float64
DataFrame.iloc[]
enables a more powerful selection—it lets a programmer select by both column and row, using column and row indices. Let’s look at some code examples below.
Python Code
data.iloc[:, 2] # select all values in the second column
The resulting output will look like this:
0 2009
1 2019
2 1997
3 2015
4 2018
...
961 2010
962 1982
963 1993
964 1999
965 2017
Name: Year, Length: 966, dtype: object
Python Code
data.iloc[2,:] # select all values in the third row
The resulting output will look like this:
Unnamed: 0 3
Title Titanic
Year 1997
Genre Drama
Rating 7.9
Duration 194
US_Gross_Million 659.33
Worldwide_Gross_Million 2201.65
Votes 1,162,142
Name: 2, dtype: object
To pinpoint a specific value within the “US_Gross_Million” column, you can use an index number.
Python Code
print (data["US_Gross_Million"][0]) # index 0 refers to the top row
print (data["US_Gross_Million"][2]) # index 2 refers to the third row
The resulting output will look like this:
760.51
659.33
You can also use DataFrame.iloc[]
to select a specific group of cells on the table. The example code below shows different ways of using iloc[]
. There are multiple ways of using iloc[]
, but this chapter introduces a couple of common ones. You will learn more techniques for working with data throughout this textbook.
Python Code
data.iloc[:, 1] # select all values in the second column (index 1)
The resulting output will look like this:
0 Avatar
1 Avengers: Endgame
2 Titanic
3 Star Wars: Episode VII - The Force Awakens
4 Avengers: Infinity War
...
961 The A-Team
962 Tootsie
963 In the Line of Fire
964 Analyze This
965 The Hitman's Bodyguard
Name: Title, Length: 966, dtype: object
Python Code
data.iloc[[1, 3], [2, 3]]
# select the rows at index 1 and 3, the columns at index 2 and 3
The resulting output will look like this:
Example 1.11
Problem
Select a “sepal_width” column of the IRIS dataset using the column name.
Solution
The following code in a new cell returns the “sepal_width” column.
Python Code
data = pd.read_csv("[Path to ch1-iris.csv]")
data["sepal_width"]
The resulting output will look like this:
0 3.5
1 3.0
2 3.2
3 3.1
4 3.6
...
145 3.0
146 2.5
147 3.0
148 3.4
149 3.0
Name: sepal_width, Length: 150, dtype: float64
Example 1.12
Problem
Select a “petal_length” column of the IRIS dataset using iloc[]
.
Solution
The following code in a new cell returns the “petal_length” column.
Python Code
data.iloc[:, 2]
The resulting output will look like this:
0 1.4
1 1.4
2 1.3
3 1.5
4 1.4
...
145 5.2
146 5.0
147 5.2
148 5.4
149 5.1
Name: petal_length, Length: 150, dtype: float64
Search Data Using Python Pandas
To search for some data entries that fulfill specific criteria (i.e., filter), you can use DataFrame.loc[]
of Pandas
. When you indicate the filtering criteria inside the brackets, [], the output returns the filtered rows within the DataFrame. For example, the code below filters out the rows whose genre is comedy. Notice that the output only has 307 out of the full 3,400 rows. You can check the output on your own, and you will see their Genre values are all “Comedy.”
Python Code
data = pd.read_csv("[Path to ch1-movieprofit.csv]")
data.loc[data['Genre'] == 'Comedy']
The resulting output will look like this:
Example 1.13
Problem
Using DataFrame.loc[]
, search for all the items of Iris-virginica species in the IRIS dataset.
Solution
The following code returns a filtered DataFrame whose species are Iris-virginica. All such rows show up as an output.
Python Code
data = pd.read_csv("[Path to ch1-iris.csv]")
data.loc[data['species'] == 'Iris-virginica']
The resulting figure will look like this:
(Rows 109 through 149 not shown.)
Example 1.14
Problem
This time, search for all the items whose species is Iris-virginica and whose sepal width is wider than 3.2.
Solution
You can use a Boolean expression—in other words, an expression that evaluates as either True or False—inside data.loc[]
.
Python Code
data.loc[(data['species'] == 'Iris-virginica') & (data['sepal_width'] > 3.2)]
The resulting output will look like this:
Visualize Data Using Python Matplotlib
There are multiple ways to draw plots of data in Python. The most common and straightforward way is to import another library, Matplotlib
, which is specialized for data visualization. Matplotlib
is a huge library, and to draw the plots you only need to import a submodule named pyplot
.
Type the following import statement in a new cell. Note it is convention to denote matplotlib.pyplot
with plt
, similarly to denoting Pandas
with pd
.
Python Code
import matplotlib.pyplot as plt
Matplotlib
offers a method for each type of plot, and you will learn the Matplotlib
methods for all of the commonly used types throughout this textbook. In this chapter, however, let’s briefly look at how to draw a plot using Matplotlib
in general.
Suppose you want to draw a scatterplot between “US_Gross_Million” and “Worldwide_Gross_Million” of the movie profit dataset (ch1-movieprofit.csv). You will investigate scatterplots in more detail in Correlation and Linear Regression Analysis. The example code below draws such a scatterplot using the method scatter()
. scatter()
takes the two columns of your interest—data["US_Gross_Million"] and data["Worldwide_Gross_Million"]—as the inputs and assigns them for the x- and y-axes, respectively.
Python Code
data = pd.read_csv("[Path to ch1-movieprofit.csv]")
# draw a scatterplot using matplotlib’s scatter()
plt.scatter(data["US_Gross_Million"], data["Worldwide_Gross_Million"])
The resulting output will look like this:
Notice that it simply has a set of dots on a white plane. The plot itself does not show what each axis represents, what this plot is about, etc. Without them, it is difficult to capture what the plot shows. You can set these with the following code. The resulting plot below indicates that there is a positive correlation between domestic gross and worldwide gross.
Python Code
# draw a scatterplot
plt.scatter(data["US_Gross_Million"], data["Worldwide_Gross_Million"])
# set the title
plt.title("Domestic vs. Worldwide Gross")
# set the x-axis label
plt.xlabel("Domestic")
# set the y-axis label
plt.ylabel("Worldwide")
The resulting output will look like this:
You can also change the range of numbers along the x- and y-axes with plt.xlim()
and plt.ylim()
. Add the following two lines of code to the cell in the previous Python code example, which plots the scatterplot.
Python Code
# draw a scatterplot
plt.scatter(data["US_Gross_Million"], data["Worldwide_Gross_Million"])
# set the title
plt.title("Domestic vs. Worldwide Gross")
# set the x-axis label
plt.xlabel("Domestic")
# set the y-axis label
plt.ylabel("Worldwide")
# set the range of values of the x- and y-axes
plt.xlim(1*10**2, 3*10**2) # x axis: 100 to 300
plt.ylim(1*10**2, 1*10**3) # y axis: 100 to 1,000
The resulting output will look like this:
The resulting plot with the additional lines of code has a narrower range of values along the x- and y-axes.
Example 1.15
Problem
Using the iris dataset, draw a scatterplot between petal length and height of Setosa Iris. Set the title, x-axis label, and y-axis label properly as well.
Solution
Python Code
import matplotlib.pyplot as plt
data = pd.read_csv("[Path to ch1-iris.csv]")
# select the rows whose species are Setosa Iris
setosa = data.loc[(data['species'] == 'Iris-setosa')]
# draw a scatterplot
plt.scatter(setosa["petal_length"], setosa["petal_width"])
# set the title
plt.title("Petal Length vs. Petal Width of Setosa Iris")
# set the x-axis label
plt.xlabel("Petal Length")
# set the y-axis label
plt.ylabel("Petal Width")
The resulting output will look like this:
Datasets
Note: The primary datasets referenced in the chapter code may also be downloaded here.