Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Principles of Data Science

1.5 Data Science with Python

Principles of Data Science1.5 Data Science with Python

Learning Outcomes

By the end of this section, you should be able to

  • 1.5.1 Load data to Python.
  • 1.5.2 Perform basic data analysis using Python.
  • 1.5.3 Use visualization principles to graphically plot data using Python.

Multiple tools are available for writing and executing Python programs. Jupyter Notebook is one convenient and user-friendly tool. The next section explains how to set up the Jupyter Notebook environment using Google Colaboratory (Colab) and then provides the basics of two open-source Python libraries named Pandas and Matplotlib. These libraries are specialized for data analysis and data visualization, respectively.

Exploring Further

Python Programming

In the discussion below, we assume you are familiar with basic Python syntax and know how to write a simple program using Python. If you need a refresher on the basics, please refer to Das, U., Lawson, A., Mayfield, C., & Norouzi, N. (2024). Introduction to Python Programming. OpenStax. https://openstax.org/books/introduction-python-programming/pages/1-introduction.

Jupyter Notebook on Google Colaboratory

Jupyter Notebook is a web-based environment that allows you to run a Python program more interactively, using programming code, math equations, visualizations, and plain texts. There are multiple web applications or software you could use to edit a Jupyter Notebook, but in this textbook we will use Google’s free application named Google Colaboratory (Colab), often abbreviated as Colab. It is a cloud-based platform, which means that you can open, edit, run, and save a Jupyter Notebook on your Google Drive.

Setting up Colab is simple. On your Google Drive, click New > More. If your Google Drive has already installed Colab before, you will see Colaboratory under More. If not, click “Connect more apps” and install Colab by searching “Colaboratory” on the app store (Figure 1.15). For further information, see the Google Colaboratory Ecosystem animation.

A screenshot of the Google Drive icon with an add new button. Below is the Google Workspace Marketplace menu with a magnifying glass and Colaboratory highlighted.
Figure 1.15 Install Google Colaboratory (Colab)

Now click New > More > Google Laboratory. A new, empty Jupyter Notebook will show up as in Figure 1.16.

A screenshot of an empty Jupyter Notebook within Google Colaboratory. The notebook is named Untitled.jpynb.
Figure 1.16 Google Colaboratory Notebook

The gray area with the play button is called a cell. A cell is a block where you can type either code or plain text. Notice that there are two buttons on top of the first cell—“+ Code” and “+ Text.” These two buttons add a code or text cell, respectively. A code cell is for the code you want to run; a text cell is to add any text description or note.

Let’s run a Python program on Colab. Type the following code in a code cell.

Python Code

      print ("hello world!")
      

The resulting output will look like this:

hello world!

You can write a Python program across multiple cells and put text cells in between. Colab would treat all the code cells as part of a single program, running from the top to bottom of the current Jupyter Notebook. For example, the two code cells below run as if it is a single program.

When running one cell at a time from the top, we see the following outputs under each cell.

Python Code

      
      a = 1
      print ("The a value in the first cell:", a)
      

The resulting output will look like this:

The a value in the first cell: 1

Python Code

      b = 3
      print ("a in the second cell:", a)
      print ("b in the second cell:", b)
      a + b

The resulting output will look like this:

a in the second cell: 1
b in the second cell: 3
4

Conventional Python versus Jupyter Notebook Syntax

While conventional Python syntax requires print() syntax to print something to the program console, Jupyter Notebook does not require print(). On Jupyter Notebook, the line a+b instead of print(a+b) also prints the value of a+b as an output. But keep in mind that if there are multiple lines of code that trigger printing some values, only the output from the last line will show.

You can also run multiple cells in bulk. Click Runtime on the menu, and you will see there are multiple ways of running multiple cells at once (Figure 1.17). The two commonly used ones are “Run all” and “Run before.” “Run all” runs all the cells in order from the top; “Run before” runs all the cells before the currently selected one.

A screenshot of the Runtime menu in Google Colab with options to Run all, Run before, Run the focused cell, Run selection, and Run after with the keyboard shortcuts for each.
Figure 1.17 Multiple Ways of Running Cells on Colab

One thing to keep in mind is that being able to split a long program into multiple blocks and run one block at a time raises chances of user error. Let’s look at a modified code from the previous example.

Python Code

      a = 1
      print ("the value in the first cell:", a)
      

The resulting output will look like this:

the value in the first cell: 1

Python Code

      b = 3
      print ("a in the second cell:", a)
      print ("b in the second cell:", b)
      a + b
    

The resulting output will look like this:

a in the second cell: 1
b in the second cell: 3
4

Python Code

      a = 2
      a + b
    

The resulting output will look like this:

5

The modified code has an additional cell at the end, updating a from 1 to 2. Notice that now a+b returns 5 as a has been changed to 2. Now suppose you need to run the second cell for some reason, so you run the second cell again.

Python Code

      a = 1
      print ("the a value in the first cell:", a)
      

The resulting output will look like this:

the a value in the first cell: 1

Python Code

      b = 3
      print ("a in the second cell:", a)
      print ("b in the second cell:", b)
      a + b
      

The resulting output will look like this:

a in the second cell: 2
b in the second cell: 3
5

Python Code

      a = 2
      a + b
      

The resulting output will look like this:

5

The value of a has changed to 2. This implies that the execution order of each cell matters! If you have run the third cell before the second cell, the value of a will have the value from the third one even though the third cell is located below the second cell. Therefore, it is recommended to use “Run all” or “Run before” after you make changes across multiple cells of code. This way your code is guaranteed to run sequentially from the top.

Python Pandas

One of the strengths of Python is that it includes a variety of free, open-source libraries. Libraries are a set of already-implemented methods that a programmer can refer to, allowing a programmer to avoid building common functions from scratch.

Pandas is a Python library specialized for data manipulation and analysis, and it is very commonly used among data scientists. It offers a variety of methods, which allows data scientists to quickly use them for data analysis. You will learn how to analyze data using Pandas throughout this textbook.

Colab already has Pandas installed, so you just need to import Pandas and you are set to use all the methods in Pandas. Note that it is convention to abbreviate pandas to pd so that when you call a method from Pandas, you can do so by using pd instead of having to type out Pandas every time. It offers a bit of convenience for a programmer!

Python Code

      # import Pandas and assign an abbreviated identifier "pd"
      import pandas as pd
      

Exploring Further

Installing Pandas on Your Computer

If you wish to install Pandas on your own computer, refer to the installation page of the Pandas website.

Load Data Using Python Pandas

The first step for data analysis is to load the data of your interest to your Notebook. Let’s create a folder on Google Drive where you can keep a CSV file for the dataset and a Notebook for data analysis. Download a public dataset, ch1-movieprofit.csv, and store it in a Google Drive folder. Then open a new Notebook in that folder by entering that folder and clicking New > More > Google Colaboratory.

Open the Notebook and allow it to access files in your Google Drive by following these steps:

First, click the Files icon on the side tab (Figure 1.18).

A screenshot of the side tab of Google Colab showing the following icons: hamburger menu, magnifying glass, x in brackets, key, and folder. The folder icon is highlighted and the word “Files” has popped up.
Figure 1.18 Side Tab of Colab

Then click the Mount Drive icon (Figure 1.19) and select “Connect to Google Drive” on the pop-up window.

A screenshot of the Files popup menu on Google Colab. The menu includes four icons, and the Mount Drive icon is selected.
Figure 1.19 Features under Files on Colab

Notice that a new cell has been inserted on the Notebook as a result (Figure 1.20).

Code snippet in Google Colab displaying a Python command to mount Google Drive. The command imports the 'drive' module and uses the 'mount' function with the path '/content/drive'.
Figure 1.20 An Inserted Cell to Mount Your Google Drive

Connect your Google Drive by running the cell, and now your Notebook file can access all the files under content/drive. Navigate folders under drive to find your Notebook and ch1-movieprofit.csv files. Then click “…” > Copy Path (Figure 1.21).

A screenshot showing how to copy the path of a C S V file located in a Google Drive folder.
Figure 1.21 Copying the Path of a CSV File Located in a Google Drive Folder

Now replace [Path] with the copied path in the below code. Run the code and you will see the dataset has been loaded as a table and stored as a Python variable data.

Python Code

        # import Pandas and assign an abbreviated identifier "pd"
        import pandas as pd
        
        data = pd.read_csv("[Path]")
        data
       

The resulting output will look like this:

A Python output table displaying movie data, including title, year, genre, rating, duration, US gross, worldwide gross, and votes. The table is sorted by worldwide gross in descending order.

The read_csv() method in Pandas loads a CSV file and stores it as a DataFrame. A DataFrame is a data type that Pandas uses to store multi-column tabular data. Therefore, the variable data holds the table in ch1-movieprofit.csv in the form of a Pandas DataFrame.

DataFrame versus Series

Pandas defines two data types for tabular data—DataFrame and Series. While DataFrame is used for multi-column tabular data, Series is used for single-column data. Many methods in Pandas support both DataFrame and Series, but some are only for one or the other. It is always good to check if the method you are using works as you expect. For more information, refer to the Pandas documentation or Das, U., Lawson, A., Mayfield, C., & Norouzi, N. (2024). Introduction to Python Programming. OpenStax. https://openstax.org/books/introduction-python-programming/pages/1-introduction.

Example 1.9

Problem

Remember the Iris dataset we used in Data and Datasets? Load the dataset ch1-iris.csv to a Python program using Pandas.

Exploring Further

Can I load a file that is uploaded to someone else’s Google Drive and shared with me?

Yes! This is useful especially when your Google Drive runs out of space. Simply add the shortcut of the shared file to your own drive. Right-click > Organize > Add Shortcut will let you select where to store the shortcut. Once done, you can call pd.read_csv() using the path of the shortcut.

Summarize Data Using Python Pandas

You can compute basic statistics for data quite quickly by using the DataFrame.describe() method. Add and run the following code in a new cell. It calls the describe() method upon data, the DataFrame we defined earlier with ch1-movieprofit.csv.

Python Code

      data = pd.read_csv("[Path to ch1-movieprofit.csv]")
      data.describe()
      

like this:

A Python output table displaying descriptive statistics for movie data, including count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values for rating, duration, US gross, and worldwide gross in millions.

describe() returns a table whose columns are a subset of the columns in the entire dataset and whose rows are different statistics. The statistics include the number of unique values in a column (count), mean (mean), standard deviation (std), minimum and maximum values (min/max), and different quartiles (25%/50%/75%), which you will learn about in Measures of Variation. Using this representation, you can compute such statistics of different columns easily.

Example 1.10

Problem

Summarize the IRIS dataset using describe() of ch1-iris.csv you loaded in the previous example.

Select Data Using Python Pandas

The Pandas DataFrame allows a programmer to use the column name itself when selecting a column. For example, the following code prints all the values in the “US_Gross_Million” column in the form of a Series (remember the data from a single column is stored in the Series type in Pandas).

Python Code

      data = pd.read_csv("[Path to ch1-movieprofit.csv]")
      
      data["US_Gross_Million"]
      

like this:

0   760.51
1   858.37
2   659.33
3   936.66
4   678.82
    ...
961   77.22
962  177.20
963  102.31
964  106.89
965   75.47
Name: US_Gross_Million, Length: 966, dtype: float64

DataFrame.iloc[] enables a more powerful selection—it lets a programmer select by both column and row, using column and row indices. Let’s look at some code examples below.

Python Code

      data.iloc[:, 2] # select all values in the second column
      

The resulting output will look like this:

0   2009
1   2019
2   1997
3   2015
4   2018
    ...
961  2010
962  1982
963  1993
964  1999
965  2017
Name: Year, Length: 966, dtype: object

Python Code

      data.iloc[2,:] # select all values in the third row
      

The resulting output will look like this:

Unnamed: 0             3
Title            Titanic
Year              1997
Genre             Drama
Rating              7.9
Duration             194
US_Gross_Million       659.33
Worldwide_Gross_Million   2201.65
Votes           1,162,142
Name: 2, dtype: object

To pinpoint a specific value within the “US_Gross_Million” column, you can use an index number.

Python Code

      
      print (data["US_Gross_Million"][0]) # index 0 refers to the top row
      print (data["US_Gross_Million"][2]) # index 2 refers to the third row
      

The resulting output will look like this:

760.51
659.33

You can also use DataFrame.iloc[] to select a specific group of cells on the table. The example code below shows different ways of using iloc[]. There are multiple ways of using iloc[], but this chapter introduces a couple of common ones. You will learn more techniques for working with data throughout this textbook.

Python Code

      
      data.iloc[:, 1] # select all values in the second column (index 1)
      

The resulting output will look like this:

0                     Avatar
1                Avengers: Endgame
2                     Titanic
3   Star Wars: Episode VII - The Force Awakens
4             Avengers: Infinity War
             ...          
961                  The A-Team
962                    Tootsie
963              In the Line of Fire
964                 Analyze This
965            The Hitman's Bodyguard
Name: Title, Length: 966, dtype: object

Python Code

      data.iloc[[1, 3], [2, 3]]  
      # select the rows at index 1 and 3, the columns at index 2 and 3

The resulting output will look like this:

A Python output table with two columns and two rows. The first column is labeled “Year,” and the second column is labeled “Genre.” The first row contains the values “2019” and “Action,” and the second row contains the values “2015” and “Action.” There are two icons to the right of the table, one that looks like a calendar and one that looks like a bar chart.

Example 1.11

Problem

Select a “sepal_width” column of the IRIS dataset using the column name.

Example 1.12

Problem

Select a “petal_length” column of the IRIS dataset using iloc[].

Search Data Using Python Pandas

To search for some data entries that fulfill specific criteria (i.e., filter), you can use DataFrame.loc[] of Pandas. When you indicate the filtering criteria inside the brackets, [], the output returns the filtered rows within the DataFrame. For example, the code below filters out the rows whose genre is comedy. Notice that the output only has 307 out of the full 3,400 rows. You can check the output on your own, and you will see their Genre values are all “Comedy.”

Python Code

      data = pd.read_csv("[Path to ch1-movieprofit.csv]")
      
      data.loc[data['Genre'] == 'Comedy']
      

The resulting output will look like this:

A Python output table displaying movie data, including title, year, genre, rating, duration, US gross, worldwide gross, and votes. The table is sorted by worldwide gross in descending order.

Example 1.13

Problem

Using DataFrame.loc[], search for all the items of Iris-virginica species in the IRIS dataset.

Example 1.14

Problem

This time, search for all the items whose species is Iris-virginica and whose sepal width is wider than 3.2.

Visualize Data Using Python Matplotlib

There are multiple ways to draw plots of data in Python. The most common and straightforward way is to import another library, Matplotlib, which is specialized for data visualization. Matplotlib is a huge library, and to draw the plots you only need to import a submodule named pyplot.

Type the following import statement in a new cell. Note it is convention to denote matplotlib.pyplot with plt, similarly to denoting Pandas with pd.

Python Code

      import matplotlib.pyplot as plt
      

Matplotlib offers a method for each type of plot, and you will learn the Matplotlib methods for all of the commonly used types throughout this textbook. In this chapter, however, let’s briefly look at how to draw a plot using Matplotlib in general.

Suppose you want to draw a scatterplot between “US_Gross_Million” and “Worldwide_Gross_Million” of the movie profit dataset (ch1-movieprofit.csv). You will investigate scatterplots in more detail in Correlation and Linear Regression Analysis. The example code below draws such a scatterplot using the method scatter(). scatter() takes the two columns of your interest—data["US_Gross_Million"] and data["Worldwide_Gross_Million"]—as the inputs and assigns them for the x- and y-axes, respectively.

Python Code

      data = pd.read_csv("[Path to ch1-movieprofit.csv]")
      
      # draw a scatterplot using matplotlib’s scatter()
      plt.scatter(data["US_Gross_Million"], data["Worldwide_Gross_Million"])
      

The resulting output will look like this:

An unlabeled scatter plot. The X axis ranges from 0 to 1,000. The Y axis ranges from 0 to 3,000. Data points are clustered toward the lower left corner, with a general upward trend indicating that a higher value on the X axis tends to correlate with a higher value on the Y axis.

Notice that it simply has a set of dots on a white plane. The plot itself does not show what each axis represents, what this plot is about, etc. Without them, it is difficult to capture what the plot shows. You can set these with the following code. The resulting plot below indicates that there is a positive correlation between domestic gross and worldwide gross.

Python Code

      # draw a scatterplot
      plt.scatter(data["US_Gross_Million"], data["Worldwide_Gross_Million"])
      
      # set the title
      plt.title("Domestic vs. Worldwide Gross")
      
      # set the x-axis label
      plt.xlabel("Domestic")
      
      # set the y-axis label
      plt.ylabel("Worldwide")
      

The resulting output will look like this:

A scatter plot comparing the domestic gross versus worldwide gross of movies. The x-axis represents domestic gross and ranges from 0 to 1,000, and the y-axis represents worldwide gross and ranges from 0 to 3,000. Each data point is a blue dot representing a movie. The plot shows a general positive correlation between domestic and worldwide gross, indicating that movies with higher domestic gross tend to also have higher worldwide gross.  Data points are clustered toward the lower left corner, with a general upward trend.

You can also change the range of numbers along the x- and y-axes with plt.xlim() and plt.ylim(). Add the following two lines of code to the cell in the previous Python code example, which plots the scatterplot.

Python Code

      # draw a scatterplot
      plt.scatter(data["US_Gross_Million"], data["Worldwide_Gross_Million"])
      
      # set the title
      plt.title("Domestic vs. Worldwide Gross")
      
      # set the x-axis label
      plt.xlabel("Domestic")
      
      # set the y-axis label
      plt.ylabel("Worldwide")
      
      # set the range of values of the x- and y-axes
      plt.xlim(1*10**2, 3*10**2) # x axis: 100 to 300
      plt.ylim(1*10**2, 1*10**3) # y axis: 100 to 1,000
      

The resulting output will look like this:

A scatter plot comparing the domestic gross versus worldwide gross of movies. The x-axis represents domestic gross and ranges from 100 to 300, and the y-axis represents worldwide gross and ranges from 100 to 1,000. Each data point is a blue dot representing a movie. The plot shows a general positive correlation between domestic and worldwide gross, indicating that movies with higher domestic gross tend to also have higher worldwide gross.  Data points are clustered toward the lower left corner, with a general upward trend.

The resulting plot with the additional lines of code has a narrower range of values along the x- and y-axes.

Example 1.15

Problem

Using the iris dataset, draw a scatterplot between petal length and height of Setosa Iris. Set the title, x-axis label, and y-axis label properly as well.

Datasets

Note: The primary datasets referenced in the chapter code may also be downloaded here.

Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.