Udayan Das; Aubrey Lawson; Chris Mayfield; Narges Norouzi

Learning objectives

By the end of this section you should be able to

Describe the Pandas library.
Create a DataFrame and a Series object.
Choose appropriate Pandas functions to gain insight from heterogeneous data.

Pandas library

Pandas is an open-source Python library used for data cleaning, processing, and analysis. Pandas provides data structures and data analysis tools to analyze structured data efficiently. The name "Pandas" is derived from the term "panel data," which refers to multidimensional structured datasets. Key features of Pandas include:

Data structure: Pandas implements two main data structures:
- Series: A Series is a one-dimensional labeled array.
- DataFrame: A DataFrame is a two-dimensional labeled data structure that consists of columns and rows. A DataFrame can be thought of as a spreadsheet-like data structure where each column represents a Series. DataFrame is a heterogeneous data structure where each column can have a different data type.
Data processing functionality: Pandas provides various functionalities for data processing, such as data selection, filtering, slicing, sorting, merging, joining, and reshaping.
Integration with other libraries: Pandas integrates well with other Python libraries, such as NumPy. The integration capability allows for data exchange between different data analysis and visualization tools.

The conventional alias for importing Pandas is pd. In other words, Pandas is imported as import pandas as pd. Examples of DataFrame and Series objects are shown below.

DataFrame example

Series example

	Name	Age	City
0	Emma	15	Dubai
1	Gireeja	28	London
2	Sophia	22	San Jose

0	Emma
1	Gireeja
2	Sophia
dtype:	object

Table 15.1

Data input and output

A DataFrame can be created from a dictionary, list, NumPy array, or a CSV file. Column names and column data types can be specified at the time of DataFrame instantiation.

Description Example Output Explanation

DataFrame from a dictionary

import pandas as pd

# Create a dictionary of columns
data = {
  "Name": ["Emma", "Gireeja", "Sophia"],
  "Age": [15, 28, 22],
  "City": ["Dubai", "London", "San Jose"]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

# Display the DataFrame
df

	Name	Age	City
0	Emma	15	Dubai
1	Gireeja	28	London
2	Sophia	22	San Jose

The pd.DataFrame() function takes in a dictionary and converts it into a DataFrame. Dictionary keys will be column labels and values are stored in respective columns.

DataFrame from a list

import pandas as pd

# Create a list of rows
data = [
  ["Emma", 15, "Dubai"],
  ["Gireeja", 28, "London"],
  ["Sophia", 22, "San Jose"]
]

# Define column labels
columns = ["Name", "Age", "City"]

# Create a DataFrame from list using column labels
df = pd.DataFrame(data, columns=columns)

# Display the DataFrame
df

	Name	Age	City
0	Emma	15	Dubai
1	Gireeja	28	London
2	Sophia	22	San Jose

The pd.DataFrame() function takes in a list containing the records in different rows of a DataFrame, along with a list of column labels, and creates a DataFrame with the given rows and column labels.

DataFrame from a NumPy array

import numpy as np
import pandas as pd

# Create a NumPy array
data = np.array([
  [1, 0, 0],
  [0, 1, 0],
  [2, 3, 4]
])

# Define column labels
columns = ["A", "B", "C"]

# Create a DataFrame from the NumPy array
df = pd.DataFrame(data, columns=columns)

# Display the DataFrame
df

	A	B	C
0	1	0	0
1	0	1	0
2	2	3	4

A NumPy array, along with column labels, are passed to the pd.DataFrame() function to create a DataFrame object.

DataFrame from a CSV file

import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv("data.csv")

# Display the DataFrame
df

The content of the CSV file will be printed in a tabular format.

The pd.read_csv() function reads a CSV file into a DataFrame and organizes the content in a tabular format.

DataFrame from a Excel file

import pandas as pd

# Read the Excel file into a DataFrame
df = pd.read_excel("data.xlsx")

# Display the DataFrame
df

The content of the Excel file will be printed in a tabular format.

The pd.read_excel() function reads an Excel file into a DataFrame and organizes the content in a tabular format.

Table 15.2 DataFrame creation.

Concepts in Practice

Pandas basics

1.

Which of the following is a Pandas data structure?

Series
dictionary
list

2.

A DataFrame object can be considered as a collection of Series objects.

true
false

3.

What are the benefits of Pandas over NumPy?

Pandas provides integration with other libraries.
Pandas supports heterogeneous data whereas NumPy supports homogenous numerical data.
Pandas supports both one-dimensional and two-dimensional data structures.

Pandas for data manipulation and analysis

The Pandas library provides functions and techniques to explore, manipulate, and gain insights from the data. Key DataFrame functions that analyze this code are described in the following table.

    import pandas as pd
    import numpy as np
    
    # Create a sample DataFrame
    days = {
      'Season': ['Summer', 'Summer', 'Fall', 'Winter', 'Fall', 'Winter'],
      'Month': ['July', 'June', 'September', 'January', 'October', 'February'],
      'Month-day': [1, 12, 3, 7, 20, 28],
      'Year': [2000, 1990, 2020, 1998, 2001, 2022]
    }
    df = pd.DataFrame(days)

	Season	Month	Month-day	Year
0	Summer	July	1	2000
1	Summer	June	12	1990
2	Fall	September	3	2020
3	Winter	January	7	1998
4	Fall	October	20	2001
5	Winter	February	28	2022

Table 15.3

Function name Explanation Example Output

head(n)

Returns the first n rows. If a value is not passed, the first 5 rows will be shown.

df.head(4)

	Season	Month	Month-day	Year
0	Summer	July	1	2000
1	Summer	June	12	1990
2	Fall	September	3	2020
3	Winter	January	7	1998

tail(n)

Returns the last n rows. If a value is not passed, the last 5 rows will be shown.

df.tail(3)

	Season	Month	Month-day	Year
3	Winter	January	7	1998
4	Fall	October	20	2001
5	Winter	February	28	2022

info()

Provides a summary of the DataFrame, including the column names, data types, and the number of non-null values. The function also returns the DataFrame's memory usage.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #    Column        Non-Null Count    Dtype
---   ------        --------------    -----
 0    Season        6 non-null        object
 1    Month         6 non-null        object
 2    Month-day     6 non-null        int64
 3    Year          6 non-null        int64
dtypes: int64(2), object(2)
memory usage: 320.0+ bytes

describe()

Generates the column count, mean, standard deviation, minimum, maximum, and quartiles.

df.describe()

	Month-day	Year
count	6.000000	6.000000
mean	11.833333	2005.166667
std	10.457852	12.875040
min	1.000000	1990.000000
25%	4.000000	1998.500000
50%	9.500000	2000.500000
75%	18.000000	2015.250000
max	28.000000	2022.000000

value_counts()

Counts the occurrences of unique values in a column when a column is passed as an argument and presents them in descending order.

df.value_counts \('Season')

Season
Fall  2
Summer  2
Winter  2
dtype: int64

unique()

Returns an array of unique values in a column when called on a column.

df['Season'] \.unique()

['Summer' 'Fall' 'Winter']

Table 15.4 DataFrame functions.

Concepts in Practice

DataFrame operations

4.

Which of the following returns the top five rows of a DataFrame?

df.head()
df.head(5)
both

5.

What does the unique() function do in a DataFrame when applied to a column?

returns the number of unique columns
returns the number of unique values in the given column
returns the unique values in the given column

6.

Which function generates statistical information of columns with numerical data types?

describe()
info()
unique()

Exploring further

Please refer to the Pandas user guide for more information about the Pandas library.

Pandas User Guide

Programming practice with Google

Use the Google Colaboratory document below to practice Pandas functionalities to extract insights from a dataset.

Google Colaboratory document

15.3 Pandas