Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo

Learning objectives

By the end of this section you should be able to

  • Describe the Pandas library.
  • Create a DataFrame and a Series object.
  • Choose appropriate Pandas functions to gain insight from heterogeneous data.

Pandas library

Pandas is an open-source Python library used for data cleaning, processing, and analysis. Pandas provides data structures and data analysis tools to analyze structured data efficiently. The name "Pandas" is derived from the term "panel data," which refers to multidimensional structured datasets. Key features of Pandas include:

  • Data structure: Pandas implements two main data structures:
    • Series: A Series is a one-dimensional labeled array.
    • DataFrame: A DataFrame is a two-dimensional labeled data structure that consists of columns and rows. A DataFrame can be thought of as a spreadsheet-like data structure where each column represents a Series. DataFrame is a heterogeneous data structure where each column can have a different data type.
  • Data processing functionality: Pandas provides various functionalities for data processing, such as data selection, filtering, slicing, sorting, merging, joining, and reshaping.
  • Integration with other libraries: Pandas integrates well with other Python libraries, such as NumPy. The integration capability allows for data exchange between different data analysis and visualization tools.

The conventional alias for importing Pandas is pd. In other words, Pandas is imported as import pandas as pd. Examples of DataFrame and Series objects are shown below.

DataFrame example Series example
Name Age City
0 Emma 15 Dubai
1 Gireeja 28 London
2 Sophia 22 San Jose
0 Emma
1 Gireeja
2 Sophia
dtype: object
Table 15.1

Data input and output

A DataFrame can be created from a dictionary, list, NumPy array, or a CSV file. Column names and column data types can be specified at the time of DataFrame instantiation.

Description Example Output Explanation
DataFrame from a dictionary
import pandas as pd

# Create a dictionary of columns
data = {
  "Name": ["Emma", "Gireeja", "Sophia"],
  "Age": [15, 28, 22],
  "City": ["Dubai", "London", "San Jose"]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

# Display the DataFrame
df
Name Age City
0 Emma 15 Dubai
1 Gireeja 28 London
2 Sophia 22 San Jose

The pd.DataFrame() function takes in a dictionary and converts it into a DataFrame. Dictionary keys will be column labels and values are stored in respective columns.

DataFrame from a list
import pandas as pd

# Create a list of rows
data = [
  ["Emma", 15, "Dubai"],
  ["Gireeja", 28, "London"],
  ["Sophia", 22, "San Jose"]
]

# Define column labels
columns = ["Name", "Age", "City"]

# Create a DataFrame from list using column labels
df = pd.DataFrame(data, columns=columns)

# Display the DataFrame
df
Name Age City
0 Emma 15 Dubai
1 Gireeja 28 London
2 Sophia 22 San Jose

The pd.DataFrame() function takes in a list containing the records in different rows of a DataFrame, along with a list of column labels, and creates a DataFrame with the given rows and column labels.

DataFrame from a NumPy array
import numpy as np
import pandas as pd

# Create a NumPy array
data = np.array([
  [1, 0, 0],
  [0, 1, 0],
  [2, 3, 4]
])

# Define column labels
columns = ["A", "B", "C"]

# Create a DataFrame from the NumPy array
df = pd.DataFrame(data, columns=columns)

# Display the DataFrame
df
A B C
0 1 0 0
1 0 1 0
2 2 3 4

A NumPy array, along with column labels, are passed to the pd.DataFrame() function to create a DataFrame object.

DataFrame from a CSV file
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv("data.csv")

# Display the DataFrame
df
The content of the CSV file will be printed in a tabular format.

The pd.read_csv() function reads a CSV file into a DataFrame and organizes the content in a tabular format.

DataFrame from a Excel file
import pandas as pd

# Read the Excel file into a DataFrame
df = pd.read_excel("data.xlsx")

# Display the DataFrame
df
The content of the Excel file will be printed in a tabular format.

The pd.read_excel() function reads an Excel file into a DataFrame and organizes the content in a tabular format.

Table 15.2 DataFrame creation.

Concepts in Practice

Pandas basics

1.
Which of the following is a Pandas data structure?
  1. Series
  2. dictionary
  3. list
2.
A DataFrame object can be considered as a collection of Series objects.
  1. true
  2. false
3.
What are the benefits of Pandas over NumPy?
  1. Pandas provides integration with other libraries.
  2. Pandas supports heterogeneous data whereas NumPy supports homogenous numerical data.
  3. Pandas supports both one-dimensional and two-dimensional data structures.

Pandas for data manipulation and analysis

The Pandas library provides functions and techniques to explore, manipulate, and gain insights from the data. Key DataFrame functions that analyze this code are described in the following table.

    import pandas as pd
    import numpy as np
    
    # Create a sample DataFrame
    days = {
      'Season': ['Summer', 'Summer', 'Fall', 'Winter', 'Fall', 'Winter'],
      'Month': ['July', 'June', 'September', 'January', 'October', 'February'],
      'Month-day': [1, 12, 3, 7, 20, 28],
      'Year': [2000, 1990, 2020, 1998, 2001, 2022]
    }
    df = pd.DataFrame(days)
Season Month Month-day Year
0 Summer July 1 2000
1 Summer June 12 1990
2 Fall September 3 2020
3 Winter January 7 1998
4 Fall October 20 2001
5 Winter February 28 2022
Table 15.3
Function name Explanation Example Output

head(n)

Returns the first n rows. If a value is not passed, the first 5 rows will be shown.

df.head(4)
Season Month Month-day Year
0 Summer July 1 2000
1 Summer June 12 1990
2 Fall September 3 2020
3 Winter January 7 1998

tail(n)

Returns the last n rows. If a value is not passed, the last 5 rows will be shown.

df.tail(3)
Season Month Month-day Year
3 Winter January 7 1998
4 Fall October 20 2001
5 Winter February 28 2022

info()

Provides a summary of the DataFrame, including the column names, data types, and the number of non-null values. The function also returns the DataFrame's memory usage.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #    Column        Non-Null Count    Dtype
---   ------        --------------    -----
 0    Season        6 non-null        object
 1    Month         6 non-null        object
 2    Month-day     6 non-null        int64
 3    Year          6 non-null        int64
dtypes: int64(2), object(2)
memory usage: 320.0+ bytes

describe()

Generates the column count, mean, standard deviation, minimum, maximum, and quartiles.
df.describe()
Month-day Year
count 6.000000 6.000000
mean 11.833333 2005.166667
std 10.457852 12.875040
min 1.000000 1990.000000
25% 4.000000 1998.500000
50% 9.500000 2000.500000
75% 18.000000 2015.250000
max 28.000000 2022.000000

value_counts()

Counts the occurrences of unique values in a column when a column is passed as an argument and presents them in descending order.
df.value_counts \('Season')
Season
Fall  2
Summer  2
Winter  2
dtype: int64

unique()

Returns an array of unique values in a column when called on a column.
df['Season'] \.unique()
​​['Summer' 'Fall' 'Winter']
Table 15.4 DataFrame functions.

Concepts in Practice

DataFrame operations

4.
Which of the following returns the top five rows of a DataFrame?
  1. df.head()
  2. df.head(5)
  3. both
5.
What does the unique() function do in a DataFrame when applied to a column?
  1. returns the number of unique columns
  2. returns the number of unique values in the given column
  3. returns the unique values in the given column
6.
Which function generates statistical information of columns with numerical data types?
  1. describe()
  2. info()
  3. unique()

Exploring further

Please refer to the Pandas user guide for more information about the Pandas library.

Programming practice with Google

Use the Google Colaboratory document below to practice Pandas functionalities to extract insights from a dataset.

Google Colaboratory document

Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/introduction-python-programming/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/introduction-python-programming/pages/1-introduction
Citation information

© Jul 30, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

This book utilizes the OpenStax Python Code Runner. The code runner is developed by Wiley and is All Rights Reserved.