Learning objectives
By the end of this section you should be able to
- Describe the Pandas library.
- Create a DataFrame and a Series object.
- Choose appropriate Pandas functions to gain insight from heterogeneous data.
Pandas library
Pandas is an open-source Python library used for data cleaning, processing, and analysis. Pandas provides data structures and data analysis tools to analyze structured data efficiently. The name "Pandas" is derived from the term "panel data," which refers to multidimensional structured datasets. Key features of Pandas include:
- Data structure: Pandas implements two main data structures:
- Series: A Series is a one-dimensional labeled array.
- DataFrame: A DataFrame is a two-dimensional labeled data structure that consists of columns and rows. A DataFrame can be thought of as a spreadsheet-like data structure where each column represents a Series. DataFrame is a heterogeneous data structure where each column can have a different data type.
- Data processing functionality: Pandas provides various functionalities for data processing, such as data selection, filtering, slicing, sorting, merging, joining, and reshaping.
- Integration with other libraries: Pandas integrates well with other Python libraries, such as NumPy. The integration capability allows for data exchange between different data analysis and visualization tools.
The conventional alias for importing Pandas is pd
. In other words, Pandas is imported as import pandas as pd
. Examples of DataFrame and Series objects are shown below.
DataFrame example | Series example | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Data input and output
A DataFrame can be created from a dictionary, list, NumPy array, or a CSV file. Column names and column data types can be specified at the time of DataFrame instantiation.
Description | Example | Output | Explanation | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DataFrame from a dictionary | import pandas as pd
# Create a dictionary of columns
data = {
"Name": ["Emma", "Gireeja", "Sophia"],
"Age": [15, 28, 22],
"City": ["Dubai", "London", "San Jose"]
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
# Display the DataFrame
df |
|
The |
||||||||||||||||
DataFrame from a list | import pandas as pd
# Create a list of rows
data = [
["Emma", 15, "Dubai"],
["Gireeja", 28, "London"],
["Sophia", 22, "San Jose"]
]
# Define column labels
columns = ["Name", "Age", "City"]
# Create a DataFrame from list using column labels
df = pd.DataFrame(data, columns=columns)
# Display the DataFrame
df |
|
The |
||||||||||||||||
DataFrame from a NumPy array | import numpy as np
import pandas as pd
# Create a NumPy array
data = np.array([
[1, 0, 0],
[0, 1, 0],
[2, 3, 4]
])
# Define column labels
columns = ["A", "B", "C"]
# Create a DataFrame from the NumPy array
df = pd.DataFrame(data, columns=columns)
# Display the DataFrame
df |
|
A NumPy array, along with column labels, are passed to the |
||||||||||||||||
DataFrame from a CSV file | import pandas as pd
# Read the CSV file into a DataFrame
df = pd.read_csv("data.csv")
# Display the DataFrame
df |
The content of the CSV file will be printed in a tabular format. | The |
||||||||||||||||
DataFrame from a Excel file | import pandas as pd
# Read the Excel file into a DataFrame
df = pd.read_excel("data.xlsx")
# Display the DataFrame
df |
The content of the Excel file will be printed in a tabular format. | The |
Concepts in Practice
Pandas basics
Pandas for data manipulation and analysis
The Pandas library provides functions and techniques to explore, manipulate, and gain insights from the data. Key DataFrame functions that analyze this code are described in the following table.
import pandas as pd
import numpy as np
# Create a sample DataFrame
days = {
'Season': ['Summer', 'Summer', 'Fall', 'Winter', 'Fall', 'Winter'],
'Month': ['July', 'June', 'September', 'January', 'October', 'February'],
'Month-day': [1, 12, 3, 7, 20, 28],
'Year': [2000, 1990, 2020, 1998, 2001, 2022]
}
df = pd.DataFrame(days)
Season | Month | Month-day | Year | |
---|---|---|---|---|
0 | Summer | July | 1 | 2000 |
1 | Summer | June | 12 | 1990 |
2 | Fall | September | 3 | 2020 |
3 | Winter | January | 7 | 1998 |
4 | Fall | October | 20 | 2001 |
5 | Winter | February | 28 | 2022 |
Function name | Explanation | Example | Output | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Returns the first |
df.head(4) |
|
|||||||||||||||||||||||||||
|
Returns the last |
df.tail(3) |
|
|||||||||||||||||||||||||||
|
Provides a summary of the DataFrame, including the column names, data types, and the number of non-null values. The function also returns the DataFrame's memory usage. | df.info() |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6 entries, 0 to 5 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Season 6 non-null object 1 Month 6 non-null object 2 Month-day 6 non-null int64 3 Year 6 non-null int64 dtypes: int64(2), object(2) memory usage: 320.0+ bytes |
|||||||||||||||||||||||||||
|
Generates the column count, mean, standard deviation, minimum, maximum, and quartiles. | df.describe() |
|
|||||||||||||||||||||||||||
|
Counts the occurrences of unique values in a column when a column is passed as an argument and presents them in descending order. | df.value_counts \('Season') |
Season
Fall 2
Summer 2
Winter 2
dtype: int64
|
|||||||||||||||||||||||||||
|
Returns an array of unique values in a column when called on a column. | df['Season'] \.unique() |
['Summer' 'Fall' 'Winter'] |
Concepts in Practice
DataFrame operations
Exploring further
Please refer to the Pandas user guide for more information about the Pandas library.
Programming practice with Google
Use the Google Colaboratory document below to practice Pandas functionalities to extract insights from a dataset.