Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Introduction to Python Programming

15.4 Exploratory data analysis

Introduction to Python Programming15.4 Exploratory data analysis

Learning objectives

By the end of this section you should be able to

  • Describe exploratory data analysis.
  • Inspect DataFrame entries through appropriate indexing.
  • Use filtering and slicing to obtain a subset of a DataFrame.
  • Identify Null values in a DataFrame.
  • Remove or replace Null values in a DataFrame.

Exploratory data analysis

Exploratory Data Analysis (EDA) is the task of analyzing data to gain insights, identify patterns, and understand the underlying structure of the data. During EDA, data scientists visually and statistically examine data to uncover relationships, anomalies, and trends, and to generate hypotheses for further analysis. The main goal of EDA is to become familiar with the data and assess the quality of the data. Once data are understood and cleaned, data scientists may perform feature creation and hypothesis formation. A feature is an individual variable or attribute that is calculated from the raw data in the dataset.

Data indexing can be used to select and access specific rows and columns. Data indexing is essential in examining a dataset. In Pandas, two types of indexing methods exist:

  • Label-based indexing using loc[]: loc[] allows you to access data in a DataFrame using row/column labels. Ex: df.loc[row_label, column_label] returns specific data at the intersection of row_label and column_label.
  • Integer-based indexing using iloc[]: iloc[] allows you to access data in a DataFrame using integer-based indexes. Integer indexes can be passed to retrieve specific data. Ex: df.iloc[row_index, column_index] returns specific data at the index row_index and column_index.

Checkpoint

Indexing a DataFrame

Concepts in Practice

DataFrame indexing

Given the following code, respond to the questions below.

    import pandas as pd

    # Create sample data
    data = {
      "A": ["a", "b", "c", "d"],
      "B": [12, 20, 5, -10],
      "C": ["C", "C", "C", "C"]
    }

    df = pd.DataFrame(data)
1.
What is the output of print(df.iloc[0, 0])?
  1. a
  2. b
  3. IndexError
2.
What is the output of print(df.iloc[1, 1])?
  1. a
  2. b
  3. 20
3.
What is the output of print(df.loc[2, 'A'])?
  1. c
  2. C
  3. b

Data slicing and filtering

Data slicing and filtering involve selecting specific subsets of data based on certain conditions or index/label ranges. Data slicing refers to selecting a subset of rows and/or columns from a DataFrame. Slicing can be performed using ranges, lists, or Boolean conditions.

  • Slicing using ranges: Ex: df.loc[start_row:end_row, start_column:end_column] selects rows and columns within the specified ranges.
  • Slicing using a list: Ex: df.loc[[label1, label2, ...], :] selects rows that are in the list [label1, label2, ...] and includes all columns since all columns are selected by the colon operator.
  • Slicing based on a condition: df[condition] selects only the rows that meet the given condition.

Data filtering involves selecting rows or columns based on certain conditions. Ex: In the expression df[df['column_name'] > threshold], the DataFrame df is filtered using the selection operator ([]) and the condition(df['column_name'] > threshold) that is passed. All entries in the DataFrame df where the corresponding value in the DataFrame is True will be returned.

Checkpoint

Indexing on a flight dataset

Concepts in Practice

DataFrame slicing and filtering

Given the following code, respond to the questions.

    import pandas as pd

    # Create sample data
    data = {
      "A": [1, 2, 3, 4],
      "B": [5, 6, 7, 8],
      "C": [9, 10, 11, 12]
    }

    df = pd.DataFrame(data)
4.
Which of the following returns the first three rows of column A?
  1. df.loc[1:3, "A"]
  2. df.loc[0:2, "A"]
  3. df.loc[0:3, "A"]
5.
Which of the following returns the second row?
  1. df.iloc[2]
  2. df[2]
  3. df.loc[1]
6.
Which of the following returns the first column?
  1. df.loc[0, :]
  2. df.iloc[:, 0]
  3. df.loc["A"]
7.
Which of the following results in selecting the second and fourth rows of the DataFrame?
  1. df[df.loc[:, "A"] % 2 == 0]
  2. df[df[:, "A"] % 2 == 0]
  3. df[df.loc["A"] % 2 == 0]

Handling missing data

Missing values in a dataset can occur when data are not available or are not recorded properly. Identifying and removing missing values is an important step in data cleaning and preprocessing. A data scientist should consider ethical considerations throughout the EDA process, especially when handling missing data. They might consider answering questions such as "Why are the data missing?", "Whose data are missing?", and "Considering the missing data, is the dataset still a representative sample of the population under study?". The functions below are useful in understanding and analyzing missing data.

  • isnull(): The isnull() function can be used to identify Null entries in a DataFrame. The return value of the function is a Boolean DataFrame, with the same dimensions as the original DataFrame with True values where missing values exist.
  • dropna(): The dropna() function can be used to drop rows with Null values.
  • fillna(): The fillna() function can be used to replace Null values with a provided substitute value. Ex: df.fillna(df.mean()) replaces all Null values with the average value of the specific column.

To define a Null value in a DataFrame, you can use the np.nan value from the NumPy library. Functions that aid in identifying and removing null entries are described in the table below the following code.

    import pandas as pd
    import numpy as np

    # Create sample data
    data = {
      "Column 1": ["A", "B", "C", "D", "E"],
      "Column 2": [np.NAN, 200, 500, 0, -10],
      "Column 3": [True, True, False, np.NaN, np.NaN]
    }
    
    df = pd.DataFrame(data)
Column 1 Column 2 Column 3
0 A NaN True
1 B 200.0 True
2 C 500.0 False
3 D 0.0 NaN
4 E -10.0 NaN
Table 15.5
Function Example Output Explanation
isnull()
df.isnull()
Column 1 Column 2 Column 3
0 False True False
1 False False False
2 False False False
3 False False True
4 False False True

The df.isnull() function returns a Boolean array with Boolean values representing whether each entry is Null.

fillna()
df["Column 2"] =\
df["Column 2"].fillna(df["Column 2"]
  .mean())
Column 1 Column 2 Column 3
0 A 172.5 True
1 B 200.0 True
2 C 500.0 False
3 D 0.0 NaN
4 E -10.0 NaN

Null values in Column 2 are replaced with the mean of non-Null values in the column.

dropna()
# Applied after the run 
# of the previous row
df = df.dropna()
Column 1 Column 2 Column 3
0 A 172.5 True
1 B 200.0 True
2 C 500.0 False

All rows containing a Null value are removed from the DataFrame.

Table 15.6 Null identification and removal examples.

Concepts in Practice

Missing value treatment

8.
Which of the following is used to check the DataFrame for Null values?
  1. isnan()
  2. isnull()
  3. isnone()
9.
Assuming that a DataFrame df is given, which of the following replaces Null values with zeros?
  1. df.fillna(0)
  2. df.replacena(0)
  3. df.fill(0)
10.
Assuming that a DataFrame df is given, what does the expression df.isnull().sum() do?
  1. Calculates sum of the non-Null values in each column
  2. Calculates the number of Null values in the DataFrame
  3. Calculates the number of Null values in each column

Programming practice with Google

Use the Google Colaboratory document below to practice EDA on a given dataset.

Google Colaboratory document

Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/introduction-python-programming/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/introduction-python-programming/pages/1-introduction
Citation information

© Jul 30, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

This book utilizes the OpenStax Python Code Runner. The code runner is developed by Wiley and is All Rights Reserved.