Learning Outcomes
By the end of this section, you should be able to:
- 9.1.1 Visualize data with properly labeled boxplots, histograms, and Pareto charts.
- 9.1.2 Use Python to generate various data visualizations for univariate data.
We’ve seen that data may originate from surveys, experiments, polls, questionnaires, sensors, or other sources. Data may be represented as numeric values, such as age or salary, or as categories, such as hair color, political affiliation, etc. Once data is collected and organized, a data scientist is interested in creating various visualizations, graphs, and charts to facilitate communication and assist in detecting trends and patterns.
Encoding text involves converting the text into a numerical representation for machine learning algorithms to process (see Data Cleaning and Preprocessing). Univariate data includes observations or measurements on a single characteristic or attribute, and the data need to be converted in a structured format that is suitable for analysis and visualization. Bivariate data refers to data collected on two (possibly related) variables such as “years of experience” and “salary.” The data visualization method chosen will depend on whether the data is univariate or bivariate, as discussed later in the chapter. For example, boxplots might be used to graph univariate data, whereas a scatterplot might be used to graph bivariate data.
Univariate data can be visualized in many ways, including as bar graphs (or bar charts). Bar graphs present categorical data in a summarized form based on frequency or relative frequency. For example, a college administrator might be interested in creating a bar chart to show the geographic diversity of students enrolling in the college during a certain semester.
In this section, we discuss three commonly used visualization methods—boxplots, histograms, and Pareto charts.
Boxplots
A boxplot or box-and-whisker plot is a graph used to display a distribution of a dataset based on the quartiles, minimum, and maximum of the data. This is called the five-number summary, and thus a boxplot displays the minimum, (first quartile), median, (third quartile), and the maximum of a dataset. Note that a boxplot can be created using a horizontal format or vertical format. A boxplot can also be created with multiple plots, and this is helpful to compare two or more variables side by side. A boxplot is also useful to identify outliers and investigate skewness of a dataset.
For example, a human resources administrator might want to compare salaries for men versus women employees and thus create two side-by-side boxplots showing the distribution of men’s salaries vs. women’s salaries at the company. An example of a side-by-side boxplot created using Python appears in Using Python to Create Boxplots.
The steps needed to create a horizontal boxplot are given in the following list; keep in mind that typically we use a tool such as Python or Excel to create these boxplots.
- Create a horizontal scale that extends from the minimum to the maximum of the dataset.
- Draw a rectangle above the numerical scale where the left side of the rectangle is placed at the first quartile of the dataset and the right side of the triangle is placed at the third quartile of the dataset.
- Draw a vertical line within the rectangle located at the median.
- Draw a whisker to the left extending from the rectangle to the location of the minimum. Also, draw a whisker to the right extending from the rectangle to the location of the maximum.
- An optional step is to plot outliers on the boxplot either to the left of the left whisker or to the right of the right whisker. (See Descriptive Statistics: Statistical Measurements and Probability Distributions for a further explanation regarding identification of outliers). Typically, an outlier is plotted using an asterisk or other symbol on the boxplot. A boxplot that includes outliers is sometimes referred to as a modified boxplot.
Note that boxplots can also be created in a vertical format; an example of a vertical boxplot created using Python is shown in Example 9.1.
Boxplots provide important information to the data scientist to assist with identifying the shape of the distribution for a dataset. After creating the boxplot, the researcher can visualize the distribution to identify its shape, whether there are outliers, and if any skewness is present.
The following guidelines may be useful in determining the shape of a distribution in a boxplot:
- Variability: Recall that the length of the box represents the interquartile range, IQR. Both the IQR and the length of the whiskers can provide information about the spread and variability of the data. A narrower box and shorter whiskers indicate lower variability, while a wider box and longer whiskers indicate higher variability.
- Outliers: Boxplots can provide a visual representation of outliers in the data. Outliers are data points that can be plotted outside the whiskers of the boxplot. Identifying outliers can provide insights into the tails of the distribution and help determine whether the distribution is skewed.
- Skewness: Boxplots can help identify skewness in the distribution. Recall from Collecting and Preparing Data that skewness refers to the lack of symmetry in a distribution of data. If one whisker (either the left or right whisker) is longer than the other, it indicates skewness toward that side. For instance, if the left whisker is longer, the distribution may be left-skewed, whereas if the right whisker is longer, the distribution may be right-skewed.
- Symmetry: The symmetry of the distribution can be determined by examining the position of the median line (the line inside the rectangle). If the median line is closer to one end of the box, the distribution may be skewed in that direction. If the median line is positioned close to the center of the box, the distribution is likely symmetric.
Using Python to Create Boxplots
To help with data visualization, Python includes many built-in graphing capabilities in a package called Matplotlib
, which we first introduced in Python Basics for Data Science.
Matplotlib
is a library containing a variety of plotting routines used for creating high-quality visualizations and graphs to generate visualizations and plots for data exploration, analysis, and presentation. Matplotlib
contains functions such as bar
, plot
, boxplot
, and scatter
that can be used to generate bar charts, time series graphs, boxplots, and scatterplots, respectively. We’ve already seen how this package is used to generate a variety of visualizations, including scatterplots (in Inferential Statistics and Regression Analysis), time series graphs (in Time and Series Forecasting, and clustering results (in Decision-Making Using Machine Learning Basics). Here we’ll show how it can be used to create boxplots, histograms, and Pareto charts.
The following Python command will import this library and label this as plt
in the Python code:
import matplotlib.pyplot as plt
Import Command
The import commands allow the user to create a shortened reference to the command, so in the preceding example, the shortened reference plt
can be used to refer to the Matplotlib
package.
Once the import command is executed, a boxplot can be created in Python using the following syntax:
plt.boxplot([minimum, q1, median, q3, maximum], vert=False, labels = [‘specific label’], widths = 0.1)
The parameter vert
controls whether the boxplot is to be displayed in vertical format or horizontal format.
The parameter labels
specifies a label to be placed next to the boxplot.
The parameter widths
specifies the width of the rectangle.
Consider a human resources manager who collects the following salary data for a sample of employees at a company (data in US dollars):
42000, 43500, 47900, 52375, 54000, 56125, 58350, 62429, 65000, 71325, 79428, 85984, 92000, 101860, 119432, 139450, 140000
The following is the five-number summary for this dataset and the corresponding boxplot showing the distribution of salaries at a company (refer to Descriptive Statistics: Statistical Measurements and Probability Distributions for details on how to calculate the median and quartiles of a dataset):
Five-number summary for salaries at a company:
The following Python code will generate the boxplot of salaries in a horizontal format.
Python Code
# import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
# Given values for the five-number summary
minimum = 42000
q1 = 54000
median = 65000
q3 = 92000
maximum = 140000
# Define a function to format the ticks with commas as thousands separators
def format_ticks(value, tick_number):
return f'{value:,.0f}'
# Create a boxplot using the given values
plt.boxplot([[minimum, q1, median, q3, maximum]], vert=False, labels=['Salaries'], widths=0.1)
# Add labels and title
plt.xlabel('Salaries of Employees at ABC Corporation')
plt.title('Boxplot Showing Distribution of Salaries ($)')
# Apply the custom formatter to the x-axis
plt.gca().xaxis.set_major_formatter(ticker.FuncFormatter(format_ticks))
# Show the plot
plt.show()
The resulting output will look like this:
Notice in the boxplot above the left edge and right edge of the rectangle represent the first quartile and third quartile, respectively. Recall from Measurements of Position that the interquartile range (IQR) is calculated as the third quartile less the first quartile, and thus the IQR is represented by the width of the rectangle shown in the boxplot. In this example, the first quartile is $54,000 and the third quartile is $92,000, and the IQR is thus , which is $38,000.The vertical line shown within the rectangle indicates the median, and in this example, the median salary is $65,000. Finally, the whisker on the left side of the boxplot indicates the minimum of the dataset and the whisker on the right side of the boxplot represents the maximum of the dataset. In this example, the minimum salary at the company is $42,000 and the maximum salary is $140,000.
Notice from the boxplot that the vertical line within the box is not centered within the rectangle; this indicates that the distribution is not symmetric. Also, notice that the whisker on the right side of the rectangle that extends from to the maximum is longer than the whisker on the left side of the rectangle that extends from to the minimum; this gives an indication of a skewed distribution (right-skew).
Example 9.1 provides an example of this, using the same Python command to generate a boxplot in a vertical format.
Example 9.1
Problem
Create Python code to generate a boxplot in a vertical format for ages of Major League baseball players based on the following dataset: SOCR-small.csv, provided in Table 9.1. (This dataset was initially introduced in What Are Data and Data Science?.)
Name | Team | Position | Height (Inches) | Weight (Pounds) | Age (Years) |
---|---|---|---|---|---|
Paul_McAnulty | SD | Outfielder | 70 | 220 | 26.01 |
Terrmel_Sledge | SD | Outfielder | 72 | 185 | 29.95 |
Jack_Cust | SD | Outfielder | 73 | 231 | 28.12 |
Jose_Cruz_Jr. | SD | Outfielder | 72 | 210 | 32.87 |
Russell_Branyan | SD | Outfielder | 75 | 195 | 31.2 |
Mike_Cameron | SD | Outfielder | 74 | 200 | 34.14 |
Brian_Giles | SD | Outfielder | 70 | 205 | 36.11 |
Mike_Thompson | SD | Pitcher | 76 | 200 | 26.31 |
Clay_Hensley | SD | Pitcher | 71 | 190 | 27.50 |
Chris_Young | SD | Pitcher | 82 | 250 | 27.77 |
Greg_Maddux | SD | Pitcher | 72 | 185 | 40.88 |
Jake_Peavy | SD | Pitcher | 73 | 180 | 25.75 |
Solution
Here is the Python code to generate a vertical boxplot for the ages of a sample of baseball players. (Note: In order to generate a boxplot using a vertical format, set the parameter vert
to be TRUE.)
Python Code
import matplotlib.pyplot as plt
# Sample data
ages = [26.01, 29.95, 28.12, 32.87, 31.2, 34.14, 36.11, 26.31, 27.5, 27.77, 40.88, 25.75]
# Creating the boxplot
plt.boxplot(ages, vert="TRUE")
# Adding title and labels
plt.title('Vertical Boxplot for Ages of Major League Baseball Players')
plt.ylabel('Ages')
# Display the plot
plt.show()
The resulting output will look like this:
Notice in the boxplot, the horizontal line representing the median is not centered within the rectangle, which indicates the distribution is not symmetric. Also, notice the whisker running vertically from the top of the rectangle is longer than the whisker running vertically from the bottom of the rectangle, and this gives an indication of a skewed distribution (right-skew).
The following example illustrates the generation of a side-by-side boxplot using Python.
Example 9.2
Problem
A real estate agent would like to compare the distribution of home prices in San Francisco, California, versus San Jose, California. The agent collects data for a sample of recently sold homes as follows (housing prices are in U.S. $1000s).
San Francisco Housing Prices:
1250, 1050, 972, 1479, 1550, 1000, 1499, 1140, 1388, 1050, 1465
San Jose Housing Prices:
615, 712, 879, 992, 1125, 1300, 1305, 1322, 1498, 1510, 1623
Solution
The Python code in this example uses the boxplot
routine, which is part of the Matplotlib
library:
Python Code
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
# Housing prices for two cities (data in thousands)
SanFran_prices = [1250, 1050, 972, 1479, 1550, 1000, 1499, 1140, 1388, 1050, 1465]
SanJose_prices = [615, 712, 879, 992, 1125, 1300, 1305, 1322, 1498, 1510, 1623]
# Put data into an array
home_prices = [SanFran_prices, SanJose_prices]
# Create the figure and axis
fig, ax = plt.subplots()
# Create the boxplot using boxplot routine
ax.boxplot(home_prices, labels=['San Francisco Home Prices', 'San Jose Home Prices'])
# Add titles and labels
plt.title('Side-by-Side Boxplots Comparing Housing Prices for San Francisco and San Jose')
plt.ylabel('Home Prices (in Thousands)')
# Define a function to format the ticks for the y-axis to include dollar signs
def format_ticks(value, tick_number):
return f'${value:,.0f}'
# Apply the custom formatter to the y-axis
ax.yaxis.set_major_formatter(ticker.FuncFormatter(format_ticks))
ax.set_ylabel('Home Prices (in Thousands)')
# Show the plot
plt.show()
The resulting output will look like this:
Notice in the boxplot for San Franscico housing prices, the distribution appears symmetric, and the median housing price appears to be approximately $1,250. The median home price for San Jose appears to be approximately $1,200. The boxplot for San Jose home prices indicates more dispersion as compared to the boxplot for San Francisco home prices.
Histograms
A histogram is a graphical representation of the distribution of numerical data. Histograms are particularly useful for displaying the distribution of continuous data; they provide the visualization of a frequency (or relative frequency) distribution table.
For example, Figure 9.2 displays the distribution of heights (in inches to the nearest half-inch) of 100 semiprofessional male soccer players. In this histogram, heights are plotted on the horizontal axis and relative frequency of occurrence is posted on the vertical axis. Notice the tallest bar in the histogram corresponds to heights between 65.95 and 67.95 inches.
Histograms are widely used in exploratory data analysis and descriptive statistics to gain insights into the distribution of numerical data, identify patterns, detect outliers, and make comparisons between different datasets.
A histogram is similar to a bar chart except that a histogram will not have any gaps between the bars whereas a bar chart typically has gaps between the bars. A histogram is constructed as a series of continuous intervals where the height of each bar represents a frequency or count of the number of data points falling within each interval. The vertical axis in a histogram is typically frequency (count) or relative frequency. and the horizontal axis is based on specific intervals for the data (called bins).
The range of values in the dataset is divided into intervals, also known as bins. These bins are usually of equal width and are created after an examination of the minimum and maximum of a dataset. There is no standard rule as to how many bins should be used, but a rule of thumb is to set up the number of bins to approximate the square root of the number of data points. For example, if there are 500 data values, then a researcher might decide to use approximately 22 bins to organize the data. Generally, the larger the dataset, the more bins are employed. A data scientist should experiment with different numbers of bins and then determine if the resulting histogram provides a reasonable visualization to represent the distribution of the variable.
The width of each bin can be determined using a simple formula based on the maximum and minimum of the dataset:
Like the boxplot, a histogram can provide a visual summary of the distribution of the data, including information about central tendency, spread, skewness, and presence of outliers.
The shape of a histogram can provide characteristics of the underlying data distribution. If the histogram shows the heights of bars where the tallest bar is in the center of the histogram and the bars decrease in height to the left and right of center, this is an indication of a bell-shaped distribution—also called a normal distribution (see Figure 9.3).
If the histogram shows taller bars on the left side or taller bars on the right side, this is indicative of a skewed distribution. Taller bars on the left side implies a longer tail on the right side of the histogram, which then indicates a skewed right distribution, whereas taller bars on the right side of the histogram implies a longer tail on the left side of the histogram, which then indicates a skewed left distribution.
Examples of skewed distributions are shown in Figure 9.4 and Figure 9.5.
Exploring Further
Determining a Skewed or Symmetric Distribution with the Mean and Median
Another method to determine skewed versus bell-shaped (symmetric) distributions is through the use of the mean and median. In a symmetric distribution, we expect the mean and median to be approximately the same. In a right skewed distribution, the mean is typically greater than the median. In a left skewed distribution, the mean is typically less than the median. This Briefed by Data website provides animations and more detail on shapes of distributions, including examples of skewed distributions.
A histogram with multiple peaks is indicative of a bimodal or multimodal distribution—which indicates the possibility of subpopulations within the dataset.
A histogram showing bars that are all at the same approximate height is indicative of a uniform distribution, where the data points are evenly distributed across the range of values.
To create a histogram, follow these steps:
- Decide on the number of bins.
- Calculate the width of each bin.
- Set up a frequency distribution table with two columns: the first column shows the interval for each bin and the second column is the frequency for that interval, which is the count of the number of data values that fall within the given interval for each row in the table.
- Create the histogram by plotting the bin intervals on the x-axis and creating bars whose heights are determined by the frequency for each row in the table.
Using Python to Create Histograms
Typically, histograms are generated using some form of technology. Python includes a function called hist()
as part of the Matplotlib
library to generate histograms, as demonstrated in Example 9.3.
Example 9.3
Problem
A corporate manager is analyzing the following monthly sales data for 30 salespeople (with sales amounts in U.S. dollars):
4969, 4092, 2277, 4381, 3675, 6134, 4490, 6381, 6849, 3134, 7174, 2809, 6057, 7501, 4745, 3415, 2174, 4570, 5957, 5999, 3452, 3390, 3872, 4491, 4670, 5288, 5210, 5934, 6832, 4933
Generate a histogram using Python for this data and comment on the shape of the distribution. Use 5 bins for the histogram.
Solution
The function hist()
, which is part of the Matplotlib
library, can be used to generate the histogram.
The Python program uses the plot.hist()
function, and the data array is specified as well as the number of bins.
Python Code
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
# Define a function to format the ticks with commas as thousands separators
def format_ticks(value, tick_number):
return f'{value:,.0f}'
# Dataset of sales amounts
sales = [4969, 4092, 2277, 4381, 3675, 6134, 4490, 6381, 6849, 3134, 7174, 2809, 6057, 7501, 4745, 3415, 2174, 4570, 5957, 5999, 3452, 3390, 3872, 4491, 4670, 5288, 5210, 5934, 6832, 4933]
# Create histogram
# specify 5 bins to be used when creating the histogram
plt.hist(sales, bins=5, color='blue', edgecolor='black')
# Add labels and title
plt.xlabel('Sales Amounts ($)')
plt.ylabel('Frequency')
plt.title('Histogram of Sales ')
# Display the histogram
plt.show()
The resulting output will look like this:
Notice that the shape of the histogram appears bell-shaped and symmetric, in that the tallest bar is in the center of the histogram and the bars decrease in height to the left and right of center.
Pareto Charts
A Pareto chart is a bar graph for categorical data where the taller bars are on the left side of the chart and the bars are placed in a descending height from the left side of the chart to the right side of the chart.
A Pareto chart is unique in that it combines both bar and line graphs to represent data in descending order of frequency or importance, along with the cumulative percentage of the total. The chart is named after Vilfredo Pareto, an Italian economist who identified the 80/20 rule, which states that roughly 80% of effects come from 20% of causes. This type of chart is commonly used in quality improvement efforts where a researcher would like to quickly identify the major contributors to a problem.
When creating a Pareto chart, two vertical scales are typically employed. The left vertical scale is based on frequency, whereas the right vertical scale is based on cumulative percentage. The left vertical scale corresponds to the bar graph and the bars are then oriented in descending order according to frequency, so the height of the bars will decrease from left to right. The right vertical scale is used to show cumulative frequency for the various categories and corresponds to the line chart on the graph. The line chart allows the reader to quickly determine which categories contribute significantly to the overall percentage. For example, the line chart makes it easier to determine which categories make up 80% of the cumulative frequency, as shown in Example 9.4.
Example 9.4
Problem
A manufacturing engineer is interested in analyzing a high defect level on a smartphone manufacturing line and collects the following defect data over a one-month period (see Table 9.2).
Failure Category | Frequency (Number of Defects) |
---|---|
Cracked screen | 1348 |
Power button nonfunctional | 739 |
Scratches on case | 1543 |
Does not charge | 1410 |
Volume controls nonfunctional | 595 |
Create a Pareto chart for these failure categories using Python.
Solution
We can use pandas to create a DataFrame to store the failure categories and number of defects. Then we will calculate the cumulative frequency percentages for display on the Pareto chart.
The DataFrame is sorted in descending order of number of defects. The cumulative percentage is calculated using the Python cusum()
function.
The reference to ax1
is the primary y-axis for frequency (i.e., left-hand y-axis). The reference to ax2
is the y-axis, which will plot the cumulative frequency (right-hand y-axis). The reference to the twinx
function is a method in matplotlib that allows the two y-axis scales to share the same x-axis.
Python Code
import pandas as pd
import matplotlib.pyplot as plt
# Manufacturing data results for one-month time period
data = {
'Failure_Category': ['Cracked Screen', 'Power Button', 'Scratches', 'Does Not Charge', 'Volume Controls'],
'Frequency': [1348, 739, 1543, 1410, 595]
}
# Create a DataFrame from the data
df = pd.DataFrame(data)
# Sort the DataFrame in descending order based on the frequency
df_sorted = df.sort_values(by='Frequency', ascending=False)
# Calculate the cumulative percentage
df_sorted['Cumulative Percentage'] = (df_sorted['Frequency'].cumsum() / df_sorted['Frequency'].sum()) * 100
# Create the Pareto chart
fig, ax1 = plt.subplots()
# Plot the bars in descending order
ax1.bar(df_sorted['Failure_Category'], df_sorted['Frequency'])
ax1.set_ylabel('Frequency')
# Plot the cumulative frequency line chart
ax2 = ax1.twinx()
ax2.plot(df_sorted['Failure_Category'], df_sorted['Cumulative Percentage'], color='blue', marker='o')
ax2.set_ylabel('Cumulative Percentage (%)')
# Set x-axis labels
ax1.set_xticklabels(df_sorted['Failure_Category'], rotation=45, ha='right')
# Title
plt.title('Pareto Chart of Smartphone Manufacturing Defects')
# Show the plot
plt.show()
The resulting output will look like this:
Notice in the Pareto output chart that the leftmost three bars account for approximately 75% of the overall defect level, so a quality engineer can quickly determine that the three categories of “Scratches,” “Does Not Charge,” and “Cracked Screen” account for the majority of the defects and would then be the defects to focus on for quality improvement efforts.
According to the Pareto chart that is created, the three defect categories of “Scratches,” “Does Not Charge,” and “Cracked Screen” are major contributors to the defect level and should be investigated with higher priority. The two defect categories of “Power Button” and “Volume Controls” are not major contributors and should be investigated with lower priority.