Dr. Shaun V. Ault; Dr. Soohyun Nam Liao; Larry Musolino

Learning Outcomes

By the end of this section, you should be able to:

9.1.1 Visualize data with properly labeled boxplots, histograms, and Pareto charts.
9.1.2 Use Python to generate various data visualizations for univariate data.

We’ve seen that data may originate from surveys, experiments, polls, questionnaires, sensors, or other sources. Data may be represented as numeric values, such as age or salary, or as categories, such as hair color, political affiliation, etc. Once data is collected and organized, a data scientist is interested in creating various visualizations, graphs, and charts to facilitate communication and assist in detecting trends and patterns.

Encoding text involves converting the text into a numerical representation for machine learning algorithms to process (see Data Cleaning and Preprocessing). Univariate data includes observations or measurements on a single characteristic or attribute, and the data need to be converted in a structured format that is suitable for analysis and visualization. Bivariate data refers to data collected on two (possibly related) variables such as “years of experience” and “salary.” The data visualization method chosen will depend on whether the data is univariate or bivariate, as discussed later in the chapter. For example, boxplots might be used to graph univariate data, whereas a scatterplot might be used to graph bivariate data.

Univariate data can be visualized in many ways, including as bar graphs (or bar charts). Bar graphs present categorical data in a summarized form based on frequency or relative frequency. For example, a college administrator might be interested in creating a bar chart to show the geographic diversity of students enrolling in the college during a certain semester.

In this section, we discuss three commonly used visualization methods—boxplots, histograms, and Pareto charts.

Boxplots

A boxplot or box-and-whisker plot is a graph used to display a distribution of a dataset based on the quartiles, minimum, and maximum of the data. This is called the five-number summary, and thus a boxplot displays the minimum, $Q_{1}$ (first quartile), median, $Q_{3}$ (third quartile), and the maximum of a dataset. Note that a boxplot can be created using a horizontal format or vertical format. A boxplot can also be created with multiple plots, and this is helpful to compare two or more variables side by side. A boxplot is also useful to identify outliers and investigate skewness of a dataset.

For example, a human resources administrator might want to compare salaries for men versus women employees and thus create two side-by-side boxplots showing the distribution of men’s salaries vs. women’s salaries at the company. An example of a side-by-side boxplot created using Python appears in Using Python to Create Boxplots.

The steps needed to create a horizontal boxplot are given in the following list; keep in mind that typically we use a tool such as Python or Excel to create these boxplots.

Create a horizontal scale that extends from the minimum to the maximum of the dataset.
Draw a rectangle above the numerical scale where the left side of the rectangle is placed at the first quartile of the dataset and the right side of the triangle is placed at the third quartile of the dataset.
Draw a vertical line within the rectangle located at the median.
Draw a whisker to the left extending from the rectangle to the location of the minimum. Also, draw a whisker to the right extending from the rectangle to the location of the maximum.
An optional step is to plot outliers on the boxplot either to the left of the left whisker or to the right of the right whisker. (See Descriptive Statistics: Statistical Measurements and Probability Distributions for a further explanation regarding identification of outliers). Typically, an outlier is plotted using an asterisk or other symbol on the boxplot. A boxplot that includes outliers is sometimes referred to as a modified boxplot.

Note that boxplots can also be created in a vertical format; an example of a vertical boxplot created using Python is shown in Example 9.1.

Boxplots provide important information to the data scientist to assist with identifying the shape of the distribution for a dataset. After creating the boxplot, the researcher can visualize the distribution to identify its shape, whether there are outliers, and if any skewness is present.

The following guidelines may be useful in determining the shape of a distribution in a boxplot:

Variability: Recall that the length of the box represents the interquartile range, IQR. Both the IQR and the length of the whiskers can provide information about the spread and variability of the data. A narrower box and shorter whiskers indicate lower variability, while a wider box and longer whiskers indicate higher variability.
Outliers: Boxplots can provide a visual representation of outliers in the data. Outliers are data points that can be plotted outside the whiskers of the boxplot. Identifying outliers can provide insights into the tails of the distribution and help determine whether the distribution is skewed.
Skewness: Boxplots can help identify skewness in the distribution. Recall from Collecting and Preparing Data that skewness refers to the lack of symmetry in a distribution of data. If one whisker (either the left or right whisker) is longer than the other, it indicates skewness toward that side. For instance, if the left whisker is longer, the distribution may be left-skewed, whereas if the right whisker is longer, the distribution may be right-skewed.
Symmetry: The symmetry of the distribution can be determined by examining the position of the median line (the line inside the rectangle). If the median line is closer to one end of the box, the distribution may be skewed in that direction. If the median line is positioned close to the center of the box, the distribution is likely symmetric.

Using Python to Create Boxplots

To help with data visualization, Python includes many built-in graphing capabilities in a package called Matplotlib, which we first introduced in Python Basics for Data Science.

Matplotlib is a library containing a variety of plotting routines used for creating high-quality visualizations and graphs to generate visualizations and plots for data exploration, analysis, and presentation. Matplotlib contains functions such as bar, plot, boxplot, and scatter that can be used to generate bar charts, time series graphs, boxplots, and scatterplots, respectively. We’ve already seen how this package is used to generate a variety of visualizations, including scatterplots (in Inferential Statistics and Regression Analysis), time series graphs (in Time and Series Forecasting, and clustering results (in Decision-Making Using Machine Learning Basics). Here we’ll show how it can be used to create boxplots, histograms, and Pareto charts.

The following Python command will import this library and label this as plt in the Python code:

import matplotlib.pyplot as plt

Import Command

The import commands allow the user to create a shortened reference to the command, so in the preceding example, the shortened reference plt can be used to refer to the Matplotlib package.

Once the import command is executed, a boxplot can be created in Python using the following syntax:

plt.boxplot([minimum, q1, median, q3, maximum], vert=False, labels = [‘specific label’], widths = 0.1)

The parameter vert controls whether the boxplot is to be displayed in vertical format or horizontal format.

The parameter labels specifies a label to be placed next to the boxplot.

The parameter widths specifies the width of the rectangle.

Consider a human resources manager who collects the following salary data for a sample of employees at a company (data in US dollars):

42000, 43500, 47900, 52375, 54000, 56125, 58350, 62429, 65000, 71325, 79428, 85984, 92000, 101860, 119432, 139450, 140000

The following is the five-number summary for this dataset and the corresponding boxplot showing the distribution of salaries at a company (refer to Descriptive Statistics: Statistical Measurements and Probability Distributions for details on how to calculate the median and quartiles of a dataset):

Five-number summary for salaries at a company:

\begin{array}{rcl} Minimum Salary & = & $ 42000 \\ First Quartile Salary & = & $ 54000 \\ Median Salary & = & $ 65000 \\ Third Quartile Salary & = & $ 92000 \\ Maximum Salary & = & $ 140000 \end{array}

The following Python code will generate the boxplot of salaries in a horizontal format.

Python Code

    # import matplotlib 
    
    import matplotlib.pyplot as plt
    import matplotlib.ticker as ticker 

    # Given values for the five-number summary
    minimum = 42000
    q1 = 54000
    median = 65000
    q3 = 92000
    maximum = 140000
    # Define a function to format the ticks with commas as thousands separators
    def format_ticks(value, tick_number):
        return f'{value:,.0f}'
    
    # Create a boxplot using the given values
    plt.boxplot([[minimum, q1, median, q3, maximum]], vert=False, labels=['Salaries'], widths=0.1)
    
    # Add labels and title
    plt.xlabel('Salaries of Employees at ABC Corporation')
    plt.title('Boxplot Showing Distribution of Salaries ($)')
    
    # Apply the custom formatter to the x-axis
    plt.gca().xaxis.set_major_formatter(ticker.FuncFormatter(format_ticks))
    
    # Show the plot
    plt.show()

The resulting output will look like this:

Boxplot showing the distribution of salaries at ABC Corporation. The minimum salary is $40,000, the first quartile (Q1) is $54,000, the median is $65,000, the third quartile (Q3) is $92,000, and the maximum salary is $140,000.

Notice in the boxplot above the left edge and right edge of the rectangle represent the first quartile and third quartile, respectively. Recall from Measurements of Position that the interquartile range (IQR) is calculated as the third quartile less the first quartile, and thus the IQR is represented by the width of the rectangle shown in the boxplot. In this example, the first quartile is $54,000 and the third quartile is $92,000, and the IQR is thus $$ 92,000 - $ 54,000$ , which is $38,000.The vertical line shown within the rectangle indicates the median, and in this example, the median salary is $65,000. Finally, the whisker on the left side of the boxplot indicates the minimum of the dataset and the whisker on the right side of the boxplot represents the maximum of the dataset. In this example, the minimum salary at the company is $42,000 and the maximum salary is $140,000.

Notice from the boxplot that the vertical line within the box is not centered within the rectangle; this indicates that the distribution is not symmetric. Also, notice that the whisker on the right side of the rectangle that extends from $Q_{3}$ to the maximum is longer than the whisker on the left side of the rectangle that extends from $Q_{1}$ to the minimum; this gives an indication of a skewed distribution (right-skew).

Example 9.1 provides an example of this, using the same Python command to generate a boxplot in a vertical format.

Example 9.1

Problem

Create Python code to generate a boxplot in a vertical format for ages of Major League baseball players based on the following dataset: SOCR-small.csv, provided in Table 9.1. (This dataset was initially introduced in What Are Data and Data Science?.)

Name	Team	Position	Height (Inches)	Weight (Pounds)	Age (Years)
Paul_McAnulty	SD	Outfielder	70	220	26.01
Terrmel_Sledge	SD	Outfielder	72	185	29.95
Jack_Cust	SD	Outfielder	73	231	28.12
Jose_Cruz_Jr.	SD	Outfielder	72	210	32.87
Russell_Branyan	SD	Outfielder	75	195	31.2
Mike_Cameron	SD	Outfielder	74	200	34.14
Brian_Giles	SD	Outfielder	70	205	36.11
Mike_Thompson	SD	Pitcher	76	200	26.31
Clay_Hensley	SD	Pitcher	71	190	27.50
Chris_Young	SD	Pitcher	82	250	27.77
Greg_Maddux	SD	Pitcher	72	185	40.88
Jake_Peavy	SD	Pitcher	73	180	25.75

Table 9.1 Baseball Player Dataset (SOCR-small.csv)
source: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights

Solution

Here is the Python code to generate a vertical boxplot for the ages of a sample of baseball players. (Note: In order to generate a boxplot using a vertical format, set the parameter vert to be TRUE.)

Python Code

    import matplotlib.pyplot as plt
    
    # Sample data
    ages = [26.01, 29.95, 28.12, 32.87, 31.2, 34.14, 36.11, 26.31, 27.5, 27.77, 40.88, 25.75]
    
    # Creating the boxplot
    plt.boxplot(ages, vert="TRUE")
    
    # Adding title and labels
    plt.title('Vertical Boxplot for Ages of Major League Baseball Players')
    plt.ylabel('Ages')
    
    # Display the plot
    plt.show()

The resulting output will look like this:

A vertical boxplot labeled as showing ages of major league baseball players. The Y axis is 26 to 40. The plot shows a median age of 29, with most players falling between the ages of 28 and 32.The whisker running vertically from the top of the rectangle is longer than the whisker running vertically from the bottom of the rectangle, indicating a right-skewed distribution.

Notice in the boxplot, the horizontal line representing the median is not centered within the rectangle, which indicates the distribution is not symmetric. Also, notice the whisker running vertically from the top of the rectangle is longer than the whisker running vertically from the bottom of the rectangle, and this gives an indication of a skewed distribution (right-skew).

The following example illustrates the generation of a side-by-side boxplot using Python.

Example 9.2

Problem

A real estate agent would like to compare the distribution of home prices in San Francisco, California, versus San Jose, California. The agent collects data for a sample of recently sold homes as follows (housing prices are in U.S. $1000s).

San Francisco Housing Prices:

1250, 1050, 972, 1479, 1550, 1000, 1499, 1140, 1388, 1050, 1465

San Jose Housing Prices:

615, 712, 879, 992, 1125, 1300, 1305, 1322, 1498, 1510, 1623

Solution

The Python code in this example uses the boxplot routine, which is part of the Matplotlib library:

Python Code

    import matplotlib.pyplot as plt
    import matplotlib.ticker as ticker
    
    # Housing prices for two cities (data in thousands)
    SanFran_prices = [1250, 1050, 972, 1479, 1550, 1000, 1499, 1140, 1388, 1050, 1465]
    SanJose_prices = [615, 712, 879, 992, 1125, 1300, 1305, 1322, 1498, 1510, 1623]
    
    # Put data into an array
    home_prices = [SanFran_prices, SanJose_prices]
    
    # Create the figure and axis
    fig, ax = plt.subplots()
    
    # Create the boxplot using boxplot routine
    ax.boxplot(home_prices, labels=['San Francisco Home Prices', 'San Jose Home Prices'])
    
    # Add titles and labels
    plt.title('Side-by-Side Boxplots Comparing Housing Prices for San Francisco and San Jose')
    plt.ylabel('Home Prices (in Thousands)')
    
    # Define a function to format the ticks for the y-axis to include dollar signs
    def format_ticks(value, tick_number):
        return f'${value:,.0f}'
    
    # Apply the custom formatter to the y-axis
    ax.yaxis.set_major_formatter(ticker.FuncFormatter(format_ticks))
    ax.set_ylabel('Home Prices (in Thousands)')
    
    # Show the plot
    plt.show()

The resulting output will look like this:

Side-by-side boxplots comparing median home prices in San Francisco and San Jose. San Francisco home prices range from approximately $900,000 to $1,500,000 with a median price around $1,200,000. San Jose home prices range from approximately $600,000 to $1,600,000 with a median price around $1,300,000. San Jose has a wider price range with higher outliers.

Notice in the boxplot for San Franscico housing prices, the distribution appears symmetric, and the median housing price appears to be approximately $1,250. The median home price for San Jose appears to be approximately $1,200. The boxplot for San Jose home prices indicates more dispersion as compared to the boxplot for San Francisco home prices.

Histograms

A histogram is a graphical representation of the distribution of numerical data. Histograms are particularly useful for displaying the distribution of continuous data; they provide the visualization of a frequency (or relative frequency) distribution table.

For example, Figure 9.2 displays the distribution of heights (in inches to the nearest half-inch) of 100 semiprofessional male soccer players. In this histogram, heights are plotted on the horizontal axis and relative frequency of occurrence is posted on the vertical axis. Notice the tallest bar in the histogram corresponds to heights between 65.95 and 67.95 inches.

A histogram displaying the distribution of heights (in inches to the nearest half inch) of 100 male semiprofessional soccer players. The X axis is labeled heights and ranges from 59.95 to 75.95. The Y axis is labeled relative frequency and ranges from 0 to 0.4. The tallest bar (0.4) corresponds to heights between 65.95 and 67.95 inches.

Figure 9.2 Histogram Displaying the Distribution of Heights (in inches to the nearest half-inch) of 100 Male Semiprofessional Soccer Players

Histograms are widely used in exploratory data analysis and descriptive statistics to gain insights into the distribution of numerical data, identify patterns, detect outliers, and make comparisons between different datasets.

A histogram is similar to a bar chart except that a histogram will not have any gaps between the bars whereas a bar chart typically has gaps between the bars. A histogram is constructed as a series of continuous intervals where the height of each bar represents a frequency or count of the number of data points falling within each interval. The vertical axis in a histogram is typically frequency (count) or relative frequency. and the horizontal axis is based on specific intervals for the data (called bins).

The range of values in the dataset is divided into intervals, also known as bins. These bins are usually of equal width and are created after an examination of the minimum and maximum of a dataset. There is no standard rule as to how many bins should be used, but a rule of thumb is to set up the number of bins to approximate the square root of the number of data points. For example, if there are 500 data values, then a researcher might decide to use approximately 22 bins to organize the data. Generally, the larger the dataset, the more bins are employed. A data scientist should experiment with different numbers of bins and then determine if the resulting histogram provides a reasonable visualization to represent the distribution of the variable.

The width of each bin can be determined using a simple formula based on the maximum and minimum of the dataset:

Width of Each Bin = \frac{max - min}{number of bins}

Like the boxplot, a histogram can provide a visual summary of the distribution of the data, including information about central tendency, spread, skewness, and presence of outliers.

The shape of a histogram can provide characteristics of the underlying data distribution. If the histogram shows the heights of bars where the tallest bar is in the center of the histogram and the bars decrease in height to the left and right of center, this is an indication of a bell-shaped distribution—also called a normal distribution (see Figure 9.3).

If the histogram shows taller bars on the left side or taller bars on the right side, this is indicative of a skewed distribution. Taller bars on the left side implies a longer tail on the right side of the histogram, which then indicates a skewed right distribution, whereas taller bars on the right side of the histogram implies a longer tail on the left side of the histogram, which then indicates a skewed left distribution.

Examples of skewed distributions are shown in Figure 9.4 and Figure 9.5.

A bell-shaped histogram with an X axis from 4 to 10 representing a normal distribution. The data is concentrated between 6 and 8 with the highest bar in the center at 7.

Figure 9.3 Bell-Shaped Histogram

A histogram showing a skewed left distribution. The X axis ranges from 4 to 8 and the highest bars are on the right side.

Figure 9.4 Skewed Left Histogram

A histogram showing a skewed right distribution. The X axis ranges from 6 to 10 and the highest bars are on the left side.

Figure 9.5 Skewed Right Histogram

Exploring Further

Determining a Skewed or Symmetric Distribution with the Mean and Median

Another method to determine skewed versus bell-shaped (symmetric) distributions is through the use of the mean and median. In a symmetric distribution, we expect the mean and median to be approximately the same. In a right skewed distribution, the mean is typically greater than the median. In a left skewed distribution, the mean is typically less than the median. This Briefed by Data website provides animations and more detail on shapes of distributions, including examples of skewed distributions.

A histogram with multiple peaks is indicative of a bimodal or multimodal distribution—which indicates the possibility of subpopulations within the dataset.

A histogram showing bars that are all at the same approximate height is indicative of a uniform distribution, where the data points are evenly distributed across the range of values.

To create a histogram, follow these steps:

Decide on the number of bins.
Calculate the width of each bin.
Set up a frequency distribution table with two columns: the first column shows the interval for each bin and the second column is the frequency for that interval, which is the count of the number of data values that fall within the given interval for each row in the table.
Create the histogram by plotting the bin intervals on the x-axis and creating bars whose heights are determined by the frequency for each row in the table.

Using Python to Create Histograms

Typically, histograms are generated using some form of technology. Python includes a function called hist() as part of the Matplotlib library to generate histograms, as demonstrated in Example 9.3.

Example 9.3

Problem

A corporate manager is analyzing the following monthly sales data for 30 salespeople (with sales amounts in U.S. dollars):

4969, 4092, 2277, 4381, 3675, 6134, 4490, 6381, 6849, 3134, 7174, 2809, 6057, 7501, 4745, 3415, 2174, 4570, 5957, 5999, 3452, 3390, 3872, 4491, 4670, 5288, 5210, 5934, 6832, 4933

Generate a histogram using Python for this data and comment on the shape of the distribution. Use 5 bins for the histogram.

Solution

The function hist(), which is part of the Matplotlib library, can be used to generate the histogram.

The Python program uses the plot.hist() function, and the data array is specified as well as the number of bins.

Python Code

 
    import matplotlib.pyplot as plt
    import matplotlib.ticker as ticker
    
    # Define a function to format the ticks with commas as thousands separators
    def format_ticks(value, tick_number):
        return f'{value:,.0f}'
    
    
    # Dataset of sales amounts
    sales = [4969, 4092, 2277, 4381, 3675, 6134, 4490, 6381, 6849, 3134, 7174, 2809, 6057, 7501, 4745, 3415, 2174, 4570, 5957, 5999, 3452, 3390, 3872, 4491, 4670, 5288, 5210, 5934, 6832, 4933]
    
    # Create histogram
    # specify 5 bins to be used when creating the histogram
    plt.hist(sales, bins=5, color='blue', edgecolor='black')
    
    # Add labels and title
    plt.xlabel('Sales Amounts ($)')
    plt.ylabel('Frequency')
    plt.title('Histogram of Sales ')
    
    # Display the histogram
    plt.show()

The resulting output will look like this:

A bell-shaped histogram labeled histogram of sales. The X axis is labeled Sales Amounts ($) and ranges from 2,000 to 7,000. The Y axis is labeled Frequency and ranges from 0 to 10. There is a normal distribution with the highest bar in the center at 5,000.

Notice that the shape of the histogram appears bell-shaped and symmetric, in that the tallest bar is in the center of the histogram and the bars decrease in height to the left and right of center.

Pareto Charts

A Pareto chart is a bar graph for categorical data where the taller bars are on the left side of the chart and the bars are placed in a descending height from the left side of the chart to the right side of the chart.

A Pareto chart is unique in that it combines both bar and line graphs to represent data in descending order of frequency or importance, along with the cumulative percentage of the total. The chart is named after Vilfredo Pareto, an Italian economist who identified the 80/20 rule, which states that roughly 80% of effects come from 20% of causes. This type of chart is commonly used in quality improvement efforts where a researcher would like to quickly identify the major contributors to a problem.

When creating a Pareto chart, two vertical scales are typically employed. The left vertical scale is based on frequency, whereas the right vertical scale is based on cumulative percentage. The left vertical scale corresponds to the bar graph and the bars are then oriented in descending order according to frequency, so the height of the bars will decrease from left to right. The right vertical scale is used to show cumulative frequency for the various categories and corresponds to the line chart on the graph. The line chart allows the reader to quickly determine which categories contribute significantly to the overall percentage. For example, the line chart makes it easier to determine which categories make up 80% of the cumulative frequency, as shown in Example 9.4.

Example 9.4

Problem

A manufacturing engineer is interested in analyzing a high defect level on a smartphone manufacturing line and collects the following defect data over a one-month period (see Table 9.2).

Failure Category	Frequency (Number of Defects)
Cracked screen	1348
Power button nonfunctional	739
Scratches on case	1543
Does not charge	1410
Volume controls nonfunctional	595

Table 9.2 Smartphone Defect Data

Create a Pareto chart for these failure categories using Python.

Solution

We can use pandas to create a DataFrame to store the failure categories and number of defects. Then we will calculate the cumulative frequency percentages for display on the Pareto chart.

The DataFrame is sorted in descending order of number of defects. The cumulative percentage is calculated using the Python cusum() function.

The reference to ax1 is the primary y-axis for frequency (i.e., left-hand y-axis). The reference to ax2 is the y-axis, which will plot the cumulative frequency (right-hand y-axis). The reference to the twinx function is a method in matplotlib that allows the two y-axis scales to share the same x-axis.

Python Code

    import pandas as pd
    import matplotlib.pyplot as plt
    
    # Manufacturing data results for one-month time period
    data = {
        'Failure_Category': ['Cracked Screen', 'Power Button', 'Scratches', 'Does Not Charge', 'Volume Controls'],
        'Frequency': [1348, 739, 1543, 1410, 595]
    }
    
    # Create a DataFrame from the data
    df = pd.DataFrame(data)
    
    # Sort the DataFrame in descending order based on the frequency
    df_sorted = df.sort_values(by='Frequency', ascending=False)
    
    # Calculate the cumulative percentage
    df_sorted['Cumulative Percentage'] = (df_sorted['Frequency'].cumsum() / df_sorted['Frequency'].sum()) * 100
    
    # Create the Pareto chart
    fig, ax1 = plt.subplots()
    
    # Plot the bars in descending order
    ax1.bar(df_sorted['Failure_Category'], df_sorted['Frequency'])
    ax1.set_ylabel('Frequency')
    
    # Plot the cumulative frequency line chart
    ax2 = ax1.twinx()
    ax2.plot(df_sorted['Failure_Category'], df_sorted['Cumulative Percentage'], color='blue', marker='o')
    ax2.set_ylabel('Cumulative Percentage (%)')
    
    # Set x-axis labels
    ax1.set_xticklabels(df_sorted['Failure_Category'], rotation=45, ha='right')
    
    # Title
    plt.title('Pareto Chart of Smartphone Manufacturing Defects')
    
    # Show the plot
    plt.show()

The resulting output will look like this:

A pareto chart of smartphone manufacturing defects. The X axis has options for “Scratches,” “Does not charge,” “Cracked Screen,” “Power Button” and “Volume Controls.” The Y axis has frequency ranging from 0 to 1,600. A line chart overlays the bars, showing the cumulative percentage of defects. The chart reveals that these five defect types account for 100% of the recorded defects.

Notice in the Pareto output chart that the leftmost three bars account for approximately 75% of the overall defect level, so a quality engineer can quickly determine that the three categories of “Scratches,” “Does Not Charge,” and “Cracked Screen” account for the majority of the defects and would then be the defects to focus on for quality improvement efforts.

According to the Pareto chart that is created, the three defect categories of “Scratches,” “Does Not Charge,” and “Cracked Screen” are major contributors to the defect level and should be investigated with higher priority. The two defect categories of “Power Button” and “Volume Controls” are not major contributors and should be investigated with lower priority.

9.1 Encoding Univariate Data

Learning Outcomes

Boxplots

Using Python to Create Boxplots

Problem

Solution

Problem

Solution

Histograms

Determining a Skewed or Symmetric Distribution with the Mean and Median

Using Python to Create Histograms

Problem

Solution

Pareto Charts

Problem

Solution