Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Principles of Data Science

9.1 Encoding Univariate Data

Principles of Data Science9.1 Encoding Univariate Data

Learning Outcomes

By the end of this section, you should be able to:

  • 9.1.1 Visualize data with properly labeled boxplots, histograms, and Pareto charts.
  • 9.1.2 Use Python to generate various data visualizations for univariate data.

We’ve seen that data may originate from surveys, experiments, polls, questionnaires, sensors, or other sources. Data may be represented as numeric values, such as age or salary, or as categories, such as hair color, political affiliation, etc. Once data is collected and organized, a data scientist is interested in creating various visualizations, graphs, and charts to facilitate communication and assist in detecting trends and patterns.

Encoding text involves converting the text into a numerical representation for machine learning algorithms to process (see Data Cleaning and Preprocessing). Univariate data includes observations or measurements on a single characteristic or attribute, and the data need to be converted in a structured format that is suitable for analysis and visualization. Bivariate data refers to data collected on two (possibly related) variables such as “years of experience” and “salary.” The data visualization method chosen will depend on whether the data is univariate or bivariate, as discussed later in the chapter. For example, boxplots might be used to graph univariate data, whereas a scatterplot might be used to graph bivariate data.

Univariate data can be visualized in many ways, including as bar graphs (or bar charts). Bar graphs present categorical data in a summarized form based on frequency or relative frequency. For example, a college administrator might be interested in creating a bar chart to show the geographic diversity of students enrolling in the college during a certain semester.

In this section, we discuss three commonly used visualization methods—boxplots, histograms, and Pareto charts.

Boxplots

A boxplot or box-and-whisker plot is a graph used to display a distribution of a dataset based on the quartiles, minimum, and maximum of the data. This is called the five-number summary, and thus a boxplot displays the minimum, Q1Q1 (first quartile), median, Q3Q3 (third quartile), and the maximum of a dataset. Note that a boxplot can be created using a horizontal format or vertical format. A boxplot can also be created with multiple plots, and this is helpful to compare two or more variables side by side. A boxplot is also useful to identify outliers and investigate skewness of a dataset.

For example, a human resources administrator might want to compare salaries for men versus women employees and thus create two side-by-side boxplots showing the distribution of men’s salaries vs. women’s salaries at the company. An example of a side-by-side boxplot created using Python appears in Using Python to Create Boxplots.

The steps needed to create a horizontal boxplot are given in the following list; keep in mind that typically we use a tool such as Python or Excel to create these boxplots.

  1. Create a horizontal scale that extends from the minimum to the maximum of the dataset.
  2. Draw a rectangle above the numerical scale where the left side of the rectangle is placed at the first quartile of the dataset and the right side of the triangle is placed at the third quartile of the dataset.
  3. Draw a vertical line within the rectangle located at the median.
  4. Draw a whisker to the left extending from the rectangle to the location of the minimum. Also, draw a whisker to the right extending from the rectangle to the location of the maximum.
  5. An optional step is to plot outliers on the boxplot either to the left of the left whisker or to the right of the right whisker. (See Descriptive Statistics: Statistical Measurements and Probability Distributions for a further explanation regarding identification of outliers). Typically, an outlier is plotted using an asterisk or other symbol on the boxplot. A boxplot that includes outliers is sometimes referred to as a modified boxplot.

Note that boxplots can also be created in a vertical format; an example of a vertical boxplot created using Python is shown in Example 9.1.

Boxplots provide important information to the data scientist to assist with identifying the shape of the distribution for a dataset. After creating the boxplot, the researcher can visualize the distribution to identify its shape, whether there are outliers, and if any skewness is present.

The following guidelines may be useful in determining the shape of a distribution in a boxplot:

  • Variability: Recall that the length of the box represents the interquartile range, IQR. Both the IQR and the length of the whiskers can provide information about the spread and variability of the data. A narrower box and shorter whiskers indicate lower variability, while a wider box and longer whiskers indicate higher variability.
  • Outliers: Boxplots can provide a visual representation of outliers in the data. Outliers are data points that can be plotted outside the whiskers of the boxplot. Identifying outliers can provide insights into the tails of the distribution and help determine whether the distribution is skewed.
  • Skewness: Boxplots can help identify skewness in the distribution. Recall from Collecting and Preparing Data that skewness refers to the lack of symmetry in a distribution of data. If one whisker (either the left or right whisker) is longer than the other, it indicates skewness toward that side. For instance, if the left whisker is longer, the distribution may be left-skewed, whereas if the right whisker is longer, the distribution may be right-skewed.
  • Symmetry: The symmetry of the distribution can be determined by examining the position of the median line (the line inside the rectangle). If the median line is closer to one end of the box, the distribution may be skewed in that direction. If the median line is positioned close to the center of the box, the distribution is likely symmetric.

Using Python to Create Boxplots

To help with data visualization, Python includes many built-in graphing capabilities in a package called Matplotlib, which we first introduced in Python Basics for Data Science.

Matplotlib is a library containing a variety of plotting routines used for creating high-quality visualizations and graphs to generate visualizations and plots for data exploration, analysis, and presentation. Matplotlib contains functions such as bar, plot, boxplot, and scatter that can be used to generate bar charts, time series graphs, boxplots, and scatterplots, respectively. We’ve already seen how this package is used to generate a variety of visualizations, including scatterplots (in Inferential Statistics and Regression Analysis), time series graphs (in Time and Series Forecasting, and clustering results (in Decision-Making Using Machine Learning Basics). Here we’ll show how it can be used to create boxplots, histograms, and Pareto charts.

The following Python command will import this library and label this as plt in the Python code:

import matplotlib.pyplot as plt

Import Command

The import commands allow the user to create a shortened reference to the command, so in the preceding example, the shortened reference plt can be used to refer to the Matplotlib package.

Once the import command is executed, a boxplot can be created in Python using the following syntax:

plt.boxplot([minimum, q1, median, q3, maximum], vert=False, labels = [‘specific label’], widths = 0.1)

The parameter vert controls whether the boxplot is to be displayed in vertical format or horizontal format.

The parameter labels specifies a label to be placed next to the boxplot.

The parameter widths specifies the width of the rectangle.

Consider a human resources manager who collects the following salary data for a sample of employees at a company (data in US dollars):

42000, 43500, 47900, 52375, 54000, 56125, 58350, 62429, 65000, 71325, 79428, 85984, 92000, 101860, 119432, 139450, 140000

The following is the five-number summary for this dataset and the corresponding boxplot showing the distribution of salaries at a company (refer to Descriptive Statistics: Statistical Measurements and Probability Distributions for details on how to calculate the median and quartiles of a dataset):

Five-number summary for salaries at a company:

Minimum Salary=$42000First Quartile Salary=$54000Median Salary=$65000Third Quartile Salary=$92000Maximum Salary=$140000Minimum Salary=$42000First Quartile Salary=$54000Median Salary=$65000Third Quartile Salary=$92000Maximum Salary=$140000

The following Python code will generate the boxplot of salaries in a horizontal format.

Python Code

    # import matplotlib 
    
    import matplotlib.pyplot as plt
    import matplotlib.ticker as ticker 

    # Given values for the five-number summary
    minimum = 42000
    q1 = 54000
    median = 65000
    q3 = 92000
    maximum = 140000
    # Define a function to format the ticks with commas as thousands separators
    def format_ticks(value, tick_number):
        return f'{value:,.0f}'
    
    # Create a boxplot using the given values
    plt.boxplot([[minimum, q1, median, q3, maximum]], vert=False, labels=['Salaries'], widths=0.1)
    
    # Add labels and title
    plt.xlabel('Salaries of Employees at ABC Corporation')
    plt.title('Boxplot Showing Distribution of Salaries ($)')
    
    # Apply the custom formatter to the x-axis
    plt.gca().xaxis.set_major_formatter(ticker.FuncFormatter(format_ticks))
    
    # Show the plot
    plt.show()
    

The resulting output will look like this:

Boxplot showing the distribution of salaries at ABC Corporation. The minimum salary is $40,000, the first quartile (Q1) is $54,000, the median is $65,000, the third quartile (Q3) is $92,000, and the maximum salary is $140,000.

Notice in the boxplot above the left edge and right edge of the rectangle represent the first quartile and third quartile, respectively. Recall from Measurements of Position that the interquartile range (IQR) is calculated as the third quartile less the first quartile, and thus the IQR is represented by the width of the rectangle shown in the boxplot. In this example, the first quartile is $54,000 and the third quartile is $92,000, and the IQR is thus $92,000$54,000$92,000$54,000, which is $38,000.The vertical line shown within the rectangle indicates the median, and in this example, the median salary is $65,000. Finally, the whisker on the left side of the boxplot indicates the minimum of the dataset and the whisker on the right side of the boxplot represents the maximum of the dataset. In this example, the minimum salary at the company is $42,000 and the maximum salary is $140,000.

Notice from the boxplot that the vertical line within the box is not centered within the rectangle; this indicates that the distribution is not symmetric. Also, notice that the whisker on the right side of the rectangle that extends from Q3Q3 to the maximum is longer than the whisker on the left side of the rectangle that extends from Q1Q1 to the minimum; this gives an indication of a skewed distribution (right-skew).

Example 9.1 provides an example of this, using the same Python command to generate a boxplot in a vertical format.

Example 9.1

Problem

Create Python code to generate a boxplot in a vertical format for ages of Major League baseball players based on the following dataset: SOCR-small.csv, provided in Table 9.1. (This dataset was initially introduced in What Are Data and Data Science?.)

Name Team Position Height (Inches) Weight (Pounds) Age (Years)
Paul_McAnulty SD Outfielder 70 220 26.01
Terrmel_Sledge SD Outfielder 72 185 29.95
Jack_Cust SD Outfielder 73 231 28.12
Jose_Cruz_Jr. SD Outfielder 72 210 32.87
Russell_Branyan SD Outfielder 75 195 31.2
Mike_Cameron SD Outfielder 74 200 34.14
Brian_Giles SD Outfielder 70 205 36.11
Mike_Thompson SD Pitcher 76 200 26.31
Clay_Hensley SD Pitcher 71 190 27.50
Chris_Young SD Pitcher 82 250 27.77
Greg_Maddux SD Pitcher 72 185 40.88
Jake_Peavy SD Pitcher 73 180 25.75
Table 9.1 Baseball Player Dataset (SOCR-small.csv)
source: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights

The following example illustrates the generation of a side-by-side boxplot using Python.

Example 9.2

Problem

A real estate agent would like to compare the distribution of home prices in San Francisco, California, versus San Jose, California. The agent collects data for a sample of recently sold homes as follows (housing prices are in U.S. $1000s).

San Francisco Housing Prices:

1250, 1050, 972, 1479, 1550, 1000, 1499, 1140, 1388, 1050, 1465

San Jose Housing Prices:

615, 712, 879, 992, 1125, 1300, 1305, 1322, 1498, 1510, 1623

Histograms

A histogram is a graphical representation of the distribution of numerical data. Histograms are particularly useful for displaying the distribution of continuous data; they provide the visualization of a frequency (or relative frequency) distribution table.

For example, Figure 9.2 displays the distribution of heights (in inches to the nearest half-inch) of 100 semiprofessional male soccer players. In this histogram, heights are plotted on the horizontal axis and relative frequency of occurrence is posted on the vertical axis. Notice the tallest bar in the histogram corresponds to heights between 65.95 and 67.95 inches.

A histogram displaying the distribution of heights (in inches to the nearest half inch) of 100 male semiprofessional soccer players. The X axis is labeled heights and ranges from 59.95 to 75.95. The Y axis is labeled relative frequency and ranges from 0 to 0.4. The tallest bar (0.4) corresponds to heights between 65.95 and 67.95 inches.
Figure 9.2 Histogram Displaying the Distribution of Heights (in inches to the nearest half-inch) of 100 Male Semiprofessional Soccer Players

Histograms are widely used in exploratory data analysis and descriptive statistics to gain insights into the distribution of numerical data, identify patterns, detect outliers, and make comparisons between different datasets.

A histogram is similar to a bar chart except that a histogram will not have any gaps between the bars whereas a bar chart typically has gaps between the bars. A histogram is constructed as a series of continuous intervals where the height of each bar represents a frequency or count of the number of data points falling within each interval. The vertical axis in a histogram is typically frequency (count) or relative frequency. and the horizontal axis is based on specific intervals for the data (called bins).

The range of values in the dataset is divided into intervals, also known as bins. These bins are usually of equal width and are created after an examination of the minimum and maximum of a dataset. There is no standard rule as to how many bins should be used, but a rule of thumb is to set up the number of bins to approximate the square root of the number of data points. For example, if there are 500 data values, then a researcher might decide to use approximately 22 bins to organize the data. Generally, the larger the dataset, the more bins are employed. A data scientist should experiment with different numbers of bins and then determine if the resulting histogram provides a reasonable visualization to represent the distribution of the variable.

The width of each bin can be determined using a simple formula based on the maximum and minimum of the dataset:

Width of Each Bin=maxminnumber of binsWidth of Each Bin=maxminnumber of bins

Like the boxplot, a histogram can provide a visual summary of the distribution of the data, including information about central tendency, spread, skewness, and presence of outliers.

The shape of a histogram can provide characteristics of the underlying data distribution. If the histogram shows the heights of bars where the tallest bar is in the center of the histogram and the bars decrease in height to the left and right of center, this is an indication of a bell-shaped distribution—also called a normal distribution (see Figure 9.3).

If the histogram shows taller bars on the left side or taller bars on the right side, this is indicative of a skewed distribution. Taller bars on the left side implies a longer tail on the right side of the histogram, which then indicates a skewed right distribution, whereas taller bars on the right side of the histogram implies a longer tail on the left side of the histogram, which then indicates a skewed left distribution.

Examples of skewed distributions are shown in Figure 9.4 and Figure 9.5.

A bell-shaped histogram with an X axis from 4 to 10 representing a normal distribution. The data is concentrated between 6 and 8 with the highest bar in the center at 7.
Figure 9.3 Bell-Shaped Histogram
A histogram showing a skewed left distribution. The X axis ranges from 4 to 8 and the highest bars are on the right side.
Figure 9.4 Skewed Left Histogram
A histogram showing a skewed right distribution. The X axis ranges from 6 to 10 and the highest bars are on the left side.
Figure 9.5 Skewed Right Histogram

Exploring Further

Determining a Skewed or Symmetric Distribution with the Mean and Median

Another method to determine skewed versus bell-shaped (symmetric) distributions is through the use of the mean and median. In a symmetric distribution, we expect the mean and median to be approximately the same. In a right skewed distribution, the mean is typically greater than the median. In a left skewed distribution, the mean is typically less than the median. This Briefed by Data website provides animations and more detail on shapes of distributions, including examples of skewed distributions.

A histogram with multiple peaks is indicative of a bimodal or multimodal distribution—which indicates the possibility of subpopulations within the dataset.

A histogram showing bars that are all at the same approximate height is indicative of a uniform distribution, where the data points are evenly distributed across the range of values.

To create a histogram, follow these steps:

  1. Decide on the number of bins.
  2. Calculate the width of each bin.
  3. Set up a frequency distribution table with two columns: the first column shows the interval for each bin and the second column is the frequency for that interval, which is the count of the number of data values that fall within the given interval for each row in the table.
  4. Create the histogram by plotting the bin intervals on the x-axis and creating bars whose heights are determined by the frequency for each row in the table.

Using Python to Create Histograms

Typically, histograms are generated using some form of technology. Python includes a function called hist() as part of the Matplotlib library to generate histograms, as demonstrated in Example 9.3.

Example 9.3

Problem

A corporate manager is analyzing the following monthly sales data for 30 salespeople (with sales amounts in U.S. dollars):

4969, 4092, 2277, 4381, 3675, 6134, 4490, 6381, 6849, 3134, 7174, 2809, 6057, 7501, 4745, 3415, 2174, 4570, 5957, 5999, 3452, 3390, 3872, 4491, 4670, 5288, 5210, 5934, 6832, 4933

Generate a histogram using Python for this data and comment on the shape of the distribution. Use 5 bins for the histogram.

Pareto Charts

A Pareto chart is a bar graph for categorical data where the taller bars are on the left side of the chart and the bars are placed in a descending height from the left side of the chart to the right side of the chart.

A Pareto chart is unique in that it combines both bar and line graphs to represent data in descending order of frequency or importance, along with the cumulative percentage of the total. The chart is named after Vilfredo Pareto, an Italian economist who identified the 80/20 rule, which states that roughly 80% of effects come from 20% of causes. This type of chart is commonly used in quality improvement efforts where a researcher would like to quickly identify the major contributors to a problem.

When creating a Pareto chart, two vertical scales are typically employed. The left vertical scale is based on frequency, whereas the right vertical scale is based on cumulative percentage. The left vertical scale corresponds to the bar graph and the bars are then oriented in descending order according to frequency, so the height of the bars will decrease from left to right. The right vertical scale is used to show cumulative frequency for the various categories and corresponds to the line chart on the graph. The line chart allows the reader to quickly determine which categories contribute significantly to the overall percentage. For example, the line chart makes it easier to determine which categories make up 80% of the cumulative frequency, as shown in Example 9.4.

Example 9.4

Problem

A manufacturing engineer is interested in analyzing a high defect level on a smartphone manufacturing line and collects the following defect data over a one-month period (see Table 9.2).

Failure Category Frequency (Number of Defects)
Cracked screen 1348
Power button nonfunctional 739
Scratches on case 1543
Does not charge 1410
Volume controls nonfunctional 595
Table 9.2 Smartphone Defect Data

Create a Pareto chart for these failure categories using Python.

Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.