Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo

Key Concepts

8.1 Gathering and Organizing Data

  • Categorical data places units into groups (categories), while quantitative data is a numerical measure of a property of a unit.
  • The sampling method for a study depends on the way that randomization is used to select units for the sample.
  • Frequency distributions help to summarize data by counting the number of units that fall into a particular category or range of quantitative values.

8.2 Visualizing Data

  • Categorical data can be visualized using pie charts or bar charts; quantitative data can be visualized using stem-and-leaf plots or histograms.
  • Areas in pie charts and bar charts represent proportions of the data falling into a particular category, while areas in histograms represent proportions of the data that fall into a given range of data values (or “bins”). Stem-and-leaf plots are visual representations of entire datasets.
  • By manipulating the axes, changing widths of bars, or making bad choices for bins, we can create data visualizations that misrepresent the distribution of data.

8.3 Mean, Median and Mode

  • The mode of a dataset is the value that appears the most frequently. The median is a value that is greater than or equal to no more than 50% of the data and less than or equal to no more than 50% of the data. The mean is the sum of all the data values, divided by the number of units in the dataset.
  • The median of a dataset is not affected by outliers, but the mean will be biased toward outliers. This distinction might affect which measure of centrality is used to summarize a dataset.

8.4 Range and Standard Deviation

  • The range of a dataset is the difference between its largest and smallest values. The standard deviation is approximately the mean difference (in absolute value) that individual units fall from the mean of the dataset.

8.5 Percentiles

  • The percentile rank of a data value is the percentage of all values in the dataset that are less than or equal to the given value.

8.6 The Normal Distribution

  • Normally distributed data follow a bell-shaped, symmetrical distribution.
  • The mean of normally distributed data falls at the peak of the distribution. The standard deviation of normally-distributed data is the distance from the peak to either of the inflection points.
  • Data that are normally distributed follow the 68-95-99.7 Rule, which says that approximately 68% of the data fall within one standard deviation of the mean, 95% fall within two standard deviations, and 99.7% fall within three standard deviations.
  • The zz-score for a data value is the number of standard deviations that value falls above (or below, if the zz-score is negative) the mean.
  • We can use the normal distribution to estimate percentiles.

8.7 Applications of the Normal Distribution

  • We can use zz-scores to compare data values from different datasets.

8.8 Scatter Plots, Correlation, and Regression Lines

  • If one variable affects the value of another variable, we say the first is an explanatory variable and the second is a response variable.
  • Scatter plots place a point in the xyxy-plane for each unit in the dataset. The xx-value is the value of the explanatory variable, and the yy-value is the value of the response variable.
  • The correlation coefficient rr gives us information about the strength and direction of the relationship between two variables. If rr is positive, the relationship is positive: an increase in the value of the explanatory variable tends to correspond to an increase in the value of the response variable. If rr is negative, the relationship is negative: an increase in the value of the explanatory variable tends to correspond to a decrease in the value of the response variable. Values of rr that are close to 0 indicate weak relationships, while values close to –1 or indicate strong relationships.
  • The regression line for a relationship between two variables is the line that best represents the data. It can be used to predict values of the response variable for a given value of the explanatory variable.
Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/contemporary-mathematics/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/contemporary-mathematics/pages/1-introduction
Citation information

© Jul 25, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.