8.1 Gathering and Organizing Data

Categorical data places units into groups (categories), while quantitative data is a numerical measure of a property of a unit.
The sampling method for a study depends on the way that randomization is used to select units for the sample.
Frequency distributions help to summarize data by counting the number of units that fall into a particular category or range of quantitative values.

8.2 Visualizing Data

Categorical data can be visualized using pie charts or bar charts; quantitative data can be visualized using stem-and-leaf plots or histograms.
Areas in pie charts and bar charts represent proportions of the data falling into a particular category, while areas in histograms represent proportions of the data that fall into a given range of data values (or “bins”). Stem-and-leaf plots are visual representations of entire datasets.
By manipulating the axes, changing widths of bars, or making bad choices for bins, we can create data visualizations that misrepresent the distribution of data.

The mode of a dataset is the value that appears the most frequently. The median is a value that is greater than or equal to no more than 50% of the data and less than or equal to no more than 50% of the data. The mean is the sum of all the data values, divided by the number of units in the dataset.
The median of a dataset is not affected by outliers, but the mean will be biased toward outliers. This distinction might affect which measure of centrality is used to summarize a dataset.

The range of a dataset is the difference between its largest and smallest values. The standard deviation is approximately the mean difference (in absolute value) that individual units fall from the mean of the dataset.

The percentile rank of a data value is the percentage of all values in the dataset that are less than or equal to the given value.

Normally distributed data follow a bell-shaped, symmetrical distribution.
The mean of normally distributed data falls at the peak of the distribution. The standard deviation of normally-distributed data is the distance from the peak to either of the inflection points.
Data that are normally distributed follow the 68-95-99.7 Rule, which says that approximately 68% of the data fall within one standard deviation of the mean, 95% fall within two standard deviations, and 99.7% fall within three standard deviations.
The $z$ -score for a data value is the number of standard deviations that value falls above (or below, if the $z$ -score is negative) the mean.
We can use the normal distribution to estimate percentiles.

If one variable affects the value of another variable, we say the first is an explanatory variable and the second is a response variable.
Scatter plots place a point in the $xy$ -plane for each unit in the dataset. The $x$ -value is the value of the explanatory variable, and the $y$ -value is the value of the response variable.
The correlation coefficient $r$ gives us information about the strength and direction of the relationship between two variables. If $r$ is positive, the relationship is positive: an increase in the value of the explanatory variable tends to correspond to an increase in the value of the response variable. If $r$ is negative, the relationship is negative: an increase in the value of the explanatory variable tends to correspond to a decrease in the value of the response variable. Values of $r$ that are close to 0 indicate weak relationships, while values close to –1 or indicate strong relationships.
The regression line for a relationship between two variables is the line that best represents the data. It can be used to predict values of the response variable for a given value of the explanatory variable.