Skip to Content Go to accessibility page

Principles of Data Science

Key Terms

Principles of Data ScienceKey Terms

Key Terms

alternative hypothesis: a complementary statement regarding an unknown population parameter used in hypothesis testing

analysis of variance (ANOVA): statistical method to compare three or more means and determine if the means are all statistically the same or if at least one mean is different from the others

best-fit linear equation: an equation of the form $\hat{y} = a + b x$ that provides the best-fit straight line to the $(x, y)$ data points

bivariate data: data collected on two variables where the data values are paired with one another

bootstrapping: a method to construct a confidence interval that is based on repeated sampling and does not rely on any assumptions regarding the underlying distribution

central limit theorem: describes the relationship between the sample distribution of sample means and the underlying population

confidence interval: an interval where sample data is used to provide an estimate for a population parameter

confidence level: the probability that the interval estimate will contain the population parameter, given that the estimation process on the parameter is repeated over and over

correlation: a measure of association between two numeric variables

correlation analysis: a statistical method used to evaluate and quantify the strength and direction of the linear relationship between two quantitative variables

correlation coefficient: a measure of the strength and direction of the linear relationship between two variables

critical value: z-score that cuts off an area under the normal curve corresponding to a specified confidence level

dependent samples: samples from one population that can be paired or matched to the samples taken from the second population

dependent variable: in correlation analysis, the variable being studied or measured; the dependent variable is the outcome that is measured or observed to determine the impact of changes in the independent variable

F distribution: a skewed probability distribution that arises in statistical hypothesis testing, such as ANOVA analysis

hypothesis testing: a statistical method to test claims regarding population parameters using sample data

independent samples: the sample from one population that is not related to the sample taken from the second population

independent variable: in correlation analysis, the variable that is manipulated or changed in an experiment or study; the value of the independent variable is controlled or chosen by the experimenter to observe its effect on the dependent variable

inferential statistics: statistical methods that allow researchers to infer or generalize observations from samples to the larger population from which they were selected

least squares method: a method used in linear regression that generates a straight line fit to the data values such that the sum of the squares of the residual is the least sum possible

level of significance ( $a$ ): the maximum allowed probability of making a Type I error; the level of significance is the probability value used to determine when the sample data indicates significant evidence against the null hypothesis

linear correlation: a measure of the association between two variables that exhibit an approximate straight-line fit when plotted on a scatterplot

margin of error: an indication of the maximum error of the estimate

matched pairs: samples from one population that can be paired or matched to the samples taken from the second population

method of least squares: a mathematical method to generate a linear equation that is the “best fit” to the points on the scatterplot in the sense that the line minimizes the differences between the predicted values and observed values for y

modeling: the process of creating a mathematical representation that describes the relationship between different variables in a dataset; the model is then used to understand, explain, and predict the behavior of the data

nonparametric methods: statistical methods that do not rely on any assumptions regarding the underlying distribution

null hypothesis: statement of no effect or no change in the population

p-value: the probability of obtaining a sample statistic with a value as extreme as (or more extreme than) the value determined by the sample data under the assumption that the null hypothesis is true

parametric methods: statistical methods that assume a specific form for the underlying distribution

point estimate: a sample statistic used to estimate a population parameter

prediction: a forecast for the dependent variable based on a specific value of the independent variable generated using the linear model

proportion: a measure that expresses the relationship between a part and the whole; a proportion represents the fraction or percentage of a dataset that exhibits a particular characteristic or falls into a specific category

regression analysis: a statistical technique used to model the relationship between a dependent variable and one or more independent variables

residual: the difference between an observed y-value and the predicted y-value obtained from the linear regression equation

sample mean: a point estimate for the unknown population mean chosen as the most unbiased estimate of the population

sample proportion: chosen as the most unbiased estimate of the population, calculated as the number of successes divided by the sample size: $p = \frac{x}{n}$

sample statistic: a numerical summary or measure that describes a characteristic of a sample, such as a sample mean or sample proportion

sampling distribution: a probability distribution of a sample statistic based on all possible random samples of a certain size from a population or the distribution of a statistic (such as the mean) that would result from taking random samples from the same population repeatedly and calculating the statistic for each sample

scatterplot (or scatter diagram): graphical display that shows values of the independent variable plotted on the $x$ -axis and values of the dependent variable plotted on the $y$ -axis

standard error of the mean: the standard deviation of the sample mean, calculated as the population standard deviation divided by the square root of the sample size

standardized test statistic: a numerical measure that describes how many standard deviations a particular value is from the mean of a distribution; a standardized test statistic is typically used to assess whether an observed sample statistic is significantly different from what would be expected under a null hypothesis

t-distribution: a bell-shaped, symmetric distribution similar to the normal distribution, though the t-distribution has “thicker tails” as compared to the normal distribution

test statistic: a numerical value used to assess the strength of evidence against a null hypothesis, calculated from sample data that is used in hypothesis testing

Type I error: an error made in hypothesis testing where a researcher rejects the null hypothesis when in fact the null hypothesis is actually true

Type II error: an error made in hypothesis testing where a researcher fails to reject the null hypothesis when the null hypothesis is actually false

unbiased estimator: a statistic that provides a valid estimate for the corresponding population parameter without overestimating or underestimating the parameter

variable: a characteristic or attribute that can be measured or observed.

Order a print copy

Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information

If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction

Citation information

Use the information below to generate a citation. We recommend using a citation tool such as this one.
- Authors: Dr. Shaun V. Ault, Dr. Soohyun Nam Liao, Larry Musolino
- Publisher/website: OpenStax
- Book title: Principles of Data Science
- Publication date: Jan 24, 2025
- Location: Houston, Texas
- Book URL: https://openstax.org/books/principles-data-science/pages/1-introduction
- Section URL: https://openstax.org/books/principles-data-science/pages/4-key-terms

© Apr 23, 2026 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.