Skip to Content Go to accessibility page

Principles of Data Science

Key Terms

Principles of Data ScienceKey Terms

Search for key terms or text.

Key Terms

Bayes’ Theorem: a method used to calculate a conditional probability when additional information is obtained to refine a probability estimate

binomial distribution: a probability distribution for a discrete random variable that is the number of successes in n independent trials. Each of these trials has two possible outcomes, success or failure, with the probability of success in each trial the same.

central tendency: the statistical measure that represents the center of a dataset

coefficient of variation (CV): measures the variation of a dataset by calculating the standard deviation as a percentage of the mean

complement of an event: the set of all outcomes in the sample space that are not included in the event

conditional probability: the probability of an event given that another event has already occurred

continuous probability distribution: probability distribution that deals with random variables that can take on any value within a given range or interval

continuous random variable: a random variable where there is an infinite number of values that the variable can take on

dependent events: events where the probability of occurrence of one event is affected by the occurrence of another event

descriptive statistics: the organization, summarization, and graphical display of data

discrete probability distribution: probability distribution associated with variables that take on a finite or countably infinite number of distinct values

discrete random variable: a random variable where there is only a finite or countable infinite number of values that the variable can take on

empirical probability: a probability that is calculated based on data that has been collected from an experiment

empirical rule: a rule that provides the percentages of data values falling within one, two, and three standard deviations from the mean for a bell-shaped (normal) distribution

event: a subset of the sample space

frequency: a count of the number of times that an event or observation occurs in an experiment or study

frequency distribution: a method of organizing and summarizing a dataset that provides the frequency with which each value in the dataset occurs

independent events: events where the probability of occurrence of one event is not affected by the occurrence of another event

interquartile range (IQR): a number that indicates the spread of the middle half, or middle 50%, of the data; the difference between the third quartile ( $Q_{3}$ ) and the first quartile ( $Q_{1}$ )

mean (also called arithmetic mean): a measure of center of a dataset, calculated by adding up the data values and dividing the sum by the number of data values; average

median: the middle value in an ordered dataset

mode: the most frequently occurring data value in a dataset

mutually exclusive events: events that cannot occur at the same time

normal distribution: a bell-shaped distribution curve that is used to model many measurements, including IQ scores, salaries, heights, weights, blood pressures, etc.

outcome: the result of a single trial in a probability experiment

outliers: data values that are significantly different from the other data values in a dataset

percentiles: numbers that divide an ordered dataset into hundredths; used to describe the relative standing of a particular value within a dataset by indicating the percentage of data points that fall below it

Poisson distribution: a probability distribution for discrete random variables used to calculate probabilities for a certain number of occurrences in a specific interval

population data: data representing all the outcomes or measurements that are of interest

population mean: the average for all measurements of interest corresponding to the entire group under study

population size: the number of measurements for the entire group under study

probability: a numerical measure that assesses the likelihood of occurrence of an event

probability analysis: provides the tools to model, understand, and quantify uncertainties, allowing data scientists to make informed decisions from data

probability density function (PDF): a function that is used to describe the probability distribution of a continuous random variable

probability distribution: a mathematical function that assigns probabilities to various outcomes

probability mass function (PMF): a function that is used to define the probability distribution of a discrete random variable

quartiles: numbers that divide an ordered dataset into quarters; the second quartile is the same as the median

random variable: a variable where a single numerical value is assigned to a specific outcome from an experiment

range: a measure of dispersion for a dataset calculated by subtracting the minimum from the maximum of the dataset

relative frequency probability: a method of determining the likelihood of an event occurring based on the observed frequency of its occurrence in a given sample or population

sample data: data representing outcomes or measurements collected from a subset or part of a population

sample mean: the average for a subset of the measurements of interest

sample size: the number of measurements for the subset taken from the overall population

sample space: the set of all possible outcomes in a probability experiment

standard deviation: a measure of the spread of a dataset, given in the same units as the data, that indicates how far a typical data value is from the mean

standard normal distribution: a normal distribution with mean of 0 and standard deviation of 1

statistical analysis: the science of collecting, organizing, and interpreting data to make decisions

theoretical probability: a probability that is calculated based on an assessment of equally likely outcomes

trimmed mean: a calculation for the average or mean of a dataset where some percentage of data values are removed from the lower and upper end of the dataset; typically used to mitigate the effects of outliers on the mean

variance: the measure of the spread of data values in a dataset based on the squared deviations from the mean, which is the average of the squared deviations of the observations from the mean

$z$ -score: a measure of the position of a data value in the dataset, calculated by subtracting the mean from the data value and then dividing the difference by the standard deviation

Order a print copy

Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information

If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction

Citation information

Use the information below to generate a citation. We recommend using a citation tool such as this one.
- Authors: Dr. Shaun V. Ault, Dr. Soohyun Nam Liao, Larry Musolino
- Publisher/website: OpenStax
- Book title: Principles of Data Science
- Publication date: Jan 24, 2025
- Location: Houston, Texas
- Book URL: https://openstax.org/books/principles-data-science/pages/1-introduction
- Section URL: https://openstax.org/books/principles-data-science/pages/3-key-terms

© Dec 4, 2025 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.