Skip to Content Go to accessibility page Keyboard shortcuts menu

Principles of Data Science

Key Terms

Principles of Data ScienceKey Terms

Search for key terms or text.

accuracy: for machine learning in general, a measure of the correctness of a machine learning model with respect to predictions.

bias: error introduced by overly-simplistic or overly-rigid models that do not capture important features of the data

big data: Extremely large and complex datasets that require special methods to handle

binary (binomial) classification: classification of data into one of two categories

bootstrap aggregating (bagging): resampling from the same testing data multiple times to create a number of models (for example, decision trees) that all contribute to the overall model

bootstrapping: resampling portions of the data multiple times in order to generate a distribution that determines a confidence interval for parameters in a model

centroid: geometric center of a subset of points

cluster: a subset of a dataset consisting of points that have similar characteristics or are near one another

confusion matrix: table of values indicating how data was classified correctly or incorrectly by a given model. Entry in row i and column j gives the number of times (or percentage) that data with label i was classified by the model as label j

data cleaning: process of identifying and correcting errors, typos, inconsistencies, missing data, and other anomalies in a dataset

data mining: process of discovering patterns, trends, and insights from large datasets

DBScan algorithm: common density-based clustering algorithm

decision tree: classifier algorithm that builds a hierarchical structure where each internal node represents a decision based on a feature of the data, and each leaf node represents a final decision, label, or prediction

density-based clustering algorithm: clustering algorithm that builds clusters of relatively dense subsets

depth: number of levels of a decision tree, or equivalently, the length of the longest branch of the tree

depth-limiting pruning: pre-pruning method that restricts the total depth (number of levels) of a decision tree

entropy: measure of the average amount of information or uncertainty

error-based (reduced-error) pruning: pruning method that removes branches that do not significantly improve the overall accuracy of the decision tree

F1 Score: combination of precision $(p)$ and recall $(r)$ . $F 1 = \frac{2 (p) (r)}{p + r} = \frac{2 T P}{2 T P + F P + F N}$

facial recognition: application of machine learning that involves categorizing or labeling images of faces based on the identities of individuals depicted in those images

Gaussian naïve Bayes: classification algorithm that is useful when variables are assumed to come from normal distributions

heatmap: shading or coloring of a table to show contrasts in low versus high values

information gain: comparison of entropy change due to adding child nodes to a parent node in a decision tree

information theory: framework for measuring and managing the uniqueness of data, or the degree of surprise or uncertainty associated with an event or message

k-means clustering algorithm: clustering algorithm that iteratively locates centroids of clusters

labeled data: data that has classification labels

leaf-limiting pruning: pre-pruning method that restricts the total number of leaf nodes of a decision tree

likelihood: measure of accuracy of a classifier algorithm, useful for setting up logistic regression models

logistic regression: modeling method that fits data to a logistic (sigmoid) function and typically performs binary classification

logit function: function of the form $ln (\frac{p}{1 - p})$ used to compute log-odds and transform data when performing logistic regression

machine learning (ML) model: any algorithm that trains on data to determine or adjust parameters of a model for use in classification, clustering, decision making, prediction, or pattern recognition

mean absolute error (MAE): measure of error: $MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |$

mean absolute percentage error (MAPE): measure of relative error: $MAPE = \frac{1}{n} \sum_{i = 1}^{n} | \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} |$

mean squared error (MSE): measure of error: $MSE = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}$

minimum description length (MDL) pruning: post-pruning method that seeks to find the least complex form of a decision tree that meets an acceptable measure of accuracy

multiclass (multinomial) classification: classification of data into more than two categories

multiple regression: regression techniques that use more than one input variable

naïve Bayes classification: also known as multinomial naïve Bayes classification, a classification algorithm that makes use of prior probabilities and Bayes’ Theorem to predict the class or label of new data

odds: probability of an event $E$ occurring divided by the probability of $E$ not occurring

one-hot encoding: replacing categorical/text values in a dataset with vectors that contain a single 1 and all other entries being 0; each category vector has the 1 in a distinct place

overfitting: modeling using a method that yields high variance; the model captures too much of the noise and so may perform well on training data but very poorly on testing data

precision: ratio of true positive predictions to the total number of positive predictions: $p = \frac{T P}{T P + F P}$

prior probability: estimate of a probability, which may be updated or corrected based on Bayes’ Theorem

pruning: reducing the size of a decision tree by removing branches that split the data too finely

random forest: classifier algorithm that uses multiple decision trees and bootstrap aggregating

recall: ratio of true positive predictions to the total number of actual positives: $r = \frac{T P}{T P + F N}$

regression tree: type of decision tree in which the decisions are based on numerical comparisons of continuous data

root mean squared error (RMSE): measure of error: $RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}$

sigmoid function: function useful in logistic regression: $σ (x) = \frac{1}{1 + e^{- x}}$

silhouette score: a measure of how well-separated the clusters are when using a clustering algorithm

supervised learning: machine learning methods that train on labeled data

testing set (or data): portion of the dataset that is set aside and used after the training of the algorithm to test for accuracy of the model

training set (or data): portion of the dataset that is used to train a machine learning algorithm

underfitting: modeling using a method that yields high bias; the model does not capture important features of the data

unlabeled data: data that has not been classified or for which classification data is not known yet

unsupervised learning: machine learning methods that do not require data to be labeled in order to learn; often, unsupervised learning is a first step in discovering meaningful clusters that will be used to define labels

variance: error due to an overly sensitive model that reacts to small changes in the data

weak learners: individual models that are trained on parts of the dataset and then combined in a bootstrap aggregating method such as random forest

Order a print copy

Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information

If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction

Citation information

Use the information below to generate a citation. We recommend using a citation tool such as this one.
- Authors: Dr. Shaun V. Ault, Dr. Soohyun Nam Liao, Larry Musolino
- Publisher/website: OpenStax
- Book title: Principles of Data Science
- Publication date: Jan 24, 2025
- Location: Houston, Texas
- Book URL: https://openstax.org/books/principles-data-science/pages/1-introduction
- Section URL: https://openstax.org/books/principles-data-science/pages/6-key-terms

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.