Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo

accuracy
for machine learning in general, a measure of the correctness of a machine learning model with respect to predictions.
bias
error introduced by overly-simplistic or overly-rigid models that do not capture important features of the data
big data
Extremely large and complex datasets that require special methods to handle
binary (binomial) classification
classification of data into one of two categories
bootstrap aggregating (bagging)
resampling from the same testing data multiple times to create a number of models (for example, decision trees) that all contribute to the overall model
bootstrapping
resampling portions of the data multiple times in order to generate a distribution that determines a confidence interval for parameters in a model
centroid
geometric center of a subset of points
cluster
a subset of a dataset consisting of points that have similar characteristics or are near one another
confusion matrix
table of values indicating how data was classified correctly or incorrectly by a given model. Entry in row i and column j gives the number of times (or percentage) that data with label i was classified by the model as label j
data cleaning
process of identifying and correcting errors, typos, inconsistencies, missing data, and other anomalies in a dataset
data mining
process of discovering patterns, trends, and insights from large datasets
DBScan algorithm
common density-based clustering algorithm
decision tree
classifier algorithm that builds a hierarchical structure where each internal node represents a decision based on a feature of the data, and each leaf node represents a final decision, label, or prediction
density-based clustering algorithm
clustering algorithm that builds clusters of relatively dense subsets
depth
number of levels of a decision tree, or equivalently, the length of the longest branch of the tree
depth-limiting pruning
pre-pruning method that restricts the total depth (number of levels) of a decision tree
entropy
measure of the average amount of information or uncertainty
error-based (reduced-error) pruning
pruning method that removes branches that do not significantly improve the overall accuracy of the decision tree
F1 Score
combination of precision (p)(p) and recall (r)(r). F1=2(p)(r)p+r=2TP2TP+FP+FNF1=2(p)(r)p+r=2TP2TP+FP+FN
facial recognition
application of machine learning that involves categorizing or labeling images of faces based on the identities of individuals depicted in those images
Gaussian naïve Bayes
classification algorithm that is useful when variables are assumed to come from normal distributions
heatmap
shading or coloring of a table to show contrasts in low versus high values
information gain
comparison of entropy change due to adding child nodes to a parent node in a decision tree
information theory
framework for measuring and managing the uniqueness of data, or the degree of surprise or uncertainty associated with an event or message
k-means clustering algorithm
clustering algorithm that iteratively locates centroids of clusters
labeled data
data that has classification labels
leaf-limiting pruning
pre-pruning method that restricts the total number of leaf nodes of a decision tree
likelihood
measure of accuracy of a classifier algorithm, useful for setting up logistic regression models
logistic regression
modeling method that fits data to a logistic (sigmoid) function and typically performs binary classification
logit function
function of the form ln(p1p)ln(p1p) used to compute log-odds and transform data when performing logistic regression
machine learning (ML) model
any algorithm that trains on data to determine or adjust parameters of a model for use in classification, clustering, decision making, prediction, or pattern recognition
mean absolute error (MAE)
measure of error: MAE=1ni=1n|yiy^i|MAE=1ni=1n|yiy^i|
mean absolute percentage error (MAPE)
measure of relative error: MAPE=1ni=1n|yiy^iyi|MAPE=1ni=1n|yiy^iyi|
mean squared error (MSE)
measure of error: MSE=1ni=1n(yiy^i)2MSE=1ni=1n(yiy^i)2
minimum description length (MDL) pruning
post-pruning method that seeks to find the least complex form of a decision tree that meets an acceptable measure of accuracy
multiclass (multinomial) classification
classification of data into more than two categories
multiple regression
regression techniques that use more than one input variable
naïve Bayes classification
also known as multinomial naïve Bayes classification, a classification algorithm that makes use of prior probabilities and Bayes’ Theorem to predict the class or label of new data
odds
probability of an event EE occurring divided by the probability of EE not occurring
one-hot encoding
replacing categorical/text values in a dataset with vectors that contain a single 1 and all other entries being 0; each category vector has the 1 in a distinct place
overfitting
modeling using a method that yields high variance; the model captures too much of the noise and so may perform well on training data but very poorly on testing data
precision
ratio of true positive predictions to the total number of positive predictions: p=TPTP+FPp=TPTP+FP
prior probability
estimate of a probability, which may be updated or corrected based on Bayes’ Theorem
pruning
reducing the size of a decision tree by removing branches that split the data too finely
random forest
classifier algorithm that uses multiple decision trees and bootstrap aggregating
recall
ratio of true positive predictions to the total number of actual positives: r=TPTP+FNr=TPTP+FN
regression tree
type of decision tree in which the decisions are based on numerical comparisons of continuous data
root mean squared error (RMSE)
measure of error: RMSE=1ni=1n(yiy^i)2RMSE=1ni=1n(yiy^i)2
sigmoid function
function useful in logistic regression: σ(x)=11+exσ(x)=11+ex
silhouette score
a measure of how well-separated the clusters are when using a clustering algorithm
supervised learning
machine learning methods that train on labeled data
testing set (or data)
portion of the dataset that is set aside and used after the training of the algorithm to test for accuracy of the model
training set (or data)
portion of the dataset that is used to train a machine learning algorithm
underfitting
modeling using a method that yields high bias; the model does not capture important features of the data
unlabeled data
data that has not been classified or for which classification data is not known yet
unsupervised learning
machine learning methods that do not require data to be labeled in order to learn; often, unsupervised learning is a first step in discovering meaningful clusters that will be used to define labels
variance
error due to an overly sensitive model that reacts to small changes in the data
weak learners
individual models that are trained on parts of the dataset and then combined in a bootstrap aggregating method such as random forest
Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.