Skip to Content Go to accessibility page Keyboard shortcuts menu

Principles of Data Science

Key Terms

Principles of Data ScienceKey Terms

Search for key terms or text.

application programming interface (API): set of protocols, tools, and definitions for building software applications that allow different software systems to communicate and interact with each other and enable developers to access data and services from other applications, operating systems, or platforms

BCNF (Boyce-Codd Normal Form): a normal form that is similar to 3NF but stricter in its requirements for data independence; it ensures that all attributes in a table are functionally dependent on the primary key and nothing else

big data: complex, high-volume and varied data that are challenging to process with traditional methods

closed-ended questions: clear, structured questions with predetermined answer choices that are effective in gathering quantitative data

cloud computing: the delivery of computing services, such as storage, processing power, and software applications, over the internet

cloud storage: type of online storage that allows users and organizations to store, access, and manage data over the internet with data. This data is stored on remote servers managed by a cloud storage service provider.

data aggregation: the process of collecting and combining information from multiple sources into a single database or dataset

data chunking: a technique used to break down large datasets into smaller, more manageable chunks and make them easier to manage, process, analyze, and store; also known as data segmentation

data compression: the process of reducing the size of a data file or transmission to make it easier and faster to store, transmit, or manipulate

data dictionary: a data structure that stores data in key-value pairs, allowing for efficient retrieval of data using its key

data discretization: the process of converting continuous data into discrete categories or intervals to allow for easier analysis and processing of the data

data indexing: the process of organizing and storing data in a way that makes it easier to search and retrieve information

data integration: the process of combining data from multiple sources, such as databases, applications, and files, into a unified view

data processing: the process of collecting, organizing, and manipulating data to produce meaningful information involving the transformation of raw data into a more useful and understandable format

data standardization: the process of transforming data into a common scale or range to help remove any unit differences or variations in magnitude

data transformation: the process of modifying data to make it more suitable for the planned analysis

data validation: procedure to ensure that the data is reliable and trustworthy for use in the data science project, done by performing checks and tests to identify and correct any errors in the data

data warehouse: a large, organized repository of integrated data from various sources used to support decision-making processes within a company or organization

DataFrame: a two-dimensional data structure in Python that is used for data manipulation and analysis

exponential transformation: a data transformation operation that involves taking the exponent of the data values

Huffman coding: reduces file size by assigning shorter binary codes to the most frequently used characters or symbols in a given dataset and longer codes to less frequently used characters

imputation: the process of estimating missing values in a dataset by substituting them with a statistically reasonable value

log transformation: data transformation technique that requires taking the logarithm of the data values and is often used when the data is highly skewed

lossless compression: aims to reduce file size without removing any data, achieving compression by finding patterns and redundancies in the data and representing them more efficiently

lossy compression: reduces the size of data by permanently extracting particular data that is considered irrelevant or redundant

measurement error: inaccuracies or discrepancies that surface during the process of collecting, recording, or analyzing data caused by human error, environmental factors, or inconsistencies in the data

metadata: data that provides information about other data

missing at random (MAR) data: missing data related to other variables but not to any unknown or unmeasured variables; the missing data can be accounted for and included in the analysis through statistical techniques

missing completely at random (MCAR) data: type of missing data that is not related to any other variables, with no underlying cause for its absence

missing not at random (MNAR) data: type of data missing in a way that depends on unobserved data

network analysis: the examination of relationships and connections between users, data, or entities within a network to identify and analyze influential individuals or groups and understand patterns and trends within the network

noisy data: data that contains errors, inconsistencies, or random fluctuations that can negatively impact the accuracy and reliability of data analysis and interpretation

normal form: a guideline or set of rules used in database design to ensure that a database is well-structured, organized, and free from certain types of data irregularities, such as redundancy and inconsistency

normalization: the process of transforming numerical data into a standard or common range, usually between 0 and 1, to eliminate differences in scale and magnitude that may exist between different variables in a dataset

NoSQL databases: databases designed to handle unstructured data, such as social media content or data from sensors, using non-relational data models

object storage: method of storage where data are stored as objects that consist of both data and metadata and often used for storing large volumes of unstructured data, such as images and videos

observational data: data that is collected by observing and recording natural events or behaviors in their normal setting without any manipulation or experimentation; typically collected through direct observation or through the use of instruments such as cameras or sensors

open-ended questions: survey questions that allow for more in-depth responses and provide the opportunity for unexpected insights

outlier: a data point that differs significantly from other data points in a given dataset

regular expressions (regex): a sequence of characters that define a search pattern, used for matching and manipulating strings of text; commonly used for cleaning, processing, and manipulating data in a structured or unstructured format

relational databases: databases that organize data in tables and use structured query language (SQL) for data retrieval and management; often used with financial data

sampling: the process of selecting a subset of a larger population to represent and gather information about that population

sampling bias: occurs when the sample used in a study isn’t representative of the population it intends to generalize; can lead to skewed or inaccurate conclusions

sampling error: the difference between the results obtained from a sample and the true value of the population parameter it is intended to represent

sampling frame: serves as a guide for selecting a representative sample of the population and determining the coverage of the survey and ultimately affects the validity and accuracy of the results

scaling: one of the steps involved in data normalization; refers to the process of transforming the numerical values of a feature to a particular range

slicing a string: extracting a portion or section of a string based on a specified range of indices

social listening: the process of monitoring and analyzing conversations and discussions happening online, specifically on social media platforms, to track mentions, keywords, and hashtags related to a specific topic or business to understand the sentiment, trends, and overall public perception

splitting a string: dividing a text string into smaller parts or substrings based on a specified separator

square root transformation: data transformation technique that requires taking the square root of the data values; is useful when the data contains values close to zero

standardizing data: the process of transforming data into a common scale or range to help remove any unit differences or variations in magnitude

string: a data type used to represent a sequence of characters, such as letters, numbers, and symbols, in a computer program

text preprocessing: the process of cleaning, formatting, and transforming raw text data into a form that is more suitable and easier for natural language processing to understand and analyze

tokenizing: the process of breaking down a piece of text or string of characters into smaller units called tokens, which can be words, phrases, symbols, or individual characters

transactional data: a type of information that records financial transactions, including purchases, sales, and payments. It includes details such as the date, time, and parties involved in the transaction.

web scraping: a method of collecting data from websites using software or scripts. It involves extracting data from the HTML code of a web page and converting it into a usable format.

Order a print copy

Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information

If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction

Citation information

Use the information below to generate a citation. We recommend using a citation tool such as this one.
- Authors: Dr. Shaun V. Ault, Dr. Soohyun Nam Liao, Larry Musolino
- Publisher/website: OpenStax
- Book title: Principles of Data Science
- Publication date: Jan 24, 2025
- Location: Houston, Texas
- Book URL: https://openstax.org/books/principles-data-science/pages/1-introduction
- Section URL: https://openstax.org/books/principles-data-science/pages/2-key-terms

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.