Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo

application programming interface (API)
set of protocols, tools, and definitions for building software applications that allow different software systems to communicate and interact with each other and enable developers to access data and services from other applications, operating systems, or platforms
BCNF (Boyce-Codd Normal Form)
a normal form that is similar to 3NF but stricter in its requirements for data independence; it ensures that all attributes in a table are functionally dependent on the primary key and nothing else
big data
complex, high-volume and varied data that are challenging to process with traditional methods
closed-ended questions
clear, structured questions with predetermined answer choices that are effective in gathering quantitative data
cloud computing
the delivery of computing services, such as storage, processing power, and software applications, over the internet
cloud storage
type of online storage that allows users and organizations to store, access, and manage data over the internet with data. This data is stored on remote servers managed by a cloud storage service provider.
data aggregation
the process of collecting and combining information from multiple sources into a single database or dataset
data chunking
a technique used to break down large datasets into smaller, more manageable chunks and make them easier to manage, process, analyze, and store; also known as data segmentation
data compression
the process of reducing the size of a data file or transmission to make it easier and faster to store, transmit, or manipulate
data dictionary
a data structure that stores data in key-value pairs, allowing for efficient retrieval of data using its key
data discretization
the process of converting continuous data into discrete categories or intervals to allow for easier analysis and processing of the data
data indexing
the process of organizing and storing data in a way that makes it easier to search and retrieve information
data integration
the process of combining data from multiple sources, such as databases, applications, and files, into a unified view
data processing
the process of collecting, organizing, and manipulating data to produce meaningful information involving the transformation of raw data into a more useful and understandable format
data standardization
the process of transforming data into a common scale or range to help remove any unit differences or variations in magnitude
data transformation
the process of modifying data to make it more suitable for the planned analysis
data validation
procedure to ensure that the data is reliable and trustworthy for use in the data science project, done by performing checks and tests to identify and correct any errors in the data
data warehouse
a large, organized repository of integrated data from various sources used to support decision-making processes within a company or organization
DataFrame
a two-dimensional data structure in Python that is used for data manipulation and analysis
exponential transformation
a data transformation operation that involves taking the exponent of the data values
Huffman coding
reduces file size by assigning shorter binary codes to the most frequently used characters or symbols in a given dataset and longer codes to less frequently used characters
imputation
the process of estimating missing values in a dataset by substituting them with a statistically reasonable value
log transformation
data transformation technique that requires taking the logarithm of the data values and is often used when the data is highly skewed
lossless compression
aims to reduce file size without removing any data, achieving compression by finding patterns and redundancies in the data and representing them more efficiently
lossy compression
reduces the size of data by permanently extracting particular data that is considered irrelevant or redundant
measurement error
inaccuracies or discrepancies that surface during the process of collecting, recording, or analyzing data caused by human error, environmental factors, or inconsistencies in the data
metadata
data that provides information about other data
missing at random (MAR) data
missing data related to other variables but not to any unknown or unmeasured variables; the missing data can be accounted for and included in the analysis through statistical techniques
missing completely at random (MCAR) data
type of missing data that is not related to any other variables, with no underlying cause for its absence
missing not at random (MNAR) data
type of data missing in a way that depends on unobserved data
network analysis
the examination of relationships and connections between users, data, or entities within a network to identify and analyze influential individuals or groups and understand patterns and trends within the network
noisy data
data that contains errors, inconsistencies, or random fluctuations that can negatively impact the accuracy and reliability of data analysis and interpretation
normal form
a guideline or set of rules used in database design to ensure that a database is well-structured, organized, and free from certain types of data irregularities, such as redundancy and inconsistency
normalization
the process of transforming numerical data into a standard or common range, usually between 0 and 1, to eliminate differences in scale and magnitude that may exist between different variables in a dataset
NoSQL databases
databases designed to handle unstructured data, such as social media content or data from sensors, using non-relational data models
object storage
method of storage where data are stored as objects that consist of both data and metadata and often used for storing large volumes of unstructured data, such as images and videos
observational data
data that is collected by observing and recording natural events or behaviors in their normal setting without any manipulation or experimentation; typically collected through direct observation or through the use of instruments such as cameras or sensors
open-ended questions
survey questions that allow for more in-depth responses and provide the opportunity for unexpected insights
outlier
a data point that differs significantly from other data points in a given dataset
regular expressions (regex)
a sequence of characters that define a search pattern, used for matching and manipulating strings of text; commonly used for cleaning, processing, and manipulating data in a structured or unstructured format
relational databases
databases that organize data in tables and use structured query language (SQL) for data retrieval and management; often used with financial data
sampling
the process of selecting a subset of a larger population to represent and gather information about that population
sampling bias
occurs when the sample used in a study isn’t representative of the population it intends to generalize; can lead to skewed or inaccurate conclusions
sampling error
the difference between the results obtained from a sample and the true value of the population parameter it is intended to represent
sampling frame
serves as a guide for selecting a representative sample of the population and determining the coverage of the survey and ultimately affects the validity and accuracy of the results
scaling
one of the steps involved in data normalization; refers to the process of transforming the numerical values of a feature to a particular range
slicing a string
extracting a portion or section of a string based on a specified range of indices
social listening
the process of monitoring and analyzing conversations and discussions happening online, specifically on social media platforms, to track mentions, keywords, and hashtags related to a specific topic or business to understand the sentiment, trends, and overall public perception
splitting a string
dividing a text string into smaller parts or substrings based on a specified separator
square root transformation
data transformation technique that requires taking the square root of the data values; is useful when the data contains values close to zero
standardizing data
the process of transforming data into a common scale or range to help remove any unit differences or variations in magnitude
string
a data type used to represent a sequence of characters, such as letters, numbers, and symbols, in a computer program
text preprocessing
the process of cleaning, formatting, and transforming raw text data into a form that is more suitable and easier for natural language processing to understand and analyze
tokenizing
the process of breaking down a piece of text or string of characters into smaller units called tokens, which can be words, phrases, symbols, or individual characters
transactional data
a type of information that records financial transactions, including purchases, sales, and payments. It includes details such as the date, time, and parties involved in the transaction.
web scraping
a method of collecting data from websites using software or scripts. It involves extracting data from the HTML code of a web page and converting it into a usable format.
Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.