- application programming interface (API)
- set of protocols, tools, and definitions for building software applications that allow different software systems to communicate and interact with each other and enable developers to access data and services from other applications, operating systems, or platforms
- BCNF (Boyce-Codd Normal Form)
- a normal form that is similar to 3NF but stricter in its requirements for data independence; it ensures that all attributes in a table are functionally dependent on the primary key and nothing else
- big data
- complex, high-volume and varied data that are challenging to process with traditional methods
- closed-ended questions
- clear, structured questions with predetermined answer choices that are effective in gathering quantitative data
- cloud computing
- the delivery of computing services, such as storage, processing power, and software applications, over the internet
- cloud storage
- type of online storage that allows users and organizations to store, access, and manage data over the internet with data. This data is stored on remote servers managed by a cloud storage service provider.
- data aggregation
- the process of collecting and combining information from multiple sources into a single database or dataset
- data chunking
- a technique used to break down large datasets into smaller, more manageable chunks and make them easier to manage, process, analyze, and store; also known as data segmentation
- data compression
- the process of reducing the size of a data file or transmission to make it easier and faster to store, transmit, or manipulate
- data dictionary
- a data structure that stores data in key-value pairs, allowing for efficient retrieval of data using its key
- data discretization
- the process of converting continuous data into discrete categories or intervals to allow for easier analysis and processing of the data
- data indexing
- the process of organizing and storing data in a way that makes it easier to search and retrieve information
- data integration
- the process of combining data from multiple sources, such as databases, applications, and files, into a unified view
- data processing
- the process of collecting, organizing, and manipulating data to produce meaningful information involving the transformation of raw data into a more useful and understandable format
- data standardization
- the process of transforming data into a common scale or range to help remove any unit differences or variations in magnitude
- data transformation
- the process of modifying data to make it more suitable for the planned analysis
- data validation
- procedure to ensure that the data is reliable and trustworthy for use in the data science project, done by performing checks and tests to identify and correct any errors in the data
- data warehouse
- a large, organized repository of integrated data from various sources used to support decision-making processes within a company or organization
- DataFrame
- a two-dimensional data structure in Python that is used for data manipulation and analysis
- exponential transformation
- a data transformation operation that involves taking the exponent of the data values
- Huffman coding
- reduces file size by assigning shorter binary codes to the most frequently used characters or symbols in a given dataset and longer codes to less frequently used characters
- imputation
- the process of estimating missing values in a dataset by substituting them with a statistically reasonable value
- log transformation
- data transformation technique that requires taking the logarithm of the data values and is often used when the data is highly skewed
- lossless compression
- aims to reduce file size without removing any data, achieving compression by finding patterns and redundancies in the data and representing them more efficiently
- lossy compression
- reduces the size of data by permanently extracting particular data that is considered irrelevant or redundant
- measurement error
- inaccuracies or discrepancies that surface during the process of collecting, recording, or analyzing data caused by human error, environmental factors, or inconsistencies in the data
- metadata
- data that provides information about other data
- missing at random (MAR) data
- missing data related to other variables but not to any unknown or unmeasured variables; the missing data can be accounted for and included in the analysis through statistical techniques
- missing completely at random (MCAR) data
- type of missing data that is not related to any other variables, with no underlying cause for its absence
- missing not at random (MNAR) data
- type of data missing in a way that depends on unobserved data
- network analysis
- the examination of relationships and connections between users, data, or entities within a network to identify and analyze influential individuals or groups and understand patterns and trends within the network
- noisy data
- data that contains errors, inconsistencies, or random fluctuations that can negatively impact the accuracy and reliability of data analysis and interpretation
- normal form
- a guideline or set of rules used in database design to ensure that a database is well-structured, organized, and free from certain types of data irregularities, such as redundancy and inconsistency
- normalization
- the process of transforming numerical data into a standard or common range, usually between 0 and 1, to eliminate differences in scale and magnitude that may exist between different variables in a dataset
- NoSQL databases
- databases designed to handle unstructured data, such as social media content or data from sensors, using non-relational data models
- object storage
- method of storage where data are stored as objects that consist of both data and metadata and often used for storing large volumes of unstructured data, such as images and videos
- observational data
- data that is collected by observing and recording natural events or behaviors in their normal setting without any manipulation or experimentation; typically collected through direct observation or through the use of instruments such as cameras or sensors
- open-ended questions
- survey questions that allow for more in-depth responses and provide the opportunity for unexpected insights
- outlier
- a data point that differs significantly from other data points in a given dataset
- regular expressions (regex)
- a sequence of characters that define a search pattern, used for matching and manipulating strings of text; commonly used for cleaning, processing, and manipulating data in a structured or unstructured format
- relational databases
- databases that organize data in tables and use structured query language (SQL) for data retrieval and management; often used with financial data
- sampling
- the process of selecting a subset of a larger population to represent and gather information about that population
- sampling bias
- occurs when the sample used in a study isn’t representative of the population it intends to generalize; can lead to skewed or inaccurate conclusions
- sampling error
- the difference between the results obtained from a sample and the true value of the population parameter it is intended to represent
- sampling frame
- serves as a guide for selecting a representative sample of the population and determining the coverage of the survey and ultimately affects the validity and accuracy of the results
- scaling
- one of the steps involved in data normalization; refers to the process of transforming the numerical values of a feature to a particular range
- slicing a string
- extracting a portion or section of a string based on a specified range of indices
- social listening
- the process of monitoring and analyzing conversations and discussions happening online, specifically on social media platforms, to track mentions, keywords, and hashtags related to a specific topic or business to understand the sentiment, trends, and overall public perception
- splitting a string
- dividing a text string into smaller parts or substrings based on a specified separator
- square root transformation
- data transformation technique that requires taking the square root of the data values; is useful when the data contains values close to zero
- standardizing data
- the process of transforming data into a common scale or range to help remove any unit differences or variations in magnitude
- string
- a data type used to represent a sequence of characters, such as letters, numbers, and symbols, in a computer program
- text preprocessing
- the process of cleaning, formatting, and transforming raw text data into a form that is more suitable and easier for natural language processing to understand and analyze
- tokenizing
- the process of breaking down a piece of text or string of characters into smaller units called tokens, which can be words, phrases, symbols, or individual characters
- transactional data
- a type of information that records financial transactions, including purchases, sales, and payments. It includes details such as the date, time, and parties involved in the transaction.
- web scraping
- a method of collecting data from websites using software or scripts. It involves extracting data from the HTML code of a web page and converting it into a usable format.