Dr. Jean-Claude Franchitti

Learning Objectives

By the end of this section, you will be able to:

Outline the characteristics of data warehouses
Explain the extraction, transformation, and loading (ETL) process
Discuss data marts
Describe the technology behind virtual data warehouses and data marts
Summarize the nature of operational data stores
Identify data lakes and their functionality
Compare business intelligence and related tools

With the proliferation of data, new techniques that store and handle the data are required. Data warehousing and data lakes are used for storing big data. Business intelligence analyzes gathered data in data warehouses and data lakes to improve strategic decision-making.

Data Warehouse Characteristics

In the late 1980s, the concept of data warehouse started at IBM when researchers Barry Devlin and Paul Murphy developed the first business warehouse. A data warehouse centralizes an enterprise’s data from its databases. It supports the flow of data from operational systems to analytics/decision systems by creating a single repository of data from various sources both internal and external. In most cases, a data warehouse is a relational database that stores processed data that are optimized for gathering business insights. It collects data with predetermined structures and schema coming from transactional systems and business applications, and the data are typically used for operational reporting and analysis. A data warehouse is a collection of data designed to support management’s decision-making process and is meant to be:

Subject-oriented: data are organized based on subjects such as products.
Integrated: it integrates data from multiple resources and different formats.
Time-variant: it stores a time series of periodic snapshots to always be up-to-date.
Nonvolatile: data are read-only to complete the process of updating or removing data.

Data warehouse supports the types of decision-making at the operational, tactical, and strategic levels. Operational decisions happen frequently based on day-to-day operations and are structured based on a specific predefined role (i.e., limited). Tactical-level decisions occur with greater frequency (i.e., monthly) and are semistructured based on the data requirement. The strategic level has the highest level of organizational business decisions, and their decisions are unstructured and/or infrequent.

In designing a data warehouse, many schemas can be adopted such as star schema, snowflake schema, and fact constellation. A star schema is a data model with one large fact table connected to smaller tables. The fact table contains keys referring to each dimension table as shown in Figure 8.19.

Illustration of a star schema with Fact Table (DepKey, PrKey, StudentKey), connected to Department Table (DepKey, Dname, Dcode), Program Table (PrKey, Pname, Pcode), and Student Table (StudentKey, Snumber, Sname, DoB).

Figure 8.19 A graphic shows an example of a star schema. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

The snowflake schema is a data model that normalizes the dimension table (Figure 8.20). It has one fact table, and it creates smaller tables with primary–foreign key relationships. A fact constellation has more than one fact table connected to other smaller dimension tables.

Illustration of snowflake schema, connecting Sales Dimension with: Customer Dimension (Name SubDimension), Time Dimension (Month SubDimension), Product Dimension (SKU SubDimension, Category SubDimension), Company Dimension (DepartmentList SubDimension, RegionList SubDimension).

Figure 8.20 This graphic is an example of a snowflake schema. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

Extraction, Transformation, and Loading Process

After designing the schema, the next process to start is extraction, transformation, and loading (ETL), which is data integration that combines data from multiple sources, fixes the data format, and loads the data into a data warehouse. The first step is extracting the data from the system, which can be full or incremental extraction. Extraction uses change data capture (CDC), which is a technology that detects any data update event.

The second step is transforming the data, which includes formatting the data to be consistent, cleansing the data to get rid of missing data, aggregating data by merging some attributes, and enriching by adding external data. The last step is loading the data in parallel as fact and dimension tables to the data warehouse as shown in Figure 8.21. ETL steps will make changes to the data, which should be documented for better understanding and future maintenance. The documentation includes structural metadata about the data structure and semantic metadata about the meaning.

Illustration of RDBMS, External data, Codasyl, and Email going through ETL Process, then to Staging area, then through an ETL Process, ending at a Data Warehouse.

Figure 8.21 The extraction, transformation, and loading (ETL) steps of data are shown as the data makes its way to the data warehouse. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

Data Marts

A data mart is a scaled-down version of a data warehouse aimed at meeting the information needs of a homogeneous small group of end users such as a department or business unit. It provides focused content and improves query performance. Dependent data marts pull their data directly from a central data warehouse. Independent data marts are stand-alone systems drawing data directly from the operational systems, external sources, or a combination of both.

Global Issues in Technology

Storing Data

For the past decade, the trend has been to store data in any shape or form without knowing whether it will be used in the future. This has been facilitated by the advent of cloud technology that has allowed people to store unlimited amounts of data at low cost into data lakes. The amount of data stored will keep increasing over the next few decades as daily usage of computer processes keeps increasing.

What are some of the global issues you foresee with today’s ability to store an unlimited amount of data that may be analyzed at a later time?

Virtual Data Warehouses and Virtual Data Marts

A virtual data mart has no physical data but provides a single point of access to a set of underlying physical data stores; data are only accessed (“pulled”) at query time. A virtual data warehouse can be built as a set of SQL views directly on the underlying operational data sources as an extra layer on top of a collection of physical independent data marts. The metadata model contains the schema mappings between the schemas of the underlying data stores and the schema of the virtual data warehouse (involves query reformulation). A virtual data mart is usually defined as a single SQL view. There can be virtual-independent versus virtual-dependent data marts. Disadvantages are the extra processing capacity from the underlying (operational) data sources and the fact that it is not possible to keep track of historical data.

Operational Data Stores

An operational data store (ODS) is a staging area that provides query facilities. It is good for analysis tools that need data that are closer to real time. More complex analyses are still conducted on the actual data warehouse.

Data Lakes

A data lake stores all of an enterprise’s structured and unstructured data at any scale. Data lakes are large data repositories that store raw data and can be set up without having to first define the data structure and schema. They allow users to run analytics without having to move the data to a separate analytics system, enabling businesses to gain insights from new sources of data not available for analysis before. For example, by building machine learning models using data from log files (e.g., log files generated by an application that writes to operating system event), click-streams (e.g., information about visited pages on the Web), social media (e.g., LinkedIn), and IoT devices (e.g., smartwatch). By making all of the enterprise data readily available for analysis, computer/data scientists can answer a new set of business questions or tackle old questions with new data. Figure 8.22 shows an example of a data lake process.

Illustration of Type of Data, Structured, and Semi- and unstructured loading to a Raw data lake where data is cleaned, standardized, and aggregated to a refined data lake.

Figure 8.22 A data lake stores structured, semistructured, and unstructured data. The store process stores the data as raw data lake then applies many processes such as cleansing and aggregation to load the data as a refined data lake. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

Data Warehouses vs. Data Lakes

A data lake is an architectural approach specifically designed to handle data of every variety, ingestion velocity, and storage volume. A data lake allows the storage of massive amounts of data into a central location so it's readily available to be categorized, processed, analyzed, and consumed by diverse groups within an organization. Because data can be stored as-is, there is no need to convert it to a predefined schema as typically required in traditional RDBMS-driven architectures. Consumer usage patterns and the sourcing of the data itself directly influence how data are collected, stored, processed, moved, transformed, automated, and visualized. Data are the ultimate asset with boundless usage patterns now being generated and consumed by humans, machines, devices, sensors, and applications. There are some differences between data warehouse and data lakes listed in Table 8.11.

Characteristic	Data Warehouse	Data Lake
Data	Relational	Nonrelational and relational
Schema	Schema-on-write designed before the implementation	Schema-on-write Written at the time of analysis
Storage	Expensive	Low cost
Performance	Fastest query results	Query results getting faster
Users	Decision-makers	Data scientist
Analysis	Batch reporting, business intelligence, and visualization	Machine learning, predictive analysis

Table 8.11 Difference between a Data Warehouse and a Data Lake

Data Lake Development

A traditional approach to information management is no longer suitable due to cost and the inability to adapt. Seventy percent of development costs from ETL include an effort to consolidate, prepare, standardize, and transform data for downstream analytics. Costs involve initial capital investment for hardware, software licensing for databases, data integration, and analytics platforms. There is a need to react to emerging changes in the proliferation of data, new and emerging technologies, and cloud-based integrated service platforms. Traditional data storage and analytic tools can no longer provide the agility and flexibility required to deliver relevant business insights and competitive advantage. Figure 8.23 shows the timeline of a data lake.

Timeline: 1990s–EDW/Data cube; 2000s ETL–Datamarts/EDW/Cubes; mid-2000s ETL–Hadoop Data lake; ending 2000’s 10% of data–Cloud EDW; Now˜75% of databases in the Cloud by 2022–Cloud EDW, Lakehouse, Data lake/mesh.

Figure 8.23 An illustration of a data lake shows its evolution from an on-premises tool to a cloud version to a hybrid model. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

From Data Lakes to Data Swamps

A common challenge with the data lake architecture is that without the appropriate data quality and governance framework in place, when terabytes of structured and unstructured data flow into the data lakes, it often becomes extremely difficult to sort through their content. Data lakes can turn into data swamps as the stored data become too messy to be usable. A data swamp is data stored without organization to make retrieval easy. Many organizations are now calling for more data governance and metadata management practices to prevent data swamps from forming.

Link to Learning

Data lakes are used to collect very large amounts of both relational and nonrelational data, such as a streaming service collecting data on users’ watching habits, while data warehousing collects relational data, such as when a retailer collects data to use to analyze shoppers’ habits and to control inventory.

Where does business intelligence fit into this data system? Business intelligence analyzes this data to help in making decisions. They use data for marketing and targeted sales in particular.

It can be confusing to understand the ins and outs of each process and how they relate to each other. A blog post about the difference between data lakes, data warehouses, and databases does a great job of distinguishing each and how each contributes to the use of understanding data.

Business Intelligence

The main goal of the data warehouse is to support decision-making. In order to do that, we need intelligent tools. The set of activities, techniques, and tools aimed at understanding patterns in past data to predict the future is called business intelligence (BI). Modern BI is facilitated through Tableau, ClickView, and Microsoft Power BI. The quality of data controls the quality of results, which means that bad data gives bad insights (remember GIGO). BI techniques include query and reporting, which provide a graphical user interface (GUI) in which the user can graphically design a report and pivot table, which is a data summarization tool. In addition, one of the BI tools is online analytical processing (OLAP). OLAP is a computer-based approach to analyzing data that enables users to extract and view data from multiple dimensions. It is used to analyze large volumes of data and is commonly used in BI applications. OLAP allows users to perform complex analysis of data by providing a multidimensional view of the data. It can aggregate data across multiple dimensions, such as time, location, product, and customer, enabling users to see trends and relationships that might be hidden in a traditional two-dimensional view of the data.

8.5 Data Warehousing, Data Lakes, and Business Intelligence