Dr. Jean-Claude Franchitti

Learning Objectives

By the end of this section, you will be able to:

Define big data and explain its related functionality
Discuss big data analytics and its impact on computer systems
Identify and describe the tools that are used to perform shallow machine learning
Describe cognitive analytics and artificial intelligence
Identify the tools that are used to perform deep learning
Explain massively parallel processing (MPP) database management systems

In the last few years, the amount of data has increased exponentially in many businesses to be big data. Storing, analyzing, and retrieving data has become complex and challenging. Traditional databases and storage cannot deal with big data, which creates new software and hardware business opportunities. Data integration aims to provide a unified view and/or unified access over heterogeneous, and possibly distributed, data sources. Process integration deals with the sequencing of tasks in a business process but also governs data flows in these processes. Both data and processes are considered in data integration. The emergence of BI and analytics triggered the need to consolidate data into a data warehouse.

Big Data

The term big data has been in use since the early 1990s and the term was credited to computer scientist John R. Mashey who is considered the father of big data. Big data is a high volume of data in the shape of structure, semistructured, or unstructured data. Every minute, more than 300,000 tweets are created, Netflix subscribers stream more than 70,000 hours of video at once, Apple users download 30,000 apps, and Instagram users like almost two million photos. In this section, we study the five Vs of big data, big data examples, new sources of data, data and process integration, data quality, and data governance, as well as the privacy, security, and ethical use of data.

The Five Vs of Big Data

Big data helps the decision-makers to improve their decisions, which in turn improves the quality of their products and services. Big data has many characteristics. Researchers defined the scope of big data as five Vs (Figure 8.24):

volume: the amount of data; also referred to as data at rest
velocity: the speed at which data comes in and goes out; data in motion
variety: the range of data types and sources that are used; data in its many forms
veracity: the uncertainty of the data; data in doubt
value: the actual value derived using the total cost of ownership (TCO) and return on investment (ROI) of the data

5 Vs of Big Data: Velocity (Batch, Real-time, Processes, Streams), Value (Statistical, Events, Correlations, Hypothetical), Veracity (Authenticity, Origin, Availability, Accountability), Variety (Structured, Unstructured, Multi-factor), Volume (Terabytes, Records/Arch, Transactions, Tables).

Figure 8.24 The five Vs of big data are volume, velocity, variety, veracity, and value. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

New Sources of Data

IoT sources become more numerous and diverse every day. Other sources of data are network data, publicly available data, macroeconomic data, textual data, audio, images, videos, fingerprint, location (GPS), geospatial, RFID data, and many more. We will cover the IoT in more details in Chapter 13 Hybrid Multicloud Digital Solutions Development.

Data and Process Integration

Traditionally, a business has various departments that have independent data and systems (i.e., silo) from each other. Departments such as human resources and accounting did not integrate and do not share available resources, which makes it hard to answer queries or check updates. To start the convergence of analytical and operational data requires data integration, which provides a consistent view of all organization data. There are different data integration patterns such as data consolidation, data federation, and data propagation. The use of ETL to capture data from multiple sources and integrate it into a single store such as a data warehouse is data consolidation. The use of enterprise information integration (EII) to provide a unified view over data sources is data federation. The use of enterprise application integration (EAI) corresponding to the synchronous or asynchronous propagation of updates in a source system to a target system is data propagation.

A more recent approach to data integration is data virtualization, which is a technique that hides the physical location of data and uses data integration patterns to produce a unified data view. The aim is not only to integrate the data, but also to integrate the process. Process integration aims to integrate the procedures within the business process to improve performance. The main challenge of combining traditional data types with new data types is integrating the diverse data types in such a way that they can be processed and analyzed efficiently.

Data Quality and Master Data Management

Data quality involves various criteria to assess the quality of a dataset. We need to guarantee a high-quality level to achieve accurate results (GIGO). A data quality dimension includes accuracy, completeness, consistency, and accessibility. The causes of data quality issues are often deeply rooted within core organizational processes and culture. Data preprocessing activities are corrective measures for dealing with data quality issues. Transparent and well-defined collaboration between data stewards and data owners is key to sustainably improving data quality. Data integration can both improve and hamper data quality (e.g., environments where different integration approaches have been combined, leading to a jungle of systems). The series of processes, policies, standards, and tools to help organizations define and provide a single point of reference for all data that are mastered is master data management (MDM). Setting up an MDM initiative involves many steps and tools, including data source identification, mapping out the systems architecture, constructing data transformation, cleansing and normalization rules, providing data storage capabilities, monitoring, and governance facilities. MDM is the generic term of enterprise data management (EDM). EDM is a data management platform for validating the customer data; it is mainly associated with securities data.

Data Governance

Organizations are increasingly implementing company-wide data governance initiatives to govern and oversee data quality and data integration. In data governance, an organization aims to set up a company-wide controlled and supported approach toward data quality that is accompanied by data quality management processes (i.e., managing data as an asset rather than a liability). Different frameworks and standards have been introduced for data governance. A well-articulated data governance program is a good starting point. Approaches include centralized (i.e., central department of data scientists handling all analytics requests), decentralized (i.e., all data scientists are directly assigned to business units), and mixed (i.e., centrally coordinated center of analytical excellence with analytics organized at the business unit level). Businesses should aim for top-down, data-driven culture to catalyze trickledown effects. The board of directors and senior management should be actively involved in analytical model building, implementation, and monitoring processes.

Data governance has different standards such as Total Data Quality Management (TDQM), Capability Maturity Model Integration (CMMI), Data Management Body of Knowledge (DMBOK), Control Objectives for Information and Related Technology (COBIT), and Information Technology Infrastructure Library (ITIL). TDQM is a model that defines, measures, and improves the quality of the data in the organization. CMMI is a model that helps organizations to improve the reliability by decreasing risks in services and products by developing the system behaviors. DMBOK is a collection of best practices for each data management process such as data modeling, data quality, documentation, data security, and metadata. COBIT is a framework that helps organizations that are looking to improve and monitor their data management system. ITIL is a framework that improves efficiency to achieve the predictable services using standardized design.

Privacy and Security of Data

The concept of data security pertains to the following concerns: guaranteeing data integrity and data availability (i.e., ensuring that data are accurate and available when needed), authentication (i.e., verifying the identity of a user), access control (i.e., controlling who has access to data and what actions they can perform), guaranteeing confidentiality (i.e., ensuring that data are kept confidential and are not disclosed to unauthorized users), auditing (i.e., tracking all access and activity related to data), and vulnerabilities (i.e., identifying vulnerabilities in systems and data). To better understand privacy, we should start with the RACI (responsible, accountable, consulted, and informed). Responsible defines who is responsible for developing the data. Accountable defines the people who decide what should be done with the data. Consulted defines domain expertise to advise the data scientist. Informed defines a set of people who should be up-to-date on the working process.

To access internal data, a data scientist should file a data access request to specify the target data and the length of time needed for access. There are many available privacy regulations such as the General Data Protection Regulation (GDPR), Privacy Act of 1974, Health Insurance Portability and Accountability Act (HIPAA) of 1996, the Electronic Communications Privacy Act (ECPA) of 1986, and the Privacy Shield. The Privacy Shield is a framework for exchanges of personal data between the European Union and the United States.

Big Data Analytics

While storage and computing needs have grown by leaps and bounds in the last several decades, traditional hardware has not advanced enough to keep up. Enterprise data no longer fits neatly into standard storage, and the computational power required to handle most big data analytics tasks may take weeks or months, or may be impossible to complete. To overcome this deficiency, many new technologies have evolved to include multiple computers working together, distributing the database to thousands of commodity servers. When a network of computers is connected and works together to accomplish the same task, the computers form a cluster. A cluster can be thought of as a single computer but can dramatically improve the performance, availability, and scalability over a single, more powerful machine at a lower cost by using commodity hardware.

Concepts In Practice

Private Data and Analytics

All social media and search engine websites use data management for machine and deep learning applications. Today there are major concerns related to the use of private data to perform analytics and support targeted sales or broadcast fake news to select customers. Being able to control the use of private data is essential so it does not get manipulated and end up being misused for various reasons. Data compliance is attempting to address this very problem and various regulations have already been put in place in the United Kingdom and Europe via the GDPR regulations. The same is happening in the United States, with regulations being put in place in some states including California in particular.

Analytics Process Model

An analytics process model provides a statistical analysis using a set of processes to solve system problems and find a new market opportunity. There are many sample applications for analyzing data such as risk analytics (e.g., credit scoring, fraud detection), marketing analytics (i.e., using data to evaluate the success of marketing strategies), response modeling (i.e., statistical platform to model the relationship between the customers’ responses and the predicted values), customer segmentation (i.e., dividing the customers into groups with each group sharing the same characteristics to improve the marketing strategy), recommender systems (i.e., a filtering system that provides suggestions for products based on customer rating), and text analytics (i.e., identifying a pattern to understand the data).

Creating a viable data management infrastructure for analytics applications involves much more than just building a simple machine learning model. It requires an understanding of how all the parts of the enterprise’s ecosystem work together—where/how the data flows into the data team, the environment where the data are processed/transformed, the enterprise’s conventions for visualizing/presenting data, and how the model output will be converted as input for some other enterprise applications. The main goal is to build a process that is easy to maintain and where models are iterated on, the performance is reproducible, and a model’s output can be easily understood and visualized for other stakeholders so that they may make informed business decisions. Achieving those goals requires selecting the right tools as well as an understanding of what others in the industry are doing along with best practices. Figure 8.25 shows the three-stage process of the analytics process model, which we discuss next.

Illustration of analytics process model. Preprocessing: Identify problem, Identify data sources, Select, Clean Transform. Analytics (Analyze). Post-processing (Evaluate and deploy).

Figure 8.25 The three-stages of the analytics process model are shown. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

Data Preprocessing

To preprocess the data, a data scientist may use denormalizing, the merging process, sampling, exploratory analysis, missing values, and outlier detection and handling. The process of merging several normalized data tables into an aggregated, denormalized data table is called denormalizing. The merging process involves selecting information from different tables about a specific entity and then copying it to an aggregated table. Selecting a subset of historical data to build an analytical model is called sampling. The process of summarizing and visualizing data for initial insight is called exploratory analysis. Filling the empty field or deleting it involves resolving a missing value. An outlier is a value outside the population that should be detected in order to apply the handling process on it.

Types of Analytics

After finishing the preprocessing, the analytic step will start (Figure 8.26). This step aims to extract a decision model from the preprocessed data. To build such a model, there are many models, such as predictive analytics, evaluating predictive models, descriptive analytics, and social network analytics.

Illustration of Data mining (Association analysis, Apriori/FP-growth algorithm, Classification, Decision tree, Bayesian networks, Cluster analysis, K-means/hierarchical clustering) and Machine learning (Supervised/Unsupervised learning, Regression model decision trees, Clustering association, Other, Reinforce/Deep learning).

Figure 8.26 Many techniques can be used to analyze data in order to make a business decision. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

Predicting the target measure of interest using regression (e.g., linear regression, logistic regression) and classification (e.g., decision trees) is predictive analysis. Evaluating predictive models splits up the dataset with specific performance measures. Patterns of customer behavior (e.g., association rules, sequence rules, and clustering) are considered descriptive analytics.

Global Issues in Technology

Predictive Analytics

Predictive analytics is a set of techniques that uses data, statistics, and machine learning to make predictions about future events and/or behaviors. These predictions rely on data being accurate and unbiased to provide value. As we move toward generative artificial intelligence (GenAI) technology that makes use of existing data to create large language models, issues associated with the possibility of using biased data to train these models are becoming critical. Recent research is focusing on fairness in the decisions taken by models trained with biased data, and on designing methods to increase the transparency of automated decision-making processes so that possible bias issues may be easily spotted and “fixed” by removing bias.

Postprocessing of Analytical Models

The last step is the postprocessing step, and the first activities in this step are interpretation and validation. Business experts validate the data and detect any unknown pattern. In addition, sensitivity analysis takes place in postprocessing to verify the robustness of the created model. After that, experts approve the deployment of the model and the production activity can start. Finally, the expert applies the backtesting activity to be sure the model produces the correct output.

Evaluating Analytics

Analytics models should solve the business problem for which it was developed (business relevance) and should be acceptable statically (statistical performance and validity). The analytical model should be understandable to the decision-maker (interpretability) and operationally efficient. Measuring the model performance uses TCO and ROI. The total cost of ownership (TCO) represents the cost of owning and operating the analytical model over time. The calculation of return on investment (ROI) determines the ratio of net profits divided by the investment of resources.

Think It Through

Machine Learning

You are given a dataset that pertains to hospital patients affected with a particular condition and you are asked to create a predictive model that could be used to assess testing new individuals for this condition.

Would you use shallow or deep machine learning to solve this problem? Explain your dataset and logic for solving the health-care problem.

Big Data Analytics Frameworks for Shallow Machine Learning

Big data analytics creates a model from a set of data, and to learn from the data, we need machine learning (ML) algorithms. This section discusses MapReduce, Hadoop framework, SQL on Hadoop, Apache Spark framework, streaming big data analytics, on-premises versus cloud solutions, and searching unstructured data and enterprise search.

MapReduce

MapReduce is a two-step computational approach for processing large (multiterabyte or greater) datasets distributed across large clusters of commodity hardware in a reliable, fault-tolerant way (Figure 8.27). The first step is distributing data across multiple computers (Map) with each performing a computation on its slice of the data in parallel. The next step combines those results in a pair-wise manner (Reduce).

Illustration of Big Data input going through Map() then Reduce(), resulting in output.

Figure 8.27 After the data input stage, Map will start then Reduce to produce the output. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

Hadoop Framework

Hadoop is distributed data infrastructures that leverage clusters to store and process massive amounts of data. An open-source software framework used for distributed storage and processing of big datasets can be set up over a cluster of computers built from normal, commodity hardware. Many vendors such as Amazon, Cloudera, Dell, Oracle, and Microsoft offer their implementation of a Hadoop stack.

Hadoop leverages distribution and consists of three main components:

Hadoop Distributed File System (HDFS): A way to store and keep track of data across multiple (distributed) physical hard drives
MapReduce: A framework for processing data across distributed processors
Yet Another Resource Negotiator (YARN): A cluster management framework that orchestrates the distribution of CPU usage, memory, and network bandwidth allocation across distributed computers

Hadoop is built for iterative computations. It scans massive amounts of data in a single operation from disk, distributes the processing across multiple nodes, and stores the results back on disk. Hadoop is typically used to generate complex analytics models or high-volume data storage applications, such as retrospective and predictive analytics that involve analyzing past data to identify trends, machine learning and pattern matching that involve using algorithms to automatically identify patterns and trends in data, customer segmentation and churn analysis that involve dividing customers into groups based on shared characteristics or behaviors and using this information to better understand their needs and preferences, and active archives that involve storing data in a way that allows it to be easily accessed and used for analytics purposes.

SQL on Hadoop

Because of the complexity of MapReduce in a database query, many industries create other solutions. In 2007, Hadoop included the first version of Hbase as a data storage platform. HBase offers a simplified structure and query language for big data. Similar to an RDBMS, HBase organizes data in tables with rows and columns. Yahoo developed Pig, which is a high-level platform for creating programs that run on Hadoop, and its language is Pig Latin, which uses MapReduce underneath. It somewhat resembles the querying facilities of SQL. Facebook developed Hive, which is a data warehouse solution offering SQL querying facilities on top of Hadoop. It converts SQL-like queries to a MapReduce pipeline. It also offers a JDBC and ODBC interface and can run on top of HDFS as well as other file systems.

Apache Spark Framework

MapReduce processes data in batches; therefore, it is not suitable for processing real-time data. Apache Spark is a parallel data processing tool that is optimized for speed and efficiency by processing data in-memory. It operates under the same MapReduce principle but runs much faster by completing most of the computation in memory and only writing to disk when memory is full or the computation is complete.

Streaming Big Data Analytics

Data streams come from devices, sensors, websites, social media, and applications. Streaming analytics performs an analytic process on streaming data, which is useful for real-time flow of data. There are many big data streaming analytics platforms such as Amazon Kinesis Data Firehose, which is a streaming data service to capture, process, and store data streams at any scale. The Array of Things (AoT) is an open-source network that collects and returns urban data in real time. Azure Stream Analytics is a real-time analytics event-processing engine to analyze big data streaming from multiple sources.

On-Premises vs. Cloud Solutions

Another innovation that has completely transformed enterprise big data analytics capabilities is the rise of cloud services. Before cloud services were available, businesses had to buy on-premises data storage and analytics solutions from software and hardware vendors, pay up front for perpetual software license fees and annual hardware maintenance, pay service fees along with the costs of things such as power, cooling, security, disaster protection, and IT staff for building and maintaining the on-premises infrastructure. Even when it was technically possible to store and process big data, most businesses found it cost prohibitive to do so at scale. Scaling with on-premises infrastructure also requires an extensive design and procurement process, which takes a long time to implement and requires substantial up-front capital. Many potentially valuable data collection and analytics possibilities were ignored as a result. Some key benefits of cloud computing include scalability, flexibility, cost savings, reliability, and disaster recovery.

Link to Learning

Businesses can significantly reduce costs and improve operational efficiencies with cloud services because they can develop and produce their products more quickly with the out-of-the-box cloud resources with built-in scalability. Cloud services remove the up-front costs and time commitment to build on-premises infrastructure. Cloud services also lower the barriers to adopt big data tools, which has the effect of democratizing big data analytics for small and midsize businesses. Using Cloud services allows start-up and small companies to develop and scale solutions quickly, making it possible for them to compete with larger organizations. You can discover how companies and their data scientists use the cloud to deploy data science solutions to production or to expand computing power.

Searching Unstructured Data and Enterprise Search

Searching for information in documents using retrieval models that specify matching functions and query representation is information retrieval. Enterprises use a variety of retrieval models for intranet, web search, and analysis such as keyword queries that use a keyword to retrieve documents, Boolean queries that use logical operators to retrieve the documents, phrase queries that perform exact phase retrieval, proximity queries that check how close to each other within a record multiple entities are, wild card queries that support matching expressions, and natural language queries that try to formulate answers to a specific question from retrieved results. Searching unstructured data is challenging. A full-text search is selecting individual text documents from a collection of documents according to the presence of a single or a combination of search terms in the document. Indexing full-text documents is the process of adding an index for every search term that consists of term and pointer, with each pointer referring to a document that contains the term. Web search engines search a web database and gather information related to a specific term. An enterprise search is the process of making content stemming from the databases by offering tools that can be used within the enterprise.

Cognitive Analytics and Artificial Intelligence

The technology that tries to simulate a human’s way of solving problems (e.g., Siri and Alexa) is called cognitive computing. Artificial intelligence (AI) is a system that creates intelligent ways to solve problems that previously required human interaction.

Cognitive Computing

Three components must interact to achieve AI (Figure 8.28): syntax (structure), semantics (meaning), and inference (reasoning/planning). Human intelligence requires that these three components rely on some form of data management and relate to deep learning of other brain functions such as restricted Boltzmann machines, stacked autoencoder, and deep belief networks. Restricted Boltzmann machines are artificial networks that can learn a probability distribution using a set of input. Stacked autoencoder is an artificial network used to learn unlabeled data through efficient coding. Deep belief networks are intelligent networks used to invent a solution for a specific problem when the traditional intelligent network could not solve it.

Illustration of head with Language and Vision inside, with Semantics, Inference, and Syntax circling around.

Figure 8.28 Components of artificial intelligence include syntax, semantics, and inference. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

Sample AI Applications

Amazon provides various ML services. They allow developers to integrate Cloud ML into mobile and other applications. Amazon Lex allows users to incorporate voice input into applications (extension of Amazon’s Echo product) so that users can ask questions about everything from the weather to news and streaming music. Amazon Polly is the opposite of Lex; Polly turns text into speech in 27 languages. Amazon Rekognition, which is at the cutting edge of deep learning applications, takes an image as input and returns a textual description of the items that it sees in that image by performing detailed facial analysis and comparisons.

Reinforcement and Transfer Learning

The machine learning method based on encouraging desired behaviors and removing undesired behaviors is called reinforcement. The machine learning method based on reusing the result of a specific task to start a new task is called transfer learning. A deep learning network is a type of machine learning method based on artificial neural networks (Figure 8.29). Artificial neural networks simulate the network of neurons to make a computer learn and make decisions like the human brain does (e.g., recurrent neural networks or RNNs, which are an artificial neural network, uses time series data). There are various classes of deep learning networks such as:

Cloud-based deep learning frameworks such as Microsoft Cognitive Toolkit, which is an open-source toolkit for commercial-grade distributed deep learning
Google TensorFlow system applications for neural network computing and deep learning such as handwritten digit recognition and cognitive services
Predictive software libraries for cognitive applications such as Keras, which is an open-source software library that provides a Python interface for artificial neural networks

Illustration of input as purple circle, next to column of green circle, connected by arrows to blue circles, connected to a peach circle labelled Output.

Figure 8.29 The deep learning process uses artificial neural networks. (credit: modification of "NeuralNetwork" by Loxaxs/Wikimedia Commons, CC0)

Cognitive Analytics Frameworks for Deep Machine Learning

In this section, we introduce examples of Cognitive analytics frameworks for deep machine learning applications such as Spark MLlib, Amazon Machine Learning Platform and MXNet, Google TensorFlow, Azure Machine Learning Platform, and Microsoft Cognitive Toolkit.

Spark MLlib

Spark ML includes a High-level API (Spark Machine Learning Library of MLlib) for creating ML pipelines. Data can be fed into data frames, and the library enables the quick creation of a machine learning processing pipeline by combining transformers and estimators.

MXNet

MXNet is an open-source library for distributed parallel machine learning that was developed at Carnegie Mellon University, the University of Washington, and Stanford University. MXNet can be programmed with Python, Julia, R, Go, Matlab, or C++ and runs on many different platforms including clusters and GPUs; it is also now the deep learning framework of choice for Amazon.

Google TensorFlow

Google’s TensorFlow is a frequently discussed and used deep learning toolkit; if you have installed the Amazon Deep Learning AMI, you already have TensorFlow installed, and you can begin experimenting right away.

Azure Machine Learning Platform and Microsoft Cognitive Toolkit

The Microsoft Cognitive Toolkit software is available for download in a variety of formats so that deep learning examples can be run on Azure as clusters of Docker containers (i.e., multiple nodes joined using a special configuration).

Industry Spotlight

MPP Is NYSE VIP

The New York Stock Exchange (NYSE) receives 4 to 5 TB of data daily and conducts complex analytics, market surveillance, capacity planning, and monitoring. It had been using a traditional database that could not handle the workload; it took hours to load and had poor query speed. Moving to an MPP database reduced their daily analysis run time by eight hours.

Massively Parallel Processing (MPP) Databases

Similar to MapReduce, massively parallel processing (MPP) databases are referred to as NewSQL as opposed to NoSQL. MPP distributes data processing across multiple nodes, and the nodes process the data in parallel for faster speed. Unlike Hadoop, MPP is used in RDBMS and utilizes a “share-nothing” architecture. Each node processes its own slice of the data using multicore processors, making them many times faster than traditional RDBMS. Some MPP databases, such as Pivotal Greenplum, have mature machine learning libraries that allow for in-database analytics. In an MPP system, all the nodes are interconnected and data could be exchanged across the network.³ However, as with traditional RDBMS, most MPP databases do not support unstructured data, and even structured data will require some processing to fit the MPP infrastructure. Therefore, it takes additional time and resources to set up the data pipeline for an MPP database. Because MPP databases are ACID-compliant and deliver much faster speed than traditional RDBMS, they are usually employed in high-end enterprise data warehousing solutions such as Amazon Redshift, which is a data warehouse cloud platform for Amazon Web Services, and Pivotal Greenplum, which is a big data technology based on MPP architecture combined with an open-source database.

Footnotes

3IBM. Parallel processing technologies. Last updated March 21, 2023. Available at https://www.ibm.com/docs/en/iis/11.5?topic=topologies-parallel-processing.

8.6 Data Management for Shallow and Deep Learning Applications