Dr. Jean-Claude Franchitti

Learning Objectives

By the end of this section, you will be able to:

Discuss data management and its relation to computer science and data science
Understand that data are the backbone of the industry
Identify and explain key concepts in data management
Distinguish various roles in the field of data management
Explain the current state of data management

In the current digital world, collecting information and facts that are stored digitally by a computer, or data, is a straightforward process. There are many direct and indirect ways, such as social media, that support the data collection process. A large amount of data is collected every day, every hour, and every minute. But this amount of data is not useful unless benefits can be drawn from it, which requires knowing what to do with these data. In this context data are like the "crude oil" that needs to be stored digitally in order to allow extraction of information and knowledge via information and knowledge management systems. Related structured data is stored in a database management system. The knowledge supports the decision-makers in any organization in making important decisions, such as increasing the sales in some regions, changing the suppliers within the supply chain, decreasing the manufacture of a specific product, modifying the hiring process, studying the community culture, and identifying customer demand. It is clear that proper data management is required to support decision-makers. The study of managing data effectively, or data management, treats data as a corporate asset similar to physical assets such as plants and equipment. Data management requires having a strategy to analyze the data by collecting data, then storing the data, cleaning the data, preprocessing the data, and preparing the data for analytics, which will lead to decision-making.

Concepts In Practice

Managing Your Social Media Data

Facebook, Instagram, LinkedIn, Snapchat, Tik Tok, X, and many other social media applications are controlling your data. You are most likely posting too much information about your life, work, and school, but it is not only you—everyone is sharing data on social media. A supermassive Mother of all Breaches (MOAB) publicized discovered in early 2024 contains data from numerous previous breaches, comprising 12 TB of information from LinkedIn, X (formerly Twitter), Weibo, Tencent, and other platforms’ user data.

Have you ever asked yourself how service providers are managing this data? How do they store the data? Where do they store it? Which database are they are using? And the big question: can we benefit from this data and manage our own data? Meta is offering the option to control your Facebook experience by managing your data; all you need to do is complete a form, which gives you multiple options such as downloading your data, managing ad preferences, and removing a tag from a post.¹

Data Management in Computer and Data Science

Computer science is the study of computing algorithms that receive data as inputs, process data, and produce outputs. Computer science applies the principles of mathematics, physics, and engineering. Hardware, software, operating systems, and applications are working together to perform computer science computations.

Data science is the study of extracting knowledge from information using hypothesis analysis and algorithms. Information is the result of processing raw data. Figure 8.2 illustrates the skills and knowledge needed to be successful in data science.

Circles with Computer Science, Mathematics and Statistics, and Business domain knowledge overlap, creating: Machine learning, Data science, Software design, and Data analytics.

Figure 8.2 Data science combines programming skills, analytic systems, statistics, and data mining. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

Data Management in the Industry

Data are the backbone of any enterprise because analyzing data improves performance and increases profit. Gathering raw data from various sources and processing it for modeling is about 80% of the data scientist’s work.² Enterprise projects usually involve a massive amount of data, which makes it impossible to be stored locally, so many businesses move their data to the cloud. A business may also choose to run applications as well as databases and operating systems in the cloud. Data management is often handled by a separate data engineering team because most computer and data scientists do not have enough knowledge about data storage and infrastructure.

Industry Spotlight

Industry Data Management

Data management is important in every industry today. The main benefit of data management is to minimize potential errors by controlling access to data using a set of policies.

Elaborate on how useful it is to know about data management in retail industries for product marketing purposes. (Hint: Data management can help with targeted advertising, for example.)

Data Management Aspects

There are various aspects of effective data management: metadata cataloging, metadata modeling, data quality (data accuracy, data completeness, and data consistency), and data governance, which we will discuss in more detail in the following subsections.

Metadata Cataloging

Metadata are used in a database management system (DBMS), which is a system that creates, stores, and manages a set of databases. In a DBMS approach, metadata are stored in a catalog called a data dictionary. A data dictionary is a set of information describing the content, format, and structure of a database (e.g., in a relational DBMS, the catalog includes the names of all available tables and their associated fields). The metadata catalog constitutes the heart of the database system and can be part of a DBMS or a stand-alone component. The metadata cataloging process is collecting data about processes, people, products, and any enterprise-related data, which provides an important source of information for end users, application developers, as well as the DBMS itself. The catalog typically provides an extensible metadata model, import/export facilities, support for maintenance and reuse of metadata, monitoring of integrity rules, facilities for user access, and statistics about the data and its usage for the database administrator and query optimizer. Metadata cataloging improves the user experience, adds competitive advantages, and improves the efficiency of the business.

Metadata Modeling

A business presentation of metadata is metadata modeling. A database design process can be used to design a conceptual model of a database storing metadata as an enhanced entity–relationship (EER) model (Figure 8.3) or unified modeling language (UML) model. EER and UML are used to create models for designing large systems. Metadata modeling may cover various views of these models. For example, EER models and UML class diagrams help express views of a data model. While EER focuses purely on data modeling, the UML notation may be used to express comprehensive models of information systems using additional diagrams.

Box with Artist, leading to circles with ANR and aname. Artist linked to Singer and Actor with lines, bisected by arcs. Singer box linked to music style.

Figure 8.3 This EER diagram models a product entity and states that a product can be either shipped or picked up but not both. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

Data Quality

The measure of how well the data represents its purpose or fitness for use is called data quality (DQ). Data of acceptable quality in one decision context may be perceived to be of poor quality in another. Data quality determines the intrinsic value of the data to the business. Businesses may use the concept garbage in, garbage out (GIGO), which means that the quality of output is determined by the quality of the input. Poor DQ impacts organizations in many ways and at different management levels. For example, it negatively impacts operations in day-to-day operations; however, it makes a big difference at the strategic level in making decisions. DQ is a multidimensional concept in which each dimension represents a single aspect such as views, criteria, or measurements. DQ comprises both objective and subjective perspectives and can categorize different dimensions of data quality. The DQ framework has four categories: intrinsic DQ, contextual DQ, representation DQ, and access DQ, as illustrated in Figure 8.4.

Figure 8.4 The four data quality framework categories are intrinsic DQ, contextual DQ, representation DQ, and access DQ. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

In the intrinsic DQ category, data accuracy refers to whether the data values stored for an object are the correct values, and they are often correlated with other DQ dimensions. We can count the data reliability as a part of data accuracy. The degree to which all data in a specific dataset are available with a minimum percentage of missing data is called data completeness. It can be viewed from at least three perspectives: schema completeness (the degree to which entity types and attribute types are missing from the schema), column completeness (the degree to which there exist missing values in a column of a table), and population completeness (the degree to which the necessary members of a population are present or not). The data consistency dimension is part of the representation category and can also be viewed from several perspectives: consistency of redundant or duplicated data in one table or multiple tables, consistency between two related data elements, and consistency of format for the same data element used in different tables. The accessibility dimension is part of the access category and reflects the ease of retrieving the data from the underlying data sources (this often involves a trade-off with security, which is also part of the access category).

There are many common causes of bad data quality, such as the following:

computer processing
- duplication: multiple data sources providing the same data, which may produce duplicates
- consistency problem: different occurrences of data or incorrect data
- objectivity problem: data giving different results in every process
- limited computing resources: insufficient computing resources limiting the accessibility of relevant data
- accessibility problem: large volumes of stored data making it difficult to access needed information in a reasonable time
human intervention
- biased information: using human judgment in data production
- relevance problem: different processes updating the same data
- data quality problems: decoupling of data producers and consumers

Data Governance and Compliance

The set of clear roles, policies, and responsibilities that enables a business to manage and safeguard data quality using internally set rules and policies is called data governance. The related concept that ensures that data practices align with external legal requirements and industry standards is called data compliance. For example, the UK-GDPR (General Data Protection Regulation) is the United Kingdom’s data security regulation, modeled after the EU-GDPR that governs and regulates how UK organizations and businesses collect, store, use, and process personal data.

Using data governance, data are managed as an asset rather than a liability. Data governance has three dimensions: technology, people, and process. It should include standard roles for quality, security, and ownership.

In planning for data governance, we should answer four questions: what, how, why, and who. What policies should we include? How do we integrate the policies with the enterprise business process? Why do we need this policy? Who will be part of this policy?

Different frameworks have been introduced for data quality management and data quality improvement: Total Data Quality Management (TDQM), Total Quality Management (TQM), Capability Maturity Model Integration (CMMI), ISO 9000, Control Objectives for Information and Related Technology (COBIT), Data Management Body of Knowledge (DMBOK), Information Technology Infrastructure Library (ITIL), and Six Sigma. These frameworks provide guidelines for organizations that define how a product or process should be based on high-quality standards. The main issue in these frameworks is that they may cause failure when the system is not considering all of the processes in a correct flow.

It is possible to annotate the data with data quality metadata as a short-term solution. Unfortunately, many data governance efforts (if any) are mostly reactive and ad hoc.

Global Issues in Technology

Hacked!

Chegg is an educational technology company based in California. It offers various online services such as textbook rentals, tutoring, homework assistance, and more to students around the country. In April 2018, an unauthorized party gained access to Chegg’s database. This database hosted user information for both Chegg and its affiliated companies (e.g., EasyBib). The hacked information included names, emails, passwords, and addresses, but did not include any financial information or social security numbers. After discovering the breach, Chegg implemented plans to notify the 40 million affected users. The motivation for the attack is still unclear, but it is likely that a third party sought information to profit from identity theft.

What are some things that Chegg users can do to protect themselves from this hack and future breaches of personal information?

Data Management Roles

Within any organization, there are various data management roles including information architect, database designer, data owner, data steward, database administrator, computer scientist, and data scientist. Their roles are outlined in the following sections.

Information Architect

An information architect, also known as a data architect or information analyst, is responsible for designing the conceptual data model (blueprints) to bridge the gap between the business processes and the IT environment.

The information architect collaborates with the database designer who may assist in choosing the type of conceptual data model (e.g., EER or UML) and the database modeling tool. Figure 8.5 shows an example of different database systems that various personnel working in data management may encounter.

Illustration of different database systems that various personnel working in data management may encounter.

Figure 8.5 Existing database systems can be categorized as nonrelational and relational. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

Database Designer

A database designer is responsible for creating, implementing, and maintaining the database management system. Other responsibilities include translating the conceptual data model into a logical and internal data model, assisting the application developers in defining the views of the external data model, and defining company-wide uniform naming conventions when creating the various data models.

Data Owner

The data owner has the authority to ultimately decide on the access to, and usage of, the data. The data owner could be the original producer of the data, or the data could originate from a third party. A person who assumes the role of a data owner should be able to insert, edit, delete, or update data as well as check or populate the value of a field. The data owner is the one responsible for checking the quality of one or more datasets.

Data Steward

A data steward is a DQ expert who ensures that the enterprise's actual business data and the metadata are accurate, accessible, secure, and safe. The data steward performs extensive and regular data quality checks, initiates corrective measures or deeper investigation into root causes of data quality issues, and helps design preventive measures (e.g., modifications to operational information systems, integrity rules).

Database Administrator

The database administrator (DBA) is responsible for the implementation and monitoring of the database as well as ensuring databases run efficiently by collaboration with network and system managers.

Computer Scientist

A computer scientist is a person who has theoretical and practical knowledge of computer science. The computer scientist will solve problems using technology by applying computer science practices. Computer scientists typically focus on building end-to-end solutions for companies. For example, computer scientists can create software applications that implement complex algorithms and store and retrieve related data into and from database systems. Computer scientists may also be involved in designing and fine tuning complex machine learning algorithms.

Data Scientist

Data scientists typically focus on creating classification and prediction models by training existing machine learning algorithms. The resulting trained models act as programs to help classify new data or predict an outcome based on input data. In general, a data scientist is a person who has theoretical and practical knowledge of managing data. A data scientist’s background combines computer science, mathematics, and statistics. A person in this role is responsible for gathering a large amount of structured, semistructured, and unstructured data to preprocess and prepare data for advanced data analysis to develop a product or to make a business decision.

Data Management Road Map

The data management road map has multiple steps, starting from collecting and storing data to having a final product or decision. Available databases vary in type and vendor (e.g., SQL, Hadoop, Spark, and MongoDB). Storing the data in databases is not hard, but the entire data management process most often takes place on the cloud, which adds a new set of necessary skills for the computer scientist and the data scientist. Massive data growth transformed the way data are stored and analyzed, and many applications and databases are hosted on servers in data centers elsewhere. While preparing data, data scientists and computer scientists should clean and format the data/information to be used for the correct marketing use. We will discuss database types in more detail together with their data structure implementations (refer to 8.2 Data Management Systems).

Technology in Everyday Life

End-to-End Data Management

End-to-end data management covers the data life cycle within the system. A data life cycle is a process that helps the organization to manage the flow of data, and it includes creating the data, storing the data, and sharing the data.

A global positioning system (GPS) is an embedded system that mainly uses data to provide routes and destinations. Most of us use a GPS to check on road construction, traffic, or to find the shortest route.

How does knowledge of end-to-end data management help people in their everyday life? Look at these data collections from all aspects of data management including collecting, storing, and analyzing these data. Provide a few illustrative scenarios to explain your opinion.

Footnotes

1R. E. G. Beens, “The privacy mindset of the EU vs. the US,” Forbes. Updated April 14, 2022. https://www.forbes.com/sites/forbestechcouncil/2020/07/29/the-privacy-mindset-of-the-eu-vs-the-us/?sh=215a8a127d01
2ProjectPro, “Why data preparation is an important part of data science?” Updated April 11, 2024. https://www.projectpro.io/article/why-data-preparation-is-an-important-part-of-data-science/242.

8.1 Data Management Focus