Summary
8.1 Data Management Focus
- To make a decision, data should be formatted and converted to information. Knowledge comes after processing the information.
- Metadata are data about data and are stored in catalogs. The catalog provides an important source of information for end users.
- Data quality represents the measure of how well the data represents its purpose. A data quality framework categorizes the different dimensions of data quality such as intrinsic, contextual, representation, and access.
- Data governance is a set of clear roles, policies, and responsibilities that enables the enterprise to manage and safeguard data quality.
- There are various data management roles: information architect, database designer, data owner, data steward, database administrator, computer scientist, and data scientist.
- The data management road map has multiple steps, starting from collecting and storing the data to having a final product or decision.
8.2 Data Management Systems
- To store, retrieve, edit, and maintain the related data in the database, we need a system that is a database management system (DBMS).
- A database can be defined as a collection of related data items within a specific business process or problem setting.
- A DBMS is the software package used to define, create, use, and maintain a database while considering appropriate security measures.
- There are many characteristics for DBMSs such as loose coupling, efficiency, consistency, and maintenance.
- A DBMS includes various components such as DBMS interface, connection manager, security manager, DDL compiler, query processor, storage manager and DBMS utilities.
- Logical data model categories include hierarchical DBMSs, network DBMSs, relational DBMSs, object-oriented DBMSs, XML DBMSs, and NoSQL DBMSs.
- DBMS users may be divided into actors on the scene and workers behind the scene.
- There are various types of database architectures such as centralized DBMS architecture, client server DBMS architecture, n-tier DBMS architecture, cloud DBMS architecture, federated DBMS, and in-memory DBMS.
8.3 Relational Database Management Systems
- The relational model of data is based on the mathematical concept of a relation.
- Relational database management systems (RDBMSs) are one type of DBMS that stores related data elements in a row-based table structure.
- SQL is a language used in programming and managing structures data located in a RDBMS. SQL is based on relational algebra with many extensions.
- Relational algebra is a query language that uses operators to perform queries.
- The logical design is designing a database based on a specific data model but independent of physical details.
- Database normalization is the process of structuring a relational database to reduce data redundancy and improve data integrity.
- Relational database design (RDD) models data into a set of tables with rows and columns. Each row represents a record, and each column represents an attribute.
- Database tables are stored in a disk storage such as hard disks, flash memory, magnetic disks, optical disks, and tapes.
- File organization and indexing are used to minimize the number of block accesses for frequent queries, and the most popular are sequential, relative, and indexed organization.
- API technologies represent database-related entities in an OO way.
- Concurrency control is the coordination of transactions that execute simultaneously on the same data so that they do not cause inconsistencies due to mutual interference.
- Data replication is the storage of data in more than one site to improve the data availability and retrieval performance.
- Database recovery is the activity of setting the database in a consistent state without any data loss in the event of a failure or when a problem occurs.
- Database security uses a set of controls to secure data and guarantee a high level of confidentiality.
8.4 Nonrelational Database Management Systems
- A nonrelational database is a database that does not use traditional ways for storing data.
- Flat file databases and multifile relational databases are the two main legacy DBMS.
- A hierarchical model is a model in which data are stored in the form of records and organized into a tree structure.
- Non-first normal form (NFNF) is a database data model that does not meet any of the conditions of database normalization defined by the relational model.
- Object persistence appears when an object is not deleted until a need emerges to remove it from the memory.
- Persistence independence means that an object is independent of how a program manipulates it.
- The relational model has a flat structure, and expensive joins are needed to defragment the data before it can be successfully used, which increases the complexity of the objects due to the normalization.
- An XML database is a data persistence system in which the data are specified and stored in XML format.
- Mapping strategies to map XML data into relational databases are table-based mapping, schema-oblivious mapping, and schema-aware mapping.
- Unstructured data are managed by key-value stores, tuple and document stores, column-oriented databases, graph-based databases, and other NoSQL databases.
- DaaS is a data management strategy that includes many of technologies such as information life cycle solutions, data modeling, replication, and content management.
8.5 Data Warehousing, Data Lakes, and Business Intelligence
- A data warehouse is a relational database that stores processed data that are optimized for gathering business insights to support decision-making process.
- In designing a data warehouse, many schemas can be adopted such as star schema, snowflake schema, and fact constellation.
- ETL is the data extraction, transformation, and loading process.
- A data mart is a scaled-down version of a data warehouse aimed at meeting the information needs of a homogeneous small group of end users.
- Virtualization uses middleware to create a logical or virtual data warehouse.
- An operational data store (ODS) is a staging area that provides query facilities.
- Data lakes are large data repository that store raw data and can be set up without having to first define the data structure and schema.
- Business intelligence (BI) is the set of activities, techniques, and tools aimed at understanding patterns in past data to predict the future.
8.6 Data Management for Shallow and Deep Learning Applications
- Data integration aims to provide a unified view and/or unified access over heterogeneous, and possibly distributed, data sources.
- Big data encompasses both structured and highly unstructured forms of data.
- The scope of big data has five Vs: Volume, Velocity, Variety, Veracity, and Value.
- Data virtualization is a technique that hides the physical location of the data and uses data integration patterns to produce a unified data view.
- Data quality involves various criteria to assess the quality of a dataset.
- The aim of data governance is to set up a company-wide controlled and supported approach toward data quality that is accompanied by data quality management processes.
- An analytics process model includes prepressing, analytics, and postprocessing.
- Big data analytics creates a model from a set of data; to learn from the data we need machine learning algorithms.
- Streaming analytics performs analytic processes on streaming data.
- Cognitive computing is a technology tries to simulate human’s way in solving problems
- Artificial intelligence is a system that creates intelligent ways to solve problems that previously required human interaction
8.7 Informatics and Data Management
- Informatics describes the study, design, and development of information technology for the good of people, organizations, and society.
- Information systems clearly support informatics and provide an organizational context for using database systems to help collect, organize, store, analyze, preserve, retrieve, and govern data and records relevant to an organization.
- Information systems provide resources involved in collection, management, use, and dissemination of information resources of organizations.
- An information system life cycle includes feasibility analysis, requirements collection and analysis, design, implementation, and validation and acceptance testing.