Dr. Mahesh S. Raisinghani

Learning Objectives

By the end of this section, you will be able to:

Define data and differentiate between types of data
Identify tools and techniques for managing data with database systems
Determine how to gather, organize, curate, and process data
Compare and contrast database management systems

Understanding the fundamentals of data is important for effective management. Data can be categorized into structured, semistructured, and unstructured forms, requiring different tools and techniques to collect and analyze. Effective data management involves collecting, storing, and processing data using tools like a database management system (DBMS).

Identifying the business problem of an organization involves understanding the organization’s goals and opportunities through comprehensive needs analysis and stakeholder engagement, which involve collecting and analyzing data. By leveraging well-managed data, organizations can gather accurate user requirements to ensure their solutions align with the end users’ needs and expectations. These solutions are often built on robust database design and management practices, which are essential for creating scalable and efficient data systems. Careful planning, structuring, and management of data ensure databases support current and future needs. Effective database management includes regular tasks such as backup and recovery of data, performance optimization, and security enforcement. Additionally, as organizations continue to develop their data capabilities, the need for reliable mobile database development has become more prominent.

A mobile database is a database used on mobile devices, such as a smartphone or tablet, to store and access data. It must be lightweight to save storage, use less power, and work efficiently on limited mobile device resources. A cloud-based database is a database system stored and managed online. It allows the user to access it through the internet instead of keeping it on a single device. The development and integration of these database types has revolutionized how organizations store, manage, and analyze data, offering scalability, flexibility, and cost-effectiveness. Integrating mobile and cloud databases connects mobile devices with centralized storage, making it easy to sync, access, and manage data across platforms. This creates a smoother and more efficient way to handle data for businesses.

Types of Data

Understanding how to manage data is important for any organization in the digital world. Data, as you learned in 1.1 Introduction to Information Systems, consist of any information that can be processed or analyzed to gain insights and make decisions. Data come in two main categories line of business data and customer behavior data. Line of business data consists of the information generated from daily business operations, such as financial records, day-to-day operations, inventory processes, and supply chain details. These data are important for running the business efficiently. Customer behavior data consists of information collected about how customers interact with the company’s products or services. This includes a customer’s frequent transactions, purchase history, browsing patterns, social media interactions, and feedback.

These types of data can appear in various forms, such as text, voice, and images. Text data include emails, documents, news, and social media posts. Voice data come from customer service calls or voice-activated devices. Image data include photos and scanned documents, while video data consist of recordings from security cameras, social media, or marketing videos. There are three types of data: structured, semistructured, and unstructured (Table 3.1). Data that are highly organized and easily searchable is called structured data. Structured data is found in spreadsheets, tables, or databases. Another type of data is semistructured data, which are data that have some organization but does not fit neatly into tables. Two examples of semistructured data are extensible markup language (XML), which is a way to organize data using tags so that both people and computers can read it, and JavaScript Object Notation (JSON) files, which are in a format that transmits and stores data using human-readable text. This format can be used to send information from a web application to a database to migrate data. Finally, unstructured data lack a predefined structure and require advanced techniques for analysis; these include emails, videos, and social media posts.

	Structured Data	Semistructured Data	Unstructured Data
Definition	Highly organized and formatted data, easily searchable in databases	Partially organized data with some structure; tags often used to separate data elements	Data lack any specific structure, making it difficult to search and analyze directly
Examples	SQL databases, spreadsheets	XML, JSON, email	Text documents, images, videos, social media posts
Storage	Stored in tables with rows and columns	Stored in formats that contain tags	Stored as binary or text files, often in large volumes
Ease of access	Easy to query	Moderately complex	Highly complex
Scalability	Moderate, depends on the database system	High, designed to handle large volumes of data	High, but requires significant resources
Usage	Financial records, inventory systems	Web data, configuration files	Multimedia content, big data analytics, social media
Tools	Relational database management systems	NoSQL databases, XML, JSON parsers	Big data platforms (Apache Hadoop, Spark)
Performance	High for structured queries	Moderate, depends on the complexity of the structure	Variable, depends on the tools and methods used for analysis

Table 3.1 Comparison of Structured, Semistructured, and Unstructured Data Understanding the different types and forms of data is crucial to effectively manage and use the information to drive business decisions and strategies.

In 2006, British mathematician and data scientist Clive Humby compared data in the digital age to oil in the past, highlighting the crucial role of data in organizational growth.¹ Just as oil powered the industrial age, data fuels the integration of digital technology into nearly all business, production, and industry processes; this integration is known as Industry 4.0. Effective use of data is vital to the successful operation of modern organizations. Data help in understanding market trends, customer behaviors, and organizations’ internal processes, which is essential for making informed decisions, improving efficiency, enhancing customer experiences, and fostering innovation.

Managing Data in a Database Management System

A database management system (DBMS), which is a software system that manages, stores, and processes data, ensuring the data are organized, accessible, and secure. For example, a large hospital network uses a DBMS to streamline patient care and administrative tasks and support operations such as the following:

Electronic health records (EHRs): Patient medical histories, prescriptions, lab results, and imaging files are stored in a centralized database, allowing doctors to access up-to-date information instantly.
Appointment scheduling: Doctor availability and patient schedules are tracked to avoid conflicts and reduce wait times.
Billing and insurance processing: A DBMS automates billing, tracks payments, and verifies insurance claims efficiently.
Real-time alerts: A DBMS sends reminders for follow-up visits or medication refills and flags potential drug interactions or critical lab results for immediate review.

When you request a ride from a company like Uber or Lyft, or make a purchase at a grocery store, a DBMS is working in the background to handle all the information needed to make things run smoothly. A ride-sharing app collects information like the following:

your current location and where you want to go
available drivers nearby
traffic conditions to calculate the best route and estimated time
payment details to process the fare

After the ride, the data doesn’t just sit there. It’s used to

figure out where and when ride requests are most common,
help drivers be in the right place at the right time, and
improve routing and reduce delays.

By organizing and analyzing data effectively, a DBMS helps businesses make better decisions, save money, and improve customer experiences.

These systems organize and store large amounts of data efficiently, allow easy access and querying of data, ensure data accuracy, control data access, handle data transactions, guarantee data reliability, support large-scale data operations, ensure high performance, and provide mechanisms to restore data in case of failures.

Managing data effectively is essential for organizations, and database systems provide a structured environment to store, retrieve, and manipulate data efficiently. Various tools and techniques are used to manage data within these systems, each catering to different types of data and requirements. At the core of effective data management are two fundamental issues: how to use a database and how to build a database. Addressing these issues involves a variety of tools and techniques that are designed to ensure a high quality of data management.

The tools and techniques for using a database include the following:

Structured Query Language: Structured Query Language (SQL) is the standard language used to query and manage data in relational databases. It allows users to perform a wide range of operations such as SELECT, INSERT, UPDATE, DELETE, and JOIN to handle data efficiently. For example, in a ride-sharing app like Uber or Lyft, when a user requests a ride, the app’s database must quickly identify nearby drivers.
Database management systems: As mentioned previously, DBMSs manage data storage, and software like MySQL, PostgreSQL, Oracle Database, and Microsoft SQL Server provide the interface and tools necessary for database interaction.
Database access tools: A database access tool, such as SQL Server Management Studio, provides graphical user interfaces (GUIs) to facilitate database interaction without writing extensive code.
Indexing: Creating indexes on database tables improves the speed of data retrieval operations. Indexes are especially useful for large databases where query performance is critical. For example, in social media platforms like Instagram or X, millions of posts are created every second. When a user opens the app, the database must quickly retrieve the most recent posts from accounts they follow. An index on the user_id column ensures quick filtering of posts by the accounts that users follow. Another index on the created_at column allows the database to instantly sort posts by the most recent time stamps. This allows the app to load the user’s feed almost instantly, even with billions of posts in the database, providing a seamless user experience.
Transactions: Data consistency and integrity are protected by adhering to ACID (atomicity, consistency, isolation, and durability) properties. These ACID characteristics ensure that a transaction is all-or-nothing (atomicity), keeps the database accurate and adhering to rules (consistency), prevents transactions from interfering with each other (isolation), and guarantees that once a transaction is complete, it stays saved even if the system crashes (durability).
Backup and recovery: Regular backups and effective recovery plans are vital for data protection. Tools like mysqldump, pg_dump, and RMAN (Oracle Recovery Manager) help in creating backups and restoring data when necessary.
User access management: Managing user permissions and roles ensures that only authorized users can access or manipulate the data. The DBMSs provide tools to define user roles and privileges.

Building a database involves two steps—design and implementation—to ensure that a structure can efficiently store and manage data. The database design involves creating an entity relationship diagram (ERD), which is a visual representation of the database structure, including entities, attributes, and relationships. Entities are the things we are keeping data about, like “students,” or “classes.” Attributes are the details about an entity, like a student’s name or date of birth. Relationships show how entities are connected, like a student being enrolled in a class.

A technique used in the design process is normalization, in which data are organized and stored only once to eliminate the duplication of data. This helps to ensure data consistency and integrity and reduce redundancy. Tools like Lucidchart can help facilitate the design process. This involves dividing large tables into smaller, related tables and defining their relationships. Another important step in the design process is defining the database schema, which is the structures of tables, including columns and data types. This step is followed by using a data modeling tool like Microsoft Visio, which assists in creating and visualizing the database schema, helping to design both the logical and physical aspects of the database. Finally, it is important to implement data security measures such as encryption to protect sensitive data. A DBMS provides features and tools to enforce data security policies, ensuring that data remain safe and compliant with regulations. Data security policies in a DBMS can come from both the organization and federal regulations. Organizational policies are rules set by the company, like allowing only human resources to access salary records. Federal policies are legal requirements, like the Health Insurance Portability and Accountability Act (HIPAA), which protects patient health information.

Database Types

Data storage is fundamental to how information is organized and accessed in computing environments. Here are some key types of databases:

A relational database stores data in tables with rows and columns, making it ideal for structured data. Each table contains data about a specific type of entity, and tables can be linked using keys (MySQL).
A NoSQL database (Not Only SQL) does not use the traditional table structure of an SQL database. This stores data in flexible formats like documents and is designed to handle a wide variety of data types and structures. It can manage large volumes of unstructured or semistructured data. NoSQL databases are useful for applications that require high flexibility, such as real-time web applications.
A data warehouse integrates data from various sources and stores it in a combined manner (Amazon Redshift is an example). It is a specialized database optimized for analysis and reporting rather than transaction processing. A data warehouse is designed to perform complex queries and data analysis quickly and efficiently.
A data lake stores large amounts of raw data in their original format until the data are needed. A data lake can handle structured, semistructured, and unstructured data (examples include Apache Hadoop and Amazon S3). A data lake is particularly useful for big data analytics.

An important database process is data retrieval, which involves obtaining specific information from a database or storage system. It queries the database using methods such as SQL in relational databases to extract needed data quickly and efficiently. The purpose of data retrieval is to access information for analysis, reporting, or decision-making, making it an essential aspect of effective data management. A technique used to improve the speed of data retrieval operations in a database is indexing. By creating an index, a database can quickly locate and access data without having to scan the entire table. The most common type of index is the B-tree index, which maintains a balanced tree structure (a tree structure where all branches are kept roughly the same height), providing efficient insertion, deletion, and lookup operations.

Link to Learning

Learn more about how a B-tree index provides sorted data and allows for efficient searching and access, plus variations in types of B-tree indexes.

Another type is the hash index, which uses a hash function to map data to a fixed-size table. Hash indexes are ideal for equality searches but are less efficient for range queries (Table 3.2). For example, a hash index works well when searching for something specific, like “Find Customer ID = 342678,” because it can quickly locate that exact ID. But it’s not as good for tasks like “Find all customers with IDs between 2000 and 2200,” since hash indexes do not keep data in order, making range searches slower.

Number	First Name	Last Name	Country
342678	Chris	Assam	USA
675309	Taylor	Tan	Australia
649568	Anthony	Ray	Mexico

Table 3.2 Customer Data in a Hash Index Hash indexing can map values to locations in tables but are inefficient when looking up range values.

A bitmap index uses 0s and 1s to show where a value is in a database. It’s great for columns with only a few options, like “Yes/No” or “Rent/Own/Neither,” and facilitates quick searching. Bitmap indexes are efficient for columns with a limited number of distinct values and are often used in data warehouses for complex queries on large datasets. Full-text indexes organize words in text fields to make it easier and faster to search for specific words or phrases.

Data retrieval techniques (Table 3.3) ensure that data can be efficiently retrieved, processed, and utilized for different applications, ranging from simple queries (such as using SQL to pull specific data, like finding all customers from a certain city) to complex data analysis and mining (for example, finding patterns in large datasets, like discovering what products people buy together).

Retrieval Technique	Description
SQL	Used to query and manipulate data in relational databases
NoSQL database	Handles unstructured and semistructured data with specific query languages
Full-text search engine	Indexes text data to enable complex search queries
API	Provides access to data from web services and applications using HTTP methods
File-based retrieval	Retrieves data stored in files using programming languages
Web scraping	Extracts data from websites using tools
Data warehousing	Aggregates data from multiple sources for complex queries and analytics
Indexing	Improves data retrieval speed by creating quick lookup structures
In-memory data grid	Stores data in random access memory for faster access, supporting distributed caching and querying
Metadata search	Retrieves data based on descriptive information
Data mining	Extracts patterns and knowledge from datasets

Table 3.3 Data Retrieval Techniques There are different ways to retrieve data, from simple SQL queries to quickly finding records to indexing for faster searches.

SQL consists of the following:

Database: a collection of related data
Table: a collection of related data entries consisting of columns and rows
Column: a vertical entity in a table that contains all information associated with a specific field
Row: a horizontal entity in a table that contains a single record

Basic SQL commands include CREATE TABLE, INSERT, SELECT, UPDATE, and DELETE for managing databases (Table 3.4):

CREATE TABLE: makes a new table with specified columns and data types
INSERT INTO: adds new data to a table
SELECT: retrieves data from a table
UPDATE: changes existing data in a table
DELETE: removes data from a table

SQL Command	Example
Creating a table
Inserting data into a table
Retrieving data selecting all columns
Retrieving data selecting specific columns
Filtering results
Using logical operators
Updating data
Deleting data

Table 3.4 Basic SQL Commands Key SQL commands include CREATE TABLE to make tables, INSERT INTO to add data, SELECT to retrieve data, UPDATE to change data, and DELETE to remove data. These commands are essential for managing any database (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license).

A DBMS is a set of software programs that let users create and maintain databases. It provides a way to interact with the data, ensuring data are stored systematically in a structured and consistent way, making them easy to access, manage, and efficiently retrieve. DBMSs offer many advantages, such as improved data integrity and consistency, reduced redundancy, control over data independence, trusted security and access protocols, and overall management and backup protections.

Data integrity refers to the accuracy and consistency of data stored in a database. A DBMS enforces rules and constraints such as primary keys. A primary key is a unique identifier for each data entry in a database table. It ensures that no two rows in the table have the same value for this key, enforcing data integrity. It cannot have duplicate or null values. For example, in a “Customers” table, “CustomerID?” could be the primary key because it gives every customer a unique ID. Constraints like primary keys prevent incorrect data entry and maintain the integrity of the data over time. For example, a DBMS can enforce that an email field must contain a valid email format, preventing incomplete entries.

A DBMS also supports data consistency, ensuring that data remain consistent and accurate across the database. It enforces consistency through ACID properties in transactions, meaning that all operations within a transaction are completed successfully, and the database remains in a consistent state. For example, transferring money from one account to another involves two steps: debiting one account and crediting another. Both steps must be completed in the correct order, ensuring consistency.

A DBMS reduces data redundancy, or the duplication of data, to save space and prevent inconsistencies. Using normalization, DBMS structures the data in a way that reduces redundancy. For example, instead of storing customer details in multiple places, they are stored in only one place and referenced as needed. A DBMS also allows for data independence, meaning that data can be structured without affecting the programs that use it. This works because DBMSs use a schema, which shows the structure of the data separately from how the data are stored.

Protection of data from unauthorized access to ensure data privacy is considered data security. A DBMS provides robust security features like user authentication (verifying a user’s identity before granting access to a system), access controls, and encryption (arranging data into an unreadable code that can only be read with a key). For instance, only authorized users can access certain data in a company’s database. Concurrent access allows multiple users to access and modify the data simultaneously and without conflicts. A DBMS uses a locking mechanism to ensure that multiple operations can occur simultaneously without data corruption. For example, while one user is updating a record, another user can still view the data without interruption. Backup and recovery allow data to be backed up and restored after a failure or loss. A DBMS provides automated backup and recovery solutions.

Database Design Techniques

Database design is the process of organizing data according to a database model, which indicates how to structure the data and how the data are stored and connected. A well-designed database will store and retrieve data efficiently and ensure the data are accurate and accessible. Design involves understanding the data needs of an organization and structuring the data to minimize redundancy and maximize performance.

A database modeler designs and plans how a database will work. This person figures out what data are required to create the structure (like tables and relationships, often called entity relationship diagrams), and makes sure the database is efficient, secure, and easy to use. They also work with developers and administrators to set it up and keep it running smoothly by following specific constraints. Constraints are rules applied to database tables to ensure data integrity and reliability (for example, entity constraints that ensure each table has a primary key). A referential constraint maintains relationships between tables, making sure that a foreign key in one table matches a primary key in another table. This ensures that the relationships between records are consistent and eliminates any orphaned record, or a record that references another record that no longer exists. A check constraint specifies the values that can be entered into a column. For example, a check constraint can ensure that the values in an age column are always positive, or that the status column only contains predefined values like “Active” or “Inactive.”

A database design usually proceeds through three stages: conceptual, logical, and physical design. Prior to these, the designer must conduct a requirements analysis, which is a study of how a business operates to determine what data should be stored and how the data should be used. It helps in understanding what users need from the database. This involves studying how the business operates to determine what data should be stored and how the data should be used.

Next, the conceptual design creates a simple model of the data, focusing on what is needed and how it’s connected. It involves identifying key items like customers or products (entities), their details like names or prices (attributes), and how they are related, such as “customers place orders.” An ERD is often used to map this out clearly. This ensures the database fits the business or application needs. Once the concept is determined, the next step is to turn the initial models into a more detailed a logical design, which defines tables, columns, primary keys, and foreign keys. A foreign key is a column or set of columns in one table that establishes a relationship with the primary key in another table. These keys are used to maintain referential integrity between the two tables, ensuring that the value in the foreign key column corresponds to an existing value in the referenced table. Finally, the physical design translates the logical design into a physical structure (via actual implementation in the database) that the DBMS can use. This involves choosing data types, indexing strategies, and planning for storage.

The database modeler also needs to attend to these needs during the design:

Schema refinement: Making sure the database design is free of issues such as storing the same data in multiple places to save space.
Data integrity and validation: Ensuring the data remain accurate. This is achieved through constraints (rules applied to database columns to ensure data integrity and accuracy) and stored procedures (such as prewritten SQL programs stored in the database that perform specific tasks) that enforce business rules and validate the data.
Documentation: Creating detailed documentation for developers, administrators, and users.
Prototyping and iterative refinement: Building a prototype of the database, testing it, and refining the design based on feedback and performance results. This helps catch issues early and ensures the design meets user needs.
Security design: This protects database data by managing who can access it, encrypting sensitive info, and keeping track of activity. It ensures the data stay safe and recoverable.
Maintenance: Ensuring the database can be maintained over time.

A functional dependency describes how one piece of data relates to another within a table. For example, in a table of employees, entering the employee ID should provide you with their name and address. One of the ways of checking the dependencies is normalization in which data are organized and stored only once to eliminate the duplication of data. Normalization occurs across three stages:

First normal form (1NF): Ensures each column contains atomic, indivisible values. Each column has single, simple values, and rows are unique. For example, multiple phone numbers do not appear in one cell—each number gets its own row.
Second normal form (2NF): Ensures the database is in 1NF and that all nonkey columns depend on the whole primary key. For example, in a table with OrderID and ProductID as the key, Quantity must depend on both, not just one.
Third normal form (3NF): This form ensures the database is in 2NF and that all columns are dependent only on the primary key.

For systems analysts, understanding normalization is key when working with relational databases. Normalization helps organize data to avoid problems like duplicate records or errors. However, it’s important for a systems analyst to know when to balance normalization with performance needs.

In cloud environments or large systems, sometimes denormalization, or the addition of redundant data, is used to make things run faster and meet specific requirements. Consider a systems analyst managing a database for an online store. The database has two tables: one table is Customers (with customer information like name and address), and the other table is Orders (and includes details about each order). Normally, these tables are kept separate and are linked by a customer ID. This keeps the data clean—if a customer updates their address, you only need to change it in one place. But if the store is busy, joining tables every time someone checks an order can slow things down. To fix this, the analyst might denormalize the database by combining some tables to allow data retrieval to occur more quickly. Instead of keeping customer information separate, the organization could decide to store the customer’s name and address directly in the Orders table. While this makes it faster to retrieve orders, if the customer changes their address, it will have to be updated in multiple places. This is how analysts balance keeping the data clean with making the system run more quickly.

Table 3.5, Table 3.6, and Table 3.7 show how a database progresses through the first, second, and third normal forms (1NF, 2NF, 3NF) using tables for an online store’s customer and ordering data.

Beginning Data
OrderID	CustomerName	Address	Products	TotalPrice
101	John Doe	123 Main St	Shirt, Shoes	$75
102	Jane Smith	456 Oak Ave	Hat, Bag	$50
103	John Doe	123 Main St	Jacket, Sunglasses	$120
First Normal Form (1NF)
OrderID	CustomerName	Address	Product	TotalPrice
101	John Doe	123 Main St	Shirt	$75
101	John Doe	123 Main St	Shoes	$75
102	Jane Smith	456 Oak Ave	Hat	$50
102	Jane Smith	456 Oak Ave	Bag	$50
103	John Doe	123 Main St	Jacket	$120
103	John Doe	123 Main St	Sunglasses	$120

Table 3.5 First Normal Form (1NF) To begin, the data might have multiple values in one field. In 1NF, the products are separated into individual rows to ensure there is only one value per field.

Second Normal Form (2NF)—Orders
OrderID	Product	TotalPrice
101	Shirt	$75
101	Shoes	$75
102	Hat	$50
102	Bag	$50
103	Jacket	$120
103	Sunglasses	$120
Second Normal Form (2NF)—Customers
CustomerID	CustomerName	Address
1	John Doe	123 Main St
2	Jane Smith	456 Oak Ave

Table 3.6 Second Normal Form (2NF) In 2NF, partial dependencies are removed by splitting the data into two tables: one for Orders and another for Customers.

Third Normal Form (3NF)—Orders
OrderID	Product	TotalPrice	CustomerID
101	Shirt	$75	1
101	Shoes	$75	1
102	Hat	$50	2
102	Bag	$50	2
103	Jacket	$120	1
103	Sunglasses	$120	1
Third Normal Form (3NF)—Customers
CustomerID		CustomerName
1		John Doe
2		Jane Smith
Third Normal Form (3NF)—Addresses
AddressID		Address
1		123 Main St
2		456 Oak Ave

Table 3.7 Third Normal Form (3NF) In 3NF, transitive dependencies are removed by creating a separate Addresses table.

Studying the three models of database design—conceptual, logical, and physical—is important because they help ensure a database is well planned, functional, and efficient. A conceptual model is a high-level outline of what data the database will hold and how that data relate to each other. Think of it as the blueprint of an organization’s database, often visualized using ERDs. The physical model describes how the database will be implemented on a specific database management system. It considers technical aspects like storage, indexing, and performance. A logical model takes the conceptual model and adds more detail. It includes data types, indexing, and storage specifics.

Gathering, Organizing, Curating, and Processing Data

Data serve as a cornerstone for decision-making and innovation across various disciplines, ranging from scientific research to business management. Whether engaged in academic research, business operations, or personal projects, knowing how to effectively gather, organize, curate, and process data is imperative.

Gathering Data

The gathering data stage includes identifying data sources, data collection methods, and tools for data collection. Data sources can be categorized into primary and secondary sources. Primary data refers to data collected firsthand through methods such as experiments, surveys, or direct observations. Secondary data include data previously collected by other researchers or organizations and are accessible through books, articles, and other databases. The choice of data collection method depends on the research objectives and the nature of the data required. Table 3.8 lists some considerations and examples for collecting primary data.

Method	Description	Who Uses Them	When to Use Them	Example
Surveys and questionnaires	Useful for collecting data on opinions or preferences used to gather information from large groups of people can be conducted online, over the phone, or in person	Public health officials and educational researchers	When collecting standardized data, such as opinions or behaviors, from many participants	A school district might survey parents to assess satisfaction with online learning programs
Experiments	Employed to obtain specific data points under controlled conditions to test hypotheses by manipulating variables and observing the outcomes	Scientists and medical researchers	When determining cause-and-effect relationships under controlled conditions	A researcher might test what level of a drug is needed to produce a medical effect.
Observations	Involve recording data based on direct observation by watching behaviors or events	Educators and psychologists	When studying real-life interactions or behaviors as they occur	An educator might observe students during group activities to assess collaboration skills.
Interviews	Involve one-on-one conversations that explore a participant’s experiences or opinions in depth	Social scientists, psychologists, and health-care professionals	When the researcher needs rich, detailed responses or when exploring sensitive topics	A researcher studying stress among health-care workers might conduct interviews to capture personal stories and insights.

Table 3.8 Primary Data Collection There are several methods for gathering primary data. The source and type of data influence which method is most appropriate.

Organizing Data

Proper data organization ensures that data can be easily manipulated and analyzed in tabular, hierarchical, and graphical formats. Tabular format organizes data in rows and columns, facilitating sorting and filtering; hierarchical format structures data in a treelike format, suitable for nested information; and graphical format represents data as nodes and edges, ideal for depicting relationships and networks.

Implementing best practices in data management enhances data integrity and usability. Using consistent naming conventions helps avoid confusion by ensuring files and variables are clearly and uniformly named. Version control is essential for keeping track of different iterations of datasets, making it easier to manage changes over time. Regular backups can prevent data loss, ensuring that data remain safe and accessible.

Curating Data

Curating data involves several important steps to ensure its accuracy and reliability. Data cleaning is the first step, which includes handling missing values by deciding whether to flag or remove incomplete records, removing duplicates to ensure each record is unique, and standardizing formats for dates, units, and other data points. Data transformation is the process of making data more suitable for analysis through normalization. Ensuring data quality is the final step and involves making sure the data are accurate, consistent, complete, and timely. This is done by checking that data entries are correct and precise, verifying uniformity across different sources, ensuring no crucial data are missing, and using the most up-to-date data available.

Processing Data

Processing data involves various techniques and tools to analyze and visualize information effectively. Descriptive statistics—like mean, median, mode, and standard deviation—summarize data, while inferential statistics use hypothesis testing and confidence intervals to draw conclusions about a population. Machine learning algorithms, such as regression and classification models, provide predictive analysis. Tools like Microsoft Excel and Google Sheets are useful for basic analysis, while advanced statistical software like R, SPSS, and SAS offer capabilities that are more sophisticated. Programming languages like Python, with libraries such as Pandas and R, are powerful tools for data manipulation and analysis. Effective data visualization enhances understanding through charts and graphs, such as bar charts, line graphs, and scatterplots, while advanced tools like Tableau and Microsoft Power BI offer more complex visualization options. Dashboards provide interactive platforms for real-time data monitoring and decision-making.

Data Security and Access Control

Database security involves protecting the database against unauthorized access and potential threats. Measures include implementing strong access controls, using encryption, performing regular security audits, and monitoring for suspicious activities (you will learn more about these topics in Chapter 5 Information Systems Security Risk Management and Chapter 6 Enterprise Security, Data Privacy, and Risk Management.

User management includes creating, managing, and monitoring database user accounts. Proper user management ensures that only authorized personnel have access to specific data, and roles and permissions are assigned based on the principle of least privilege, which is giving a user, system, or process only the minimum access or permissions needed to perform their specific tasks. Ensuring that database transactions are processed efficiently is essential. This includes maintaining data consistency and integrity through ACID properties. Proper transaction management helps prevent issues like data corruption and ensures that operations are completed fully or not at all.

Ethics in IS

Structured Query Language Injections

In 2011, Sony faced a major data breach in their PlayStation Network when attackers used SQL injection to exploit weaknesses in their systems. An SQL injection is a type of security vulnerability that occurs when an attacker is able to insert or manipulate SQL queries within a web application’s input fields. It allows the attacker to retrieve sensitive data from the database, such as usernames, passwords, or financial information. This can happen when the application fails to properly sanitize user input before incorporating it into an SQL query, allowing the attacker to execute malicious SQL commands. This gave hackers access to sensitive customer information, including credit card numbers. The incident showed how critical it is for organizations to secure their databases and control access to prevent such attacks. Views and authorization are critical for preventing SQL injection attacks, where hackers use harmful inputs to access or manipulate database data.

To prevent SQL injection, the following practices are recommended:

Views only show users the data they need, hiding sensitive information.
Authorization controls ensure users can only access what their role allows.
Parameterized queries treat user inputs as data, not code, which allows for blocking of harmful code.
Input validation checks and cleans user inputs to prevent attacks.
Stored procedures standardize input handling, reducing vulnerabilities.
Regular security checks help identify and fix weaknesses.

Using these practices together makes databases much safer from SQL injection attacks.

One important user management procedure is access control, which is the security-driven restriction of access to ensure that only authenticated users are able to interact with specific resources. There are four common access control models: mandatory, discretionary, role-based, and privileged (Table 3.9):

Mandatory access control (MAC) is highly restrictive and typically used in government and military environments. In MAC, a central authority based on security clearances assigns access permissions to individuals working for the organization. For example, individuals with the necessary top-secret clearance are the only ones who can access a top-secret document in a military database. Users cannot alter permissions, ensuring strict compliance with the organization’s security protocols.

Discretionary access control (DAC) allows resource owners to decide who can access their data. For instance, if you own a file on your company’s shared drive, you can grant or deny access to other colleagues. While this model offers flexibility, it can lead to security risks if not managed carefully. For example, if an employee shares a sensitive document with a contractor without proper vetting, it could lead to unauthorized access.

Role-based access control (RBAC) assigns permissions based on a user’s role within an organization. For example, an employee in the human resources department is given access to payroll information and employee records, while someone in IT is granted access to system configurations and network settings. This model simplifies management by grouping permissions based on roles rather than individual users, making it easier to update access as roles change. For instance, when a junior developer gets promoted to a senior developer, their role-based access can be updated to include additional system privileges.

Privileged access management (PAM) focuses on controlling and monitoring access for users with elevated permissions, often referred to as privileged accounts. For example, system administrators may have the ability to install software, configure network settings, and access sensitive data. PAM solutions ensure that these high-level permissions are used appropriately and securely by providing tools for monitoring, auditing, and managing access. For instance, a PAM system can track and log every action taken by an administrator—such as changes to firewall settings—ensuring accountability and security.

Table 3.9 summarizes some of the pros and cons of each of these four access control models.

Access Control Model	Advantages	Disadvantages
Mandatory access control (MAC)	Provides robust security and restricts access permissions Reduces the risk of unauthorized access	Can be complex to implement and manage Not easily adaptable to changing business needs
Discretionary access control (DAC)	Resource owners have the freedom to grant access Easier to set up and manage for smaller environments	Increased risk of unauthorized access due to the potential for users to inadvertently grant permissions Can lead to inconsistent application of security policies across the organization
Role-based access control (RBAC)	Simplifies management by grouping permissions Easier to manage as the organization grows	Can lead to a large number of roles that need to be managed Not as adaptable to temporary changes in roles or duties
Privileged access management (PAM)	Focuses on controlling and monitoring high-level permissions Provides detailed logging and auditing of activities	Can be complex to implement and manage, requiring significant resources and expertise May lead to frustration among users if access is too restricted

Table 3.9 Access Control Models Access control models such as MAC, DAC, RBAC, and PAM each have their own advantages and disadvantages. They can be adapted for dynamic access control, especially with the shift to remote and blended workforces.

Choosing the right access control model for an organization can be challenging. A small defense subcontractor might need to implement MAC systems for its entire operation to meet strict security requirements determined by government regulations or classified contracts. In contrast, a prime contractor, which is a large organization managing multiple subcontractors, can use a more nuanced approach, reserving MAC systems for its most sensitive operations, such as handling classified defense projects.

Some industries commonly use role-based access controls, so that different system users (such as employees, managers, or suppliers) only have access to specific information. For example, a manager might have access to employee schedules and inventory records, while a cashier may only have access to the point-of-sale system.

Comparing Database Management Systems

When working with database management systems, it is important to understand the different types available and how they are used in various applications. We’ll focus on comparing three major types: relational, object oriented, and NoSQL. Learning the strengths and limitations of each provides insight into when and why to use them, based on specific project needs.

A relational database management system (RDBMS) is a database management system that stores and organizes data in a structured way using tables. Each table represents an entity, and each row in the table represents a record of that entity, while columns represent the attributes of the entity. An RDBMS requires a predefined schema, which is a document that defines the structure of the data in terms of tables and the relationships between them. Relationships between tables are established using foreign keys, which reference primary keys in other tables, ensuring referential integrity. The RDBMS enforces data integrity through constraints such as primary keys, foreign keys, unique constraints, and checks. The SQL is the standard language used to interact with an RDBMS for defining, querying, and manipulating data. Additionally, an RDBMS adheres to ACID properties to ensure reliable transaction management. Normalization is a key practice in an RDBMS, aimed at reducing data redundancy and improving data integrity by organizing data into multiple related tables. As an example of an RDBMS, a university stores student records in tables, with a predefined schema that organizes data into entities like “Students,” “Courses,” and “Grades,” linked by relationships. The structure ensures data consistency and simplifies reporting.

Object-oriented programming principles are fundamental concepts that guide the design and implementation of programs in object-oriented languages like Java, Python, and C++. An object-oriented database management system (OODBMS) stores data in the form of objects, similar to the way data are represented in object-oriented programming languages. Each object includes both data, in the form of attributes, and behavior, in the form of methods. This approach allows for a more direct mapping between the database and the application’s data model, facilitating complex data representations and relationships. An OODBMS supports classes, inheritance, polymorphism, and encapsulation, enabling the creation of complex data types and relationships that mirror real-world entities more closely. The schema in an OODBMS is defined using object-oriented concepts, and objects can contain references to other objects, enabling rich, interconnected data structures. Querying in an OODBMS is typically done using object query languages, which are designed to operate on objects and their relationships. An OODBMS is particularly well suited for applications requiring a complex data model—such as computer-aided design (CAD), computer-aided manufacturing, multimedia, and telecommunications—that requires efficient handling of complex data types and relationships efficiently. As an example of an OODBMS, a CAD software company can use an OODBMS to store data about three-dimensional models. Each model is an object containing data and methods reflecting real-world design elements.

A NoSQL database management system is a type of database that provides a mechanism for storing and retrieving data that is not based on the traditional relational database model. They are built to handle large amounts of data, high-speed data processing, and diverse data types. They offer scalability, flexibility, and performance, making them ideal for modern applications that manage big data and real-time web applications. Because a NoSQL DBMS does not require a fixed schema, they permit rapid or ad hoc changes to data structures. They also support distributed computing, which helps manage large-scale data across multiple servers. NoSQL databases come in various types, including key-value stores, document-oriented databases, column-family stores, and graphical databases, each tailored to specific types of data and use cases. Key-value stores organize data by associating them with a unique identifier (the key) and its corresponding value, which can be a simple string or a complex object. The simplicity of key-value stores makes them very fast for certain types of tasks, particularly those involving straightforward data access patterns. They are perfect for applications like caching, session management, and real-time data analytics. A document-oriented database, which has a flexible schema, works well with a NoSQL database management system. Document-oriented databases store data in the form of documents, usually using formats like JSON, BSON (binary JSON), or XML. Each document contains a self-contained data structure made up of fields and values, which can include nested documents and arrays. This flexible schema allows for storing complex data structures without needing a predefined schema, making it easy to adapt to changes in data requirements over time. Document-oriented databases are great for applications that need hierarchical data storage, such as content management systems, e-commerce platforms, and real-time analytics. As an example of NoSQL databases, a social media platform uses a document-oriented NoSQL database to store user profiles, posts, and comments. The flexible schema easily adapts to new features, like adding reactions or multimedia support.

Footnotes

1Clive Humby, “Data Is the New Oil,” (lecture at Association of National Advertisers conference, Orlando, FL, April 30–May 2, 2006).

3.1 Data Types, Database Management Systems, and Tools for Managing Data