Dr. Jean-Claude Franchitti

Learning Objectives

By the end of this section, you will be able to:

Discuss the relational model
Describe the theoretical and practical aspects of the structured query language (SQL)
Explain how to create a logical relational database design
Demonstrate how to create a physical relational database design
Discuss application programming interfaces and techniques used to program relational database applications
Identify the various components of a relational database management system

A relational database management system (RDBMS) is one type of DBMS that stores related data elements in a row-based table structure. An RDBMS transaction is a single logical unit of work that accesses and possibly modifies the contents of a database using read-and-write operations. Multiple SQL commands may be performed as part of a single RDBMS transaction as a single logical unit of work. For example, a student may register for multiple courses using a university’s registration system. To maintain consistency in a RDBMS, transactions satisfy properties known as atomicity, consistency, isolation, and durability (ACID). These properties are necessary for maintaining data consistency, integrity, and reliability while performing transactions via a RDMBS. Atomicity ensures that multiple SQL commands that are part of the same transaction are executed atomically (i.e., as a unit). In the example mentioned earlier, once the registration transaction is complete, the student will be registered for all three courses. If that is not possible because one of the courses is full, the transaction will not complete successfully. Consistency ensures that only valid data are written in the database. In the student registration example, it simply means that the registration information stored in the database will be consistent with the registration transaction that was completed. Isolation guarantees that multiple users that access the RDBMS concurrently will not be affected by the transactions performed by others. In the course registration example, multiple students can register concurrently without any of them noticing any side effects that result from the RDBMS being accessed by multiple students concurrently. Finally, durability guarantees that once information has been stored in the database, it will be stay there permanently. In the course registration example, if one student registers successfully, the registration information will not change unless they decide to go back and perform another drop/add transaction. While ACID transaction properties ensure that stored data are reliable and accurate, they also come at a cost. For example, guaranteeing full isolation in a RDBMS requires significant additional processing time.

Due to the use of database logical schemas that constrain the row-based table structure and require each field to abide to specific type constraints, RDBMS are not good at storing unstructured or semistructured data. The rigid schema also makes RDBMS more expensive to set up, maintain, and grow. RDBMS requires users to have specific use cases in advance, and any changes to the schema are usually difficult and time-consuming. In addition, traditional RDBMS were designed to run on a single computer node, which means their speed is significantly slower when processing large volumes of data. Sharding RDBMS is a partitioning method that splits the database into smaller components, which makes the management process faster, but using it while maintaining ACID compliance is challenging.

Relational Model

The relational model of data is based on the concept of a relation. A relation is a mathematical concept based on the ideas of sets. The model was first proposed by Dr. Edgar F. Codd of IBM Research in 1970. A relation typically contains a set of rows. The data elements in each row represent certain facts that correspond to a real-world entity or relationship. In the formal model, rows are called tuples, each column has a column header that indicates the meaning of the data items in that column, and the column header is referred to as an attribute name or just as an attribute. Multiple attributes may be selected to form superkeys that uniquely identify individual tuples. Minimal superkeys are referred to as candidate keys. Candidate keys are considered minimal because they should only include a minimum number of attributes to uniquely identify individual tuples. For example, a car database may use the vehicle identification number (VIN) and the make of the vehicle attributes as a candidate key. Another possible candidate key could be obtained by using the license plate and state of the vehicle. Either one of the candidate keys can be selected as a primary key, which then provides a unique identifier for each table record (Table 8.2). A constraint on the primary key is called a key constraint and provides a unique value, assigns a default value to a column if none is specified, and checks for predefined conditions before inserting the data.

License_number	Engine_serial_number	Make	Model	Year
Texas ABC-739	A69352	Ford	Mustang	02
Florida TVP-347	B43696	Oldsmobile	Cutlass	05
New York MPO-22	X83554	Oldsmobile	Delta	01
California 432-TFY	C43742	Mercedes	190-D	99
California RSK-629	Y82935	Toyota	Camry	04
Texas RPN-624	U028365	Jaguar	XJS	04

Table 8.2 Primary Key Example in a Car Table for a RDBMS The primary key in a table is typically underlined.

Structured Query Language

The declarative language used in managing and querying structured data in a RDBMS is called Structured Query Language (SQL). SQL implements the four sublanguages mentioned earlier: DDL, DML, DQL, and DCL. SQL is based on relational algebra with many extensions. SQL supports primary keys, foreign keys, inserting data, deleting data, updating data, indexing, recovery, concurrency, and security. SQL focuses on efficiency in addition to specifying the data needed. The main commands in SQL are CREATE TABLE, which is used to create a new table in a database such as:

CREATE TABLE Student (
    ID int,
    LastName varchar(255),
    ZipCode int
);

which will create a table name Student with three columns: ID, LastName, and ZipCode. DROP TABLE is used to drop an existing table in a database, and ALTER TABLE, which is used to add, delete, or modify columns in an existing table. SQL schemas contain tables (i.e., relations), and the RDBMS catalog stores these schemas. SQL constraints are used to specify rules for the data in a table to ensure the accuracy and reliability of that data. There are additional features of SQL such as triggers, views, nested/recursive queries, privileges, and metadata.

Relational Algebra

The theoretical framework for querying relational databases that uses unary and binary operators on relations to manipulate and retrieve tuples is called relational algebra. The following set of relational algebra operations is a complete set: Select, Project, Union, Rename, Difference, and Cartesian product. It is a complete set because any relational algebra operation can be expressed as a sequence of operations from this set.

Relational algebra operators include set operators and relational operators. The set operators include Union, Intersection, Difference, and Product, which are all binary operators because they use two relations as operands and produce a new relation as a result. In some cases, the operand relations need to be union compatible, which implies that they have the same number of attribute types defined on the same domains. In what follows, we represent a relation as R and its cardinality or number of tuples as |R|.

Union Operator

The Union operator applied to two union-compatible relations P and Q results into a new relation $R = P \cap Q$ so:

$R = {t | t \in P O R t \in Q}$ with $\max (|P|, |Q|) \leq |R| \leq |P| + |Q|$

The resulting relation R consists of the tuples that either belong to P or Q, or both. Duplicate tuples are eliminated.

Intersection Operator

The Intersection operator applied to two union-compatible relations P and Q results in a new relation $R = P \cap Q$ such that:

$R = {t | t \in P A N D t \in Q}$ with $0 \leq |R| \leq min (|P|, |Q|)$

The resulting relation R consists of the tuples that belong to both P and Q.

Difference Operator

The Difference operator applied to two union-compatible relations P and Q results in a new relation $R = P - Q$ such that:

$R = {t | t \in P A N D t \notin Q}$ with $0 \leq |R| \leq |P|$

The resulting relation R consists of the tuples that belong to P but not to Q. Note that when using the Difference operator, the order of the operands, is important because it concerns a noncommutative operation.

Product Operator

The Product operator (also known as the Cartesian product, Cross Product, or Cross Join operator) returns the Cartesian Product of two relations. Consider the following two relations: a supplier relation P and a product relation Q. The Cartesian Product of both relations consists of all possible combinations of tuples from P and Q. The resulting relation $R = P \times Q$ consists of a set of tuples $r = pq$ such that:

$R = {r = p q | p \in P A N D q \in Q}$ with $|R| = |P| \times |Q|$

The relational operators are Select, Project, Rename, Join, and Division. The Select, Project, and Rename operators are unary operators because they only use one operand. The Join and Division operators are binary operators because they use two operands.

Select Operator

The Select operator is used to return a subset of tuples of a relation based on a selection condition. The latter can consist of predicates combined using Boolean operators (and, or, not). Every predicate consists of a comparison between the values of two domain-compatible attribute types, or between an attribute type and a constant value from the domain of the attribute type.

The result of applying a Select operator with a selection condition S on a relation P can be represented as:

$R = σ_{S} (P)$ with Select operator $σ$

Project Operator

The Project operator is used to project a relation on a list of attribute types. The result of applying a Project operator with a list L on a relation P can be represented as:

$R = π_{L} (P)$ with Project operator $π$

Note that applying a Project operator to a relation results in a vertical subset of the relation with $|R| \leq |P|$ because duplicate tuples will be eliminated.

Rename Operator

The Rename operator is used to rename an attribute type of a relation (in case, for example, naming conflicts can occur). The result of applying a Rename operator on a relation P to rename attribute type B to A can be represented as:

$R = ρ_{A|B} (P)$ with Rename operator ρ

Join Operator

The Join operator allows combining tuples of two relations based on specific conditions, also called join conditions. The result of a join is the combined effect of a Cartesian product (×) and selection. The result of applying a Join operator to two relations P and Q with join condition j can be represented as follows:

$R = P ⋈_{j} Q$ with Join operator ⋈ and join condition j.

The resulting relation R consists of a set of combinations of pq tuples such that a combined tuple r ∈ R is characterized by:

$R = {r = p q | p \in P A N D q \in Q A N D j}$ with $|R| \leq |P| \times |Q|$

Division Operator

The Division operator is a binary operator that divides a relation P by a relation Q with $Q \subseteq P$ . The result of a Division operator can be represented as:

$R = P \div Q$ with Division operator ÷

The resulting relation R includes all tuples that belong to P combined with every tuple in Q. Note that the Division operator is not directly implemented in SQL.

The Select (σ), Project (π), Difference (–), Product (×), and Union (∪) operators are often referred to as the five primitive operators because all other relational algebra operators can be derived from them (derived operators).

SQL also allows aggregate functions and grouping operations. For join operations, there are many options, such as the following:

A theta join allows merging two tables based on a theta condition; theta (θ) refers to the comparison operator in the join condition. Depending upon theta, these joins can be distinguished: greater-than-join, less-than-join, greater-than-or-equal-join, less-than-or-equal-join, and an equi-join (if an equals occurs). An equi-join combines tables based on matching values in specified columns; it is equivalent to a theta join, with theta being the equal comparison operator in the join condition.
A natural join creates an implicit join based on the common columns in two tables; it is a variant of the equi-join in which one of the shared join-attribute types is removed from the result.
An inner join represents the intersection of two tables; all joins discussed thus far are inner joins because they do not include tuples lacking corresponding join values in the result.
An outer join represents the union of two tables; the outer join operator will also include tuples lacking corresponding join values in the result.

A distinction can be made between a left outer join, right outer join, and full outer join. A left outer join includes all tuples from the left relation (P) in the result, either matched with a corresponding tuple from the right relation (Q) based on the join condition or augmented with null values in case no match with a tuple from Q can be made. A left outer join can be represented as $R = P ⋈_{j} Q$ . A right outer join includes all tuples from the right relation (Q) in the result, either matched with a corresponding tuple from the left relation (P) based on the join condition or augmented with null values in case no match with a tuple from P can be made. A right outer join can be represented as $R = P ⋈_{j} Q$ . A full outer join includes all tuples from P and Q in the result either matched with the counterparts according to the join condition or augmented with null values in case no match can be found. A full outer join can be represented as $R = P ⋈_{j} Q$ .

An example of data structure representation for the relational algebra expression is a query tree.

Tuple Relational Calculus and Domain Relational Calculus

In a relational database, a tuple is one row with a collection of values separated by a comma and enclosed in parenthesis. An example of a tuple in Python represents the employee Smith, his phone number, age, and zip code.

tuple = {'Smith', 8882355151, 50, 48505}

Data inside a tuple can be of any type such as integer, string, float value, or a tuple type.

Tuple relational calculus uses tuple variables as key building blocks. A tuple variable refers to a tuple of a corresponding relation (also called range relation). A tuple relational calculus expression specifies a range relation for each tuple variable, a condition to select (combinations of) tuples, and a list of attribute types from which the values should be retrieved. As a result, a tuple relational calculus expression looks as follows:

{t(A_i) | Cond(t)}

whereby t represents the tuple variable, A_i the attribute type of which the value needs to be retrieved, and Cond(t) the range relation and extra conditions to select relevant tuples. Consider the following example:

{t.StudentID, t.StudentName | Student(t) AND t.StudentProgram = 3}

A condition is specified using a well-formed (calculus) formula (wff), which can consist of various predicate calculus atoms combined using Boolean operators (and, or, not). The result of applying a formula to a tuple of a range relation can either be true, false, or unknown. If the result is true, the tuple will be selected in the result. Relational calculus also includes quantifiers such as the existential quantifier (∃ or EXISTS) and universal quantifier (∀ or FOR ALL). A formula with an existential quantifier (∃) evaluates to true if at least one tuple satisfies it. A formula with a universal quantifier (∀) evaluates to true if every tuple from the range relation satisfies the conditions stated.

Instead of tuple variables, domain relational calculus defines domain variables that range over single values of domains of attribute types. A domain calculus expression for a relation of degree n looks as follows:

₁, v₂, …v_n, | COND(v₁, v₂, …v_n) }{v1, v2, …vn, | COND(v1, v2, …vn) }

where v₁, v₂, …v_n represent the domain variables that range over domains of attribute types, and COND represents the extra conditions.

The earlier tuple relational calculus query selecting the student ID and student name of all students who study in program number 3 would now look as follows:

{ab| ∃(e) (Student(abcde) AND e = 3)}

In this case, we defined five domain variables as a for the domain of StudentID, b for the domain of StudentName, c for the domain of StudentAddress, d for the domain of StudentCity, and e for the domain of StudentProgram. The condition e = 3 is a selection condition.

The existential quantifier (∃ or EXISTS) and universal quantifier (∀ or FOR ALL) are also supported in domain relational calculus.

Query-by-Example (QBE) Language

The relational database language based on domain relational calculus is called query-by-example (QBE). QBE queries are expressed via a graphical query language, using visual tables where the user enters commands, example elements, and conditions. A parser converts the QBE graphical queries into statements expressed in a database manipulation language, such as SQL.

Logical Design

The process of logical design involves designing a database based on a specific data model but independent of physical details. The logical design for a relational DBMS includes specifications for tables, relationships, and constraints. The logical design is performed in four steps: map conceptual model to logical model components, validate the logical model using normalization, validate logical model constraints, and validate the logical model against user requirements.

To determine the quality of relation schema design, we follow the informal design guidelines for relational schemas and use simple measures of quality such as making sure attribute semantics are clear, reducing redundant information, and reducing null values. Null values cause redundant work, which wastes storage space and increases the difficulty of performing operations.

Technology in Everyday Life

RDBMS

Airplanes have to be designed and tested for safety. Airplane wings are complex items that include a large number of parts, come in various shapes and forms (e.g., rectangular wings for small airplane, elliptical wings, tapered wings and trapezoidal wings), and require multiple designers to collaborate on their design. Airplane wings are typically designed using computer-aided design (CAD) software systems that use data management systems to store and retrieve data.

Is a RDBMS the best solution to store and retrieve these types of items? Why or why not? Are there other, better ways to track the development, design, and testing of mechanical parts crucial to passenger safety?

Relational Model Constraints

The constraints that ensure database correctness are called relational model constraints in DBMS. The different types of constraints are domain, uniqueness, key, and entity integrity constraints. The domain constraint defines the domain of values for an attribute. The uniqueness constraint specifies that all the tuples must be unique. The key constraint specifies that all the values of the primary key must be unique. The entity integrity constraint specifies that no primary key contains a null. Additionally, functional dependency (FD) is a constraint that specifies the relationship between two sets of attributes and provides a formal tool for the analysis of relational schemas. FDs enable the detection and description of business rules in precise terms. FDs can be used to help normalize relational schemas. A multivalued dependency (MVD) occurs when two attributes in a table are independent of each other but both depend on a third attribute. MVDs are required to achieve higher forms of normalization (e.g., 4NF). The normalization process can be automated using algorithms that can convert a given relation into sets of relations in a given normal form.

Normalization

The process of structuring a relational database to reduce data redundancy and improve data integrity is called database normalization. A normal form applies to a table/relation, not to the database. The main types of normalization are

first normal form (1NF),
second normal form (2NF),
third normal form (3NF),
Boyce-Codd normal form (BCNF), and
fourth normal form (4NF).

There are additional normal forms, which are more complex. Each cell in 1NF must contain only a single (atomic) value and every column must be uniquely named. The relational model requires tables to be in 1NF form; tables in 2NF must be in 1NF and not have any partial dependency. Tables in 3NF must be in 2NF and have no transitive functional dependencies. BCNF is a higher version of the 3NF and is used to address the anomalies. Tables in 4NF must be in BCNF/3NF and not have MVDs. The following example converts a table to 1NF, 2NF, and then 3NF. Table 8.3 contains the original data, that is then converted to 1NF (Table 8.4), 2NF (Table 8.5), and 3NF (Table 8.6, Table 8.7, and Table 8.8). It is assumed here that the Customer Name is a composite attribute that contains First Name and Last Name and therefore must be broken down into its component attributes to be stored in a relational table that satisfies 1NF.

Customer Name	Item 1	Item 2
Shirl Adam	Milk	Bread
Jeff Mark	Juice	Water
Rana Park	Bread	Milk

Table 8.3 Data Table

First Name	Last Name	Item 1	Item 2
Shirl	Adam	Milk	Bread
Jeff	Mark	Juice	Water
Rana	Park	Bread	Milk

Table 8.4 Table Converted to 1NF

First Name	Last Name	Item
Shirl	Adam	Milk
Shirl	Adam	Bread
Jeff	Mark	Juice
Jeff	Mark	Water
Rana	Park	Bread
Rana	Park	Milk

Table 8.5 Table Converted to 2NF

Customer ID	First Name	Last Name
1	Shirl	Adam
2	Jeff	Mark
3	Rana	Park

Table 8.6 Table Converted to 3NF: Customer

Item ID	Item
1	Milk
2	Bread
3	Juice
4	Water

Table 8.7 Table Converted to 3NF: Item

Customer ID	Item ID
1	1
1	2
2	3
2	4

Table 8.8 Table Converted to 3NF: Relational Table

Relational Database Design

Modeling data into a set of tables with rows and columns is called relational database design (RDD). Each row represents a record, and each column represents an attribute. RDD includes five phases:

Requirements analysis phase to assess the informational needs of an organization.
Logical design phase to create a conceptual model (entity-relationship [ER]) based on phase 1. ER describes interrelated attributes in a specific domain of knowledge.
Physical design phase to maximize database efficiency (Unified Modeling Language [UML]). UML helps developers visualize the relationships between the different pieces.
Implementation phase to convert the tables developed in the ER diagram into SQL.
Monitoring and maintenance phase to ensure that RDBMS is functioning properly.

Industry Spotlight

RDBMS and Banking

RDBMs are ideal in banking applications for tracking and storing account numbers, orders, and payments. Most banks run Oracle applications because of the powerful integration of technology and business applications. Oracle includes built-in functionality that is designed specifically for banks such as Oracle banking platform, which is the leading choice for banks looking to change their core systems for customer-centric retail banking.

Mapping a Conceptual EER Model to a Relational Model

After designing the ER diagram, we need to convert it to relational models that can be directly implemented by any RDBMS. ER modeling helps create a conceptual model that relies on entities and their interrelationships. The degree of a relationship establishes how many entities it interrelates. Therefore, relationships can be unary, binary, ternary, and more generally n-ary.

Cardinality ratios are used to specify how many tuples are interrelated as part of a binary relationship (i.e., 1-1, 1-N, N-1, M-N).

Multivalued attributes are allowed in ER models. They are used to represent an attribute that contains a list of values. For example, a car entity may have a color attribute that is multivalued and include a list of all the car colors.

In ER modeling, a weak entity is a type of entity that cannot be uniquely identified based on its attributes alone and must rely on a strong entity to provide the context necessary for identification. For example, an employee in a company may have dependents for health insurance purposes. In this case, a dependent entity is a weak entity and the ID of the employee, who is the strong entity, needs to be used to fully identify a dependent.

The following steps must be followed to map an EER model to a relational model:

Mapping of Regular Entity Types. Create a table that includes all of its simple attributes and select one of the key attributes as the primary key. If the key chosen is from a composite attribute in the EER model, the corresponding set of simple attributes will together form the primary key.
Mapping of Weak Entity Types. For each weak entity, create a table that includes all of its simple attributes, and include the attributes that make up the key of the associated strong entity. The primary key in this case is the combination of the primary key of the associated strong entity and the partial key of the weak entity if any.
Mapping of Binary 1:1 Relation Types. One solution is to create a relation for each participating entity, choose a primary key in one of the two relations, and make a foreign key referencing the primary key in the second relation. In some cases, it is more appropriate to merge both relations (from the participating entities) into one, or create a cross-reference (i.e., relationship relation) between the two relations.
Mapping of Binary 1:N Relationship Types. Create a relation for each participating entity. The primary key from the “one” side of the relationship is added to the “many” side as a foreign key. An alternative approach is to use a relationship relation, but it is rarely done.
Mapping of Binary M:N Relationship Types. Create a relation for each participating entity. Also create a relationship relation and include as foreign keys attributes the primary keys of the relations that represent the participating entity types. Their combination will form the primary key of the relationship relation.
Mapping of Multivalued Attributes. For each multivalued attribute, create a new relation that includes an attribute corresponding to the multivalued attribute (in the original entity type) and include as a foreign key the primary key of the relation that represents the entity that includes the multivalued attribute. The primary key in the new relation created is the combination of the attribute that corresponds to the multivalued attribute (in the original entity type) and the foreign key.
Mapping of n-ary Relationship Types. Create relationship relations for each n-ary relationship and include as foreign keys attributes the primary keys of the relations that represent the participating entity types.
Mapping Specialization or Generalization. Convert each specialization with m subclasses {S₁, S₂, . . . , S_m} and generalized superclass C, where the attributes of C are {k, a₁, . . . a_n} and k is the (primary) key, into relational schemas using one of the four following options: Option 8A: Multiple relations-Superclass and subclasses; Option 8B: Multiple relations-Subclass relations only; Option 8C: Single relation with one type attribute; Option 8D: Single relation with multiple type attributes.
Mapping of Union Types. For mapping a category whose defining superclass has different keys, it is customary to specify a new key attribute, called a surrogate key, when creating a relation to correspond to the category.

Physical Design

Attributing logical concepts to physical constructs is called physical database design. The input to the physical design step is a logical representation of the system. It is a joint responsibility of the database designer and DBA. The main responsibility of the physical design is optimizing performance while ensuring data integrity. The main issues addressed in physical design are storage media, file structures, and indexes. The database is stored on a disk using a “standard” file system, not one “tailored” to the database. The database runs on an unmodified general-purpose operating system.

Disk Storage and Basic File Structures

Database tables are stored in disk storage—which is the memory device that stores the data—such as hard disks, flash memory, magnetic disks, optical disks, and tapes as shown in Figure 8.9. The records in the tables could be of fixed length, which means the records are exactly the same length, or variable length, which means the records differ in their length. All records are classified into blocks, and the length of the record can exceed the size of a block, which is called a spanned record. Assume that each relation is stored as a file and each tuple is a record in the file. The files could be ordered (i.e., the records are sequenced on some field) or unordered (i.e., the records are not in any order). The average record access time is the average time taken for a system to access the record. A redundant array of inexpensive disks (RAID) stores information across an array of low-cost hard disks.

Triangle diagram. Primary storage from top down: CPU registers, Caches, Central storage. i/o boundary separates middle. Secondary storage in lower half: HDD, SSD and Tape, optical media.

Figure 8.9 This storage hierarchy includes primary and secondary storage. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

Physical File Organization and Indexing

File organization and indexing are used to minimize the number of block accesses for frequent queries. There are many types of file organization, with the most popular being sequential, relative, and indexed organization. In a sequential file organization, records are organized in the order they are stored and any new record is added at the end. In a relative organization, each record is assigned a numeric key to rearrange the order of the records at any time. Similar to the relative organization, indexed organization uses a key, but the key is unique and fixed. In addition, there are many other file organization types such as heap, hash, B+, and cluster file organization. A heap in file organization refers to an unordered collection of records where new records are placed wherever there is free space, and no particular ordering is enforced. Figure 8.10 shows an example of a heap file organization.

Illustration depicting Records at left and Data blocks in memory at right (Blocks 1-4). New record is shown being inserted into a data block in memory.

Figure 8.10 This heap file organization example shows how to insert a new record (record 4) into the data blocks in the memory. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

Physical Database Organization

A tablespace is where tables are stored physically in the memory (i.e., the physical container of database objects). The use of tablespaces has an impact on physical database design concerning intraquery parallelism, where different subsets of data can be searched in parallel for a complex query, and interquery parallelism (i.e., many simple queries executed in parallel).

Query Execution Concepts

Processing the query includes five steps:

parsing: checking the syntax of the query, user’s privileges, table, and attribute names
translation: translating from high-level language to machine language
optimizer: selecting the best plan to execute
execution: running the selected plan
evaluation: returning the query result

The query optimizer determines the lowest cost strategy for query execution using the internal representations of a query (i.e., query tree and query graph). The mathematical technique for processing the query quickly is called heuristics optimization. The main heuristic is to first apply the operations that reduce the size of intermediate results.

Physical Database Design and Tuning

Assume you are looking for a specific record in a huge database and that the record you are looking for is at the end. To retrieve that record, the system moves sequentially record by record and takes a long time to process. Let us follow this SQL exercise using Table 8.9.

Employee
Number	FName	LName	Country
1	Sam	Smith	USA
2	Adam	Bell	UK
3	Matt	John	USA
4	John	Zack	USA
5	Andy	Dave	UK

Table 8.9 Employee Table

We are looking for “Andy” using the following SQL statement:

SELECT * FROM employee WHERE Fname = 'Andy';

Looking for Andy takes five comparisons because Andy is at the end of the table. However, if we sort the data alphabetically (Table 8.10), searching for a name happens faster (two comparisons).

Employee_sort
FName	LName	Index
Adam	Bell	2
Andy	Dave	5
John	Zack	4
Matt	John	3
Sam	Smith	1

Table 8.10 Employee Sort Table

An index holds the field being sorted and points from each record to the corresponding record where the data are stored (Figure 8.11).

Tables for Employee_sort and Employee are visible with arrows connecting Index column from Employee_sort table to number in Employee table.

Figure 8.11 Using indexing can improve the query performance. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

Indexing helps improve the performance of the query process. Creating an index in SQL retrieves data from the database more quickly. The following are SQL syntaxes for creating an index and creating a unique index. Note that a UNIQUE refers to an index where all rows of the index must be unique. That is, a row may not have identical non-null values for all columns in this index as another row.

CREATE INDEX index_name ON table_name (column1, ...);
CREATE UNIQUE INDEX index_name ON table_name (column1, ...);

In addition, you can delete the index anytime using the drop statement:

DROP INDEX index_name ON table_name;

A key is needed to index an attribute; however, an index can be constructed with one or multiple attributes. If a table requires more than one index, we need to decide which attribute we cluster. In general cases, the indexing is implemented using binary tree, B-tree, or B+ tree. In practice, B+ trees are more commonly used in database systems. Binary tree is a tree in which every node has up to two children. B-tree generalizes the binary tree with more than two children and may have one child. B+ tree is a tree with a large number of children per node but in some cases it is more efficient to use hash indexing (e.g., equality conditions). The normalized database is the database with eliminating data redundancy to enhance the data integrity in the table. Denormalization is a strategy to increase the performance of a normalized database. It adds redundant data such as extra attributes, adds new tables, or creates instances. The main goal is to decrease the running time of queries by making data more accessible.

Think It Through

The Relational Model and Constructors

The relational model uses two types of constructors. One is “aggregate of,” which enables the creation of tables by aggregating various columns that contain data of different types. The other is “set of,” which enables the gathering of multiple records within a table.

Is there a limitation with this approach?

RDBMS APIs and Programming Techniques

A database architecture is a representation of the design that helps design, develop, implement, and maintain the DBMS. This section covers RDBMS architectures, database APIs, object persistence and object-relational mappings (ORMs), and sample RDBMS applications.

RDBMS Architectures

A relational database management system (RDBMS) architecture is categorized as a centralized system architecture and a tiered system architecture. In centralized system architecture, the DBMS’s responsibilities are handled by one centralized entity—the mainframe database, which has become rare, expensive, and difficult to maintain. The tiered system architectures aim to decouple the centralized setup by combining the computing capabilities of powerful central computers with the flexibility of personal computers (PCs).

There are multiple variants of the tiered system architecture such as two-tier architecture or client-server architecture. The fat client variant is where presentation logic and application logic are handled by the client and is common in cases where it makes sense to couple an application’s workflow with its look and feel; in that case, the DBMS now fully runs on the database server. A thin client variant is one where only the presentation logic is handled by the client and applications and database commands are executed on the server; it is common when application logic and database logic are tightly coupled or similar. Three-tier architecture decouples application logic from the DBMS and puts it in a separate layer (i.e., application server). Note: application server or database server may consist of multiple physical, distributed machines.

Database APIs

In a tiered DBMS system architecture, client applications can query database servers and receive results. It is possible to use in-process DBMSs (e.g., SQLite) or access stand-alone DBMS servers (e.g., Postgres). Client applications that wish to utilize the services provided by a DBMS use a specific API provided by the DBMS. This database API exposes an interface through which client applications can access and query a DBMS. The database server receives calls made by clients and executes the relevant operations before returning the results. In many cases, the client and server interfaces are implemented to work over a computer network using network sockets. The main goal of a database API is to expose an interface through which other parties can utilize the services provided by the DBMS. Most DBMS vendors provide a proprietary, DBMS-specific API (the disadvantage is that client applications must be aware of the DBMS that will be utilized on the server side). Alternatively, generic, vendor-agnostic universal APIs have been proposed, and they allow to easily port applications to multiple DBMSs.

APIs can be embedded or call-level. Embedded API embeds SQL statements in the host programming language, meaning that SQL statements are part of the source code. (Before the program is compiled, a SQL precompiler parses the SQL-related instructions and replaces them with source code instructions native to the host programming language used. Converted source code is then sent to the actual compiler.) Call-level APIs (SQL/CLI) work by passing SQL instructions to the DBMS by means of direct calls to a series of procedures, functions, or methods as provided by the API to perform the necessary actions (e.g., ODBC/JDBC).

Object Persistence and Object-Relational Mappings (ORMs)

API technologies such as JDBC and ADO.NET represent database-related entities (e.g., tables, records) in an object-oriented (OO) way. As plain domain entities such as Book and Author are represented as objects using a programming language’s representational capabilities and syntax, object persistence ensures that the corresponding object data are persisted behind the scenes to a database or other data source. Language-integrated query technologies apply similar ideas while object persistence APIs go a step further as they also describe the full business domain (i.e., the definition of data entities) within the host language. They allow for efficient querying of objects; such entities are frequently mapped to a relational database model using an object relational mapper (ORM). It is not strictly necessary to utilize an ORM to enable object persistence, though most APIs tightly couple both concepts.

Sample RDBMS Applications

On the Web, there are two types of calls: asynchronous and synchronous. In the asynchronous call, the client sends a request without waiting for a response. In the synchronous call, the client sends a request and waits for a response from the service. API and HTTP calls are examples of synchronous calls. An example of a database fully managed on the Web is Oracle’s NoSQL data services.

RDBMS Features

RDBMS includes multiple features such as transaction management, concurrency control, data distribution, distributed transaction management, distributed and parallel processing, recovery, and security.

Transaction Management

Most database systems are multiuser. While transaction management allows concurrent access to the same data by multiple users, it may induce different types of anomalies. As a result, errors may occur in the DBMS or its environment. RDBMS must ensure that transactions support ACID (atomicity, consistency, isolation, durability) properties as explained in the following paragraphs.

A transaction is a set of database operations induced by a single user or application that should be considered as one undividable unit of work (e.g., transfer between two bank accounts of the same customer). A transaction either succeeds or fails in its entirety.

A transaction renders the database from one consistent state into another consistent state. The transaction manager supervises the execution of database transactions. A database transaction is a sequence of read/write operations considered to be an atomic unit. The transaction manager creates a schedule with interleaved read/write operations, guarantees ACID properties, and can COMMIT a transaction upon successful execution or ROLLBACK a transaction upon unsuccessful execution. Delineating transactions within the transaction life cycle is called transaction management. Various components are involved in transaction management (i.e., scheduler, recovery manager, stored data manager, and buffer manager), and RDBMSs use a log file to register the current state of the transaction (active, committed, or aborted) and facilitate the implementation of checkpoint for rollback purposes.

Concurrency Control

The coordination of transactions that execute simultaneously on the same data so that they do not cause inconsistencies because of mutual interference is called concurrency control. A lock manager provides concurrency control, which ensures data integrity at all times. The two types of locks are read locks and write locks. The lock manager is responsible for assigning, releasing, and recording locks in the catalog. The lock manager makes use of a locking protocol that describes the locking rules, and a lock table that contains the lock information. Concurrency problems occur when multiple transactions execute concurrently without control.

Data Distribution

The process of storing data in more than one site to improve the data availability and retrieval performance is called data replication. Multiple databases located at different sites may need to be synchronized in real time to ensure consistency and optimal performance (e.g., fragmentation). There are different types of fragmentation:

vertical fragmentation: consists of a subset of columns of data; global view can be retrieved with JOIN query; useful if only some of a tuple’s attributes are relevant to a node
horizontal fragmentation (sharding): the fragment consists of rows that satisfy a query predicate; global view can be retrieved with UNION query; common in NoSQL databases
mixed fragmentation: combines horizontal and vertical fragmentation; global view can be retrieved with JOIN + UNION query

Distributed Transaction Management

A distributed transaction is a set of operations that are performed across multiple database systems. Distributed transaction management coordinates the resources between multiple databases. The transaction manager decides whether to commit or rollback a transaction using a two-phase commit (2PC). The steps of 2PC are described in Figure 8.12.

Distributed transaction request steps (System to Database A). Prepare to commit request (Database B). Prepare to commit response (Database A). Commit request (Database B). Commit response (Database A). Distributed transaction response (System).

Figure 8.12 Two-phase commit protocol illustrates the six steps needed for a distributed transaction. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

The transaction management server manages the transactions in loosely coupled and tightly coupled modes. In loosely coupled, the different database servers coordinate transactions without sharing resources; in the tightly coupled, resources are shared.

Parallel and Distributed Processing

Many databases including NoSQL use parallel processing, a technique in which multiple processors work simultaneously on different tasks or different parts of a task to enable concurrent processing of large amounts of data. Distributed database systems distribute data and data retrieval functionality over multiple data sources or locations. Figure 8.13 shows different architectures of distributed databases. In shared-memory architecture, multiple interconnected processors that run the DBMS software share the same central storage and secondary storage. With shared-disk architecture, each processor has its central storage but shares secondary storage with other processors using a Network Attached Storage (NAS) or Storage Area Network (SAN). In shared-nothing architecture, each processor has its central storage and hard disk units, and data sharing occurs through the processors communicating with one another over the network.

Illustration of parallel processing of tasks of different architectures of distributed data – Shared-memory architecture, Shared-disk architecture, and Shared-nothing architecture.

Figure 8.13 The different distributed architectures are shared-memory architecture, shared-disk architecture, and shared-nothing architecture. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

Scalability is the measure of a system’s ability to scale (increase or decrease) the performance and/or cost in response to any system change. Scalability can be achieved through vertical scalability, where the capacity of each node can be increased, and horizontal scalability in which more nodes can be added. Parallel databases focus on data distribution for performance and intraquery versus interquery parallelism. Federated databases use nodes in a shared-nothing architecture, each running an independent DBMS instance with horizontal data fragmentation.

In distributed query processing, the optimizer should not only consider the elements of a stand-alone setting but also the properties of respective fragments, communication costs, and location of the data in the network. Metadata may also be distributed, both globally (across all nodes) and locally (within a single node). Query optimization is needed.

Recovery

The activity of setting the database in a consistent state without any data loss in the event of a failure or when any problem occurs is called database recovery. System failure occurs when a system goes down in the middle of a transaction, and media failure occurs when a database file or the device storing the database file stops working. There are many recovery techniques such as mirroring by adding two complete copies of the database, transaction logs by recording the transitions’ changes, shadow paging by dividing the database into pages, and backups. Backups are the most-used method for recovery and include immediate backup and archival backup. An immediate backup stores the copies in disks, and an archival backup stores the data in different servers/sites.

Security

Using a set of controls to secure data, guaranteeing a high level of confidentiality, is considered database security. Security issues facing a database include human errors such as password sharing and weak passwords; SQL injection, which involves the use of SQL attack strings in database queries; overflow attacks, which writes a large amount of data to a fixed-length block of the memory, causing overflow; Denial of Service Attacks (DoS) that use a large amount of requests to crash the server; and internal and external attacks (Figure 8.14).

Illustration: Attacker connects to website (through Web API server) with malicious SQL query. Data is returned through Web API server to attacker along with data of all teachers.

Figure 8.14 SQL injection gives attackers access to confidential client data such as teacher data. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

To secure the database, there are many methods such as discretionary access control, which may grant or reject object access using a set of policies; mandatory access control, which restricts the ability to grant or reject access to resource objects; statistical database security, which focuses on the protection of statistical values stored in the database; encryption, which converts the data into secret code to hide the true meaning; and public key infrastructure, which is a set of roles and policies needed to manage the database.

Link to Learning

There are many places on the Internet where you can learn more about programming and databases. For example, should you want to build your own website, you will need to know something about structured query language (SQL), which is a standard language for accessing and manipulating databases. Visit an introduction to SQL to help you better understand SQL, what SQL can do, and how to use SQL when building a website.

8.3 Relational Database Management Systems