RDBMS: relational databases short review

RDBMS databases review In 1970, E. F. Codd developed the relational data model from which the RDBMS was born. The concept centered around tables, called relations or entities, to store data. Codd called the model “relational” after the name he used for tables, not the relationships you can build among them.

From his model Codd created 12 rules summarizing the features of a relational database. One rule specifies that a system must have integrity constraints to protect data. The constraints apply both to tables and table relationships. He also created a rule stating that the relational database system should have a single language that supports all the data-access and system-management needs. This rule provided the impetus for creating the universal database language called SQL.

Table of contents[Show]

Codd’s relational model was revolutionary at the time and stood in stark contrast to existing database systems. Soon after he presented his model companies began building database systems around it. Vendors also added features, such as indexing logic and custom procedural languages to enhance their systems. As a result, enterprises quickly adopted the RDBMS and it is now the default database system.

The benefits an RDBMS provides over the other database-systems helped increase its adoption rate. The following list summarizes some of the model’s advantages:

Data integrity — The relational model defines integrity rules that help guard against data corruption. That is, the data you place into an RDBMS do not change, disappear, or become corrupt.
Flexibility — In some respects an RDBMS server acts as an application-development platform. It provides an environment in which you can create data-centric applications. By definition, an RDBMS enables you to create entities and build relationships among them. More advanced RDBMSs incorporate a procedural language enabling you to build store procedures. Using these languages, in conjunction with entity relationships, enables you to model business processes and company workflow and store the resulting data in the database.
Universal data access — SQL has evolved as the default language for interacting with an RDBMS. Although some RDBMSs extend SQL to take advantage of proprietary features, the core language still remains portable.

An enterprise RDBMS is a complex system. It must provide robust data storage, incorporate integrity rules, and include server capabilities to share data. Figure 1 shows a conceptual view of a relational database system in a client-server deployment. As you can see, many pieces must come together to form the system.

Conceptual view of an RDBMS

Figure 1: Conceptual view of an RDBMS

Most RDBMSs use a client-server architecture, an architecture wherein clients communicate with the server over a network. Most often TCP/IP is the protocol for sending requests and receiving responses. The host server houses the database kernel, which consists of several internal-server processes that manage client connections, process SQL statements, and communicate with the host’s file system to manage data.

Despite the number of different enterprise RDBMSs on the market, they all share two common components: data-storage structures and data-integrity rules. The following sections provide an overview of the major components.

Understanding data storage

The purpose of a database is to safely store data. How it organizes the data on the physical storage device depends on the vendor. Regardless, an RDBMS has two basic storage structures, tables and indexes. Tables, the center of an RDBMS, hold the data. Indexes, although not technically required, improve data access performance.

A database system typically stores the tables and index data in files on a local file system. Depending upon the database, it may use either multiple files or one large file to hold the data. Some database systems also enable you to create distributive systems and store data on other hosts.

Tables

Tables consist of rows and columns. Rows represent entries like patient medical records or customer invoices. Each table row is called a record and is unique in a purely relational model. Without uniqueness, you cannot ensure consistent access to the same row. In addition, non-unique data can create data-corruption problems when you start creating relationships among tables.

Columns, often called attributes, describe each record. A table can have one or more columns. In general, columns define the information you want to track. The data type of the columns varies between databases. However, SQL standards such as SQL-92 and SQL3 define data types that most databases support. Some example data types include CHAR, NUMBER, and DATE.

As Codd mentioned, an RDBMS is based on tables. In fact, all data, even the table definitions themselves, are stored in tables. To store them, the relational database relies on system tables to completely describe itself. Storing the database metadata in tables enables database administrators (DBAs) or other authorized clients to manage the system using the same language as other users.

Indexes

Indexes help SQL queries quickly locate records. The way indexes work depends upon the index type. In general, an index ranks, or sorts, records based on a column in the table. The indexed column is known as a key. Indexes can also use composite keys consisting of multiple columns.

Without indexes a search for a record usually requires a table scan, which starts at the first row and looks sequentially at each entry for a match. Table scans do not provide quick access to a particular record or group of records, especially when a table has thousands or millions of records.

For example, imagine a phone book in which names and numbers were listed in the order in which the subscribers initially obtained their phone service. In this scenario it is nearly impossible to locate a person’s phone number. You have to perform a table scan, starting with the first entry, and look at every entry until you find a match. Fortunately, a phone book indexes entries alphabetically to help you locate the person you’re looking for. Although RDBMS indexes organize the data differently, the end result is the same: You can locate the information quickly.

Indexing is a science unto itself and many different types exist. Some common types include B-tree, clustered, non-clustered, and function-based indexes. However, most databases employ B-tree, or balanced-tree, indexes as the default. This type provides very quick access to data with minimal storage requirements. The actual implementation details of a B-tree index vary between vendors.

Figure 2 shows a conceptual view of a B-tree index for the alphabet. It consists of nodes that contain keys pointing to the location of data. The top node is called the root, subsequent nodes are branches, and the lower nodes are leaves. A B-tree index minimizes the path from the root node to any leaf node in the tree. Using the index you can locate any letter in the alphabet in three steps or fewer.

The keys in a database B-tree index point to individual records. The figure shows only three levels in the index; real indexes often have more levels.

Figure 2: Conceptual view of a B−tree index

Note Proper indexing of a table can significantly speed up “read” operations such as SQL SELECT statements. However, too many indexes can slow down INSERT or UPDATE operations. In general, a database system updates indexes whenever the column data that they represent change. Therefore new records, and updates to existing records, can cause index updates as well. The more indexes, the more updates needed, and therefore the slower the response.

Ensuring data integrity

One of the biggest benefits an RDBMS provides is that it protects the integrity of the data. Once you place data into a relational database you can remain confident that it will not change. To help provide this protection, the relational model defines integrity rules.

Different types of data corruption can occur in a database. Some, like disk-drive corruption, an RDBMS cannot guard against because it occurs at the physical layer. A whole drive does not have to become damaged for this type of corruption to occur; one bad block can cause significant problems. A DBA usually handles this situation by restoring the database from the last good backup. Unfortunately, an RDBMS can only mitigate the effects of this type of corruption, not prevent it.

However, two types of corruption exist that the RDBMS can protect against. The first occurs when the data among related tables become unsynchronized, creating orphan records. For example, suppose a Medical_Claims table contains records for patients, which requires a link between the Medical_Claims table and the Patients table. Deleting records from the Patients table without removing the corresponding records from the Medical_Claims table will create orphan records in the Patients table. Under relational-database rules, each medical record should map to a patient. In this situation you cannot retrieve any information on the patients associated with the orphan claims records.

Duplicate table records constitute the other form of data corruption. Data duplication can lead to incorrect values generated during queries. For example, when you have two or more identical line items in an Order table, reports based on that table may be flawed. Duplicate records also create problems when you define a relationship between two tables. For instance, you cannot reliably locate the correct record that provides the link between the two tables if one of the tables has duplicate entries.

Theoretically you may create database designs to allow orphan and duplicate records. However, you will rarely encounter these designs. Remember, an RDBMS provides you with the flexibility to create good data models as well as bad ones. You should always design with data integrity in mind.

Ensuring data integrity requires both entity and referential integrity. Entity integrity pertains to individual tables while referential pertains to table relationships. The following sections describe both in greater detail.

Entity integrity

Entity integrity ensures that table data remain unique, which is a requirement of Codd’s relational model. As I mentioned in the previous section, duplicate rows can lead to erroneous values on reports or in summary queries. In addition, you cannot effectively create a relationship between two tables when duplicate records exist. RDBMSs rely on data integrity as a foundation.

To ensure uniqueness a table employs a primary key, a single attribute that is unique for every record in a table. You may also define a composite primary key consisting of multiple attributes that remain unique for each record. In any given record the other columns may contain identical data, but the column(s) representing the primary key must be unique. In general, a primary key, when combined with the table name, acts as a pointer to a particular record.

Figure 3 illustrates the primary key for the Employees table I use as an example throughout this book. I have defined the column SSN, which represents an employee’s Social Security number, as the unique identifier for each row. No two rows can have the same value as this column.

Figure 3: Example of a primary key

A column that represents a primary key carries some additional constraints. For example, you cannot store NULL values in these columns. The column(s) that define a primary key must contain meaningful data. NULL values represent uncertainty. In addition, you must always specify a primary key value when adding a new record to a table. Not doing so is equivalent to adding a NULL value.

Note A database NULL value represents unknown values or indecision. Contrary to what many believe it does not represent the number zero or an empty character string.

Referential integrity

You may build RDBMS applications that contain multiple unrelated tables. However, this creates a database application good for looking up information in only one table at a time. Most RDBMS applications consist of multiple tables related in some manner to model real-world objects and processes.

To relate two tables you must create a link from one table to the other table. Foreign keys provide that link. Foreign keys are primary keys from one table used in another table. Because they are primary keys, they act as pointers to unique records in the other table.

For example, consider the Employees and Location tables in Figure 4. The Employees table has a column, Loc_Id, which holds the code from the Location table that indicates the home-office location. In this example
the Loc_Id column in the Employees table is a foreign key. Each entry points to a unique record in the Location table. Regarding nomenclature, the Employees entity is called the relational table and the Location table is known as the base table.

Figure 4: Example of a foreign key

Foreign keys also limit data duplication, which minimizes your database’s size. If you reference the primary key in a base table you do not have to include the whole record. The foreign key points to it. For example, when querying the Employees table you can easily retrieve the location by joining the two tables in a SELECT statement. I provide an example of this operation in the section "Reviewing SQL Basics" later in this chapter.

Like primary keys, foreign keys have usage rules. For instance, foreign keys must contain valid values in the base table. However, unlike primary keys, a foreign key can also contain NULL values. A table may even have multiple rows containing NULL values for the foreign key. Why? In some situations a record may not have valid value in the foreign key’s base table.

Figure 5 provides an example of a situation in which you might store a NULL value in a foreign key. Notice that in the Family table the foreign key, Pet_Type, references a Pet lookup table. In some cases a family may not have a pet, in which case a NULL value is appropriate.

Figure 5: Example of a NULL value for a foreign key

However, a NULL value introduces ambiguity because it represents the unknown. In the previous example, if you encounter a NULL value you don’t know whether the family doesn’t have a pet, forgot to answer the question, or has a type of pet that is not listed. Good database design suggests including a "no pet" or "not listed" entry in the Pet table to deal with this problem.

Table relationships

As I mentioned above, the ability to relate entities enables you to model systems and processes. Table relationships describe how records in one table map to records in another table. When creating the relationship you rely on primary and foreign keys to glue the entities together. Because the keys have integrity constraints associated with them, they also help to ensure that the relationship remains valid.

The following sections present the different table relationships with examples explaining them.

One-to-one relationships The one-to-one relationship is the simplest. It specifies that only one record in a table maps to only one record in another table. This relationship seldom occurs in database applications. However, you will need it in some obvious situations.

Figure 6 shows an example of a situation in which you might encounter a one-to-one relationship. In this case, it describes the relationship between an automobile and an engine. As you know, an automobile can only have one engine and vice versa. In the model, both the Automobile and Engine tables have composite primary keys: Auto_ID and Engine_ID, respectively. The Automobile table stores data about the vehicle, such as color and model. The Engine table stores engine specific information. The two tables share a one-to-one relationship based on their primary keys. In this example the one-to-one relationship is obvious.

Figure 6: Example of a one−to−one relationship

You can also use the one-to-one relationship to split a table into two parts. For example, suppose you have a table with a large number of columns. To make administration and documentation easier, or circumvent system limitations, you may want to split the table into two smaller entities. When splitting a table you keep the same primary key for each table and create the one-to-one link based on it.

Splitting a table also makes sense if it contains both confidential and non- confidential data. You can keep the confidential data in a separate table and allow only authorized personnel to access it. Again, identical primary keys in each table create the link.

One-to-many relationships The most common entity relationship is the one-to-many relationship. It occurs when a record in one table has zero, one, or many matching records in another table. You may also hear this relationship appropriately called a parent-child or master-detail relationship.

The relationship frequently occurs when a relational table includes information from a lookup table. Figure 7 shows an example of a one-to-many relationship used in the data model for an online store that sells computer systems. All the lookup tables — Mouse, Monitor, and Keyboard — contain unique inventory items for the different components.

Figure 7: Example of a one−to−many relationship

The lookup tables have a one-to-many relationship with the Computer_System table. That is, every record in a lookup table can have zero, one, or many records in the Computer_System table. For example, a certain style of mouse is used in many different systems. The same holds true for keyboards and monitors.

Many-to-many relationships Two tables share a many-to-many relationship when a record in Table A has many matching records in Table B, and a record in Table B has many matching records in Table A.

Unfortunately, the relational model does not enable you to directly implement this relationship between two tables. To implement the relationship you need to use a third table that indirectly join the two other tables to each other. The third table, often called a joining, linking, or association table, has a one-to-many relationship with each table.

Figure 8 illustrates this relationship between an Order and a Products table. A single product can exist on many different orders, and a single order can contain many different products. To uniquely identify a specific product-order combination you need an association table, Order_Details. It has two one-to-many relationships, one with the Products table and another with the Order table. Each entry in the Order_Details table represents a specific order-product combination.

As you can see, many-to-many relationships can become complex and confusing. This underscores the importance of understanding the data model before developing an application.

Figure 8: Example of a many−to−many relationship

Reviewing SQL Basics

The Structured Query Language (SQL) is the foundation for interacting with an RDBMS. The language enables you to build a database and the structures that hold data, as well as to insert, delete, and retrieve data from it. All RDBMSs use SQL. In fact, the universal acceptance of SQL has helped make the relational database the most widely used database system. As a result, an understanding of SQL will enable you to work with many different databases with relative ease.

SQL is a non-procedural language. You cannot create a sequence of statements and execute them as a whole. When using SQL you submit one statement, or command, at a time and wait for the result. A query provides a good example. You submit a SELECT statement and the database server returns the results. It is not possible to chain two statements together and execute them simultaneously.

However, some database vendors have developed procedural languages around SQL to create a programming environment. With a vendor’s language you can create stored procedures, which are mini-programs equivalent to Java methods. Stored procedures enable you to sequentially execute multiple statements as well as to control the flow of execution.

ANSI published the first SQL standard in 1986. The specification defined the basic functions and syntax for database interaction. The next major version, SQL92, published in 1992, provided the major language components we use today. In a nutshell, it gives us language constructs for manipulating schemas and database administration. SQL3, adopted in 1999, provides support for custom data types and supports certain object-oriented programming concepts, such as polymorphism and inheritance. Using SQL3 you can create your own object hierarchy within the database using custom data types.

Although an SQL standard exists, different database vendors have modified it slightly to exploit their different proprietary features. Think of the national language of a country. It’s understood everywhere, but different dialects exist in different regions. It’s the same with the modern RDBMS. Each vendor uses SQL, but each has its own dialect.

For example, Oracle developed an SQL version called PL/SQL with extensions that apply to Oracle-specific features. SQLServer uses T-SQL, or Transact-SQL, in the same manner. Nonetheless, if you understand SQL concepts, you should have few problems adopting an individual database’s SQL dialect.

SQL has two major components with which you can work: the Data Manipulation Language (DML) and the Data Definition Language (DDL). DML is the language for interacting with the data. With it you can add, change, or remove table data. DDL is the language for building the data-storage structures within the database. It enables you to create or destroy structures such as tables, indexes, or even a database itself.

The following two sections describe DML and DDL. I’ve also provided some examples to illustrate the concepts.

Using Data Manipulation Language (DML)

DML is the name given to a subset of SQL commands used for manipulating records in a relational database.

It defines commands for retrieving, inserting, updating, and deleting table data. As a developer you will likely use DML statements more often than any other SQL statement type. I list the four most commonly used DML commands and a brief description of each in Table 1.

Table 1: Data Manipulation Language (DML) Commands

SQL DML Command Description

SELECT Retrieves records from one or more tables in a relational database.

UPDATE Modifies existing columns for one or more records in a single table.

INSERT Adds a new record into a single database table.

DELETE Removes one or more records from a single table.

You use the SELECT command to retrieve data from one or more tables. It does not affect the underlying data, it only returns it. The other statements — UPDATE, INSERT, and DELETE — alter the underlying table data. Improperly using these commands can cause data loss or corruption.

However, most RDBMSs protect against misuse or mistakes by using transactions. Generally, a database system buffers your changes until you issue the COMMIT statement, which tells the database to make the changes permanent. If you decide not to apply changes, and have not yet issued the COMMIT statement, you can use the ROLLBACK statement to undo them.

Caution JDBC automatically performs commits after each DML statement. You can change this behavior by using the Connection.setAutoCommit() method to change the auto-commit state to false. This will require you to explicitly commit changes with the Connection.commit() method. Your driver may function differently, so you should check the documentation provided with it.

Not only do the UPDATE, DELETE, and INSERT statements affect table data, they also affect indexes. Generally, a database system will update an index whenever a record changes or you add new data. For example, when you add a record to a table, the database adds an entry to the appropriate index. The same holds true for updating and deleting records; the index entry is modified appropriately.