- By Itzik Ben-Gan
Creating tables and defining data integrity
This section describes the fundamentals of creating tables and defining data integrity using T-SQL. Feel free to run the included code samples in your environment.
As mentioned earlier, DML rather than DDL is the focus of this book. Still, you need to understand how to create tables and define data integrity. I won’t go into the explicit details here, but I’ll provide a brief description of the essentials.
Before you look at the code for creating a table, remember that tables reside within schemas, and schemas reside within databases. The examples use the book’s sample database, TSQLV4, and a schema called dbo.
The examples here use a schema named dbo that is created automatically in every database and is also used as the default schema for users who are not explicitly associated with a different schema.
The following code creates a table named Employees in the dbo schema in the TSQLV4 database:
USE TSQLV4; DROP TABLE IF EXISTS dbo.Employees; CREATE TABLE dbo.Employees ( empid INT NOT NULL, firstname VARCHAR(30) NOT NULL, lastname VARCHAR(30) NOT NULL, hiredate DATE NOT NULL, mgrid INT NULL, ssn VARCHAR(20) NOT NULL, salary MONEY NOT NULL );
The USE statement sets the current database context to that of TSQLV4. It is important to incorporate the USE statement in scripts that create objects to ensure that SQL Server creates the objects in the specified database. In an on-premises SQL Server implementation, the USE statement can actually change the database context from one to another. In SQL Database, you cannot switch between different databases, but the USE statement will not fail as long as you are already connected to the target database. So even in SQL Database, I recommend having the USE statement to ensure that you are connected to the right database when creating your objects.
The DROP IF EXISTS command drops the table if it already exists. Note that this command was introduced in SQL Server 2016. If you’re using earlier versions of SQL Server, use the following statement instead:
IF OBJECT_ID(N'dbo.Employees', N'U') IS NOT NULL DROP TABLE dbo.Employees;
The IF statement invokes the OBJECT_ID function to check whether the Employees table already exists in the current database. The OBJECT_ID function accepts an object name and type as inputs. The type U represents a user table. This function returns the internal object ID if an object with the specified input name and type exists, and NULL otherwise. If the function returns a NULL, you know that the object doesn’t exist. In our case, the code drops the table if it already exists and then creates a new one. Of course, you can choose a different treatment, such as simply not creating the object if it already exists.
The CREATE TABLE statement is in charge of defining what I referred to earlier as the heading of the relation. Here you specify the name of the table and, in parentheses, the definition of its attributes (columns).
Notice the use of the two-part name dbo.Employees for the table name, as recommended earlier. If you omit the schema name, for ad-hoc queries SQL Server will assume the default schema associated with the database user running the code. For queries in stored procedures, SQL Server will assume the schema associated with the procedure’s owner.
For each attribute, you specify the attribute name, data type, and whether the value can be NULL (which is called nullability).
In the Employees table, the attributes empid (employee ID) and mgrid (manager ID) are each defined with the INT (four-byte integer) data type; the firstname, lastname, and ssn (US Social Security number) are defined as VARCHAR (variable-length character string with the specified maximum supported number of characters); and hiredate is defined as DATE and salary is defined as MONEY.
If you don’t explicitly specify whether a column allows or disallows NULLs, SQL Server will have to rely on defaults. Standard SQL dictates that when a column’s nullability is not specified, the assumption should be NULL (allowing NULLs), but SQL Server has settings that can change that behavior. I recommend that you be explicit and not rely on defaults. Also, I recommend defining a column as NOT NULL unless you have a compelling reason to support NULLs. If a column is not supposed to allow NULLs and you don’t enforce this with a NOT NULL constraint, you can rest assured that NULLs will occur. In the Employees table, all columns are defined as NOT NULL except for the mgrid column. A NULL in the mgrid column would represent the fact that the employee has no manager, as in the case of the CEO of the organization.
Defining data integrity
As mentioned earlier, one of the great benefits of the relational model is that data integrity is an integral part of it. Data integrity enforced as part of the model—namely, as part of the table definitions—is considered declarative data integrity. Data integrity enforced with code—such as with stored procedures or triggers—is considered procedural data integrity.
Data type and nullability choices for attributes and even the data model itself are examples of declarative data integrity constraints. In this section, I will describe other examples of declarative constraints: primary key, unique, foreign key, check, and default constraints. You can define such constraints when creating a table as part of the CREATE TABLE statement, or you can define them for already-created tables by using an ALTER TABLE statement. All types of constraints except for default constraints can be defined as composite constraints—that is, based on more than one attribute.
A primary-key constraint enforces the uniqueness of rows and also disallows NULLs in the constraint attributes. Each unique set of values in the constraint attributes can appear only once in the table—in other words, only in one row. An attempt to define a primary-key constraint on a column that allows NULLs will be rejected by the RDBMS. Each table can have only one primary key.
Here’s an example of defining a primary-key constraint on the empid attribute in the Employees table that you created earlier:
ALTER TABLE dbo.Employees ADD CONSTRAINT PK_Employees PRIMARY KEY(empid);
With this primary key in place, you can be assured that all empid values will be unique and known. An attempt to insert or update a row such that the constraint would be violated will be rejected by the RDBMS and result in an error.
To enforce the uniqueness of the logical primary-key constraint, SQL Server will create a unique index behind the scenes. A unique index is a physical mechanism used by SQL Server to enforce uniqueness. Indexes (not necessarily unique ones) are also used to speed up queries by avoiding unnecessary full table scans (similar to indexes in books).
A unique constraint enforces the uniqueness of rows, allowing you to implement the concept of alternate keys from the relational model in your database. Unlike with primary keys, you can define multiple unique constraints within the same table. Also, a unique constraint is not restricted to columns defined as NOT NULL.
The following code defines a unique constraint on the ssn column in the Employees table:
ALTER TABLE dbo.Employees ADD CONSTRAINT UNQ_Employees_ssn UNIQUE(ssn);
As with a primary-key constraint, SQL Server will create a unique index behind the scenes as the physical mechanism to enforce the logical unique constraint.
According to standard SQL, a column with a unique constraint is supposed to allow multiple NULLs (as if two NULLs were different from each other). However, SQL Server’s implementation rejects duplicate NULLs (as if two NULLs were equal to each other). To emulate the standard unique constraint in SQL Server you can use a unique filtered index that filters only non-NULL values. For example, suppose that the column ssn allowed NULLs, and you wanted to create such an index instead of a unique constraint. You would have used the following code:
CREATE UNIQUE INDEX idx_ssn_notnull ON dbo.Employees(ssn) WHERE ssn IS NOT NULL;
The index is defined as a unique one, and the filter excludes NULLs from the index, so duplicate NULLs will be allowed, whereas duplicate non-NULL values won’t be allowed.
A foreign-key enforces referential integrity. This constraint is defined on one or more attributes in what’s called the referencing table and points to candidate-key (primary-key or unique-constraint) attributes in what’s called the referenced table. Note that the referencing and referenced tables can be one and the same. The foreign key’s purpose is to restrict the values allowed in the foreign-key columns to those that exist in the referenced columns.
The following code creates a table called Orders with a primary key defined on the orderid column:
DROP TABLE IF EXISTS dbo.Orders; CREATE TABLE dbo.Orders ( orderid INT NOT NULL, empid INT NOT NULL, custid VARCHAR(10) NOT NULL, orderts DATETIME2 NOT NULL, qty INT NOT NULL, CONSTRAINT PK_Orders PRIMARY KEY(orderid) );
Suppose you want to enforce an integrity rule that restricts the values supported by the empid column in the Orders table to the values that exist in the empid column in the Employees table. You can achieve this by defining a foreign-key constraint on the empid column in the Orders table pointing to the empid column in the Employees table, like the following:
ALTER TABLE dbo.Orders ADD CONSTRAINT FK_Orders_Employees FOREIGN KEY(empid) REFERENCES dbo.Employees(empid);
Similarly, if you want to restrict the values supported by the mgrid column in the Employees table to the values that exist in the empid column of the same table, you can do so by adding the following foreign key:
ALTER TABLE dbo.Employees ADD CONSTRAINT FK_Employees_Employees FOREIGN KEY(mgrid) REFERENCES dbo.Employees(empid);
Note that NULLs are allowed in the foreign-key columns (mgrid in the last example) even if there are no NULLs in the referenced candidate-key columns.
The preceding two examples are basic definitions of foreign keys that enforce a referential action called no action. No action means that attempts to delete rows from the referenced table or update the referenced candidate-key attributes will be rejected if related rows exist in the referencing table. For example, if you try to delete an employee row from the Employees table when there are related orders in the Orders table, the RDBMS will reject such an attempt and produce an error.
You can define the foreign key with actions that will compensate for such attempts (to delete rows from the referenced table or update the referenced candidate-key attributes when related rows exist in the referencing table). You can define the options ON DELETE and ON UPDATE with actions such as CASCADE, SET DEFAULT, and SET NULL as part of the foreign-key definition. CASCADE means that the operation (delete or update) will be cascaded to related rows. For example, ON DELETE CASCADE means that when you delete a row from the referenced table, the RDBMS will delete the related rows from the referencing table. SET DEFAULT and SET NULL mean that the compensating action will set the foreign-key attributes of the related rows to the column’s default value or NULL, respectively. Note that regardless of which action you choose, the referencing table will have only orphaned rows in the case of the exception with NULLs that I mentioned earlier. Parents with no children are always allowed.
You can use a check constraint to define a predicate that a row must meet to be entered into the table or to be modified. For example, the following check constraint ensures that the salary column in the Employees table will support only positive values:
ALTER TABLE dbo.Employees ADD CONSTRAINT CHK_Employees_salary CHECK(salary > 0.00);
An attempt to insert or update a row with a nonpositive salary value will be rejected by the RDBMS. Note that a check constraint rejects an attempt to insert or update a row when the predicate evaluates to FALSE. The modification will be accepted when the predicate evaluates to either TRUE or UNKNOWN. For example, salary –1000 will be rejected, whereas salaries 50000 and NULL will both be accepted (if the column allowed NULLs). As mentioned earlier, SQL is based on three-valued logic, which results in two actual actions. With a check constraint, the row is either accepted or rejected.
When adding check and foreign-key constraints, you can specify an option called WITH NOCHECK that tells the RDBMS you want it to bypass constraint checking for existing data. This is considered a bad practice because you cannot be sure your data is consistent. You can also disable or enable existing check and foreign-key constraints.
A default constraint is associated with a particular attribute. It’s an expression that is used as the default value when an explicit value is not specified for the attribute when you insert a row. For example, the following code defines a default constraint for the orderts attribute (representing the order’s time stamp):
ALTER TABLE dbo.Orders ADD CONSTRAINT DFT_Orders_orderts DEFAULT(SYSDATETIME()) FOR orderts;
The default expression invokes the SYSDATETIME function, which returns the current date and time value. After this default expression is defined, whenever you insert a row in the Orders table and do not explicitly specify a value in the orderts attribute, SQL Server will set the attribute value to SYSDATETIME.
When you’re done, run the following code for cleanup:
DROP TABLE IF EXISTS dbo.Orders, dbo.Employees;