Single-Table Queries in Microsoft SQL Server 2012

  • 7/15/2012

NULL Marks

As explained in Chapter 1, “Background to T-SQL Querying and Programming,” SQL supports the NULL mark to represent missing values and uses three-valued logic, meaning that predicates can evaluate to TRUE, FALSE, or UNKNOWN. T-SQL follows the standard in this respect. Treatment of NULL marks and UNKNOWN in SQL can be very confusing because intuitively people are more accustomed to thinking in terms of two-valued logic (TRUE and FALSE). To add to the confusion, different language elements in SQL treat NULL marks and UNKNOWN differently.

Let’s start with three-valued predicate logic. A logical expression involving only existing or present values evaluates to either TRUE or FALSE, but when the logical expression involves a missing value, it evaluates to UNKNOWN. For example, consider the predicate salary > 0. When salary is equal to 1,000, the expression evaluates to TRUE. When salary is equal to –1,000, the expression evaluates to FALSE. When salary is NULL, the expression evaluates to UNKNOWN.

SQL treats TRUE and FALSE in an intuitive and probably expected manner. For example, if the predicate salary > 0 appears in a query filter (such as in a WHERE or HAVING clause), rows or groups for which the expression evaluates to TRUE are returned, whereas those for which the expression evaluates to FALSE are filtered out. Similarly, if the predicate salary > 0 appears in a CHECK constraint in a table, INSERT or UPDATE statements for which the expression evaluates to TRUE for all rows are accepted, whereas those for which the expression evaluates to FALSE for any row are rejected.

SQL has different treatments for UNKNOWN in different language elements (and for some people, not necessarily the expected treatments). The correct definition of the treatment SQL has for query filters is “accept TRUE,” meaning that both FALSE and UNKNOWN are filtered out. Conversely, the definition of the treatment SQL has for CHECK constraints is “reject FALSE,” meaning that both TRUE and UNKNOWN are accepted. If SQL used two-valued predicate logic, there wouldn’t be a difference between the definitions “accept TRUE” and “reject FALSE.” But with three-valued predicate logic, “accept TRUE” rejects UNKNOWN (it accepts only TRUE, hence it rejects both FALSE and UNKNOWN), whereas “reject FALSE” accepts it (it rejects only FALSE, hence it accepts both TRUE and UNKNOWN). With the predicate salary > 0 from the previous example, a NULL salary would cause the expression to evaluate to UNKNOWN. If this predicate appears in a query’s WHERE clause, a row with a NULL salary will be filtered out. If this predicate appears in a CHECK constraint in a table, a row with a NULL salary will be accepted.

One of the tricky aspects of UNKNOWN is that when you negate it, you still get UNKNOWN. For example, given the predicate NOT (salary > 0), when salary is NULL, salary > 0 evaluates to UNKNOWN, and NOT UNKNOWN remains UNKNOWN.

What some people find surprising is that an expression comparing two NULL marks (NULL = NULL) evaluates to UNKNOWN. The reasoning for this from SQL’s perspective is that a NULL represents a missing or unknown value, and you can’t really tell whether one unknown value is equal to another. Therefore, SQL provides you with the predicates IS NULL and IS NOT NULL, which you should use instead of = NULL and <> NULL.

To make things a bit more tangible, I’ll demonstrate the aforementioned aspects of the three-valued predicate logic. The Sales.Customers table has three attributes called country, region, and city, where the customer’s location information is stored. All locations have existing countries and cities. Some have existing regions (such as country: USA, region: WA, city: Seattle), yet for some the region element is missing and inapplicable (such as country: UK, region: NULL, city: London). Consider the following query, which attempts to return all customers where the region is equal to WA.

SELECT custid, country, region, city
FROM Sales.Customers
WHERE region = N'WA';

This query generates the following output.

custid      country         region          city
----------- --------------- --------------- ---------------
43          USA             WA              Walla Walla
82          USA             WA              Kirkland
89          USA             WA              Seattle

Out of the 91 rows in the Customers table, the query returns the three rows where the region attribute is equal to WA. The query returns neither rows in which the value in the region attribute is present and different than WA (the predicate evaluates to FALSE) nor those where the region attribute is NULL (the predicate evaluates to UNKNOWN).

The following query attempts to return all customers for whom the region is different than WA.

SELECT custid, country, region, city
FROM Sales.Customers
WHERE region <> N'WA';

This query generates the following output:

custid      country         region          city
----------- --------------- --------------- ---------------
10          Canada          BC              Tsawassen
15          Brazil          SP              Sao Paulo
21          Brazil          SP              Sao Paulo
31          Brazil          SP              Campinas
32          USA             OR              Eugene
33          Venezuela       DF              Caracas
34          Brazil          RJ              Rio de Janeiro
35          Venezuela       Táchira         San Cristóbal
36          USA             OR              Elgin
37          Ireland         Co. Cork        Cork
38          UK              Isle of Wight   Cowes
42          Canada          BC              Vancouver
45          USA             CA              San Francisco
46          Venezuela       Lara            Barquisimeto
47          Venezuela       Nueva Esparta   I. de Margarita
48          USA             OR              Portland
51          Canada          Québec          Montréal
55          USA             AK              Anchorage
61          Brazil          RJ              Rio de Janeiro
62          Brazil          SP              Sao Paulo
65          USA             NM              Albuquerque
67          Brazil          RJ              Rio de Janeiro
71          USA             ID              Boise
75          USA             WY              Lander
77          USA             OR              Portland
78          USA             MT              Butte
81          Brazil          SP              Sao Paulo
88          Brazil          SP              Resende

(28 row(s) affected)

If you expected to get 88 rows back (91 rows in the table minus 3 returned by the previous query), you might find the fact that this query returned only 28 rows surprising. But remember, a query filter “accepts TRUE,” meaning that it rejects both rows for which the logical expression evaluates to FALSE and those for which it evaluates to UNKNOWN. So this query returned rows in which a value was present in the region attribute and that value was different than WA. It returned neither rows in which the region attribute was equal to WA nor rows in which region was NULL. You will get the same output if you use the predicate NOT (region = N’WA’) because in the rows where region is NULL and the expression region = N’WA’ evaluates to UNKNOWN, NOT (region = N’WA’) evaluates to UNKNOWN also.

If you want to return all rows for which region is NULL, do not use the predicate region = NULL, because the expression evaluates to UNKNOWN in all rows—both those in which the value is present and those in which the value is missing (is NULL). The following query returns an empty set.

SELECT custid, country, region, city
FROM Sales.Customers
WHERE region = NULL;

custid      country         region          city
----------- --------------- --------------- ---------------

(0 row(s) affected)

Instead, you should use the IS NULL predicate.

SELECT custid, country, region, city
FROM Sales.Customers
WHERE region IS NULL;

This query generates the following output, shown in abbreviated form.

custid      country         region          city
----------- --------------- --------------- ---------------
1           Germany         NULL            Berlin
2           Mexico          NULL            México D.F.
3           Mexico          NULL            México D.F.
4           UK              NULL            London
5           Sweden          NULL            Luleå
6           Germany         NULL            Mannheim
7           France          NULL            Strasbourg
8           Spain           NULL            Madrid
9           France          NULL            Marseille
11          UK              NULL            London
...

(60 row(s) affected)

If you want to return all rows for which the region attribute is not WA, including those in which the value is present and different than WA, along with those in which the value is missing, you need to include an explicit test for NULL marks, like this.

SELECT custid, country, region, city
FROM Sales.Customers
WHERE region <> N'WA'
   OR region IS NULL;

This query generates the following output, shown in abbreviated form.

custid      country         region          city
----------- --------------- --------------- ---------------
1           Germany         NULL            Berlin
2           Mexico          NULL            México D.F.
3           Mexico          NULL            México D.F.
4           UK              NULL            London
5           Sweden          NULL            Luleå
6           Germany         NULL            Mannheim
7           France          NULL            Strasbourg
8           Spain           NULL            Madrid
9           France          NULL            Marseille
10          Canada          BC              Tsawassen
...

(88 row(s) affected)

SQL also treats NULL marks inconsistently in different language elements for comparison and sorting purposes. Some elements treat two NULL marks as equal to each other and others treat them as different.

For example, for grouping and sorting purposes, two NULL marks are considered equal. That is, the GROUP BY clause arranges all NULL marks in one group just like present values, and the ORDER BY clause sorts all NULL marks together. Standard SQL leaves it to the product implementation as to whether NULL marks sort before present values or after. T-SQL sorts NULL marks before present values.

As mentioned earlier, query filters “accept TRUE.” An expression comparing two NULL marks yields UNKNOWN; therefore, such a row is filtered out.

For the purposes of enforcing a UNIQUE constraint, standard SQL treats NULL marks as different from each other (allowing multiple NULL marks). Conversely, in T-SQL, a UNIQUE constraint considers two NULL marks as equal (allowing only one NULL if the constraint is defined on a single column).

Keeping in mind the inconsistent treatment SQL has for UNKNOWN and NULL marks and the potential for logical errors, you should explicitly think of NULL marks and three-valued logic in every query that you write. If the default treatment is not what you want, you must intervene explicitly; otherwise, just ensure that the default behavior is in fact what you want.