Table Expressions in Microsoft SQL Server 2008 T-SQL

  • 10/22/2008
The focus of this chapter from Microsoft SQL Server 2008 T-SQL Fundamentals is SELECT queries against table expressions. This chapter also introduces the APPLY table operator used in conjunction with a table expression.

Table expressions are named query expressions that represent a valid relational table. You can use them in data manipulation statements similar to other tables. Microsoft SQL Server supports four types of table expressions: derived tables, common table expressions (CTEs), views, and inline table-valued functions (inline TVFs), each of which I will describe in detail in this chapter. The focus of this chapter is SELECT queries against table expressions; Chapter 8, covers modifications against table expressions.

Table expressions are not physically materialized anywhere—they are virtual. A query against a table expression is internally translated to a query against the underlying objects. The benefits of using table expressions are typically related to logical aspects of your code and not to performance. For example, table expressions help you simplify your solutions by using a modular approach. Table expressions also help you circumvent certain restrictions in the language, such as the inability to refer to column aliases assigned in the SELECT clause in query clauses that are logically processed prior to the SELECT clause.

This chapter also introduces the APPLY table operator used in conjunction with a table expression. I will explain how to use this operator to apply a table expression to each row of another table.

Derived Tables

Derived tables (also known as table subqueries) are defined in the FROM clause of an outer query. Their scope of existence is the outer query. As soon as the outer query is finished, the derived table is gone.

You specify the query defining the derived table within parentheses, followed by the AS clause and the derived table name. For example, the following code defines a derived table called USACusts based on a query that returns all customers from the United States, and the outer query selects all rows from the derived table:

USE TSQLFundamentals2008;

SELECT *
FROM (SELECT custid, companyname
      FROM Sales.Customers
      WHERE country = N'USA') AS USACusts;

In this particular case, which is a simple example of the basic syntax, a derived table is not needed because the outer query doesn’t apply any manipulation.

The code in this basic example returns the following output:

custid      companyname
----------- ---------------
32          Customer YSIQX
36          Customer LVJSO
43          Customer UISOJ
45          Customer QXPPT
48          Customer DVFMB
55          Customer KZQZT
65          Customer NYUHS
71          Customer LCOUJ
75          Customer XOJYP
77          Customer LCYBZ
78          Customer NLTYP
82          Customer EYHKM
89          Customer YBQTI

A query must meet three requirements to be valid to define a table expression of any kind:

  1. Order is not guaranteed. A table expression is supposed to represent a relational table, and the rows in a relational table have no guaranteed order. Recall that this aspect of a relation stems from set theory. For this reason, ANSI SQL disallows an ORDER BY clause in queries that are used to define table expressions. T-SQL follows this restriction for the most part, with one exception—when TOP is also specified. In the context of a query with the TOP option, the ORDER BY clause serves a logical purpose: defining for the TOP option which rows to filter. If you use a query with TOP and ORDER BY to define a table expression, ORDER BY is only guaranteed to serve the logical filtering purpose for the TOP option and not the usual presentation purpose. If the outer query against the table expression does not have a presentation ORDER BY, the output is not guaranteed to be returned in any particular order. The section “Views and the ORDER BY Clause,” later in this chapter, provides more detail on this item.

  2. All columns must have names. All columns in a table must have names; therefore, you must assign column aliases to all expressions in the SELECT list of the query that is used to define a table expression.

  3. All column names must be unique. All column names in a table must be unique; therefore, a table expression that has multiple columns with the same name is invalid. This might happen when the query defining the table expression joins two tables, and both tables have a column with the same name. If you need to incorporate both columns in your table expression, they must have different column names. You can resolve this by assigning the two columns with different column aliases.

Assigning Column Aliases

One of the benefits of using table expressions is that in any clause of the outer query you can refer to column aliases that were assigned in the SELECT clause of the inner query. This helps you get around the fact that you can’t refer to column aliases assigned in the SELECT clause in query clauses that are logically processed prior to the SELECT clause (for example, WHERE or GROUP BY).

For example, suppose that you need to write a query against the Sales.Orders table and return the number of distinct customers handled in each order year. The following attempt is invalid because the GROUP BY clause refers to a column alias that was assigned in the SELECT clause, and the GROUP BY clause is logically processed prior to the SELECT clause:

SELECT
  YEAR(orderdate) AS orderyear,
  COUNT(DISTINCT custid) AS numcusts
FROM Sales.Orders
GROUP BY orderyear;

You could solve the problem by referring to the expression YEAR(orderdate) in both the GROUP BY and the SELECT clauses, but this is an example with a short expression. What if the expression were much longer? Maintaining two copies of the same expression might hurt code readability and maintainability and is more prone to errors. To solve the problem in a way that requires only one copy of the expression, you can use a table expression like so:

Example 5-1. Query with a Derived Table Using Inline Aliasing Form

SELECT orderyear, COUNT(DISTINCT custid) AS numcusts
FROM (SELECT YEAR(orderdate) AS orderyear, custid
      FROM Sales.Orders) AS D
GROUP BY orderyear;

This query returns the following output:

orderyear   numcusts
----------- -----------
2006        67
2007        86
2008        81

This code defines a derived table called D based on a query against the Orders table that returns the order year and customer ID from all rows. The SELECT list of the inner query uses inline aliasing format to assign the alias orderyear to the expression YEAR(orderdate). The outer query can refer to the orderyear column alias in both the GROUP BY and SELECT clauses, because as far as the outer query is concerned, it queries a table called D with columns called orderyear and custid.

As I mentioned earlier, SQL Server expands the definition of the table expression and accesses the underlying objects directly. After expansion, the query in Example 5-1 looks like this:

SELECT YEAR(orderdate) AS orderyear, COUNT(DISTINCT custid) AS numcusts
FROM Sales.Orders
GROUP BY YEAR(orderdate);

This is just to emphasize that you use table expressions for logical (not performance-related) reasons. Generally speaking, table expressions have neither positive nor negative performance impact.

The code in Example 5-1 uses the inline aliasing format to assign column aliases to expressions. The syntax for inline aliasing is <expression> [AS] <alias>. Note that the word AS is optional in the syntax for inline aliasing; however, I find that it helps the readability of the code and recommend using it.

In some cases, you might prefer to use a second supported form for assigning column aliases, which you can think of as an external form. With this form you do not assign column aliases following the expressions in the SELECT list—you specify all target column names in parentheses following the table expression’s name like so:

SELECT orderyear, COUNT(DISTINCT custid) AS numcusts
FROM (SELECT YEAR(orderdate), custid
      FROM Sales.Orders) AS D(orderyear, custid)
GROUP BY orderyear;

It is generally recommended that you use the inline form for a couple of reasons. If you need to debug the code when using the inline form, when you highlight the query defining the table expression and run it, the columns in the result appear with the aliases you assigned. With the external form, you cannot include the target column names when you highlight the table expression query, so the result appears with no column names in the case of the unnamed expressions. Also, when the table expression query is lengthy, using the external form it can be quite difficult to figure out which column alias belongs to which expression.

Even though it’s a best practice to use the inline aliasing form, in some cases you may find the external form more convenient to work with. For example, when the query defining the table expression isn’t going to undergo any further revisions and you want to treat it like a “black box”—you want to focus your attention on the table expression name followed by the target column list when you look at the outer query.

Using Arguments

In the query defining a derived table, you can refer to arguments. The arguments can be local variables and input parameters to a routine such as a stored procedure or function. For example, the following code declares and initializes a local variable called @empid, and the query in the code that is used to define the derived table D refers to the local variable in the WHERE clause:

DECLARE @empid AS INT = 3;

/*
-- Prior to SQL Server 2008 use separate DECLARE and SET statements:
DECLARE @empid AS INT;
SET @empid = 3;
*/
SELECT orderyear, COUNT(DISTINCT custid) AS numcusts
FROM (SELECT YEAR(orderdate) AS orderyear, custid
      FROM Sales.Orders
      WHERE empid = @empid) AS D
GROUP BY orderyear;

This query returns the number of distinct customers per year that handled the orders of the input employee (the employee whose ID is stored in the variable @empid). Here’s the output of this query:

orderyear   numcusts
----------- -----------
2006        16
2007        46
2008        30

Nesting

If you need to define a derived table using a query that by itself refers to a derived table, you end up nesting derived tables. Nesting of derived tables is a result of the fact that a derived table is defined in the FROM clause of the outer query and not separately. Nesting is a problematic aspect of programming in general as it tends to complicate the code and reduce its readability.

For example, the code in Example 5-2 returns order years and the number of customers handled in each year only for years in which more than 70 customers were handled:

Example 5-2. Query with Nested Derived Tables

SELECT orderyear, numcusts
FROM (SELECT orderyear, COUNT(DISTINCT custid) AS numcusts
      FROM (SELECT YEAR(orderdate) AS orderyear, custid
            FROM Sales.Orders) AS D1
      GROUP BY orderyear) AS D2
WHERE numcusts > 70;

This code returns the following output:

orderyear   numcusts
----------- -----------
2007        86
2008        81

The purpose of the innermost derived table, D1, is to assign the column alias orderyear to the expression YEAR(orderdate). The query against D1 refers to orderyear in both the GROUP BY and SELECT clauses, and assigns the column alias numcusts to the expression COUNT(DISTINCT custid). The query against D1 is used to define the derived table D2. The query against D2 refers to numcusts in the WHERE clause to filter order years in which more than 70 customers were handled.

The whole purpose of using table expressions in this example was to simplify the solution by reusing column aliases instead of repeating expressions. However, with the complexity added by the nesting aspect of derived tables, I’m not sure that the solution is simpler than the alternative, which does not make any use of derived tables but instead repeats expressions:

SELECT YEAR(orderdate) AS orderyear, COUNT(DISTINCT custid) AS numcusts
FROM Sales.Orders
GROUP BY YEAR(orderdate)
HAVING COUNT(DISTINCT custid) > 70;

In short, nesting is a problematic aspect of derived tables.

Multiple References

Another problematic aspect of derived tables stems from the fact that derived tables are defined in the FROM clause of the outer query and not prior to the outer query. As far as the FROM clause of the outer query is concerned, the derived table doesn’t exist yet; therefore, if you need to refer to multiple instances of the derived table, you can’t. Instead, you have to define multiple derived tables based on the same query. The query in Example 5-3 provides an example:

Example 5-3. Multiple Derived Tables Based on the Same Query

SELECT Cur.orderyear,
  Cur.numcusts AS curnumcusts, Prv.numcusts AS prvnumcusts,
  Cur.numcusts - Prv.numcusts AS growth
FROM (SELECT YEAR(orderdate) AS orderyear,
        COUNT(DISTINCT custid) AS numcusts
      FROM Sales.Orders
      GROUP BY YEAR(orderdate)) AS Cur
  LEFT OUTER JOIN
     (SELECT YEAR(orderdate) AS orderyear,
        COUNT(DISTINCT custid) AS numcusts
      FROM Sales.Orders
      GROUP BY YEAR(orderdate)) AS Prv
    ON Cur.orderyear = Prv.orderyear + 1;

This query joins two instances of a table expression to create two derived tables: the first derived table, Cur, represents current years, and the second derived table, Prv, represents previous years. The join condition Cur.orderyear = Prv.orderyear + 1 ensures that each row from the first derived table matches with the previous year of the second. By making it a LEFT outer join, the first year that has no previous year is also returned from the Cur table. The SELECT clause of the outer query calculates the difference between the number of customers handled in the current and previous years.

The code in Example 5-3 produces the following output:

orderyear   curnumcusts prvnumcusts growth
----------- ----------- ----------- -----------
2006        67          NULL        NULL
2007        86          67          19
2008        81          86          –5

The fact that you cannot refer to multiple instances of the same derived table forces you to maintain multiple copies of the same query definition. This leads to lengthy code that is hard to maintain and is prone to errors.