Table Expressions in Microsoft SQL Server 2008 T-SQL

  • 10/22/2008

Common Table Expressions

Common table expressions (CTEs) are another form of table expression very similar to derived tables, yet with a couple of important advantages. CTEs were introduced in SQL Server 2005 and are part of ANSI SQL:1999 and later standards.

CTEs are defined using a WITH statement and have the following general form:

WITH <CTE_Name>[(<target_column_list>)]
AS
(
  <inner_query_defining_CTE>
)
<outer_query_against_CTE>;

The inner query defining the CTE must follow all requirements mentioned earlier to be valid to define a table expression. As a simple example, the following code defines a CTE called USACusts based on a query that returns all customers from the United States, and the outer query selects all rows from the CTE:

WITH USACusts AS
(
  SELECT custid, companyname
  FROM Sales.Customers
  WHERE country = N'USA'
)
SELECT * FROM USACusts;

As with derived tables, as soon as the outer query finishes, the CTE gets out of scope.

Assigning Column Aliases

CTEs also support two forms of column aliasing—inline and external. For the inline form, specify <expression> AS <column_alias>; for the external form, specify the target column list in parentheses immediately after the CTE name.

Here’s an example of the inline form:

WITH C AS
(
  SELECT YEAR(orderdate) AS orderyear, custid
  FROM Sales.Orders
)
SELECT orderyear, COUNT(DISTINCT custid) AS numcusts
FROM C
GROUP BY orderyear;

And here’s an example of the external form:

WITH C(orderyear, custid) AS
(
  SELECT YEAR(orderdate), custid
  FROM Sales.Orders
)
SELECT orderyear, COUNT(DISTINCT custid) AS numcusts
FROM C
GROUP BY orderyear;

The motivations for using one form or the other are similar to those described in the context of derived tables.

Using Arguments

As with derived tables, you can also use arguments in the query used to define a CTE. Here’s an example:

DECLARE @empid AS INT = 3;

/*
-- Prior to SQL Server 2008 use separate DECLARE and SET statements:
DECLARE @empid AS INT;
SET @empid = 3;
*/
WITH C AS
(
  SELECT YEAR(orderdate) AS orderyear, custid
  FROM Sales.Orders
  WHERE empid = @empid
)
SELECT orderyear, COUNT(DISTINCT custid) AS numcusts
FROM C
GROUP BY orderyear;

Defining Multiple CTEs

On the surface, the difference between derived tables and CTEs might seem to be merely semantic. However, the fact that you first define a CTE and then use it gives it several important advantages over derived tables. One of those advantages is that if you need to refer to one CTE from another, you don’t end up nesting them like derived tables. Instead, you simply define multiple CTEs separated by commas under the same WITH statement. Each CTE can refer to all previously defined CTEs, and the outer query can refer to all CTEs. For example, the following code is the CTE alternative to the nested derived tables approach in Example 5-2:

WITH C1 AS
(
  SELECT YEAR(orderdate) AS orderyear, custid
  FROM Sales.Orders
),
C2 AS
(
  SELECT orderyear, COUNT(DISTINCT custid) AS numcusts
  FROM C1
  GROUP BY orderyear
)
SELECT orderyear, numcusts
FROM C2
WHERE numcusts > 70;

Because you define a CTE before you use it, you don’t end up nesting CTEs. Each CTE appears separately in the code in a modular manner. This modular approach substantially improves the readability and maintainability of the code compared to the nested derived table approach.

Technically you cannot nest CTEs, nor can you define a CTE within the parentheses of a derived table. However, nesting is a problematic practice; therefore, think of these restrictions as aids to code clarity rather than obstacles.

Multiple References

The fact that a CTE is defined first and then queried has another advantage: As far as the FROM clause of the outer query is concerned, the CTE already exists; therefore, you can refer to multiple instances of the same CTE. For example, the following code is the logical equivalent of the code shown earlier in Example 5-3, using CTEs instead of derived tables:

WITH YearlyCount AS
(
  SELECT YEAR(orderdate) AS orderyear,
    COUNT(DISTINCT custid) AS numcusts
  FROM Sales.Orders
  GROUP BY YEAR(orderdate)
)
SELECT Cur.orderyear,
  Cur.numcusts AS curnumcusts, Prv.numcusts AS prvnumcusts,
  Cur.numcusts - Prv.numcusts AS growth
FROM YearlyCount AS Cur
  LEFT OUTER JOIN YearlyCount AS Prv
    ON Cur.orderyear = Prv.orderyear + 1; 

As you can see, the CTE YearlyCount is defined once and accessed twice in the FROM clause of the outer query—once as Cur and once as Prv. You need to maintain only one copy of the CTE query and not multiple copies as you would with derived tables.

If you’re curious about performance, recall that earlier I mentioned that typically table expressions have no performance impact because they are not physically materialized anywhere. Both references to the CTE here are going to be expanded. Internally, this query has a self join between two instances of the Orders table, each of which involves scanning the table data and aggregating it before the join—the same physical processing that takes place with the derived table approach.

Recursive CTEs

This section is optional because it covers subjects that are beyond the fundamentals.

CTEs are unique among table expressions because they have recursive capabilities. A recursive CTE is defined by at least two queries (more are possible)—at least one query known as the anchor member and at least one query known as the recursive member. The general form of a basic recursive CTE looks like this:

WITH <CTE_Name>[(<target_column_list>)]
AS
(
  <anchor_member>
  UNION ALL
  <recursive_member>
)
<outer_query_against_CTE>;

The anchor member is a query that returns a valid relational result table—like a query that is used to define a nonrecursive table expression. The anchor member query is invoked only once.

The recursive member is a query that has a reference to the CTE name. The reference to the CTE name represents what is logically the previous result set in a sequence of executions. The first time that the recursive member is invoked, the previous result set represents whatever the anchor member returned. In each subsequent invocation of the recursive member, the reference to the CTE name represents the result set returned by the previous invocation of the recursive member. The recursive member has no explicit recursion termination check—the termination check is implicit. The recursive member is invoked repeatedly until it returns an empty set, or exceeds some limit.

Both queries must be compatible in terms of the number of columns they return and the data types of the corresponding columns.

The reference to the CTE name in the outer query represents the unified result sets of the invocation of the anchor member and all invocations of the recursive member.

If this is your first encounter with recursive CTEs, you might find this explanation hard to understand. They are best explained with an example. The following code demonstrates how to use a recursive CTE to return information about an employee (Don Funk, employee ID 2) and all of the employee’s subordinates in all levels (direct or indirect):

WITH EmpsCTE AS
(
  SELECT empid, mgrid, firstname, lastname
  FROM HR.Employees
  WHERE empid = 2

  UNION ALL

  SELECT C.empid, C.mgrid, C.firstname, C.lastname
  FROM EmpsCTE AS P
    JOIN HR.Employees AS C
      ON C.mgrid = P.empid
)
SELECT empid, mgrid, firstname, lastname
FROM EmpsCTE;

The anchor member queries the HR.Employees table and simply returns the row for employee 2:

SELECT empid, mgrid, firstname, lastname
FROM HR.Employees
WHERE empid = 2

The recursive member joins the CTE—representing the previous result set—with the Employees table to return the direct subordinates of the employees returned in the previous result set:

SELECT C.empid, C.mgrid, C.firstname, C.lastname
FROM EmpsCTE AS P
  JOIN HR.Employees AS C
    ON C.mgrid = P.empid

In other words, the recursive member is invoked repeatedly, and in each invocation it returns the next level of subordinates. The first time the recursive member is invoked it returns the direct subordinates of employee 2—employees 3 and 5. The second time the recursive member is invoked, it returns the direct subordinates of employees 3 and 5—employees 4, 6, 7, 8, and 9. The third time the recursive member is invoked, there are no more subordinates; the recursive member returns an empty set and therefore recursion stops.

The reference to the CTE name in the outer query represents the unified result sets; in other words, employee 2 and all of the employee’s subordinates.

Here’s the output of this code:

empid       mgrid       firstname  lastname
----------- ----------- ---------- --------------------
2           1           Don        Funk
3           2           Judy       Lew
5           2           Sven       Buck
6           5           Paul       Suurs
7           5           Russell    King
9           5           Zoya       Dolgopyatova
4           3           Yael       Peled
8           3           Maria      Cameron

In the event of a logical error in the join predicate in the recursive member, or problems with the data resulting in cycles, the recursive member can potentially be invoked an infinite number of times. As a safety measure, by default SQL Server restricts the number of times that the recursive member can be invoked to 100. The code will fail upon the 101st invocation of the recursive member. You can change the default maximum recursion limit by specifying the hint OPTION(MAXRECURSION n) at the end of the outer query, where n is an integer in the range 0 through 32,767 representing the maximum recursion limit you want to set. If you want to remove the restriction altogether, specify MAXRECURSION 0. Note that SQL Server stores the intermediate result sets returned by the anchor and recursive members in a work table in tempdb; if you remove the restriction and have a runaway query, the work table will quickly get very large. If tempdb can’t grow anymore—for example, when you run out of disk space—the query will fail.