SQL Windowing

Query Elements Supporting Window Functions

Window functions aren’t supported in all query clauses; rather, they’re supported only in the SELECT and ORDER BY clauses. To help you understand the reason for this restriction, I first need to explain a concept called logical query processing. Then I’ll get to the clauses that support window functions, and finally I’ll explain how to circumvent the restriction with the other clauses.

Logical Query Processing

Logical query processing describes the conceptual way in which a SELECT query is evaluated according to the logical language design. It describes a process made of a series of steps, or phases, that proceed from the query’s input tables to the query’s final result set. Note that by “logical query processing,” I mean the conceptual way in which the query is evaluated—not necessarily the physical way SQL Server processes the query. As part of the optimization, SQL Server can make shortcuts, rearrange the order of some steps, and pretty much do whatever it likes. But that’s as long as it guarantees that it will produce the same output as the one defined by logical query processing applied to the declarative query request.

Each step in logical query processing operates on one or more virtual tables (sets of rows) that serve as its input and return a virtual table as its output. The output virtual table of one step then becomes the input virtual table for the next step.

Figure 1-7 is a flow diagram illustrating the logical query processing flow in SQL Server.

FIGURE 1-7

FIGURE 1-7 Logical query processing.

Note that when you write a query, the SELECT clause appears first in terms of the keyed-in, or typed, order, but observe that in terms of the logical query processing order, it appears almost last—just before the ORDER BY clause is handled.

There’s much more to say about logical query processing, but the details are a topic for another book. For the purposes of our discussion, what’s important to note is the order in which the various clauses are evaluated. The following list shows the order (with the phases in which window functions are allowed shown in bold):

1. FROM

2. WHERE

3. GROUP BY

4. HAVING

5. SELECT

5-1. Evaluate Expressions

5-2. Remove Duplicates

6. ORDER BY

7. OFFSET-FETCH/TOP

Understanding logical query processing and the logical query processing order enables you to understand the necessity behind restricting window functions to only specific clauses.

Clauses Supporting Window Functions

As illustrated in Figure 1-7, only the query clauses SELECT and ORDER BY support window functions directly. The reason for the limitation is to avoid ambiguity by operating on (almost) the final result set of the query as the starting point for the windowed calculation. If window functions are allowed in phases previous to the SELECT phase, their initial window could be different from that in the SELECT phase, and therefore, with some query forms, it could be very difficult to figure out the right result. I’ll try to demonstrate the ambiguity problem through an example. First run the following code to create the table T1 and populate it with sample data:

SET NOCOUNT ON;
USE TSQLV5;

DROP TABLE IF EXISTS dbo.T1;
GO

CREATE TABLE dbo.T1
(
  col1 VARCHAR(10) NOT NULL
    CONSTRAINT PK_T1 PRIMARY KEY
);

INSERT INTO dbo.T1(col1) 
  VALUES('A'),('B'),('C'),('D'),('E'),('F');

Suppose that window functions were allowed in phases prior to the SELECT—for example, in the WHERE phase. Consider then the following query, and try to figure out which col1 values should appear in the result:

SELECT col1
FROM dbo.T1
WHERE col1 > 'B'
  AND ROW_NUMBER() OVER(ORDER BY col1) <= 3;

Before you assume that the answer should obviously be the values C, D, and E, consider the all-at-once concept in SQL. The concept of all-at-once means that all expressions that appear in the same logical phase are conceptually evaluated at the same point in time. This means that the order in which the expressions are evaluated shouldn’t matter. With this in mind, the following query should be semantically equivalent to the previous one:

SELECT col1
FROM dbo.T1
WHERE ROW_NUMBER() OVER(ORDER BY col1) <= 3
  AND col1 > 'B';

Now can you figure out what the right answer is? Is it C, D, and E, or is it just C?

That’s an example of the ambiguity I was talking about. By restricting window functions to only the SELECT and ORDER BY clauses of a query, this ambiguity is eliminated.

Looking at Figure 1-7, you might have noticed that within the SELECT phase, it’s step 5-1 (Evaluate Expressions) that supports window functions, and this step is evaluated before step 5-2 (Remove Duplicates). If you wonder why it is important to know such subtleties, I’ll demonstrate why.

Following is a query returning the empid and country attributes of all employees from the Employees table, followed by its output:

SELECT empid, country
FROM HR.Employees;

empid       country
----------- ---------------
1           USA
2           USA
3           USA
4           USA
5           UK
6           UK
7           UK
8           USA
9           UK

Next, examine the following query and see if you can guess what its output is before executing it:

SELECT DISTINCT country, ROW_NUMBER() OVER(ORDER BY country) AS rownum
FROM HR.Employees;

Some expect to get the following output:

country          rownum
--------------- --------------------
UK              1
USA             2

But in reality, you get this:

country  rownum
-------- -------
UK       1
UK       2
UK       3
UK       4
USA      5
USA      6
USA      7
USA      8
USA      9

Now consider that the ROW_NUMBER function in this query is evaluated in step 5-1 where the SELECT list expressions are evaluated—prior to the removal of the duplicates in step 5-2. The ROW_NUMBER function assigns nine unique row numbers to the nine employee rows, and then the DISTINCT clause has no duplicates left to remove.

When you realize this and understand that it has to do with the logical query processing order of the different elements, you can think of a solution. For example, you can have a table expression defined based on a query that just returns distinct countries and have the outer query assign the row numbers after duplicates are removed, like so:

WITH EmpCountries AS
(
  SELECT DISTINCT country FROM HR.Employees
)
SELECT country, ROW_NUMBER() OVER(ORDER BY country) AS rownum
FROM EmpCountries;

country  rownum
-------- -------
UK       1
USA      2

Can you think of other ways to solve the problem, perhaps even simpler ways than this one?

The fact that window functions are evaluated in the SELECT or ORDER BY phase means that the window defined for the calculation—before applying further restrictions—is the intermediate form of rows of the query after all previous phases. This means that window functions are evaluated after applying the FROM with all of its table operators (for example, joins), and after the WHERE filtering, the grouping, and the filtering of the groups. Consider the following query as an example:

SELECT O.empid,
  SUM(OD.qty) AS qty,
  RANK() OVER(ORDER BY SUM(OD.qty) DESC) AS rnk
FROM Sales.Orders AS O
  INNER JOIN Sales.OrderDetails AS OD
    ON O.orderid = OD.orderid
WHERE O.orderdate >= '20180101'
  AND O.orderdate < '20190101'
GROUP BY O.empid;

empid  qty   rnk
------ ----- ----
4      5273  1
3      4436  2
1      3877  3
8      2843  4
2      2604  5
7      2292  6
6      1738  7
5      1471  8
9      955   9

First the FROM clause is evaluated and the join is performed. Then, because of the filter, only the rows where the order year is 2018 remain. Then the remaining rows are grouped by employee ID. Only then are the expressions in the SELECT list evaluated, including the RANK function, which is calculated based on ordering by the total quantity descending. If there were other window functions in the SELECT list, they would all use the same result set as their starting point. Recall from earlier discussions about alternative options to window functions (for example, subqueries) that they start their view of the data from scratch—meaning that you have to repeat all the logic you have in the outer query in each of your subqueries, leading to much more verbose code.

Now back to my question of whether you can think of a simpler way to assign row numbers to distinct employee countries compared to the solution with the CTE; here’s a simpler solution:

SELECT country, ROW_NUMBER() OVER(ORDER BY country) AS rownum
FROM HR.Employees
GROUP BY country;

There are nine employees initially, but within them, there are two distinct countries; therefore, the grouping phase defines two groups. The expressions in the SELECT clause, including the ROW_NUMBER calculation, are then applied to the grouped data, producing the two expected result rows:

country  rownum
-------- -------
UK       1
USA      2

Circumventing the Limitations

I explained the reasoning behind disallowing the use of window functions in logical query processing phases that are evaluated prior to the SELECT clause. But what if you need to filter by or group by a calculation based on window functions? The solution is to use a table expression such as a CTE or a derived table. Have a query invoke the window function in its SELECT list, assigning the calculation an alias. Define a table expression based on that query, and then have the outer query refer to that alias where you need it.

Here’s an example showing how you can filter by the result of a window function using a CTE:

WITH C AS
(
  SELECT orderid, orderdate, val,
    RANK() OVER(ORDER BY val DESC) AS rnk
  FROM Sales.OrderValues
)
SELECT *
FROM C
WHERE rnk <= 5;

orderid  orderdate  val       rnk
-------- ---------- --------- ----
10865    2019-02-02 16387.50  1
10981    2019-03-27 15810.00  2
11030    2019-04-17 12615.05  3
10889    2019-02-16 11380.00  4
10417    2018-01-16 11188.40  5

With modification statements, window functions are disallowed altogether because those don’t support SELECT and ORDER BY clauses. But there are cases where involving window functions in modification statements is needed. Table expressions can be used to address this need as well because T-SQL supports modifying data through table expressions. I’ll demonstrate this capability with an UPDATE example. First, run the following code to create a table called T1 with columns col1 and col2 and populate it with sample data:

SET NOCOUNT ON;
USE TSQLV5;

DROP TABLE IF EXISTS dbo.T1;
GO

CREATE TABLE dbo.T1
(
  col1 INT NULL,
  col2 VARCHAR(10) NOT NULL
);

INSERT INTO dbo.T1(col2) 
  VALUES('C'),('A'),('B'),('A'),('C'),('B');

Explicit values were provided in col2, and NULLs were used as defaults in col1.

Suppose this table represents a situation with data-quality problems. A key wasn’t enforced in this table; therefore, it is not possible to uniquely identify rows. You want to assign unique col1 values in all rows. You’re thinking of using the ROW_NUMBER function in an UPDATE statement, like so:

UPDATE dbo.T1
  SET col1 = ROW_NUMBER() OVER(ORDER BY col2);

However, remember that this is not allowed. The workaround is to write a query against T1 returning col1 and an expression based on the ROW_NUMBER function (call it rownum); define a table expression based on this query; finally, have an outer UPDATE statement against the CTE assign rownum to col1, like so:

WITH C AS
(
  SELECT col1, col2,
     ROW_NUMBER() OVER(ORDER BY col2) AS rownum
  FROM dbo.T1
)
UPDATE C
  SET col1 = rownum;

Query T1, and observe that all rows got unique col1 values:

SELECT col1, col2
FROM dbo.T1;

col1  col2
----- -----
5     C
1     A
3     B
2     A
6     C
4     B