- By Itzik Ben-Gan
A Glimpse of Solutions Using Window Functions
The first five chapters of the book describe window functions, their optimization, and related analytical features. The material is very technical, and even though I find it fascinating, I can see how some might find it a bit dry. What’s usually much more interesting for people to read about is the use of the functions to solve practical problems, which is what this book gets to in the final chapter. When you see how window functions are used in problem solving, you truly realize their value. So, how can I convince you it’s worth your while to go through the more technical parts and not give up reading before you get to the more interesting part later? What if I give you a glimpse of a solution using window functions right now?
The querying task I will address here involves querying a table holding a sequence of values in some column and identifying the consecutive ranges of existing values. This problem is also known as the islands problem. The sequence can be a numeric one, a temporal one (which is more common), or any data type that supports total ordering. The sequence can have unique values or allow duplicates. The interval can be any fixed interval that complies with the column’s type (for example, the integer 1, the integer 7, the temporal interval 1 day, the temporal interval 2 weeks, and so on). In Chapter 6, I will get to the different variations of the problem. Here, I’ll just use a simple case to give you a sense of how it works—using a numeric sequence with the integer 1 as the interval. Use the following code to generate the sample data for this task:
SET NOCOUNT ON; USE TSQLV5; DROP TABLE IF EXISTS dbo.T1; GO CREATE TABLE dbo.T1 ( col1 INT NOT NULL CONSTRAINT PK_T1 PRIMARY KEY ); INSERT INTO dbo.T1(col1) VALUES(2),(3),(11),(12),(13),(27),(33),(34),(35),(42);
As you can see, there are some gaps in the col1 sequence in T1. Your task is to identify the consecutive ranges of existing values (also known as islands) and return the start and end of each island. Here’s what the desired result should look like:
startrange endrange ----------- ----------- 2 3 11 13 27 27 33 35 42 42
If you’re curious as to the practicality of this problem, there are numerous production examples. Examples include producing availability reports, identifying periods of activity (for example, sales), identifying consecutive periods in which a certain criterion is met (for example, periods where a stock value was above or below a certain threshold), identifying ranges of license plates in use, and so on. The current example is very simplistic on purpose so that we can focus on the techniques used to solve it. The technique you will use to solve a more complicated case requires minor adjustments to the one you use to address the simple case. So, consider it a challenge to come up with an efficient, set-based solution to this task. Try to first come up with a solution that works against the small set of sample data. Then repopulate the table with a decent number of rows—say, 10,000,000—and try your technique again. See how it performs. Only then take a look at my solutions.
Before showing the solution using window functions, I’ll show one of the many possible solutions that use more traditional language constructs. In particular, I’ll show one that uses subqueries. To explain the strategy of the first solution, examine the values in the T1.col1 sequence, where I added a conceptual attribute that doesn’t exist at the moment and that I think of as a group identifier:
col1 grp ----- --- 2 a 3 a 11 b 12 b 13 b 27 c 33 d 34 d 35 d 42 e
The grp attribute doesn’t exist yet. Conceptually, it is a value that uniquely identifies an island. This means that it must be the same for all members of the same island and different from the values generated for other islands. If you manage to calculate such a group identifier, you can then group the result by this grp attribute and return the minimum and maximum col1 values in each group (island). One way to produce this group identifier using traditional language constructs is to calculate, for each current col1 value, the minimum col1 value that is greater than or equal to the current one, and that has no following value.
As an example, following this logic, try to identify with respect to the value 2 what the minimum col1 value is that is greater than or equal to 2 and that appears before a missing value. It’s 3. Now try to do the same with respect to 3. You also get 3. So, 3 is the group identifier of the island that starts with 2 and ends with 3. For the island that starts with 11 and ends with 13, the group identifier for all members is 13. As you can see, the group identifier for all members of a given island is actually the last member of that island.
Here’s the T-SQL code required to implement this concept:
SELECT col1, (SELECT MIN(B.col1) FROM dbo.T1 AS B WHERE B.col1 >= A.col1 -- is this row the last in its group? AND NOT EXISTS (SELECT * FROM dbo.T1 AS C WHERE C.col1 = B.col1 + 1)) AS grp FROM dbo.T1 AS A;
This query generates the following output:
col1 grp ----------- ----------- 2 3 3 3 11 13 12 13 13 13 27 27 33 35 34 35 35 35 42 42
The next part is pretty straightforward—define a table expression based on the last query, and in the outer query, group by the group identifier and return the minimum and maximum col1 values for each group, like so:
SELECT MIN(col1) AS startrange, MAX(col1) AS endrange FROM (SELECT col1, (SELECT MIN(B.col1) FROM dbo.T1 AS B WHERE B.col1 >= A.col1 AND NOT EXISTS (SELECT * FROM dbo.T1 AS C WHERE C.col1 = B.col1 + 1)) AS grp FROM dbo.T1 AS A) AS D GROUP BY grp;
There are two main problems with this solution. First, it’s a bit complicated to follow the logic here. Second, it’s horribly slow. I don’t want to start going over query execution plans yet—there will be plenty of this later in the book—but I can tell you that for each row in the table, SQL Server will perform almost two complete scans of the data. Now think of a sequence of 10,000,000 rows and try to translate it to the amount of work involved. The total number of rows that will need to be processed is simply enormous, and the scaling of the solution is quadratic.
The next solution is also one that calculates a group identifier but using window functions. The first step in the solution is to use the ROW_NUMBER function to calculate row numbers based on col1 ordering. I will provide the gory details about the ROW_NUMBER function later in the book; for now, it suffices to say that it computes unique integers within the partition starting with 1 and incrementing by 1 based on the given ordering.
With this in mind, the following query returns the col1 values and row numbers based on col1 ordering, followed by its output:
SELECT col1, ROW_NUMBER() OVER(ORDER BY col1) AS rownum FROM dbo.T1; col1 rownum ----------- -------------------- 2 1 3 2 11 3 12 4 13 5 27 6 33 7 34 8 35 9 42 10
Now focus your attention on the two sequences. One (col1) is a sequence with gaps, and the other (rownum) is a sequence without gaps. With this in mind, try to figure out what’s unique to the relationship between the two sequences in the context of an island. Within an island, both sequences keep incrementing by a fixed interval. Therefore, the difference between the two is constant. For instance, consider the island that starts with 11 and ends with 13. The col1 values in this island are 11, 12, and 13, and the respective row numbers are 3, 4, and 5. The difference between the col1 values and the respective row numbers is the constant 8 throughout this island. For the next island, col1 increases by more than 1, whereas rownum increases just by 1, so the difference keeps growing. In other words, the difference between the two is constant and unique for each island. Run the following query to calculate this difference:
SELECT col1, col1 - ROW_NUMBER() OVER(ORDER BY col1) AS diff FROM dbo.T1; col1 diff ----------- -------------------- 2 1 3 1 11 8 12 8 13 8 27 21 33 26 34 26 35 26 42 32
You can see that this difference satisfies the two requirements of our group identifier (being constant and unique per island); therefore, you can use it as such. The rest is the same as in the previous solution; namely, you group the rows by the group identifier and return the minimum and maximum col1 values in each group, like so:
WITH C AS ( SELECT col1, -- the difference is constant and unique per island col1 - ROW_NUMBER() OVER(ORDER BY col1) AS grp FROM dbo.T1 ) SELECT MIN(col1) AS startrange, MAX(col1) AS endrange FROM C GROUP BY grp;
Observe how concise and simple the solution is. Of course, it’s always a good idea to add comments to help those who see the solution for the first time better understand it.
The solution is also highly efficient, and with the right indexing, scales linearly. The work involved in assigning the row numbers is negligible compared to the previous solution. It’s just a single ordered scan of the index on col1 and an operator that keeps incrementing a counter. In a performance test I ran with a sequence with 10,000,000 rows, this query finished in 3 seconds. Other solutions ran for a much longer time.
I hope that this glimpse to solutions using window functions was enough to intrigue you and help you see that they contain immense power. Now we’ll get back to studying the technicalities of window functions. Later in the book, you will have a chance to see many more examples.