Aggregating and Pivoting Data in Microsoft SQL Server 2008 T-SQL

  • 3/25/2009

Grouping Sets

A grouping set is simply a set of attributes that you group by, such as in a query that has the following GROUP BY clause:

GROUP BY custid, empid, YEAR(orderdate)

You define a single grouping set—(custid, empid, YEAR(orderdate)). Traditionally, aggregate queries define a single grouping set, as demonstrated in the previous example. SQL Server supports features that allow you to define multiple grouping sets in the same query and return a single result set with aggregates calculated for the different grouping sets. The ability to define multiple grouping sets in the same query was available prior to SQL Server 2008 in the form of options called WITH CUBE and WITH ROLLUP and a helper function called GROUPING. However, those options were neither standard nor flexible enough. SQL Server 2008 introduces several new features that allow you to define multiple grouping sets in the same query. The new features include the GROUPING SETS, CUBE, and ROLLUP subclauses of the GROUP BY clause (not to be confused with the older WITH CUBE and WITH ROLLUP options) and the helper function GROUPING_ID. These new features are ISO compliant and substantially more flexible than the older, nonstandard ones.

Before I provide the technicalities of the grouping sets–related features, I’d like to explain the motivation for using those and the kind of problems that they solve. If you’re interested only in the technicalities, feel free to skip this section.

Consider a data warehouse with a large volume of sales data. Users of this data warehouse frequently need to analyze aggregated views of the data by various dimensions, such as customer, employee, product, time, and so on. When a user such as a sales manager starts the analysis process, the user asks for some initial aggregated view of the data—for example, the total quantities for each customer and year. This request translates in more technical terms to a request to aggregate data for the grouping set (custid, YEAR(orderdate)). The user then analyzes the data, and based on the findings the user makes the next request—say, to return total quantities for each year and month. This is a request to aggregate data for a new grouping set—(YEAR(orderdate), MONTH(orderdate)). In this manner the user keeps asking for different aggregated views of the data—in other words, to aggregate data for different grouping sets.

To address such analysis needs of your system’s users, you could develop an application that generates a different GROUP BY query for each user request. Each query would need to scan all applicable base data and process the aggregates. With large volumes of data, this approach is very inefficient, and the response time will probably be unreasonable.

To provide fast response time, you need to preprocess aggregates for all grouping sets that users might ask for and store those in the data warehouse. For example, you could do this every night. When the user requests aggregates for a certain grouping set, the aggregates will be readily available. The problem is that given n dimensions, 2n possible grouping sets can be constructed from those dimensions. For example, with 10 dimensions you get 1,024 grouping sets. If you actually run a separate GROUP BY query for each, it will take a very long time to process all aggregates, and you might not have a sufficient processing window for this.

This is where the new grouping features come into the picture. They allow you to calculate aggregates for multiple grouping sets without rescanning the base data separately for each. Instead, SQL Server scans the data the minimum number of times that the optimizer figures is optimal, calculates the base aggregates, and on top of the base aggregates calculates the super aggregates (aggregates of aggregates).

Note that the product Microsoft SQL Server Analysis Services (SSAS, or just AS) specializes in preprocessing aggregates for multiple grouping sets and storing them in a specialized multidimensional database. It provides very fast response time to user requests, which are made with a language called Multidimensional Expressions (MDX). The recommended approach to handling needs for dynamic analysis of aggregated data is to implement an Analysis Services solution. However, some organizations don’t need the scale and sophistication levels provided by Analysis Services and would rather get the most they can from their relational data warehouse with T-SQL. For those organizations, the new grouping features provided by SQL Server can come in very handy.

The following sections describe the technicalities of the grouping sets–related features supported by SQL Server 2008.

Sample Data

In my examples I will use the Orders table that you create and populate in tempdb by running the code provided earlier in Example 8-1. This code is provided here again for your convenience:

SET NOCOUNT ON;
USE tempdb;

IF OBJECT_ID('dbo.Orders', 'U') IS NOT NULL DROP TABLE dbo.Orders;
GO

CREATE TABLE dbo.Orders
(
  orderid   INT        NOT NULL,
  orderdate DATETIME   NOT NULL,
  empid     INT        NOT NULL,
  custid    VARCHAR(5) NOT NULL,
  qty       INT        NOT NULL,
  CONSTRAINT PK_Orders PRIMARY KEY(orderid)
);
GO

INSERT INTO dbo.Orders
  (orderid, orderdate, empid, custid, qty)
VALUES
  (30001, '20060802', 3, 'A', 10),
  (10001, '20061224', 1, 'A', 12),
  (10005, '20061224', 1, 'B', 20),
  (40001, '20070109', 4, 'A', 40),
  (10006, '20070118', 1, 'C', 14),
  (20001, '20070212', 2, 'B', 12),
  (40005, '20080212', 4, 'A', 10),
  (20002, '20080216', 2, 'C', 20),
  (30003, '20080418', 3, 'B', 15),
  (30004, '20060418', 3, 'C', 22),
  (30007, '20060907', 3, 'D', 30);

The GROUPING SETS Subclause

SQL Server 2008 allows you to define multiple grouping sets in the same query by using the new GROUPING SETS subclause of the GROUP BY clause. Within the outermost pair of parentheses, you specify a list of grouping sets separated by commas. Each grouping set is expressed by a pair of parentheses containing the set’s elements separated by commas. For example, the following query defines four grouping sets:

SELECT custid, empid, YEAR(orderdate) AS orderyear, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY GROUPING SETS
(
  ( custid, empid, YEAR(orderdate) ),
  ( custid, YEAR(orderdate)        ),
  ( empid, YEAR(orderdate)         ),
  ()
);

The first grouping set is (custid, empid, YEAR(orderdate)), the second is (custid, YEAR(orderdate)), the third is (empid, YEAR(orderdate)), and the fourth is the empty grouping set (), which is used to calculate grand totals. This query generates the following output:

custid empid       orderyear   qty
------ ----------- ----------- -----------
A      1           2006        12
B      1           2006        20
NULL   1           2006        32
C      1           2007        14
NULL   1           2007        14
B      2           2007        12
NULL   2           2007        12
C      2           2008        20
NULL   2           2008        20
A      3           2006        10
C      3           2006        22
D      3           2006        30
NULL   3           2006        62
B      3           2008        15
NULL   3           2008        15
A      4           2007        40
NULL   4           2007        40
A      4           2008        10
NULL   4           2008        10
NULL   NULL        NULL        205
A      NULL        2006        22
B      NULL        2006        20
C      NULL        2006        22
D      NULL        2006        30
A      NULL        2007        40
B      NULL        2007        12
C      NULL        2007        14
A      NULL        2008        10
B      NULL        2008        15
C      NULL        2008        20

As you can see in the output of the query, NULLs are used as placeholders in inapplicable attributes. You could also think of these NULLs as indicating that the row represents an aggregate over all values of that column. This way, SQL Server can combine rows associated with different grouping sets to one result set. So, for example, in rows associated with the grouping set (custid, YEAR(orderdate)), the empid column is NULL. In rows associated with the empty grouping set, the columns empid, custid, and orderyear are NULLs and so on.

Compared to a query that unifies the result sets of four GROUP BY queries, our query that uses the GROUPING SETS subclause requires much less code. It has a performance advantage as well. Examine the execution plan of this query shown in Figure 8-3.

Figure 8-3

Figure 8-3 Execution plan of query with GROUPING SETS subclause

Observe that even though the query defines four grouping sets, the execution plan shows only two scans of the data. In particular, observe that the first branch of the plan shows two Stream Aggregate operators. The Sort operator sorts the data by empid, YEAR(orderdate), custid. Based on this sorting, the first Stream Aggregate operator calculates the aggregates for the grouping set (custid, empid, YEAR(orderdate)); the second Stream Aggregate operates on the results of the first and calculates the aggregates for the grouping set (empid, YEAR(orderdate)) and the empty grouping set. The second branch of the plan sorts the data by YEAR(orderdate), custid to allow the Stream Aggregate operator that follows to calculate aggregates for the grouping set (custid, YEAR(orderdate)).

Following is a query that is logically equivalent to the previous one, except that this one actually invokes four GROUP BY queries—one for each grouping set—and unifies their result sets:

SELECT custid, empid, YEAR(orderdate) AS orderyear, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY custid, empid, YEAR(orderdate)

UNION ALL

SELECT custid, NULL AS empid, YEAR(orderdate) AS orderyear, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY custid, YEAR(orderdate)

UNION ALL

SELECT NULL AS custid, empid, YEAR(orderdate) AS orderyear, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY empid, YEAR(orderdate)

UNION ALL

SELECT NULL AS custid, NULL AS empid, NULL AS orderyear, SUM(qty) AS qty
FROM dbo.Orders;

The execution plan for this query is shown in Figure 8-4. You can see that the data is scanned four times.

Figure 8-4

Figure 8-4 Execution plan of code unifying four GROUP BY queries

SQL Server 2008 allows you to define up to 4,096 grouping sets in a single query.

The CUBE Subclause

SQL Server 2008 also introduces the CUBE subclause of the GROUP BY clause (not to be confused with the older WITH CUBE option). The CUBE subclause is merely an abbreviated way to express a large number of grouping sets without actually listing them in a GROUPING SETS subclause. CUBE accepts a list of elements as input and defines all possible grouping sets out of those, including the empty grouping set. In set theory, this is called the power set of a set. The power set of a set V is the set of all subsets of V. Given n elements, CUBE produces 2n grouping sets. For example, CUBE(a, b, c) is equivalent to GROUPING SETS( (a, b, c), (a, b), (a, c), (b, c), (a), (b), (c), () ).

The following query uses the CUBE option to define all four grouping sets that can be made of the elements custid and empid:

SELECT custid, empid, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY CUBE(custid, empid);

This query generates the following output:

custid empid       qty
------ ----------- -----------
A      1           12
B      1           20
C      1           14
NULL   1           46
B      2           12
C      2           20
NULL   2           32
A      3           10
B      3           15
C      3           22
D      3           30
NULL   3           77
A      4           50
NULL   4           50
NULL   NULL        205
A      NULL        72
B      NULL        47
C      NULL        56
D      NULL        30

The following query using the GROUPING SETS subclause is equivalent to the previous query:

SELECT custid, empid, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY GROUPING SETS
  (
    ( custid, empid ),
    ( custid        ),
    ( empid         ),
    ()
  );

Note that each of the elements in the list you provide to CUBE as input can be made of either a single attribute or multiple attributes. The previous CUBE expression used two single-attribute elements. To define a multi-attribute element, simply list the element’s attributes in parentheses. As an example, the expression CUBE( x, y, z ) has three single-attribute elements and defines eight grouping sets: (x, y, z), (x, y), (x, z), (y, z), (x), (y), (z), (). The expression CUBE( (x, y), z ) has one two-attribute element and one single-attribute element and defines four grouping sets: (x, y, z), (x, y), (z), ().

Prior to SQL Server 2008, you could achieve something similar to what the CUBE subclause gives you by using a WITH CUBE option that you specified after the GROUP BY clause, like so:

SELECT custid, empid, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY custid, empid
WITH CUBE;

This is an equivalent to our previous CUBE query, but it has two drawbacks. First, it’s not standard, while the new CUBE subclause is. Second, when you specify the WITH CUBE option, you cannot define additional grouping sets beyond the ones defined by CUBE, while you can with the new CUBE subclause.

The ROLLUP Subclause

The new ROLLUP subclause of the GROUP BY clause is similar to the CUBE subclause. It also allows defining multiple grouping sets in an abbreviated way. However, while CUBE defines all possible grouping sets that can be made of the input elements (the power set), ROLLUP defines only a subset of those. ROLLUP assumes a hierarchy between the input elements. For example, ROLLUP(a, b, c) assumes a hierarchy between the elements a, b, and c. When there is a hierarchy, not all possible grouping sets that can be made of the input elements make sense in terms of having business value. Consider, for example, the hierarchy country, region, city. You can see the business value in the grouping sets (country, region, city), (country, region), (country), and (). But as grouping sets, (city), (region), (region, city) and (country, city) have no business value. For example, the grouping set (city) has no business value because different cities can have the same name, and a business typically needs totals by city, not by city name. When the input elements represent a hierarchy, ROLLUP produces only the grouping sets that make business sense for the hierarchy. Given n elements, ROLLUP will produce n + 1 grouping sets.

The following query shows an example of using the ROLLUP subclause:

SELECT
  YEAR(orderdate) AS orderyear,
  MONTH(orderdate) AS ordermonth,
  DAY(orderdate) AS orderday,
  SUM(qty) AS qty
FROM dbo.Orders
GROUP BY
  ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate));

Out of the three input elements, ROLLUP defines four (3 + 1) grouping sets—(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)), (YEAR(orderdate), MONTH(orderdate)), (YEAR(orderdate)), and (). This query generates the following output:

orderyear   ordermonth  orderday    qty
----------- ----------- ----------- -----------
2006        4           18          22
2006        4           NULL        22
2006        8           2           10
2006        8           NULL        10
2006        9           7           30
2006        9           NULL        30
2006        12          24          32
2006        12          NULL        32
2006        NULL        NULL        94
2007        1           9           40
2007        1           18          14
2007        1           NULL        54
2007        2           12          12
2007        2           NULL        12
2007        NULL        NULL        66
2008        2           12          10
2008        2           16          20
2008        2           NULL        30
2008        4           18          15
2008        4           NULL        15
2008        NULL        NULL        45
NULL        NULL        NULL        205

This query is equivalent to the following query that uses the GROUPING SETS subclause to define the aforementioned grouping sets explicitly:

SELECT
  YEAR(orderdate) AS orderyear,
  MONTH(orderdate) AS ordermonth,
  DAY(orderdate) AS orderday,
  SUM(qty) AS qty
FROM dbo.Orders
GROUP BY
  GROUPING SETS
  (
    ( YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ),
    ( YEAR(orderdate), MONTH(orderdate)                 ),
    ( YEAR(orderdate)                                   ),
    ()
  );

Like with CUBE, each of the elements in the list you provide to ROLLUP as input can be made of either a single attribute or multiple attributes. As an example, the expression ROLLUP( x, y, z ) defines four grouping sets: (x, y, z), (x, y), (x), (). The expression ROLLUP( (x, y), z ) defines three grouping sets: (x, y, z), (x, y), ().

Similar to the WITH CUBE option that I described earlier, previous versions of SQL Server prior to SQL Server 2008 supported a WITH ROLLUP option. Following is a query that is equivalent to the previous ROLLUP query, except that it uses the older WITH ROLLUP option:

SELECT
  YEAR(orderdate) AS orderyear,
  MONTH(orderdate) AS ordermonth,
  DAY(orderdate) AS orderday,
  SUM(qty) AS qty
FROM dbo.Orders
GROUP BY YEAR(orderdate), MONTH(orderdate), DAY(orderdate)
WITH ROLLUP;

Like the WITH CUBE option, the WITH ROLLUP option is nonstandard and doesn’t allow you to define further grouping sets in the same query.

Grouping Sets Algebra

One beautiful thing about the design of the grouping sets–related features implemented in SQL Server 2008 is that they support a whole algebra of operations that can help you define a large number of grouping sets using minimal coding. You have support for operations that you can think of as multiplication, division, and addition.

Multiplication

Multiplication means producing a Cartesian product of grouping sets. You perform multiplication by separating GROUPING SETS subclauses (or the abbreviated CUBE and ROLLUP subclauses) by commas. For example, if A represents a set of attributes a1, a2, . . ., an, and B represents a set of attributes b1, b2, . . ., bn, and so on, the product GROUPING SETS( (A), (B), (C) ), GROUPING SETS( (D), (E) ) is equal to GROUPING SETS ( (A, D), (A, E), (B, D), (B, E), (C, D), (C, E) ).

Consider the following query and try to figure out which grouping sets it defines:

SELECT custid, empid,
  YEAR(orderdate) AS orderyear,
  MONTH(orderdate) AS ordermonth,
  DAY(orderdate) AS orderday,
  SUM(qty) AS qty
FROM dbo.Orders
GROUP BY
  CUBE(custid, empid),
  ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate));

First, expand the CUBE and ROLLUP subclauses to the corresponding GROUPING SETS subclauses, and you get the following query:

SELECT custid, empid,
  YEAR(orderdate) AS orderyear,
  MONTH(orderdate) AS ordermonth,
  SUM(qty) AS qty
FROM dbo.Orders
GROUP BY
  GROUPING SETS
  (
    ( custid, empid ),
    ( custid        ),
    ( empid         ),
    ()
  ),
  GROUPING SETS
  (
    ( YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ),
    ( YEAR(orderdate), MONTH(orderdate)                 ),
    ( YEAR(orderdate)                                   ),
    ()
  );

Now apply the multiplication between the GROUPING SETS subclauses, and you get the following query:

SELECT custid, empid,
  YEAR(orderdate) AS orderyear,
  MONTH(orderdate) AS ordermonth,
  SUM(qty) AS qty
FROM dbo.Orders
GROUP BY
  GROUPING SETS
  (
    ( custid, empid, YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ),
    ( custid, empid, YEAR(orderdate), MONTH(orderdate)                 ),
    ( custid, empid, YEAR(orderdate)                                   ),
    ( custid, empid                                                    ),
    ( custid, YEAR(orderdate), MONTH(orderdate), DAY(orderdate)        ),
    ( custid, YEAR(orderdate), MONTH(orderdate)                        ),
    ( custid, YEAR(orderdate)                                          ),
    ( custid                                                           ),
    ( empid, YEAR(orderdate), MONTH(orderdate), DAY(orderdate)         ),
    ( empid, YEAR(orderdate), MONTH(orderdate)                         ),
    ( empid, YEAR(orderdate)                                           ),
    ( empid                                                            ),
    ( YEAR(orderdate), MONTH(orderdate), DAY(orderdate)                ),
    ( YEAR(orderdate), MONTH(orderdate)                                ),
    ( YEAR(orderdate)                                                  ),
    ()
  );

Division

When multiple grouping sets in an existing GROUPING SETS subclause share common elements, you can separate the common elements to another GROUPING SETS subclause and multiply the two. The concept is similar to arithmetic division, where you divide operands of an expression by a common element and pull it outside the parentheses. For example, (5×3 + 5×7) can be expressed as (5)×(3 + 7). Based on this logic, you can sometimes reduce the amount of code needed to define multiple grouping sets. For example, see if you can reduce the code in the following query while preserving the same grouping sets:

SELECT
  custid,
  empid,
  YEAR(orderdate) AS orderyear,
  MONTH(orderdate) AS ordermonth,
  SUM(qty) AS qty
FROM dbo.Orders
GROUP BY
  GROUPING SETS
  (
    ( custid, empid, YEAR(orderdate), MONTH(orderdate) ),
    ( custid, empid, YEAR(orderdate)                   ),
    ( custid,        YEAR(orderdate), MONTH(orderdate) ),
    ( custid,        YEAR(orderdate)                   ),
    ( empid,         YEAR(orderdate), MONTH(orderdate) ),
    ( empid,         YEAR(orderdate)                   )
  );

Because YEAR(orderdate) is a common element to all grouping sets, you can move it to another GROUPING SETS subclause and multiply the two, like so:

SELECT
  custid,
  empid,
  YEAR(orderdate) AS orderyear,
  MONTH(orderdate) AS ordermonth,
  SUM(qty) AS qty
FROM dbo.Orders
GROUP BY
  GROUPING SETS
  (
    ( YEAR(orderdate)                 )
  ),
  GROUPING SETS
  (
    ( custid, empid, MONTH(orderdate) ),
    ( custid, empid                   ),
    ( custid,        MONTH(orderdate) ),
    ( custid                          ),
    ( empid,         MONTH(orderdate) ),
    ( empid                           )
  );

Note that when a GROUPING SETS subclause contains only one grouping set, it is equivalent to listing the grouping set’s elements directly in the GROUP BY clause. Hence, the previous query is logically equivalent to the following:

SELECT
  custid,
  empid,
  YEAR(orderdate) AS orderyear,
  MONTH(orderdate) AS ordermonth,
  SUM(qty) AS qty
FROM dbo.Orders
GROUP BY
  YEAR(orderdate),
  GROUPING SETS
  (
    ( custid, empid, MONTH(orderdate) ),
    ( custid, empid                   ),
    ( custid,        MONTH(orderdate) ),
    ( custid                          ),
    ( empid,         MONTH(orderdate) ),
    ( empid                           )
  );

You can reduce this form even further. Notice in the remaining GROUPING SETS subclause that three subsets of elements appear once with MONTH(orderdate) and once without. Hence, you can reduce this form to a multiplication between a GROUPING SETS subclause containing those three and another containing two grouping sets, (MONTH(orderdate)) and the empty grouping set, like so:

SELECT
  custid,
  empid,
  YEAR(orderdate) AS orderyear,
  MONTH(orderdate) AS ordermonth,
  SUM(qty) AS qty
FROM dbo.Orders
GROUP BY
  YEAR(orderdate),
  GROUPING SETS
  (
    ( custid, empid ),
    ( custid        ),
    ( empid         )
  ),
  GROUPING SETS
  (
    ( MONTH(orderdate) ),
    ()
  );

Addition

Recall that when you separate GROUPING SETS, CUBE, and ROLLUP subclauses by commas, you get a Cartesian product between the sets of grouping sets that each represents. But what if you have an existing GROUPING SETS subclause and you just want to add—not multiply—the grouping sets that are defined by a CUBE or ROLLUP subclause? This can be achieved by specifying the CUBE or ROLLUP subclause (or multiple ones) within the parentheses of the GROUPING SETS subclause.

For example, the following query demonstrates adding the grouping sets defined by a ROLLUP subclause to the grouping sets defined by the hosting GROUPING SETS subclause:

SELECT
  custid,
  empid,
  YEAR(orderdate) AS orderyear,
  MONTH(orderdate) AS ordermonth,
  SUM(qty) AS qty
FROM dbo.Orders
GROUP BY
  GROUPING SETS
  (
    ( custid, empid ),
    ( custid        ),
    ( empid         ),
    ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate))
  );

This query is a logical equivalent of the following query:

SELECT
  custid,
  empid,
  YEAR(orderdate) AS orderyear,
  MONTH(orderdate) AS ordermonth,
  SUM(qty) AS qty
FROM dbo.Orders
GROUP BY
  GROUPING SETS
  (
    ( custid, empid ),
    ( custid        ),
    ( empid         ),
    ( YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ),
    ( YEAR(orderdate), MONTH(orderdate)                 ),
    ( YEAR(orderdate)                                   ),
    ()
  );

Unfortunately, there is no built-in option to do subtraction. For example, you can’t somehow express the idea of CUBE( a, b, c, d ) minus GROUPING SETS ( (a, c), (b, d), () ). Of course, you can achieve this with the EXCEPT set operation and other techniques but not as a direct algebraic operation on grouping sets–related subclauses.

The GROUPING_ID Function

In your applications you may need to be able to identify the grouping set with which each result row of your query is associated. Relying on the NULL placeholders may lead to convoluted code, not to mention the fact that if a column is defined in the table as allowing NULLs, a NULL in the result will be ambiguous. SQL Server 2008 introduces a very convenient tool for this purpose in the form of a function called GROUPING_ID. This function accepts a list of attributes as input and constructs an integer bitmap where each bit represents the corresponding attribute (the rightmost bit represents the rightmost input attribute). The bit is 0 when the corresponding attribute is a member of the grouping set and 1 otherwise.

You provide the function with all attributes that participate in any grouping set as input, and you will get a unique integer representing each grouping set. So, for example, the expression GROUPING_ID( a, b, c, d ) would return 0 ( 0×8 + 0×4 + 0×2 + 0×1 ) for rows associated with the grouping set ( a, b, c, d ), 1 ( 0×8 + 0×4 + 0×2 + 1×1 ) for the grouping set ( a, b, c ), 2 ( 0×8 + 0×4 + 1×2 + 0×1 ) for the grouping set ( a, b, d ), 3 ( 0×8 + 0×4 + 1×2 + 1×1 ) for the grouping set ( a, b ), and so on.

The following query demonstrate the use of the GROUPING_ID function:

SELECT
  GROUPING_ID(
    custid, empid,
    YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ) AS grp_id,
  custid, empid,
  YEAR(orderdate) AS orderyear,
  MONTH(orderdate) AS ordermonth,
  DAY(orderdate) AS orderday,
  SUM(qty) AS qty
FROM dbo.Orders
GROUP BY
  CUBE(custid, empid),
  ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate));

This query generates the following output:

grp_id  custid empid  orderyear  ordermonth  orderday  qty
------- ------ ------ ---------- ----------- --------- ----
0       C      3      2006       4           18        22
16      NULL   3      2006       4           18        22
0       A      3      2006       8           2         10
24      NULL   NULL   2006       4           18        22
25      NULL   NULL   2006       4           NULL      22
16      NULL   3      2006       8           2         10
24      NULL   NULL   2006       8           2         10
25      NULL   NULL   2006       8           NULL      10
0       D      3      2006       9           7         30
16      NULL   3      2006       9           7         30
...

For example, the grp_id value 25 represents the grouping set ( YEAR(orderdate), MONTH(orderdate) ). These attributes are represented by the second (value 2) and third (value 4) bits. However, remember that the bits representing members that participate in the grouping set are turned off. The bits representing the members that do not participate in the grouping set are turned on. In our case, those are the first (1), fourth (8), and fifth (16) bits representing the attributes DAY(orderdate), empid and custid, respectively. The sum of the values of the bits that are turned on is 1 + 8 + 16 = 25.

The following query helps you see which bits are turned on or off in each integer bitmap generated by the GROUPING_ID function with five input elements:

SELECT
  GROUPING_ID(e, d, c, b, a) as n,
  COALESCE(e, 1) as [16],
  COALESCE(d, 1) as [8],
  COALESCE(c, 1) as [4],
  COALESCE(b, 1) as [2],
  COALESCE(a, 1) as [1]
FROM (VALUES(0, 0, 0, 0, 0)) AS D(a, b, c, d, e)
GROUP BY CUBE (a, b, c, d, e)
ORDER BY n;

This query generates the following output:

n           16          8           4           2           1
----------- ----------- ----------- ----------- ----------- -----------
0           0           0           0           0           0
1           0           0           0           0           1
2           0           0           0           1           0
3           0           0           0           1           1
4           0           0           1           0           0
5           0           0           1           0           1
6           0           0           1           1           0
7           0           0           1           1           1
8           0           1           0           0           0
9           0           1           0           0           1
10          0           1           0           1           0
11          0           1           0           1           1
12          0           1           1           0           0
13          0           1           1           0           1
14          0           1           1           1           0
15          0           1           1           1           1
16          1           0           0           0           0
17          1           0           0           0           1
18          1           0           0           1           0
19          1           0           0           1           1
20          1           0           1           0           0
21          1           0           1           0           1
22          1           0           1           1           0
23          1           0           1           1           1
24          1           1           0           0           0
25          1           1           0           0           1
26          1           1           0           1           0
27          1           1           0           1           1
28          1           1           1           0           0
29          1           1           1           0           1
30          1           1           1           1           0
31          1           1           1           1           1

Remember—when the bit is off, the corresponding member is part of the grouping set.

As mentioned, the GROUPING_ID function was introduced in SQL Server 2008. You could produce a similar integer bitmap prior to SQL Server 2008, but it involved more work. You could use a function called GROUPING that accepts a single attribute as input and returns 0 if the attribute is a member of the grouping set and 1 otherwise. You could construct the integer bitmap by multiplying the GROUPING value of each attribute by a different power of 2 and summing all values. Here’s an example of implementing this logic in a query that uses the older WITH CUBE option:

SELECT
  GROUPING(custid)          * 4 +
  GROUPING(empid)           * 2 +
  GROUPING(YEAR(orderdate)) * 1 AS grp_id,
  custid, empid, YEAR(orderdate) AS orderyear,
  SUM(qty) AS totalqty
FROM dbo.Orders
GROUP BY custid, empid, YEAR(orderdate)
WITH CUBE;

This query generates the following output:

grp_id      custid empid       orderyear   totalqty
----------- ------ ----------- ----------- -----------
0           A      1           2006        12
0           B      1           2006        20
4           NULL   1           2006        32
0           A      3           2006        10
0           C      3           2006        22
0           D      3           2006        30
4           NULL   3           2006        62
6           NULL   NULL        2006        94
0           C      1           2007        14
4           NULL   1           2007        14
...

Materialize Grouping Sets

Recall that before I started describing the technicalities of the grouping sets–related features, I explained that one of their uses is to preprocess aggregates for multiple grouping sets and store those in the data warehouse for fast retrieval. The following code demonstrates materializing aggregates for multiple grouping sets, including an integer identifier of the grouping set calculated with the GROUPING_ID function in a table called MyGroupingSets:

USE tempdb;
IF OBJECT_ID('dbo.MyGroupingSets', 'U') IS NOT NULL  DROP TABLE dbo.MyGroupingSets;
GO

SELECT
  GROUPING_ID(
    custid, empid,
    YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ) AS grp_id,
  custid, empid,
  YEAR(orderdate) AS orderyear,
  MONTH(orderdate) AS ordermonth,
  DAY(orderdate) AS orderday,
  SUM(qty) AS qty
INTO dbo.MyGroupingSets
FROM dbo.Orders
GROUP BY
  CUBE(custid, empid),
  ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate));

CREATE UNIQUE CLUSTERED INDEX idx_cl_grp_id_grp_attributes
  ON dbo.MyGroupingSets(grp_id, custid, empid, orderyear, ordermonth, orderday);

The index created on the table MyGroupingSets is defined on the grp_id column as the first key to allow efficient retrieval of all rows associated with a single grouping set. For example, consider the following query, which asks for all rows associated with the grouping set ( custid, YEAR(orderdate), MONTH(orderdate) ):

SELECT *
FROM dbo.MyGroupingSets
WHERE grp_id = 9;

This query generates the following output:

grp_id      custid empid       orderyear   ordermonth  orderday    qty
----------- ------ ----------- ----------- ----------- ----------- -----------
9           A      NULL        2006        8           NULL        10
9           A      NULL        2006        12          NULL        12
9           A      NULL        2007        1           NULL        40
9           A      NULL        2008        2           NULL        10
9           B      NULL        2006        12          NULL        20
9           B      NULL        2007        2           NULL        12
9           B      NULL        2008        4           NULL        15
9           C      NULL        2006        4           NULL        22
9           C      NULL        2007        1           NULL        14
9           C      NULL        2008        2           NULL        20
9           D      NULL        2006        9           NULL        30

Figure 8-5 shows the plan for this query.

Figure 8-5

Figure 8-5 Execution plan of query that filters a single grouping set

This plan is very efficient. It scans only the rows that are associated with the requested grouping set because they reside in a consecutive section in the leaf of the clustered index.

Provided that you are using aggregates that are additive measures, like SUM, COUNT, and AVG, you can apply incremental updates to the stored aggregates with only the delta of additions since you last processed those aggregates. You can achieve this by using the new MERGE statement that was introduced in SQL Server 2008. Here I’m just going to show the code to demonstrate how this is done. For details about the MERGE statement, please refer to Chapter 10.

Run the following code to simulate another day’s worth of order activity (April 19, 2008):

INSERT INTO dbo.Orders
  (orderid, orderdate, empid, custid, qty)
VALUES
  (50001, '20080419', 1, 'A', 10),
  (50002, '20080419', 1, 'B', 30),
  (50003, '20080419', 2, 'A', 20),
  (50004, '20080419', 2, 'B',  5),
  (50005, '20080419', 3, 'A', 15)

Then run the following code to incrementally update the stored aggregates with the new day’s worth of data:

WITH LastDay AS
(
  SELECT
    GROUPING_ID(
      custid, empid,
      YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ) AS grp_id,
    custid, empid,
    YEAR(orderdate) AS orderyear,
    MONTH(orderdate) AS ordermonth,
    DAY(orderdate) AS orderday,
    SUM(qty) AS qty
  FROM dbo.Orders
  WHERE orderdate = '20080419'
  GROUP BY
    CUBE(custid, empid),
    ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate))
)
MERGE INTO dbo.MyGroupingSets AS TGT
USING LastDay AS SRC
  ON     (TGT.grp_id    = SRC.grp_id)
     AND (TGT.orderyear  = SRC.orderyear
          OR (TGT.orderyear IS NULL AND SRC.orderyear IS NULL))
     AND (TGT.ordermonth = SRC.ordermonth
          OR (TGT.ordermonth IS NULL AND SRC.ordermonth IS NULL))
     AND (TGT.orderday   = SRC.orderday
          OR (TGT.orderday IS NULL AND SRC.orderday IS NULL))
     AND (TGT.custid   = SRC.custid
          OR (TGT.custid IS NULL AND SRC.custid IS NULL))
     AND (TGT.empid    = SRC.empid
          OR (TGT.empid IS NULL AND SRC.empid IS NULL))
WHEN MATCHED THEN
  UPDATE SET
    TGT.qty += SRC.qty
WHEN NOT MATCHED THEN
  INSERT (grp_id, custid, empid, orderyear, ordermonth, orderday)
  VALUES (SRC.grp_id, SRC.custid, SRC.empid, SRC.orderyear, SRC.ordermonth, SRC.orderday);

The code in the CTE LastDay calculates aggregates for the same grouping sets as in the original query but filters only the last day’s worth of data. The MERGE statement then increments the quantities of groups that already exist in the target by adding the new quantities and inserts the groups that don’t exist in the target.

Sorting

Consider a request to calculate the total quantity aggregate for all grouping sets in the hierarchy order year > order month > order day. You can achieve this, of course, by simply using the ROLLUP subclause. However, a tricky part of the request is that you need to sort the rows in the output in a hierarchical manner, that is, days of a month, followed by the month total, months of a year followed by the yearly total, and finally the grand total. This can be achieved with the help of the GROUPING function as follows:

SELECT
  YEAR(orderdate)  AS orderyear,
  MONTH(orderdate) AS ordermonth,
  DAY(orderdate)   AS orderday,
  SUM(qty)         AS totalqty
FROM dbo.Orders
GROUP BY
  ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate))
ORDER BY
  GROUPING(YEAR(orderdate)) , YEAR(orderdate),
  GROUPING(MONTH(orderdate)), MONTH(orderdate),
  GROUPING(DAY(orderdate))  , DAY(orderdate);

Remember that the GROUPING function returns 0 when the element is a member of a grouping set (representing detail) and 1 when the element isn’t (representing an aggregate). Because we want to present detail before aggregates, the GROUPING function is very convenient. We want to first see the detail of years and at the end the grand total. Within the detail of years, we want to sort by year. Within each year, we want to first see the detail of months and then the year total. Within the detail of months, we want to sort by month. Within the month we want to sort by the detail of days and then month total. Within the detail of days, we want to sort by day.

This query generates the following output:

orderyear   ordermonth  orderday    totalqty
----------- ----------- ----------- -----------
2006        4           18          22
2006        4           NULL        22
2006        8           2           10
2006        8           NULL        10
2006        9           7           30
2006        9           NULL        30
2006        12          24          32
2006        12          NULL        32
2006        NULL        NULL        94
2007        1           9           40
2007        1           18          14
2007        1           NULL        54
2007        2           12          12
2007        2           NULL        12
2007        NULL        NULL        66
2008        2           12          10
2008        2           16          20
2008        2           NULL        30
2008        4           18          15
2008        4           19          80
2008        4           NULL        95
2008        NULL        NULL        125
NULL        NULL        NULL        285