Aggregating and Pivoting Data in Microsoft SQL Server 2008 T-SQL

  • 3/25/2009
This chapter from Inside Microsoft SQL Server 2008 T-SQL Querying covers various data-aggregation techniques, including using the OVER clause with aggregate functions, tiebreakers, running aggregates, pivoting, unpivoting, custom aggregations, histograms, grouping factors, and grouping sets.

In my solutions in this chapter, I’ll reuse techniques that I introduced earlier. I’ll also introduce new techniques for you to familiarize yourself with.

Logic will naturally be an integral element in the solutions. Remember that at the heart of every querying problem lies a logical puzzle.

OVER Clause

The OVER clause allows you to request window-based calculations—that is, calculations performed over a whole window of rows. In Chapter 6, I described in detail how you use the OVER clause with analytical ranking functions. Microsoft SQL Server also supports the OVER clause with scalar aggregate functions; however, currently you can provide only the PARTITION BY clause. Future versions of SQL Server will most likely also support the other ANSI elements of aggregate window functions, including the ORDER BY and ROWS clauses.

The purpose of using the OVER clause with scalar aggregates is to calculate, for each row, an aggregate based on a window of values that extends beyond that row—and to do all this without using a GROUP BY clause in the query. In other words, the OVER clause allows you to add aggregate calculations to the results of an ungrouped query. This capability provides an alternative to requesting aggregates with subqueries in case you need to include both base row attributes and aggregates in your results.

Remember that in Chapter 7, I presented a problem in which you were required to calculate two aggregates for each order row: the percentage the row contributed to the total value of all orders and the difference between the row’s order value and the average value over all orders. In my examples I used a table called MyOrderValues that you create and populate by running the following code:

SET NOCOUNT ON;
USE InsideTSQL2008;

IF OBJECT_ID('dbo.MyOrderValues', 'U') IS NOT NULL
  DROP TABLE dbo.MyOrderValues;
GO
SELECT *
INTO dbo.MyOrderValues
FROM Sales.OrderValues;

ALTER TABLE dbo.MyOrderValues
  ADD CONSTRAINT PK_MyOrderValues PRIMARY KEY(orderid);

CREATE INDEX idx_val ON dbo.MyOrderValues(val);

I showed the following optimized query in which I used a cross join between the base table and a derived table of aggregates instead of using multiple subqueries:

SELECT orderid, custid, val,
  CAST(val / sumval * 100. AS NUMERIC(5, 2)) AS pct,
  CAST(val - avgval AS NUMERIC(12, 2)) AS diff
FROM dbo.MyOrderValues
  CROSS JOIN (SELECT SUM(val) AS sumval, AVG(val) AS avgval
              FROM dbo.MyOrderValues) AS Aggs;

This query produces the following output:

orderid  custid  val       pct   diff
-------- ------- --------- ----- -------------
10248    85      440.00    0.03  -1085.05
10249    79      1863.40   0.15  338.35
10250    34      1552.60   0.12  27.55
10251    84      654.06    0.05  -870.99
10252    76      3597.90   0.28  2072.85
10253    34      1444.80   0.11  -80.25
10254    14      556.62    0.04  -968.43
10255    68      2490.50   0.20  965.45
10256    88      517.80    0.04  -1007.25
...

The motivation for calculating the two aggregates in a single derived table instead of as two separate subqueries stemmed from the fact that each subquery accessed the base table separately, while the derived table calculated the aggregates using a single scan of the data. SQL Server’s query optimizer didn’t use the fact that the two subqueries aggregated the same data into the same groups.

When you specify multiple aggregates with identical OVER clauses in the same SELECT list, however, the aggregates refer to the same window, as with a derived table, and SQL Server’s query optimizer evaluates them all with one scan of the source data. Here’s how you use the OVER clause to answer the same request:

SELECT orderid, custid, val,
  CAST(val / SUM(val) OVER() * 100. AS NUMERIC(5, 2)) AS pct,
  CAST(val - AVG(val) OVER() AS NUMERIC(12, 2)) AS diff
FROM dbo.MyOrderValues;

Here, because I didn’t specify a PARTITION BY clause, the aggregates were calculated based on the whole input. Logically, SUM(val) OVER() is equivalent here to the subquery (SELECT SUM(val) FROM dbo.MyOrderValues). Physically, it’s a different story. As an exercise, you can compare the execution plans of the following two queries, each requesting a different number of aggregates using the same OVER clause:

SELECT orderid, custid, val,
  SUM(val) OVER() AS sumval
FROM dbo.MyOrderValues;

SELECT orderid, custid, val,
  SUM(val)   OVER() AS sumval,
  COUNT(val) OVER() AS cntval,
  AVG(val)   OVER() AS avgval,
  MIN(val)   OVER() AS minval,
  MAX(val)   OVER() AS maxval
FROM dbo.MyOrderValues;

You’ll find the two plans nearly identical, with the only difference being that the single Stream Aggregate operator calculates a different number of aggregates. The query costs are identical. On the other hand, compare the execution plans of the following two queries, each requesting a different number of aggregates using subqueries:

SELECT orderid, custid, val,
  (SELECT SUM(val) FROM dbo.MyOrderValues) AS sumval
FROM dbo.MyOrderValues;

SELECT orderid, custid, val,
  (SELECT SUM(val)   FROM dbo.MyOrderValues) AS sumval,
  (SELECT COUNT(val) FROM dbo.MyOrderValues) AS cntval,
  (SELECT AVG(val)   FROM dbo.MyOrderValues) AS avgval,
  (SELECT MIN(val)   FROM dbo.MyOrderValues) AS minval,
  (SELECT MAX(val)   FROM dbo.MyOrderValues) AS maxval
FROM dbo.MyOrderValues;

You’ll find that they have different plans, with the latter being more expensive because it rescans the source data for each aggregate.

Another benefit of the OVER clause is that it allows for shorter and simpler code. This is especially apparent when you need to calculate partitioned aggregates. Using OVER, you simply specify a PARTITION BY clause. Using subqueries, you have to correlate the inner query to the outer, making the query longer and more complex.

As an example of using the PARTITION BY clause, the following query calculates the percentage of the order value out of the customer total and the difference from the customer average:

SELECT orderid, custid, val,
  CAST(val / SUM(val) OVER(PARTITION BY custid) * 100.
    AS NUMERIC(5, 2)) AS pct,
  CAST(val - AVG(val) OVER(PARTITION BY custid) AS NUMERIC(12, 2)) AS diff
FROM dbo.MyOrderValues
ORDER BY custid;

This query generates the following output:

orderid  custid  val     pct    diff
-------- ------- ------- ------ ------------
10643    1       814.50  19.06  102.33
10692    1       878.00  20.55  165.83
10702    1       330.00  7.72   -382.17
10835    1       845.80  19.79  133.63
10952    1       471.20  11.03  -240.97
11011    1       933.50  21.85  221.33
10926    2       514.40  36.67  163.66
10759    2       320.00  22.81  -30.74
10625    2       479.75  34.20  129.01
10308    2       88.80   6.33   -261.94
...

In short, the OVER clause allows for more concise and faster-running queries.

When you’re done, run the following code for cleanup:

IF OBJECT_ID('dbo.MyOrderValues', 'U') IS NOT NULL
  DROP TABLE dbo.MyOrderValues;