Aggregating and Pivoting Data in Microsoft SQL Server 2008 T-SQL

3/25/2009

Contents

Running Aggregations

Running aggregations are aggregations of data over a sequence (typically temporal). Running aggregate problems have many variations, and I’ll describe several important ones here.

In my examples, I’ll use a summary table called EmpOrders that contains one row for each employee and month, with the total quantity of orders made by that employee in that month. Run the following code to create the EmpOrders table and populate it with sample data:

USE tempdb;

IF OBJECT_ID('dbo.EmpOrders') IS NOT NULL DROP TABLE dbo.EmpOrders;

CREATE TABLE dbo.EmpOrders
(
  empid    INT  NOT NULL,
  ordmonth DATE NOT NULL,
  qty      INT  NOT NULL,
  PRIMARY KEY(empid, ordmonth)
);
GO

INSERT INTO dbo.EmpOrders(empid, ordmonth, qty)
  SELECT O.empid,
    DATEADD(month, DATEDIFF(month, 0, O.orderdate), 0) AS ordmonth,
    SUM(qty) AS qty
  FROM InsideTSQL2008.Sales.Orders AS O
    JOIN InsideTSQL2008.Sales.OrderDetails AS OD
      ON O.orderid = OD.orderid
  GROUP BY empid,
    DATEADD(month, DATEDIFF(month, 0, O.orderdate), 0);

Run the following query to get the contents of the EmpOrders table:

SELECT empid, CONVERT(VARCHAR(7), ordmonth, 121) AS ordmonth, qty
FROM dbo.EmpOrders
ORDER BY empid, ordmonth;

This query generates the following output, shown here in abbreviated form:

empid       ordmonth qty
----------- -------- -----------
1           2006-07  121
1           2006-08  247
1           2006-09  255
1           2006-10  143
1           2006-11  318
1           2006-12  536
1           2007-01  304
1           2007-02  168
1           2007-03  275
1           2007-04  20
...
2           2006-07  50
2           2006-08  94
2           2006-09  137
2           2006-10  248
2           2006-11  237
2           2006-12  319
2           2007-01  230
2           2007-02  36
2           2007-03  151
2           2007-04  468
...

I’ll discuss three types of running aggregation problems: cumulative, sliding, and year-to-date (YTD).

Cumulative Aggregations

Cumulative aggregations accumulate data from the first element within the sequence up to the current point. For example, imagine the following request: for each employee and month, return the total quantity and average monthly quantity from the beginning of the employee’s activity through the month in question.

Recall the techniques for calculating row numbers without using the built-in ROW_NUMBER function; using these techniques, you scan the same rows we need here to calculate the total quantities. The difference is that for row numbers you used the aggregate COUNT, and here you’ll use the aggregates SUM and AVG. I demonstrated two set-based solutions to calculate row numbers without the ROW_NUMBER function—one using subqueries and one using joins. In the solution using joins, I applied what I called an expand-collapse technique. To me, the subquery solution is much more intuitive than the join solution, with its artificial expand-collapse technique. So, when there’s no performance difference, I’d rather use subqueries. Typically, you won’t see a performance difference when only one aggregate is involved because the plans would be similar. However, when you request multiple aggregates, the subquery solution might result in a plan that scans the data separately for each aggregate. Compare this to the plan for the join solution, which typically calculates all aggregates during a single scan of the source data.

So my choice is usually simple—use a subquery for one aggregate and use a join for multiple aggregates. The following query applies the expand-collapse approach to produce the desired result:

SELECT O1.empid, CONVERT(VARCHAR(7), O1.ordmonth, 121) AS ordmonth,
  O1.qty AS qtythismonth, SUM(O2.qty) AS totalqty,
  CAST(AVG(1.*O2.qty) AS NUMERIC(12, 2)) AS avgqty
FROM dbo.EmpOrders AS O1
  JOIN dbo.EmpOrders AS O2
    ON O2.empid = O1.empid
    AND O2.ordmonth <= O1.ordmonth
GROUP BY O1.empid, O1.ordmonth, O1.qty
ORDER BY O1.empid, O1.ordmonth;

This query generates the following output, shown here in abbreviated form:

empid       ordmonth qtythismonth totalqty    avgqty
----------- -------- ------------ ----------- ----------
1           2006-07  121          121         121.00
1           2006-08  247          368         184.00
1           2006-09  255          623         207.67
1           2006-10  143          766         191.50
1           2006-11  318          1084        216.80
1           2006-12  536          1620        270.00
1           2007-01  304          1924        274.86
1           2007-02  168          2092        261.50
1           2007-03  275          2367        263.00
1           2007-04  20           2387        238.70
...
2           2006-07  50           50          50.00
2           2006-08  94           144         72.00
2           2006-09  137          281         93.67
2           2006-10  248          529         132.25
2           2006-11  237          766         153.20
2           2006-12  319          1085        180.83
2           2007-01  230          1315        187.86
2           2007-02  36           1351        168.88
2           2007-03  151          1502        166.89
2           2007-04  468          1970        197.00
...

Now let’s say that you are asked to return only one aggregate (say, total quantity). You can safely use the subquery approach:

SELECT O1.empid, CONVERT(VARCHAR(7), O1.ordmonth, 121) AS ordmonth,
  O1.qty AS qtythismonth,
  (SELECT SUM(O2.qty)
   FROM dbo.EmpOrders AS O2
   WHERE O2.empid = O1.empid
     AND O2.ordmonth <= O1.ordmonth) AS totalqty
FROM dbo.EmpOrders AS O1
GROUP BY O1.empid, O1.ordmonth, O1.qty;

As was the case for calculating row numbers based on subqueries or joins, when calculating running aggregates based on similar techniques, the N² performance issues I discussed before apply once again. Because running aggregates typically are calculated on a fairly small number of rows per group, you won’t be adversely affected by performance issues, assuming you have appropriate indexes (keyed on grouping columns, then sort columns, and including covering columns).

Let p be the number of partitions involved (employees in our case), let n be the average number of rows per partition (months in our case), and let a be the number of aggregates involved. The total number of rows scanned using the join approach can be expressed as pn + p(n+n²)/2 and as pn + ap(n+n²)/2 using the subquery approach because with subqueries the optimizer uses a separate scan per subquery. It’s important to note that the N² complexity is relevant to the partition size and not the table size. If the number of rows in the table grows by a factor of f but the partition size doesn’t change, the run time increases by a factor of f as well. If, on the other hand, the average partition size grows by a factor of f, the run time increases by a factor of f². With small partitions (say, up to several dozen rows), this set-based solution provides reasonable performance. With large partitions, a cursor solution would be faster despite the overhead associated with row-by-row manipulation because a cursor scans the rows only once, and the per-row overhead is constant.

NOTE

ANSI SQL provides support for running aggregates by means of aggregate window functions. SQL Server 2005 introduced the OVER clause for aggregate functions only with the PARTITION BY clause, and unfortunately SQL Server 2008 didn’t enhance the OVER clause further. Further enhancements are currently planned for the next major release of SQL Server—SQL Server 11. Per ANSI SQL—and I hope in future versions of SQL Server—you could provide a solution relying exclusively on window functions, like so:

SELECT empid, CONVERT(VARCHAR(7), ordmonth, 121) AS ordmonth, qty,
  SUM(O2.qty) OVER(PARTITION BY empid
                   ORDER BY ordmonth
                   ROWS BETWEEN UNBOUNDED PRECEDING
                            AND CURRENT ROW) AS totalqty,
  CAST(AVG(1.*O2.qty) OVER(PARTITION BY empid
                           ORDER BY ordmonth
                           ROWS BETWEEN UNBOUNDED PRECEDING
                                    AND CURRENT ROW)
  AS NUMERIC(12, 2)) AS avgqty
FROM dbo.EmpOrders;

When this code is finally supported in SQL Server, you can expect dramatic performance improvements and obviously much simpler queries. Being familiar with the way ranking calculations based on the OVER clause are currently optimized, you should expect running aggregates based on the OVER clause to be optimized similarly. That is, given a good index to support the request, you should expect the plan to involve a single ordered scan of the data. Then the total number of rows scanned would simply be the number of rows in the table (pn).

You might also be requested to filter the data—for example, return monthly aggregates for each employee only for months before the employee reached a certain target. Typically, you’ll have a target for each employee stored in a Targets table that you’ll need to join to. To make this example simple, I’ll assume that all employees have the same target total quantity—1,000. In practice, you’ll use the target attribute from the Targets table. Because you need to filter an aggregate, not an attribute, you must specify the filter expression (in this case, SUM(O2.qty) < 1000) in the HAVING clause, not the WHERE clause. The solution is as follows:

SELECT O1.empid, CONVERT(VARCHAR(7), O1.ordmonth, 121) AS ordmonth,
  O1.qty AS qtythismonth, SUM(O2.qty) AS totalqty,
  CAST(AVG(1.*O2.qty) AS NUMERIC(12, 2)) AS avgqty
FROM dbo.EmpOrders AS O1
  JOIN dbo.EmpOrders AS O2
    ON O2.empid = O1.empid
    AND O2.ordmonth <= O1.ordmonth
GROUP BY O1.empid, O1.ordmonth, O1.qty
HAVING SUM(O2.qty) < 1000
ORDER BY O1.empid, O1.ordmonth;

This query generates the following output, shown here in abbreviated form:

empid       ordmonth qtythismonth totalqty    avgqty
----------- -------- ------------ ----------- ----------
1           2006-07  121          121         121.00
1           2006-08  247          368         184.00
1           2006-09  255          623         207.67
1           2006-10  143          766         191.50
2           2006-07  50           50          50.00
2           2006-08  94           144         72.00
2           2006-09  137          281         93.67
2           2006-10  248          529         132.25
2           2006-11  237          766         153.20
3           2006-07  182          182         182.00
3           2006-08  228          410         205.00
3           2006-09  75           485         161.67
3           2006-10  151          636         159.00
3           2006-11  204          840         168.00
3           2006-12  100          940         156.67
...

Things get a bit tricky if you also need to include the rows for those months in which the employees reached their target. If you specify SUM(O2.qty) <= 1000 (that is, write <= instead of <), you still won’t get the row in which the employee reached the target unless the total through that month is exactly 1,000. But remember that you have access to both the cumulative total and the current month’s quantity, and using these two values together, you can solve this problem. If you change the HAVING filter to SUM(O2.qty) – O1.qty < 1000, you get the months in which the employee’s total quantity, excluding the current month’s orders, had not reached the target. In particular, the first month in which an employee reached or exceeded the target satisfies this new criterion, and that month will appear in the results. The complete solution follows:

SELECT O1.empid, CONVERT(VARCHAR(7), O1.ordmonth, 121) AS ordmonth,
  O1.qty AS qtythismonth, SUM(O2.qty) AS totalqty,
  CAST(AVG(1.*O2.qty) AS NUMERIC(12, 2)) AS avgqty
FROM dbo.EmpOrders AS O1
  JOIN dbo.EmpOrders AS O2
    ON O2.empid = O1.empid
    AND O2.ordmonth <= O1.ordmonth
GROUP BY O1.empid, O1.ordmonth, O1.qty
HAVING SUM(O2.qty) - O1.qty < 1000
ORDER BY O1.empid, O1.ordmonth;

This query generates the following output, shown here in abbreviated form:

empid       ordmonth qtythismonth totalqty    avgqty
----------- -------- ------------ ----------- ----------
1           2006-07  121          121         121.00
1           2006-08  247          368         184.00
1           2006-09  255          623         207.67
1           2006-10  143          766         191.50
1           2006-11  318          1084        216.80
2           2006-07  50           50          50.00
2           2006-08  94           144         72.00
2           2006-09  137          281         93.67
2           2006-10  248          529         132.25
2           2006-11  237          766         153.20
2           2006-12  319          1085        180.83
3           2006-07  182          182         182.00
3           2006-08  228          410         205.00
3           2006-09  75           485         161.67
3           2006-10  151          636         159.00
3           2006-11  204          840         168.00
3           2006-12  100          940         156.67
3           2007-01  364          1304        186.29
...

NOTE

You might have another solution in mind that seems like a plausible and simpler alternative—to leave the SUM condition alone but change the join condition to O2.ordmonth < O1.ordmonth. This way, the query would select rows where the total through the previous month did not meet the target. However, in the end, this solution is not any easier (the AVG is hard to generate, for example); what’s worse is that you might come up with a solution that does not work for employees who reach the target in their first month.

Suppose you’re interested in seeing results only for the specific month in which the employee reached the target of 1,000, without seeing results for preceding months. What’s true for only those rows in the output of the last query? You’re looking for rows where the total quantity is greater than or equal to 1,000. Simply add this criterion to the HAVING filter. Here’s the query followed by its output:

SELECT O1.empid, CONVERT(VARCHAR(7), O1.ordmonth, 121) AS ordmonth,
  O1.qty AS qtythismonth, SUM(O2.qty) AS totalqty,
  CAST(AVG(1.*O2.qty) AS NUMERIC(12, 2)) AS avgqty
FROM dbo.EmpOrders AS O1
  JOIN dbo.EmpOrders AS O2
    ON O2.empid = O1.empid
    AND O2.ordmonth <= O1.ordmonth
GROUP BY O1.empid, O1.ordmonth, O1.qty
HAVING SUM(O2.qty) - O1.qty < 1000
  AND SUM(O2.qty) >= 1000
ORDER BY O1.empid, O1.ordmonth;

empid       ordmonth qtythismonth totalqty    avgqty
----------- -------- ------------ ----------- ----------
1           2006-11  318          1084        216.80
2           2006-12  319          1085        180.83
3           2007-01  364          1304        186.29
4           2006-10  613          1439        359.75
5           2007-05  247          1213        173.29
6           2007-01  64           1027        171.17
7           2007-03  191          1069        152.71
8           2007-01  305          1228        175.43
9           2007-06  161          1007        125.88

Sliding Aggregations

Sliding aggregates are calculated over a sliding window in a sequence (again, typically temporal), as opposed to being calculated from the beginning of the sequence until the current point. A moving average—such as the employee’s average quantity over the last three months—is one example of a sliding aggregate.

NOTE

Without clarification, expressions such as “last three months” are ambiguous. The last three months could mean the previous three months (not including this month), or it could mean the previous two months along with this month. When you get a problem like this, be sure you know precisely what window of time you are using for aggregation—for a particular row, exactly when does the window begin and end?

In our example, the window of time is this: greater than the point in time starting three months ago and smaller than or equal to the current point in time. Note that this definition works well even in cases where you track finer time granularities than a month (including day, hour, minute, second, millisecond, microsecond, and nanosecond). This definition also addresses implicit conversion issues resulting from the accuracy level supported by SQL Server for the DATETIME data type—3.33 milliseconds. To avoid implicit conversion issues, it’s wiser to use > and <= predicates than the BETWEEN predicate.

The main difference between the solution for cumulative aggregates and the solution for sliding aggregates is in the join condition (or in the subquery’s filter in the case of the alternate solution using subqueries). Instead of using O2.ordmonth <= O1.current_month, you use O2.ordmonth > three_months_before_current AND O2.ordmonth <= O1.current_month. In T-SQL, this translates to the following query:

SELECT O1.empid,
  CONVERT(VARCHAR(7), O1.ordmonth, 121) AS tomonth,
  O1.qty AS qtythismonth,
  SUM(O2.qty) AS totalqty,
  CAST(AVG(1.*O2.qty) AS NUMERIC(12, 2)) AS avgqty
FROM dbo.EmpOrders AS O1
  JOIN dbo.EmpOrders AS O2
    ON O2.empid = O1.empid
    AND (O2.ordmonth > DATEADD(month, -3, O1.ordmonth)
         AND O2.ordmonth <=  O1.ordmonth)
GROUP BY O1.empid, O1.ordmonth, O1.qty
ORDER BY O1.empid, O1.ordmonth;

This query generates the following output, shown here in abbreviated form:

empid       tomonth qtythismonth totalqty    avgqty
----------- ------- ------------ ----------- ----------
1           2006-07 121          121         121.00
1           2006-08 247          368         184.00
1           2006-09 255          623         207.67
1           2006-10 143          645         215.00
1           2006-11 318          716         238.67
1           2006-12 536          997         332.33
1           2007-01 304          1158        386.00
1           2007-02 168          1008        336.00
1           2007-03 275          747         249.00
1           2007-04 20           463         154.33
...
2           2006-07 50           50          50.00
2           2006-08 94           144         72.00
2           2006-09 137          281         93.67
2           2006-10 248          479         159.67
2           2006-11 237          622         207.33
2           2006-12 319          804         268.00
2           2007-01 230          786         262.00
2           2007-02 36           585         195.00
2           2007-03 151          417         139.00
2           2007-04 468          655         218.33
...

Note that this solution includes aggregates for three-month periods that don’t include three months of actual data. If you want to return only periods with three full months accumulated, without the first two periods that do not cover three months, you can add the criterion MIN(O2.ordmonth) = DATEADD(month, –2, O1.ordmonth) to the HAVING filter.

NOTE

Per ANSI SQL, you can use the ORDER BY and ROWS subclauses of the OVER clause—which are currently missing in SQL Server—to address sliding aggregates. You would use the following query to return the desired result for the last sliding aggregates request (assuming the data has exactly one row per month):

SELECT empid, CONVERT(VARCHAR(7), ordmonth, 121) AS ordmonth,
  qty AS qtythismonth,
  SUM(O2.qty) OVER(PARTITION BY empid
                   ORDER BY ordmonth
                   ROWS BETWEEN 2 PRECEDING
                            AND CURRENT ROW) AS totalqty,
  CAST(AVG(1.*O2.qty) OVER(PARTITION BY empid
                           ORDER BY ordmonth
                           ROWS BETWEEN 2 PRECEDING
                                    AND CURRENT ROW)
    AS NUMERIC(12, 2)) AS avgqty
FROM dbo.EmpOrders;

Year-to-Date (YTD)

YTD aggregates accumulate values from the beginning of a period based on some date and time unit (say, a year) until the current point. The calculation is very similar to the sliding aggregates solution. The only difference is the lower bound provided in the query’s filter, which is the calculation of the beginning of the year. For example, the following query returns YTD aggregates for each employee and month:

SELECT O1.empid,
  CONVERT(VARCHAR(7), O1.ordmonth, 121) AS ordmonth,
  O1.qty AS qtythismonth,
  SUM(O2.qty) AS totalqty,
  CAST(AVG(1.*O2.qty) AS NUMERIC(12, 2)) AS avgqty
FROM dbo.EmpOrders AS O1
  JOIN dbo.EmpOrders AS O2
    ON O2.empid = O1.empid
    AND (O2.ordmonth >= CAST(CAST(YEAR(O1.ordmonth) AS CHAR(4))
                               + '0101' AS DATETIME)
         AND O2.ordmonth <= O1.ordmonth)
GROUP BY O1.empid, O1.ordmonth, O1.qty
ORDER BY O1.empid, O1.ordmonth;

This query generates the following output, shown here in abbreviated form:

empid       ordmonth qtythismonth totalqty    avgqty
----------- -------- ------------ ----------- ----------
1           2006-07  121          121         121.00
1           2006-08  247          368         184.00
1           2006-09  255          623         207.67
1           2006-10  143          766         191.50
1           2006-11  318          1084        216.80
1           2006-12  536          1620        270.00
1           2007-01  304          304         304.00
1           2007-02  168          472         236.00
1           2007-03  275          747         249.00
1           2007-04  20           767         191.75
...
2           2006-07  50           50          50.00
2           2006-08  94           144         72.00
2           2006-09  137          281         93.67
2           2006-10  248          529         132.25
2           2006-11  237          766         153.20
2           2006-12  319          1085        180.83
2           2007-01  230          230         230.00
2           2007-02  36           266         133.00
2           2007-03  151          417         139.00
2           2007-04  468          885         221.25
...

Save to your account