Aggregating and Pivoting Data in Microsoft SQL Server 2008 T-SQL

3/25/2009

Contents

Pivoting

Pivoting is a technique that allows you to rotate rows to columns, possibly performing aggregations along the way. The number of applications for pivoting is simply astounding. In this section, I’ll present a few, including pivoting attributes in an open schema environment, solving relational division problems, and formatting aggregated data. Later in the chapter and also in later chapters in the book, I’ll show additional applications.

Pivoting Attributes

I’ll use open schema as the scenario for pivoting attributes. Open schema is a design problem describing an environment that needs to deal with frequent schema changes. The relational model and SQL were conceived to handle frequent changes and requests for data via SQL’s data manipulation language (DML). However, SQL’s data definition language (DDL) was not conceived to support frequent schema changes. Whenever you need to add new entities, you must create new tables; whenever existing entities change their structures, you must add, alter, or drop columns. Such changes usually require downtime of the affected objects, and they also bring about substantial revisions to the application.

You can choose from several ways to model an open schema environment, each of which has advantages and disadvantages. One of those models is known as Entity Attribute Value (EAV) and also as the narrow representation of data. In this model, you store all data in a single table, where each attribute value resides in its own row along with the entity or object ID and the attribute name or ID. You represent the attribute values using the data type SQL_VARIANT to accommodate multiple attribute types in a single column.

In my examples, I’ll use the OpenSchema table, which you can create and populate by running the following code:

USE tempdb;

IF OBJECT_ID('dbo.OpenSchema') IS NOT NULL DROP TABLE dbo.OpenSchema;

CREATE TABLE dbo.OpenSchema
(
  objectid  INT          NOT NULL,
  attribute NVARCHAR(30) NOT NULL,
  value     SQL_VARIANT  NOT NULL,
  PRIMARY KEY (objectid, attribute)
);
GO

INSERT INTO dbo.OpenSchema(objectid, attribute, value) VALUES
  (1, N'attr1', CAST(CAST('ABC'      AS VARCHAR(10))   AS SQL_VARIANT)),
  (1, N'attr2', CAST(CAST(10         AS INT)           AS SQL_VARIANT)),
  (1, N'attr3', CAST(CAST('20070101' AS SMALLDATETIME) AS SQL_VARIANT)),
  (2, N'attr2', CAST(CAST(12         AS INT)           AS SQL_VARIANT)),
  (2, N'attr3', CAST(CAST('20090101' AS SMALLDATETIME) AS SQL_VARIANT)),
  (2, N'attr4', CAST(CAST('Y'        AS CHAR(1))       AS SQL_VARIANT)),
  (2, N'attr5', CAST(CAST(13.7       AS NUMERIC(9,3))  AS SQL_VARIANT)),
  (3, N'attr1', CAST(CAST('XYZ'      AS VARCHAR(10))   AS SQL_VARIANT)),
  (3, N'attr2', CAST(CAST(20         AS INT)           AS SQL_VARIANT)),
  (3, N'attr3', CAST(CAST('20080101' AS SMALLDATETIME) AS SQL_VARIANT));

-- show the contents of the table
SELECT * FROM dbo.OpenSchema;

This generates the following output:

objectid    attribute  value
----------- ---------- ------------------------
1           attr1      ABC
1           attr2      10
1           attr3      2007-01-01 00:00:00.000
2           attr2      12
2           attr3      2009-01-01 00:00:00.000
2           attr4      Y
2           attr5      13.700
3           attr1      XYZ
3           attr2      20
3           attr3      2008-01-01 00:00:00.000

Representing data this way allows logical schema changes to be implemented without adding, altering, or dropping tables and columns—you use DML INSERTs, UPDATEs, and DELETEs instead.

Of course, other aspects of working with the data (such as enforcing integrity, tuning, and querying) become more complex and expensive with such a representation. As mentioned, there are other approaches to dealing with open schema environments—for example, storing the data in XML format, using a wide representation of data, using CLR types, and others. However, when you weigh the advantages and disadvantages of each representation, you might find the EAV approach demonstrated here more favorable in some scenarios.

Keep in mind that this representation of the data requires very complex queries even for simple requests because different attributes of the same entity instance are spread over multiple rows. Before you query such data, you might want to rotate it to a traditional form with one column for each attribute—perhaps store the result in a temporary table, index it, query it, and then get rid of the temporary table. To rotate the data from its open schema form into a traditional form, you need to use a pivoting technique.

In the following section, I’ll describe the steps involved in solving pivoting problems. I’d like to point out that to understand the steps of the solution, it can be very helpful if you think about query logical processing phases, which I described in detail in Chapter 1. I discussed the query processing phases involved with the native PIVOT table operator, but those phases apply just as well to the standard solution that does not use this proprietary operator. Moreover, in the standard solution the phases are more apparent in the code, while using the PIVOT operator they are implicit.

The first step you might want to try when solving pivoting problems is to figure out how the number of rows in the result correlates to the number of rows in the source data. Here, you need to create a single result row out of the multiple base rows for each object. In SQL, this translates to grouping rows. So our first logical processing phase in pivoting is a grouping phase, and the associated element (the element you need to group by) is the objectid column.

As the next step in a pivoting problem, you can think in terms of the result columns. You need a result column for each unique attribute. Because the data contains five unique attributes (attr1, attr2, attr3, attr4, and attr5), you need five expressions in the SELECT list. Each expression is supposed to extract, out of the rows belonging to the grouped object, the value corresponding to a specific attribute. You can think of this logical phase as a spreading phase—you need to spread the values, or shift them, from the source column (value in our case) to the corresponding target column. As for the element that dictates where to spread the values, or the spread by element, in our case it is the attribute column. This spreading activity can be done with the following CASE expression, which in this example is applied to the attribute attr2:

CASE WHEN attribute = 'attr2' THEN value END

Remember that with no ELSE clause, CASE assumes an implicit ELSE NULL. The CASE expression just shown yields NULL for rows where attribute does not equal attr2 and yields value when attribute does equal attr2. This means that among the rows with a given value of objectid (say, 1), the CASE expression would yield several NULLs and, at most, one known value (10 in our example), which represents the value of the target attribute (attr2 in our example) for the given objectid.

The third phase in pivoting attributes is to extract the known value (if it exists) out of the set of NULLs and the known value. You have to use an aggregate for this purpose because, as you’ll recall, the query involves grouping. The trick to extracting the one known value is to use MAX or MIN. Both ignore NULLs and will return the one non-NULL value present because both the minimum and the maximum of a set containing one value is that value. So our third logical processing phase in pivoting is an aggregation phase. The aggregation element is the value column, and the aggregate function is MAX. Using the previous expression implementing the second phase with attr2, here’s the revised expression including the aggregation as well:

MAX(CASE WHEN attribute = 'attr2' THEN value END) AS attr2

Here’s the complete query that pivots the attributes from OpenSchema:

SELECT objectid,
  MAX(CASE WHEN attribute = 'attr1' THEN value END) AS attr1,
  MAX(CASE WHEN attribute = 'attr2' THEN value END) AS attr2,
  MAX(CASE WHEN attribute = 'attr3' THEN value END) AS attr3,
  MAX(CASE WHEN attribute = 'attr4' THEN value END) AS attr4,
  MAX(CASE WHEN attribute = 'attr5' THEN value END) AS attr5
FROM dbo.OpenSchema
GROUP BY objectid;

This query generates the following output:

objectid    attr1      attr2      attr3                    attr4      attr5
----------- ---------- ---------- ------------------------ ---------- ----------
1           ABC        10         2007-01-01 00:00:00.000  NULL       NULL
2           NULL       12         2009-01-01 00:00:00.000  Y          13.700
3           XYZ        20         2008-01-01 00:00:00.000  NULL       NULL

This technique for pivoting data is very efficient because it scans the base table only once.

SQL Server supports a native specialized table operator for pivoting called PIVOT. This operator does not provide any special advantages over the technique I just showed, except that it allows for shorter code. It doesn’t support dynamic pivoting, and underneath the covers, it applies very similar logic to the one I presented in the last solution. So you probably won’t even find noticeable performance differences. At any rate, here’s how you would pivot the OpenSchema data using the PIVOT operator:

SELECT objectid, attr1, attr2, attr3, attr4, attr5
FROM dbo.OpenSchema
  PIVOT(MAX(value) FOR attribute
    IN([attr1],[attr2],[attr3],[attr4],[attr5])) AS P;

Within this solution, you can identify all the elements I used in the previous solution. The inputs to the PIVOT operator are as follows:

The aggregate function applied to the aggregation element. In our case, it’s MAX(value), which extracts the single non-NULL value corresponding to the target attribute. In other cases, you might have more than one non-NULL value per group and want a different aggregate (for example, SUM or AVG).
Following the FOR keyword, the name of the spread by element (attribute, in our case). This is the source column holding the values that become the target column names.
The list of actual target column names in parentheses following the keyword IN.

As you can see, in the parentheses of the PIVOT operator, you specify the aggregate function and aggregation element and the spread by element and spreading values but not the group by elements. This is a problematic aspect of the syntax of the PIVOT operator—the grouping elements are implicitly derived from what was not specified. The grouping elements are the list of all columns from the input table to the PIVOT operator that were not mentioned as either the aggregation or the spreading elements. In our case, objectid is the only column left. If you unintentionally query the base table directly, you might end up with undesired grouping. If new columns will be added to the table in the future, those columns will be implicitly added to PIVOT’s grouping list. Therefore, it is strongly recommended that you apply the PIVOT operator not to the base table directly but rather to a table expression (derived table or CTE) that includes only the elements relevant to the pivoting activity. This way, you can control exactly which columns remain besides the aggregation and spreading elements. Future column additions to the table won’t have any impact on what PIVOT ends up operating on. The following query demonstrates applying this approach to our previous query, using a derived table:

SELECT objectid, attr1, attr2, attr3, attr4, attr5
FROM (SELECT objectid, attribute, value FROM dbo.OpenSchema) AS D
  PIVOT(MAX(value) FOR attribute
    IN([attr1],[attr2],[attr3],[attr4],[attr5])) AS P;

TIP

The input to the aggregate function must be a base column from the PIVOT operator’s input table with no manipulation—it cannot be an expression (for example: SUM(qty * price)). If you want to provide the aggregate function with an expression as input, have the PIVOT operator operate on a derived table or CTE (as suggested for other reasons as well), and in the derived table query assign the expression with a column alias (qty * price AS value). Then, as far as the PIVOT operator is concerned, that alias is the name of a base column in its input table, so it is valid to use that column name as input to PIVOT’s aggregate function (SUM(value)).

Also, you cannot spread attributes from more than one column (the column that appears after the FOR keyword). If you need to pivot more than one column’s attributes (say, empid and YEAR(orderdate)), you can use a similar approach to the previous suggestion: in the derived table or CTE used as the input to the PIVOT operator, concatenate the values from all columns you want to use as the spreading elements and assign the expression with a column alias (CAST(empid AS VARCHAR(10)) + ‘_’ + CAST(YEAR(orderdate) AS CHAR(4)) AS emp_year). Then, in the outer query, specify that column after PIVOT’s FOR keyword (FOR emp_year IN([1_2007], [1_2008], [1_2009], [2_2007], . . .)).

Relational Division

You can also use pivoting to solve relational division problems when the number of elements in the divisor set is fairly small. In my examples, I’ll use the OrderDetails table, which you create and populate by running the following code:

USE tempdb;

IF OBJECT_ID('dbo.OrderDetails') IS NOT NULL
  DROP TABLE dbo.OrderDetails;

CREATE TABLE dbo.OrderDetails
(
  orderid   VARCHAR(10) NOT NULL,
  productid INT         NOT NULL,
  PRIMARY KEY(orderid, productid)
  /* other colums */
);
GO

INSERT INTO dbo.OrderDetails(orderid, productid) VALUES
  ('A', 1),
  ('A', 2),
  ('A', 3),
  ('A', 4),
  ('B', 2),
  ('B', 3),
  ('B', 4),
  ('C', 3),
  ('C', 4),
  ('D', 4);

A classic relational division problem is to return orders that contain a certain basket of products—say, products 2, 3, and 4. You use a pivoting technique to rotate only the relevant products into separate columns for each order. Instead of returning an actual attribute value, you produce a 1 if the product exists in the order and a 0 otherwise. Create a derived table out of the pivot query, and in the outer query filter only orders that contain a 1 in all product columns. Here’s the full query, which correctly returns orders A and B:

SELECT orderid
FROM (SELECT
        orderid,
        MAX(CASE WHEN productid = 2 THEN 1 END) AS P2,
        MAX(CASE WHEN productid = 3 THEN 1 END) AS P3,
        MAX(CASE WHEN productid = 4 THEN 1 END) AS P4
      FROM dbo.OrderDetails
      GROUP BY orderid) AS P
WHERE P2 = 1 AND P3 = 1 AND P4 = 1;

If you run only the derived table query, you get the following output with the pivoted products for each order:

orderid    P2          P3          P4
---------- ----------- ----------- -----------
A          1           1           1
B          1           1           1
C          NULL        1           1
D          NULL        NULL        1

To answer the request at hand using the new PIVOT operator, use the following query:

SELECT orderid
FROM (SELECT orderid, productid FROM dbo.OrderDetails) AS D
  PIVOT(MAX(productid) FOR productid IN([2],[3],[4])) AS P
WHERE [2] = 2 AND [3] = 3 AND [4] = 4;

The aggregate function must accept a column as input, so I provided the productid itself. This means that if the product exists within an order, the corresponding value will contain the actual productid and not 1. That’s why the filter looks a bit different here.

Note that you can make both queries more intuitive and similar to each other in their logic by using the COUNT aggregate instead of MAX. This way, both queries would produce a 1 where the product exists and a 0 where it doesn’t (instead of NULL). Here’s what the query that does not use the PIVOT operator looks like:

SELECT orderid
FROM (SELECT
        orderid,
        COUNT(CASE WHEN productid = 2 THEN productid END) AS P2,
        COUNT(CASE WHEN productid = 3 THEN productid END) AS P3,
        COUNT(CASE WHEN productid = 4 THEN productid END) AS P4
      FROM dbo.OrderDetails
      GROUP BY orderid) AS P
WHERE P2 = 1 AND P3 = 1 AND P4 = 1;

And here’s the query you would use based on the PIVOT operator:

SELECT orderid
FROM (SELECT orderid, productid FROM dbo.OrderDetails) AS D
  PIVOT(COUNT(productid) FOR productid IN([2],[3],[4])) AS P
WHERE [2] = 1 AND [3] = 1 AND [4] = 1;

Aggregating Data

You can also use a pivoting technique to format aggregated data, typically for reporting purposes. In my examples, I’ll use the Orders table, which you create and populate by running the code in Example 8-1.

Example 8-1. Creating and populating the Orders table

SET NOCOUNT ON;
USE tempdb;

IF OBJECT_ID('dbo.Orders', 'U') IS NOT NULL DROP TABLE dbo.Orders;

CREATE TABLE dbo.Orders
(
  orderid   INT        NOT NULL,
  orderdate DATETIME   NOT NULL,
  empid     INT        NOT NULL,
  custid    VARCHAR(5) NOT NULL,
  qty       INT        NOT NULL,
  CONSTRAINT PK_Orders PRIMARY KEY(orderid)
);
GO

INSERT INTO dbo.Orders
  (orderid, orderdate, empid, custid, qty)
VALUES
  (30001, '20060802', 3, 'A', 10),
  (10001, '20061224', 1, 'A', 12),
  (10005, '20061224', 1, 'B', 20),
  (40001, '20070109', 4, 'A', 40),
  (10006, '20070118', 1, 'C', 14),
  (20001, '20070212', 2, 'B', 12),
  (40005, '20080212', 4, 'A', 10),
  (20002, '20080216', 2, 'C', 20),
  (30003, '20080418', 3, 'B', 15),
  (30004, '20060418', 3, 'C', 22),
  (30007, '20060907', 3, 'D', 30);

-- show the contents of the table
SELECT * FROM dbo.Orders;

This generates the following output:

orderid     orderdate               empid       custid qty
----------- ----------------------- ----------- ------ -----------
10001       2006-12-24 00:00:00.000 1           A      12
10005       2006-12-24 00:00:00.000 1           B      20
10006       2007-01-18 00:00:00.000 1           C      14
20001       2007-02-12 00:00:00.000 2           B      12
20002       2008-02-16 00:00:00.000 2           C      20
30001       2006-08-02 00:00:00.000 3           A      10
30003       2008-04-18 00:00:00.000 3           B      15
30004       2006-04-18 00:00:00.000 3           C      22
30007       2006-09-07 00:00:00.000 3           D      30
40001       2007-01-09 00:00:00.000 4           A      40
40005       2008-02-12 00:00:00.000 4           A      10

Suppose you want to return a row for each customer, with the total yearly quantities in a different column for each year. As with all pivoting problems, it boils down to identifying the grouping, spreading, and aggregation elements. In this case, the grouping element is the custid column, the spreading element is the expression YEAR(orderdate), and the aggregate function and element is SUM(qty). What remains is simply to use the solution templates I provided previously. Here’s the solution that does not use the PIVOT operator, followed by its output:

SELECT custid,
  SUM(CASE WHEN orderyear = 2006 THEN qty END) AS [2006],
  SUM(CASE WHEN orderyear = 2007 THEN qty END) AS [2007],
  SUM(CASE WHEN orderyear = 2008 THEN qty END) AS [2008]
FROM (SELECT custid, YEAR(orderdate) AS orderyear, qty
      FROM dbo.Orders) AS D
GROUP BY custid;

custid 2006        2007        2008
------ ----------- ----------- -----------
A      22          40          10
B      20          12          15
C      22          14          20
D      30          NULL        NULL

Here you can see the use of a derived table to isolate only the relevant elements for the pivoting activity (custid, orderyear, qty).

One of the main issues with this pivoting solution is that you might end up with lengthy query strings when the number of elements you need to rotate is large. It’s not a problem in this case because we are dealing with order years, and there usually aren’t that many, but it could be a problem in other cases when the spreading column has a large number of values. In an effort to shorten the query string, you can use a matrix table that contains a column and a row for each attribute that you need to rotate (orderyear, in this case). Only column values in the intersections of corresponding rows and columns contain the value 1, and the other column values are populated with a NULL or a 0, depending on your needs. Run the following code to create and populate the Matrix table:

USE tempdb;
GO

IF OBJECTPROPERTY(OBJECT_ID('dbo.Matrix'), 'IsUserTable') = 1
  DROP TABLE dbo.Matrix;
GO

CREATE TABLE dbo.Matrix
(
  orderyear INT NOT NULL PRIMARY KEY,
  y2006 INT NULL,
  y2007 INT NULL,
  y2008 INT NULL
);

INSERT INTO dbo.Matrix(orderyear, y2006) VALUES(2006, 1);
INSERT INTO dbo.Matrix(orderyear, y2007) VALUES(2007, 1);
INSERT INTO dbo.Matrix(orderyear, y2008) VALUES(2008, 1);
-- show the contents of the table
SELECT * FROM dbo.Matrix;

This generates the following output:

orderyear   y2006       y2007       y2008
----------- ----------- ----------- -----------
2006        1           NULL        NULL
2007        NULL        1           NULL
2008        NULL        NULL        1

You join the base table (or table expression) with the Matrix table based on a match in orderyear. This means that each row from the base table will be matched with one row from Matrix—the one with the same orderyear. In that row, only the corresponding orderyear’s column value will contain a 1. So you can substitute the expression

SUM(CASE WHEN orderyear = <some_year> THEN qty END) AS [<some_year>]

with the logically equivalent expression

SUM(qty*y<some_year>) AS [<some_year>]

Here’s what the full query looks like:

SELECT custid,
  SUM(qty*y2006) AS [2006],
  SUM(qty*y2007) AS [2007],
  SUM(qty*y2008) AS [2008]
FROM (SELECT custid, YEAR(orderdate) AS orderyear, qty
      FROM dbo.Orders) AS D
  JOIN dbo.Matrix AS M ON D.orderyear = M.orderyear
GROUP BY custid;

If you need the number of orders instead of the sum of qty, in the original solution you produce a 1 instead of the qty column for each order and use the COUNT aggregate function, like so:

SELECT custid,
  COUNT(CASE WHEN orderyear = 2006 THEN 1 END) AS [2006],
  COUNT(CASE WHEN orderyear = 2007 THEN 1 END) AS [2007],
  COUNT(CASE WHEN orderyear = 2008 THEN 1 END) AS [2008]
FROM (SELECT custid, YEAR(orderdate) AS orderyear
      FROM dbo.Orders) AS D
GROUP BY custid;

This code generates the following output:

custid 2006        2007        2008
------ ----------- ----------- -----------
A      2           1           1
B      1           1           1
C      1           1           1
D      1           0           0

With the Matrix table, simply specify the column corresponding to the target year:

SELECT custid,
  COUNT(y2006) AS [2006],
  COUNT(y2007) AS [2007],
  COUNT(y2008) AS [2008]
FROM (SELECT custid, YEAR(orderdate) AS orderyear
      FROM dbo.Orders) AS D
  JOIN dbo.Matrix AS M ON D.orderyear = M.orderyear
GROUP BY custid;

Of course, using the PIVOT operator, the query strings are pretty much as short as they can get. You don’t explicitly specify the CASE expressions: those are constructed behind the scenes for you (you can actually see them by looking at the properties of the aggregate operator in the plan). In short, you don’t need to use the Matrix table approach with the PIVOT operator. Here’s the query using the PIVOT operator to calculate total yearly quantities per customer:

SELECT *
FROM (SELECT custid, YEAR(orderdate) AS orderyear, qty
      FROM dbo.Orders) AS D
  PIVOT(SUM(qty) FOR orderyear IN([2006],[2007],[2008])) AS P;

And here’s a query that counts the orders:

SELECT *
FROM (SELECT custid, YEAR(orderdate) AS orderyear
      FROM dbo.Orders) AS D
  PIVOT(COUNT(orderyear) FOR orderyear IN([2006],[2007],[2008])) AS P;

Remember that static queries performing pivoting require you to know ahead of time the list of attributes you’re going to rotate. For dynamic pivoting, you need to construct the query string dynamically.

Save to your account