Aggregating and Pivoting Data in Microsoft SQL Server 2008 T-SQL

  • 3/25/2009

Histograms

Histograms are powerful analytical tools that express the distribution of items. For example, suppose you need to figure out from the order information in the Sales.OrderValues view how many small, medium, and large orders you have, based on the order values. In other words, you need a histogram with three steps. The extreme values (the minimum and maximum values) are what defines values as small, medium, or large. Suppose for the sake of simplicity that the minimum order value is 10 and the maximum is 40. Take the difference between the two extremes (40 – 10 = 30) and divide it by the number of steps (3) to get the step size. In this case, it’s 30 divided by 3, which is 10. So the boundaries of step 1 (small) would be 10 and 20; for step 2 (medium), they would be 20 and 30; and for step 3 (large), they would be 30 and 40.

To generalize this, let mn = MIN(val) and mx = MAX(val) and let stepsize = (mx – mn) / @numsteps. Given a step number n, the lower bound of the step (lb) is mn + (n – 1) * stepsize and the higher bound (hb) is mn + n * stepsize. Something is tricky here. What predicate do you use to bracket the elements that belong in a specific step? You can’t use val BETWEEN lb and hb because a value that is equal to hb appears in this step and also in the next step, where it equals the lower bound. Remember that the same calculation yielded the higher bound of one step and the lower bound of the next step. One approach to deal with this problem is to increase each of the lower bounds besides the first by one so that they exceed the previous step’s higher bounds. With integers, this is a fine solution, but with another data type (such as NUMERIC in our case) it doesn’t work because there are potential values between adjacent steps but not within either one—between the cracks, so to speak.

What I like to do to solve the problem is keep the same value in both bounds, and instead of using BETWEEN, I use val >= lb and val < hb. This technique has its own issues, but I find it easier to deal with than the previous technique. The issue here is that the item with the highest quantity (40, in our simplified example) is left out of the histogram. To solve this, I add a very small number to the maximum value before calculating the step size: stepsize = ((1E0*mx + 0.0000000001) – mn) / @numsteps. This technique allows the item with the highest value to be included, and the effect on the histogram is otherwise negligible. I multiplied mx by the float value 1E0 to protect against the loss of the upper data point when val is typed as MONEY or SMALLMONEY.

So you need the following ingredients to generate the lower and higher bounds of the histogram’s steps: @numsteps (given as input), step number (the n column from the Nums auxiliary table), mn, and stepsize, which I described earlier.

Here’s the T-SQL code required to produce the step number, lower bound, and higher bound for each step of the histogram:

USE InsideTSQL2008;

DECLARE @numsteps AS INT;
SET @numsteps = 3;

SELECT n AS step,
  mn + (n - 1) * stepsize AS lb,
  mn + n * stepsize AS hb
FROM dbo.Nums
  CROSS JOIN
    (SELECT MIN(val) AS mn,
       ((1E0*MAX(val) + 0.0000000001) - MIN(val))
       / @numsteps AS stepsize
     FROM Sales.OrderValues) AS D
WHERE n < = @numsteps;

This code generates the following output:

step        lb                     hb
----------- ---------------------- ----------------------
1           12.5                   5470.83333333337
2           5470.83333333337       10929.1666666667
3           10929.1666666667       16387.5000000001

You might want to encapsulate this code in a user-defined function to simplify the queries that return the actual histograms, like so:

IF OBJECT_ID('dbo.HistSteps') IS NOT NULL
  DROP FUNCTION dbo.HistSteps;
GO
CREATE FUNCTION dbo.HistSteps(@numsteps AS INT) RETURNS TABLE
AS
RETURN
  SELECT n AS step,
    mn + (n - 1) * stepsize AS lb,
    mn + n * stepsize AS hb
  FROM dbo.Nums
    CROSS JOIN
      (SELECT MIN(val) AS mn,
         ((1E0*MAX(val) + 0.0000000001) - MIN(val))
         / @numsteps AS stepsize
       FROM Sales.OrderValues) AS D
  WHERE n < = @numsteps;
GO

To test the function, run the following query, which will give you a three-row histogram steps table:

SELECT * FROM dbo.HistSteps(3) AS S;

To return the actual histogram, simply join the steps table and the OrderValues view on the predicate I described earlier (val >= lb AND val < hb), group the data by step number, and return the step number and row count:

SELECT step, COUNT(*) AS numorders
FROM dbo.HistSteps(3) AS S
  JOIN Sales.OrderValues AS O
    ON val >= lb AND val < hb
GROUP BY step;

This query generates the following histogram:

step        numorders
----------- -----------
1           803
2           21
3           6

You can see that there are 803 small orders, 21 medium orders, and 6 large order. To return a histogram with 10 steps, simply provide 10 as the input to the HistSteps function:

SELECT step, COUNT(*) AS numorders
FROM dbo.HistSteps(10) AS S
  JOIN Sales.OrderValues AS O
    ON val >= lb AND val < hb
GROUP BY step;

This query generates the following output:

step        numorders
----------- -----------
1           578
2           172
3           46
4           14
5           3
6           6
7           8
8           1
10          2

Note that because you’re using an inner join, empty steps are not returned like in the case of step 9. To return empty steps also, you can use the following outer join query:

SELECT step, COUNT(val) AS numorders
FROM dbo.HistSteps(10) AS S
  LEFT OUTER JOIN Sales.OrderValues AS O
    ON val >= lb AND val < hb
GROUP BY step;

As you can see in the output of this query, empty steps are included this time:

step        numorders
----------- -----------
1           578
2           172
3           46
4           14
5           3
6           6
7           8
8           1
9           0
10          2

There’s another alternative to taking care of the issue with the step boundaries and the predicate used to identify a match. You can simply check whether the step number is 1, in which case you subtract 1 from the lower bound. Then, in the query generating the actual histogram, you use the predicate val > lb AND val <= hb.

Another approach is to check whether the step is the last, and if it is, add 1 to the higher bound. Then use the predicate val >= lb AND val < hb.

Here’s the revised function implementing the latter approach:

ALTER FUNCTION dbo.HistSteps(@numsteps AS INT) RETURNS TABLE
AS
RETURN
  SELECT n AS step,
    mn + (n - 1) * stepsize AS lb,
    mn + n * stepsize + CASE WHEN n = @numsteps THEN 1 ELSE 0 END AS hb
  FROM dbo.Nums
    CROSS JOIN
      (SELECT MIN(val) AS mn,
         (1E0*MAX(val) - MIN(val)) / @numsteps AS stepsize
    FROM Sales.OrderValues) AS D
  WHERE n < = @numsteps;
GO

And the following query generates the actual histogram:

SELECT step, COUNT(val) AS numorders
FROM dbo.HistSteps(3) AS S
  LEFT OUTER JOIN Sales.OrderValues AS O
    ON val >= lb AND val < hb
GROUP BY step;