# Aggregating and Pivoting Data in Microsoft SQL Server 2008 T-SQL

- 3/25/2009

## Histograms

Histograms are powerful analytical tools that express the distribution of items. For example, suppose you need to figure out from the order information in the Sales.OrderValues view how many small, medium, and large orders you have, based on the order values. In other words, you need a histogram with three steps. The extreme values (the minimum and maximum values) are what defines values as small, medium, or large. Suppose for the sake of simplicity that the minimum order value is 10 and the maximum is 40. Take the difference between the two extremes (*40 – 10 = 30*) and divide it by the number of steps (3) to get the step size. In this case, it’s 30 divided by 3, which is 10. So the boundaries of step 1 (small) would be 10 and 20; for step 2 (medium), they would be 20 and 30; and for step 3 (large), they would be 30 and 40.

To generalize this, let *mn = MIN(val)* and *mx = MAX(val)* and let *stepsize = (mx – mn) / @numsteps*. Given a step number *n*, the lower bound of the step (*lb*) is *mn + (n – 1) * stepsize* and the higher bound (*hb*) is *mn + n * stepsize*. Something is tricky here. What predicate do you use to bracket the elements that belong in a specific step? You can’t use *val BETWEEN lb and hb* because a value that is equal to *hb* appears in this step and also in the next step, where it equals the lower bound. Remember that the same calculation yielded the higher bound of one step and the lower bound of the next step. One approach to deal with this problem is to increase each of the lower bounds besides the first by one so that they exceed the previous step’s higher bounds. With integers, this is a fine solution, but with another data type (such as NUMERIC in our case) it doesn’t work because there are potential values between adjacent steps but not within either one—between the cracks, so to speak.

What I like to do to solve the problem is keep the same value in both bounds, and instead of using BETWEEN, I use *val >= lb* and *val < hb*. This technique has its own issues, but I find it easier to deal with than the previous technique. The issue here is that the item with the highest quantity (40, in our simplified example) is left out of the histogram. To solve this, I add a very small number to the maximum value before calculating the step size: *stepsize = ((1E0*mx + 0.0000000001) – mn) / @numsteps*. This technique allows the item with the highest value to be included, and the effect on the histogram is otherwise negligible. I multiplied *mx* by the float value *1E0* to protect against the loss of the upper data point when *val* is typed as MONEY or SMALLMONEY.

So you need the following ingredients to generate the lower and higher bounds of the histogram’s steps: *@numsteps* (given as input), step number (the *n* column from the Nums auxiliary table), *mn*, and *stepsize*, which I described earlier.

Here’s the T-SQL code required to produce the step number, lower bound, and higher bound for each step of the histogram:

USE InsideTSQL2008; DECLARE @numsteps AS INT; SET @numsteps = 3; SELECT n AS step, mn + (n - 1) * stepsize AS lb, mn + n * stepsize AS hb FROM dbo.Nums CROSS JOIN (SELECT MIN(val) AS mn, ((1E0*MAX(val) + 0.0000000001) - MIN(val)) / @numsteps AS stepsize FROM Sales.OrderValues) AS D WHERE n < = @numsteps;

This code generates the following output:

step lb hb ----------- ---------------------- ---------------------- 1 12.5 5470.83333333337 2 5470.83333333337 10929.1666666667 3 10929.1666666667 16387.5000000001

You might want to encapsulate this code in a user-defined function to simplify the queries that return the actual histograms, like so:

IF OBJECT_ID('dbo.HistSteps') IS NOT NULL DROP FUNCTION dbo.HistSteps; GO CREATE FUNCTION dbo.HistSteps(@numsteps AS INT) RETURNS TABLE AS RETURN SELECT n AS step, mn + (n - 1) * stepsize AS lb, mn + n * stepsize AS hb FROM dbo.Nums CROSS JOIN (SELECT MIN(val) AS mn, ((1E0*MAX(val) + 0.0000000001) - MIN(val)) / @numsteps AS stepsize FROM Sales.OrderValues) AS D WHERE n < = @numsteps; GO

To test the function, run the following query, which will give you a three-row histogram steps table:

SELECT * FROM dbo.HistSteps(3) AS S;

To return the actual histogram, simply join the steps table and the OrderValues view on the predicate I described earlier (*val >= lb AND val < hb*), group the data by step number, and return the step number and row count:

SELECT step, COUNT(*) AS numorders FROM dbo.HistSteps(3) AS S JOIN Sales.OrderValues AS O ON val >= lb AND val < hb GROUP BY step;

This query generates the following histogram:

step numorders ----------- ----------- 1 803 2 21 3 6

You can see that there are 803 small orders, 21 medium orders, and 6 large order. To return a histogram with 10 steps, simply provide 10 as the input to the *HistSteps* function:

SELECT step, COUNT(*) AS numorders FROM dbo.HistSteps(10) AS S JOIN Sales.OrderValues AS O ON val >= lb AND val < hb GROUP BY step;

This query generates the following output:

step numorders ----------- ----------- 1 578 2 172 3 46 4 14 5 3 6 6 7 8 8 1 10 2

Note that because you’re using an inner join, empty steps are not returned like in the case of step 9. To return empty steps also, you can use the following outer join query:

SELECT step, COUNT(val) AS numorders FROM dbo.HistSteps(10) AS S LEFT OUTER JOIN Sales.OrderValues AS O ON val >= lb AND val < hb GROUP BY step;

As you can see in the output of this query, empty steps are included this time:

step numorders ----------- ----------- 1 578 2 172 3 46 4 14 5 3 6 6 7 8 8 1 9 0 10 2

There’s another alternative to taking care of the issue with the step boundaries and the predicate used to identify a match. You can simply check whether the step number is 1, in which case you subtract 1 from the lower bound. Then, in the query generating the actual histogram, you use the predicate *val > lb AND val <= hb*.

Another approach is to check whether the step is the last, and if it is, add 1 to the higher bound. Then use the predicate *val >= lb AND val < hb*.

Here’s the revised function implementing the latter approach:

ALTER FUNCTION dbo.HistSteps(@numsteps AS INT) RETURNS TABLE AS RETURN SELECT n AS step, mn + (n - 1) * stepsize AS lb, mn + n * stepsize + CASE WHEN n = @numsteps THEN 1 ELSE 0 END AS hb FROM dbo.Nums CROSS JOIN (SELECT MIN(val) AS mn, (1E0*MAX(val) - MIN(val)) / @numsteps AS stepsize FROM Sales.OrderValues) AS D WHERE n < = @numsteps; GO

And the following query generates the actual histogram:

SELECT step, COUNT(val) AS numorders FROM dbo.HistSteps(3) AS S LEFT OUTER JOIN Sales.OrderValues AS O ON val >= lb AND val < hb GROUP BY step;