Microsoft SQL Server 2012 Internals: Special Storage

  • 11/15/2013

Sparse columns

This section looks at another special storage format, added in SQL Server 2008. Sparse columns are ordinary columns that have an optimized storage format for NULL values. Sparse columns reduce the space requirements for NULL values, allowing you to have many more columns in your table definition, as long as most of them are NULL. Using sparse columns requires more overhead to store and retrieve non-NULL values.

Sparse columns are intended to be used for tables storing data describing entities with many possible attributes, in which most of the attributes will be NULL for most rows. For example, a content management system such as Microsoft Windows SharePoint Services might need to keep track of many different types of data in a single table. Because different properties apply to different subsets of rows in the table, only a small subset of the columns is populated with values for each row. Another way of looking at this is that for any particular property, only a subset of rows has a value for that property. Sparse columns allow you to store a very large number of possible columns for a single row. For this reason, the Sparse Columns feature is sometimes also referred to as the wide-table feature.

Management of sparse columns

You shouldn’t consider defining a column as SPARSE unless at least 90 percent of the rows in the table are expected to have NULL values for that column. This limit isn’t enforced, however, and you can define almost any column as SPARSE. Sparse columns save space on NULL values.

The Sparse Columns feature allows you to have far more columns that you ever could before. The limit is now 30,000 columns in a table, with no more than 1,024 of them being non-sparse. (Computed columns are considered non-sparse.) Obviously, not all 30,000 columns could have values in them. The number of populated columns you can have depends on the bytes of data in the row. Sparse columns optimize the storage size for NULL values, which take no space at all for sparse columns, unlike non-sparse columns, which do need space even for NULLs. (As you saw in Chapter 6, a fixed-length NULL column always uses the whole column width, and a variable-length NULL column uses at least two bytes in the column offset array.)

Although the sparse columns themselves take no space, some fixed overhead is needed to allow for sparse columns in a row. As soon as you define even one column with the SPARSE attribute, SQL Server adds a sparse vector to the end of the row. We’ll see the actual structure of this sparse vector in the section “Physical storage,” later in this chapter, but to start, you should be aware that even with sparse columns, the maximum size of a data row (excluding LOB and row-overflow) remains at 8,060, including overhead bytes. Because the sparse vector includes additional overhead, the maximum number of bytes for the rest of the rows decreases. Also, the size of all fixed-length non-NULL sparse columns in a row is limited to 8,023 bytes.

Creating a table

Creating a table with sparse columns is very straightforward, as you can just add the attribute SPARSE to any column of any data type except text, ntext, image, geography, geometry, timestamp, or any user-defined data type. Also, sparse columns can’t include the IDENTITY, ROWGUIDCOL, or FILESTREAM attributes. A sparse column can’t be part of a clustered index or part of the primary key. Tables containing sparse columns can’t be compressed, either at the row level or the page level. (The next section discusses compression in detail.) A few other restrictions are enforced, particularly if you are partitioning a table with sparse columns, so you should check the documentation for full details.

The examples in this section are necessarily very simple because it would be impractical to print code examples with enough columns to make sparse columns really useful. The following example shows the creation of two very similar tables: one that doesn’t allow sparse columns and another that does. I attempt to insert the same rows into each table. Because a row allowing sparse columns has a smaller maximum length, it fails when trying to insert a row that the table with no sparse columns has no problem with:

USE testdb;
GO
IF OBJECT_ID('test_nosparse') IS NOT NULL
    DROP TABLE test_nosparse;
GO
CREATE TABLE test_nosparse
(
  col1 int,
  col2 char(8000),
  col3 varchar(8000)
);
GO
INSERT INTO test_nosparse
        SELECT null, null, null;
INSERT INTO test_nosparse
        SELECT 1, 'a', 'b';
GO

These two rows can be inserted with no error. Now, build the second table:

IF OBJECT_ID('test_sparse') IS NOT NULL
    DROP TABLE test_sparse;
GO
CREATE TABLE test_sparse
(
 col1 int SPARSE,
 col2 char(8000) SPARSE,
 col3 varchar(8000) SPARSE
);
GO
INSERT INTO test_sparse
        SELECT NULL, NULL, NULL;
INSERT INTO test_sparse
        SELECT 1, 'a', 'b';
GO

The second INSERT statement generates the following error:

Msg 576, Level 16, State 5, Line 2
Cannot create a row that has sparse data of size 8046 which is greater than the
allowable maximum sparse data size of 8023.

Although the second row inserted into the test_sparse table looks just like a row that was inserted successfully into the test_nosparse table, internally it’s not. The total of the sparse columns is 4 bytes for the int, plus 8,000 bytes for the char and 24 bytes for the row-overflow pointer, which is greater than the 8,023-byte limit.

Altering a table

You can alter tables to convert a non-sparse column into a sparse column, or vice versa. Be careful, however, because if you are altering a very large row in a table with no sparse columns, changing one column to be sparse reduces the number of bytes of data that are allowed on a page. This can result in an error being thrown in cases where an existing column is converted into a sparse column. For example, the following code creates a table with large rows, but the INSERT statements, with or without NULLs, are accepted. However, when you try to make one of the columns SPARSE—even a relatively small column like the 8-byte datetime column—the extra overhead makes the existing rows too large and the ALTER fails:

IF OBJECT_ID('test_nosparse_alter') IS NOT NULL
    DROP TABLE test_nosparse_alter;
GO
GO
CREATE TABLE test_nosparse_alter
(
c1 int,
c2 char(4020) ,
c3 char(4020) ,
c4 datetime
);
GO
INSERT INTO test_nosparse_alter SELECT NULL, NULL, NULL, NULL;
INSERT INTO test_nosparse_alter SELECT 1, 1, 'b', GETDATE();
GO
ALTER TABLE test_nosparse_alter
  ALTER COLUMN c4 datetime SPARSE;

This error is received:

Msg 1701, Level 16, State 1, Line 2
Creating or altering table 'test_nosparse_alter' failed because the minimum row size
would be 8075, including 23 bytes of internal overhead. This exceeds the maximum
allowable table row size of 8060 bytes.

In general, you can treat sparse columns just like any other column, with only a few restrictions. In addition to the restrictions mentioned earlier on the data types that can’t be defined as SPARSE, you need to keep in mind the following limitations.

  • A sparse column can’t have a default value.

  • A sparse column can’t be bound to a rule.

  • Although a computed column can refer to a sparse column, a computed column can’t be marked as SPARSE.

  • A sparse column can’t be part of a clustered index or a unique primary key index. However, both persisted and non-persisted computed columns that refer to sparse columns can be part of a clustered key.

  • A sparse column can’t be used as a partition key of a clustered index or heap. However, a sparse column can be used as the partition key of a nonclustered index.

Except for the requirement that sparse columns can’t be part of the clustered index or primary key, building indexes on sparse columns has no other restrictions. However, if you’re using sparse columns the way they are intended to be used and the vast majority of your rows have NULL for the sparse columns, any regular index on a sparse column is very inefficient and might have limited usefulness. Sparse columns are really intended to be used with filtered indexes, which are discussed in Chapter 7.

Column sets and sparse column manipulation

If sparse columns are used as intended, only a few columns in each row have values, and your INSERT and UPDATE statements are relatively straightforward. For INSERT statements, you can specify a column list and then specify values only for those few columns in the column list. For UPDATE statements, values can be specified for just a few columns in each row. The only time you need to be concerned about how to deal with a potentially very large list of columns is if you are selecting data without listing individual columns—that is, using a SELECT *. Good developers know that using SELECT * is never a good idea, but SQL Server needs a way of dealing with a result set with potentially thousands (or tens of thousands) of columns. The mechanism to help deal with SELECT * is a construct called COLUMN_SET, which is an untyped XML representation that combines multiple columns of a table into a structured output. You can think of a COLUMN_SET as a nonpersisted computed column because the COLUMN_SET isn’t physically stored in the table. In this release of SQL Server, the only possible COLUMN_SET contains all the sparse columns in the table. Future versions might allow us to define other COLUMN_SET variations.

A table can only have one COLUMN_SET defined, and when a table has a COLUMN_SET defined, SELECT * no longer returns individual sparse columns. Instead, it returns an XML fragment containing all the non-NULL values for the sparse columns. For example, the code in Listing 8-4 builds a table containing an identity column, 25 sparse columns, and a column set.

Listing 8-4 Building a table with an identity column, sparse columns, and a column set

USE testdb;
GO
IF EXISTS (SELECT * FROM sys.tables WHERE name = 'lots_of_sparse_columns')
                 DROP TABLE lots_of_sparse_columns;
GO
CREATE TABLE lots_of_sparse_columns
(ID int IDENTITY,
 col1 int SPARSE,
 col2 int SPARSE,
 col3 int SPARSE,
 col4 int SPARSE,
 col5 int SPARSE,
 col6 int SPARSE,
 col7 int SPARSE,
 col8 int SPARSE,
 col9 int SPARSE,
 col10 int SPARSE,
 col11 int SPARSE,
 col12 int SPARSE,
 col13 int SPARSE,
 col14 int SPARSE,
 col15 int SPARSE,
 col16 int SPARSE,
 col17 int SPARSE,
 col18 int SPARSE,
 col19 int SPARSE,
 col20 int SPARSE,
 col21 int SPARSE,
 col22 int SPARSE,
 col23 int SPARSE,
 col24 int SPARSE,
 col25 int SPARSE,
 sparse_column_set XML COLUMN_SET FOR ALL_SPARSE_COLUMNS);
 GO

Next, values are inserted into 3 of the 25 columns, specifying individual column names:

INSERT INTO lots_of_sparse_columns (col4, col7, col12)  SELECT 4,6,11;

You can also insert directly into the COLUMN_SET, specifying values for columns in an XML fragment. The capability to update the COLUMN_SET is another feature that differentiates COLUMN_SETs from computed columns:

INSERT INTO lots_of_sparse_columns (sparse_column_set)
                SELECT '<col8>42</col8><col17>0</col17><col22>30000</col22>';

Here are my results when I run SELECT * from this table:

SELECT * FROM lots_of_sparse_columns;
Results:
ID      sparse_column_set
------- ---------------------------------------------------
1       <col4>4</col4><col7>6</col7><col12>11</col12>
2       <col8>42</col8><col17>0</col17><col22>30000</col22>

You can still select from individual columns, either instead of or in addition to selecting the entire COLUMN_SET. So the following SELECT statements are both valid:

SELECT ID, col10, col15, col20
    FROM lots_of_sparse_columns;
SELECT *, col11
    FROM lots_of_sparse_columns;

Keep the following points in mind if you decide to use sparse columns in your tables.

  • When defined, the COLUMN_SET can’t be altered. To change a COLUMN_SET, you must drop and re-create the COLUMN_SET column.

  • A COLUMN_SET can be added to a table that doesn’t include any sparse columns. If sparse columns are later added to the table, they appear in the column set.

  • A COLUMN_SET is optional and isn’t required to use sparse columns.

  • Constraints or default values can’t be defined on a COLUMN_SET.

  • Distributed queries aren’t supported on tables that contain COLUMN_SETs.

  • Replication doesn’t support COLUMN_SETs.

  • The Change Data Capture feature doesn’t support COLUMN_SETs.

  • A COLUMN_SET can’t be part of any kind of index. This includes XML indexes, full-text indexes, and indexed views. A COLUMN_SET also can’t be added as an included column in any index.

  • A COLUMN_SET can’t be used in the filter expression of a filtered index or filtered statistics.

  • When a view includes a COLUMN_SET, the COLUMN_SET appears in the view as an XML column.

  • XML data has a size limit of 2 GB. If the combined data of all the non-NULL sparse columns in a row exceeds this limit, the operation produces an error.

  • Copying all columns from a table with a COLUMN_SET (using either SELECT * INTO or INSERT INTO SELECT *) doesn’t copy the individual sparse columns. Only the COLUMN_SET, as data type XML, is copied.

Physical storage

At a high level, you can think of sparse columns as being stored much as they are displayed using the COLUMN_SET—that is, as a set of (column-name, value) pairs. So if a particular column has no value, it’s not listed and no space at all is required. If a column has a value, not only does SQL Server need to store that value but it also needs to store information about which column has that value. As a result, non-NULL sparse columns take more space than their NULL counterparts. To see the difference graphically, you can compare Tables 8-5 and Table 8-6.

Table 8-5 represents a table with non-sparse columns. You can see a lot of wasted space when most of the columns are NULL. Table 8-6 shows what the same table looks like if all the columns except the ID are defined as SPARSE. All that is stored are the names of all the non-NULL columns and their values.

Table 8-5 Representation of a table defined with non-sparse columns, with many NULL values

ID

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc8

sc9

1

1

9

2

2

4

3

6

7

4

1

5

5

4

8

6

3

9

7

5

7

8

2

8

9

3

6

Table 8-6 Representation of a table defined with sparse columns, with many NULL values

ID

<sparse columns>

1

(sc1,sc9)(1,9)

2

(sc2,sc4)(2,4)

3

(sc6,sc7)(6,7)

4

(sc1,sc5)(1,5)

5

(sc4,sc8)(4,8)

6

(sc3,sc9)(3,9)

7

(sc5,sc7)(5,7)

8

(sc2,sc8)(2,8)

9

(sc3,sc6)(3,6)

SQL Server keeps track of the physical storage of sparse columns with a structure within a row called a sparse vector. Sparse vectors are present only in the data records of a base table that has at least one sparse column declared, and each data record of these tables contains a sparse vector. A sparse vector is stored as a special variable-length column at the end of a data record. It’s a special system column, and no metadata about this column appears in sys.columns or any other view. The sparse vector is stored as the last variable-length column in the row. The only thing after the sparse vector would be versioning information, used primarily with Snapshot isolation, as is discussed in Chapter 13. The NULL bitmap has no bit for the sparse vector column (if a sparse vector exists, it’s never NULL), but the count in the row of the number of variable-length columns includes the sparse vector. You might want to revisit Figure 6-5 in Chapter 6 at this time to familiarize yourself with the general structure of data rows.

Table 8-7 lists the meanings of the bytes in the sparse vector.

Table 8-7 Bytes in a sparse vector

Name

Number of bytes

Meaning

Complex Column Header

2

Value of 05 indicates that the complex column is a sparse vector.

Sparse Column Count

2

Number of sparse columns.

Column ID Set

2 * the number of sparse columns

Two bytes for the column ID of each column in the table with a value stored in the sparse vector.

Column Offset Table

2 * the number of sparse columns

Two bytes for the offset of the ending position of each sparse column.

Sparse Data

Depends on actual values

Data

Now look at the bytes of a row containing sparse columns. First, build a table containing two sparse columns, and populate it with three rows:

USE testdb;
GO
IF OBJECT_ID ('sparse_bits') IS NOT NULL
                 DROP TABLE sparse_bits;
GO
CREATE TABLE sparse_bits
(
c1 int IDENTITY,
c2 varchar(4),
c3 char(4) SPARSE,
c4 varchar(4) SPARSE
);
GO
INSERT INTO sparse_bits SELECT 'aaaa', 'bbbb', 'cccc';
INSERT INTO sparse_bits SELECT 'dddd', null, 'eeee';
INSERT INTO sparse_bits SELECT 'ffff', null, 'gg';
GO

Now you can use sys.dm_db_database_page_allocations to find the page number for the data page storing these three rows and then use DBCC PAGE to look at the bytes on the page:

SELECT allocated_page_file_id as PageFID, allocated_page_page_id as PagePID,
       object_id as ObjectID, partition_id AS PartitionID,
       allocation_unit_type_desc as AU_type, page_type as PageType
FROM sys.dm_db_database_page_allocations
       (db_id('testdb'), object_id('sparse_bits'), null, null, 'DETAILED');
-- The output indicated that the data page for my table was on page 289;
DBCC TRACEON(3604);
DBCC PAGE(testdb, 1, 289, 1);

Only the output for the first data row, which is spread over three lines of DBCC PAGE output, is shown here:

00000000:   30000800 01000000 02000002 00150029 80616161 0..............).aaa
00000014:   61050002 00030004 00100014 00626262 62636363 a            bbbbccc
00000020:   63                                           c

The boldfaced bytes are the sparse vector. You can find it easily because it starts right after the last non-sparse variable-length column, which contained aaaa, or 61616161, and continues to the end of the row. Figure 8-8 translates the sparse vector according to the meanings from Table 8-7. Don’t forget that you need to byte-swap numeric fields before translating. For example, the first two bytes are 05 00, which need to be swapped to get the hex value 0x0005. Then you can convert it to decimal.

Figure 8-8

Figure 8-8 Interpretation of the actual bytes in a sparse vector.

You can apply the same analysis to the bytes in the other two rows on the page. Here are some things to note:

  • No information about columns with NULL values appears in the sparse vector.

  • No difference exists in storage between fixed-length and variable-length strings within the sparse vector. However, that doesn’t mean you should use the two interchangeably. A sparse varchar column that doesn’t fit in the 8,060 bytes can be stored as row-overflow data; a sparse char column can’t be.

  • Because only 2 bytes are used to store the number of sparse columns, this sets the limit on the maximum number of sparse columns.

  • The 2 bytes for the complex column header indicate that there might be other possibilities for complex columns. At this time, the only other type of complex column that can be stored is one storing a back-pointer, as SQL Server does when it creates a forwarded record. (Chapter 6 briefly discussed forwarded records when discussing updates to heaps.)

Metadata

Very little extra metadata is needed to support sparse columns. The catalog view sys.columns contains two columns to keep track of sparse columns in your tables: is_sparse and is_column_set. Each column has only two possible values, 0 or 1.

Corresponding to these column properties in sys.columns, the property function COLUMNPROPERTY() also has the following properties related to sparse columns: IsSparse and IsColumnSet.

To inspect all tables with “sparse” in their name and determine which of their columns are SPARSE, which are column sets, and which are neither, you can run the following query:

SELECT OBJECT_NAME(object_id) as 'Table', name as 'Column', is_sparse, is_column_set
FROM sys.columns
WHERE OBJECT_NAME(object_id) like '%sparse%';

To see just the table and column names for all COLUMN_SET columns, you can run the following query:

SELECT OBJECT_NAME(object_id) as 'Table', name as 'Column'
FROM sys.columns
WHERE COLUMNPROPERTY(object_id, name, 'IsColumnSet') = 1;

Storage savings with sparse columns

The sparse column feature is designed to save you considerable space when most of your values are NULL. In fact, as mentioned earlier, columns that aren’t NULL but are defined as SPARSE take up more space than if they weren’t defined as SPARSE because the sparse vector has to store a couple of extra bytes to keep track of them. To start to see the space differences, you can run the script in Listing 8-5, which creates four tables with relatively short, fixed-length columns. Two have sparse columns and two don’t. Rows are inserted into each table in a loop, which inserts 100,000 rows. One table with sparse columns is populated with rows with NULL values, and the other is populated with rows that aren’t NULL. One table with no sparse columns is populated with rows with NULL values; the other is populated with rows that aren’t NULL.

Listing 8-5 Saving space with sparse columns

USE testdb;
GO
SET NOCOUNT ON;
GO
IF OBJECT_ID('sparse_nonulls_size') IS NOT NULL
               DROP TABLE sparse_nonulls_size;
GO
CREATE TABLE sparse_nonulls_size
(col1 int IDENTITY,
 col2 datetime SPARSE,
 col3 char(10) SPARSE
 );
GO
IF OBJECT_ID('nonsparse_nonulls_size') IS NOT NULL
               DROP TABLE nonsparse_nonulls_size;
GO
GO
CREATE TABLE nonsparse_nonulls_size
(col1 int IDENTITY,
 col2 datetime,
 col3 char(10)
 );
GO
IF OBJECT_ID('sparse_nulls_size') IS NOT NULL
               DROP TABLE sparse_nulls_size;
GO
GO
CREATE TABLE sparse_nulls_size
(col1 int IDENTITY,
 col2 datetime SPARSE,
 col3 char(10) SPARSE
 );
GO
IF OBJECT_ID('nonsparse_nulls_size') IS NOT NULL
               DROP TABLE nonsparse_nulls_size;
GO
GO
CREATE TABLE nonsparse_nulls_size
(col1 int IDENTITY,
 col2 datetime,
 col3 char(10)
 );
GO
DECLARE @num int
SET @num = 1
WHILE @num < 100000
BEGIN
  INSERT INTO sparse_nonulls_size
        SELECT GETDATE(), 'my message';
  INSERT INTO nonsparse_nonulls_size
        SELECT GETDATE(), 'my message';
  INSERT INTO sparse_nulls_size
        SELECT NULL, NULL;
  INSERT INTO nonsparse_nulls_size
        SELECT NULL, NULL;
  SET @num = @num + 1;
END;
GO

Now look at the number of pages in each table. The following metadata query looks at the number of data pages in the sys.allocation_units view for each of the four tables:

SELECT object_name(object_id) as 'table with 100K rows', data_pages
FROM sys.allocation_units au
    JOIN sys.partitions p
       ON p.partition_id = au.container_id
WHERE object_name(object_id) LIKE '%sparse%size';

And here are my results:

table with 100K rows     data_pages
------------------------ ----------
sparse_nonulls_size      610
nonsparse_nonulls_size   402
sparse_nulls_size        169
nonsparse_nulls_size     402

Note that the smallest number of pages is required when the table has NULL sparse columns. If the table has no sparse columns, the space usage is the same whether or not the columns have NULLs because the data was defined as fixed length. This space requirement is more than twice as much as needed for the sparse columns with NULL. The worst case is if the columns have been defined as SPARSE but have no NULL values.

Of course, the previous examples are edge cases, where all the data is either NULL or non-NULL, and is of all fixed-length data types. So although you can say that sparse columns require more storage space for non-NULL values than is required for identical data that’s not declared as SPARSE, the actual space savings depends on the data types and the percentage of rows that are NULL. Table 8-8—reprinted from SQL Server Books Online—shows the space usage for each data type. The NULL Percentage column indicates what percent of the data must be NULL to achieve a net space savings of 40 percent.

Table 8-8 Storage requirements for sparse columns

Data type

Storage bytes when not SPARSE

Storage bytes when SPARSE and bot NULL

NULL percentage

Fixed-length data types

Bit

0.125

5

98 percent

Tinyint

1

5

86 percent

Smallint

2

6

76 percent

Int

4

8

64 percent

Bigint

8

12

52 percent

Real

4

8

64 percent

Float

8

12

52 percent

smallmoney

4

8

64 percent

Money

8

12

52 percent

smalldatetime

4

8

64 percent

Datetime

8

12

52 percent

uniqueidentifier

16

20

43 percent

Date

3

7

69 percent

Precision-dependent–length data types

datetime2(0)

6

10

57 percent

datetime2(7)

8

12

52 percent

time(0)

3

7

69 percent

time(7)

5

9

60 percent

datetimetoffset(0)

8

12

52 percent

datetimetoffset (7)

10

14

49 percent

decimal/numeric(1,s)

5

9

60 percent

decimal/numeric(38,s)

17

21

42 percent

Data-dependent–length data types

sql_variant

Varies

varchar or char

2+avg. data

4+avg. data

60 percent

nvarchar or nchar

2+avg. data

4+avg. data

60 percent

varbinary or binary

2+avg. data

4+avg. data

60 percent

Xml

2+avg. data

4+avg. data

60 percent

hierarchyId

2+avg. data

4+avg. data

60 percent

The general recommendation is that you should consider using sparse columns when you anticipate that they provide a space savings of at least 20 to 40 percent.