Microsoft SQL Server 2012 Internals: Special Storage

  • 11/15/2013

Table and index partitioning

As you’ve already seen when looking at the metadata for table and index storage, partitioning is an integral feature of SQL Server space organization. Figure 6-3 in Chapter 6 illustrated the relationship between tables and indexes, partitions, and allocation units. Tables and indexes that are built without any reference to partitions are considered to be stored on a single partition. One of the more useful metadata objects for retrieving information about data storage is the sys.dm_db_partition_stats dynamic management view, which combines information found in sys.partitions, sys.allocation_units and sys.indexes.

A partitioned object is split internally into separate physical units that can be stored in different locations. Partitioning is invisible to the users and programmers, who can use T-SQL code to select from a partitioned table exactly the same way they select from a nonpartitioned table. Creating large objects on multiple partitions improves the manageability and maintainability of your database system and can greatly enhance the performance of activities such as purging historic data and loading large amounts of data. In SQL Server 2000, partitioning was available only by manually creating a view that combines multiple tables—a functionality that’s referred to as partitioned views. SQL Server 2005 introduced built-in partitioning of tables and indexes, which has many advantages over partitioned views, including improved execution plans and fewer prerequisites for implementation.

This section focuses primarily on physical storage of partitioned objects and the partitioning metadata. Chapter 11, “The Query Optimizer,” examines query plans involving partitioned tables and partitioned indexes.

Partition functions and partition schemes

To understand the partitioning metadata, you need a little background into how partitions are defined, using an example based on the SQL Server samples. You can find my Partition.sql script on the companion website. This script defines two tables, TransactionHistory and TransactionHistoryArchive, along with a clustered index and two nonclustered indexes on each. Both tables are partitioned on the TransactionDate column, with each month of data in a separate partition. Initially, TransactionHistory has 12 partitions and TransactionHistoryArchive has two.

Before you create a partitioned table or index, you must define a partition function, which is used to define the partition boundaries logically. When a partition function is created, you must specify whether the partition should use a LEFT-based or RIGHT-based boundary point. Simply put, this defines whether the boundary value itself is part of the left-hand or right-hand partition. Another way to consider this is to ask this question: Is it an upper boundary of one partition (in which case it goes to the LEFT), or a lower boundary point of the next partition (in which case it goes to the RIGHT)? The number of partitions created by a partition function with n boundaries will be n+1. Here is the partition function being used for this example:

CREATE PARTITION FUNCTION [TransactionRangePF1] (datetime)
AS RANGE RIGHT FOR VALUES ('20111001', '20111101', '20111201',
               '20120101', '20120201', '20120301', '20120401',
               '20120501', '20120601', '20120701', '20120801');

Notice that the table name isn’t mentioned in the function definition because the partition function isn’t tied to any particular table. The TransactionRangePF1 function divides the data into 12 partitions because 11 datetime boundaries exist. The keyword RIGHT specifies that any value that equals one of the boundary points goes into the partition to the right of the endpoint. So for this function, all values less than October 1, 2011 go in the first partition and values greater than or equal to October 1, 2011 and less than November 1, 2011 go in the second partition. LEFT (the default) could also have been specified, in which case the value equal to the endpoint goes in the partition to the left. After you define the partition function, you define a partition scheme, which lists a set of filegroups onto which each range of data is placed. Here is the partition schema for my example:

CREATE PARTITION SCHEME [TransactionsPS1]
AS PARTITION [TransactionRangePF1]
TO ([PRIMARY], [PRIMARY], [PRIMARY]
, [PRIMARY], [PRIMARY], [PRIMARY]
, [PRIMARY], [PRIMARY], [PRIMARY]
, [PRIMARY], [PRIMARY], [PRIMARY]);
GO

To avoid having to create 12 files and filegroups, I have put all the partitions on the PRIMARY filegroup, but for the full benefit of partitioning, you should probably have each partition on its own filegroup. The CREATE PARTITION SCHEME command must list at least as many filegroups as partitions, but it can list one more filegroup, which is considered the “next used” filegroup. If the partition function splits, the new boundary point is added in the filegroup used next. If you don’t specify an extra filegroup at the time you create the partition scheme, you can alter the partition scheme to set the next-used filegroup before modifying the function.

As you’ve seen, the listed filegroups don’t have to be unique. In fact, if you want to have all the partitions on the same filegroup, as I have here, you can use a shortcut syntax:

CREATE PARTITION SCHEME [TransactionsPS1]
AS PARTITION [TransactionRangePF1]
ALL TO ([PRIMARY]);
GO

Note that putting all the partitions on the same filegroup is usually done just for the purpose of testing your code.

Additional filegroups are used in order as more partitions are added, which can happen when a partition function is altered to split an existing range into two. If you don’t specify extra filegroups at the time you create the partition scheme, you can alter the partition scheme to add another filegroup.

The partition function and partition scheme for a second table are shown here:

CREATE PARTITION FUNCTION [TransactionArchivePF2] (datetime)
AS RANGE RIGHT FOR VALUES ('20110901');
GO
CREATE PARTITION SCHEME [TransactionArchivePS2]
AS PARTITION [TransactionArchivePF2]
TO ([PRIMARY], [PRIMARY]);
GO

The script then creates two tables and loads data into them. I will not include all the details here. To partition a table, you must specify a partition scheme in the CREATE TABLE statement. I create a table called TransactionHIstory that includes this line as the last part of the CREATE TABLE statement as follows:

ON [TransactionsPS1] (TransactionDate)

The second table, TransactionHistoryArchive, is created using the TransactionsPS2 partitioning scheme. The script then loads data into the two tables, and because the partition scheme has already been defined, each row is placed in the appropriate partition as the data is loaded. After the tables are loaded, you can examine the metadata.

Metadata for partitioning

Figure 8-18 shows most of the catalog views for retrieving information about partitions. Along the left and bottom edges, you can see the sys.tables, sys.indexes, sys.partitions, and sys.allocation_units catalog views that were discussed earlier in this chapter.

Figure 8-18

Figure 8-18 Catalog views containing metadata for partitioning and data storage.

Some of the queries use the undocumented sys.system_internals_allocation_units view instead of sys.allocation_units to retrieve page address information. The following are the most relevant columns of each of these views.

  • sys.data_spaces has a primary key called data_space_id, which is either a partition ID or a filegroup ID. Each filegroup has one row, and each partition scheme has one row. One column in sys.data_spaces specifies to which type of data space the row refers. If the row refers to a partition scheme, data_space_id can be joined with sys.partition_schemes.data_space_id. If the row refers to a filegroup, data_space_id can be joined with sys.filegroups.data_space_id. The sys.indexes view also has a data_space_id column to indicate how each heap or B-tree stored in sys.indexes is stored. So, if you know that a table is partitioned, you can join it directly with sys.partition_schemes without going through sys.data_spaces. Alternatively, you can use the following query to determine whether a table is partitioned by replacing dboTransactionHistoryArchive with the name of the table in which you’re interested:

    SELECT DISTINCT object_name(object_id) as TableName,
                ISNULL(ps.name, 'Not partitioned') as PartitionScheme
        FROM (sys.indexes i LEFT  JOIN sys.partition_schemes ps
                       ON (i.data_space_id = ps.data_space_id))
        WHERE (i.object_id = object_id(dbo.TransactionHistoryArchive'))
                  AND   (i.index_id IN (0,1));
  • sys.partition_schemes has one row for each partition scheme. In addition to the data_space_id and the name of the partition scheme, it has a function_id column to join with sys.partition_functions.

  • sys.destination_data_spaces is a linking table because sys.partition_schemes and sys.filegroups are in a many-to-many relationship with each other. For each partition scheme, there is one row for each partition. The partition number is in the destination_id column, and the filegroup ID is stored in the data_space_id column.

  • sys.partition_functions contains one row for each partition function, and its primary key function_id is a foreign key in sys.partition_schemes.

  • sys.partition_range_values (not shown) has one row for each endpoint of each partition function. Its function_id column can be joined with sys.partition_functions, and its boundary_id column can join with either partition_id in sys.partitions or with destination_id in sys.destination_data_spaces.

These views have other columns not mentioned here, and additional views provide additional information, such as the columns and their data types that the partitioning is based on. However, the preceding information should be sufficient to understand Figure 8-18 and the view shown in Listing 8-8. This view returns information about each partition of each partitioned table. The WHERE clause filters out partitioned indexes (other than the clustered index), but you can change that condition if you desire. I first create a function to return an index name, with an object ID and an index ID given, so that the view can easily return any index names. When selecting from the view, you can add your own WHERE clause to find information about just the table you’re interested in.

Listing 8-8 View returning data about each partition of each partitioned table

CREATE FUNCTION dbo.index_name (@object_id int, @index_id tinyint)
RETURNS sysname
AS
BEGIN
  DECLARE @index_name sysname
  SELECT @index_name = name FROM sys.indexes
     WHERE object_id = @object_id and index_id = @index_id
  RETURN(@index_name)
END;
GO
CREATE VIEW Partition_Info AS
  SELECT OBJECT_NAME(i.object_id) as ObjectName,
    dbo.INDEX_NAME(i.object_id,i.index_id) AS IndexName,
    object_schema_name(i.object_id) as SchemaName,
    p.partition_number as PartitionNumber, fg.name AS FilegroupName, rows as Rows,
    au.total_pages as TotalPages,
    CASE boundary_value_on_right
        WHEN 1 THEN 'less than'
        ELSE 'less than or equal to'
    END as 'Comparison'
    , rv.value as BoundaryValue,
    CASE WHEN ISNULL(rv.value, rv2.value) IS NULL THEN 'N/A'
    ELSE
      CASE
        WHEN boundary_value_on_right = 0 AND rv2.value IS NULL
           THEN 'Greater than or equal to'
        WHEN boundary_value_on_right = 0
           THEN 'Greater than'
        ELSE 'Greater than or equal to' END + ' ' +
           ISNULL(CONVERT(varchar(15), rv2.value), 'Min Value')
                + ' ' +
                +
           CASE boundary_value_on_right
             WHEN 1 THEN 'and less than'
               ELSE 'and less than or equal to'
               END + ' ' +
                + ISNULL(CONVERT(varchar(15), rv.value),
                           'Max Value')
        END as 'TextComparison'
  FROM sys.partitions p
    JOIN sys.indexes i
      ON p.object_id = i.object_id and p.index_id = i.index_id
    LEFT JOIN sys.partition_schemes ps
      ON ps.data_space_id = i.data_space_id
    LEFT JOIN sys.partition_functions f
      ON f.function_id = ps.function_id
    LEFT JOIN sys.partition_range_values rv
      ON f.function_id = rv.function_id
          AND p.partition_number = rv.boundary_id
    LEFT JOIN sys.partition_range_values rv2
      ON f.function_id = rv2.function_id
          AND p.partition_number - 1= rv2.boundary_id
    LEFT JOIN sys.destination_data_spaces dds
      ON dds.partition_scheme_id = ps.data_space_id
          AND dds.destination_id = p.partition_number
    LEFT JOIN sys.filegroups fg
      ON dds.data_space_id = fg.data_space_id
    JOIN sys.allocation_units au
      ON au.container_id = p.partition_id
WHERE i.index_id <2 AND au.type =1;

The LEFT JOIN operator is needed to get all the partitions because the sys.partition_range_values view has a row only for each boundary value, not for each partition. LEFT JOIN gives the last partition with a boundary value of NULL, which means that the value of the last partition has no upper limit. A derived table groups together all the rows in sys.allocation_units for a partition, so the space used for all the types of storage (in-row, row-overflow, and LOB) is aggregated into a single value. This query uses the preceding view to get information about my TransactionHistory table’s partitions:

SELECT ObjectName, PartitionNumber, Rows, TotalPages, Comparison, BoundaryValue
FROM Partition_Info
WHERE ObjectName = 'TransactionHistory' AND SchemaName = 'dbo'
ORDER BY ObjectName, PartitionNumber ;

Here are my results for the TransactionHistory object:

Object_Name

Partitionnumber

Rows

Totalpages

Comparison

BoundaryValue

TransactionHistory

1

11155

89

Less than

2011-10-01

TransactionHistory

2

9339

74

Less than

2011-11-01

TransactionHistory

3

10169

81

Less than

2011-12-01

TransactionHistory

4

12181

97

Less than

2012-01-01

TransactionHistory

5

9558

74

Less than

2012-02-01

TransactionHistory

6

10217

81

Less than

2012-03-01

TransactionHistory

7

10703

89

Less than

2012-04-01

TransactionHistory

8

10640

89

Less than

2012-05-01

TransactionHistory

9

12508

90

Less than

2012-06-01

TransactionHistory

10

12585

97

Less than

2012-07-01

TransactionHistory

11

3380

33

Less than

2012-08-01

TransactionHistory

12

1008

17

Less than

NULL

This view contains details about the boundary point of each partition, as well as the filegroup that each partition is stored on, the number of rows in each partition, and the amount of space used. It also contains a few additional columns that aren’t shown here, just to keep the output from being too wide. In particular, I didn’t return the FilegroupName value, because in my example, all the partitions are on the same filegroup. Anytime your partitions are on different filegroups, you most likely will want to see that value for each partition. Note that although the comparison indicates that the values in the partitioning column for the rows in a particular partition are less than the specified value, you should assume that it also means that the values are greater than or equal to the specified value in the preceding partition. However, this view doesn’t provide information about where in the particular filegroup the data is located. The next section looks at a metadata query that provides location information.

The sliding window benefits of partitioning

One of the main benefits of partitioning your data is that you can move data from one partition to another as a metadata-only operation; the data itself doesn’t have to move. As mentioned earlier, this isn’t intended to be a complete how-to guide to SQL Server 2012 partitioning; instead, it’s a description of the internal storage of partitioning information.

To understand the internals of rearranging partitions, you need to look at additional partitioning operations.

The main operation you use when working with partitions is the SWITCH option to the ALTER TABLE command. This option allows you to

  • Assign a table as a partition of an already-existing partitioned table

  • Switch a partition from one partitioned table to another

  • Reassign a partition to form a single table

In all these operations, no data is moved. Instead, the metadata is updated in the sys.partitions and sys.system_internals_allocation_units views to indicate that a particular allocation unit now is part of a partition in a different object. For example, the following query returns information about each allocation unit in the first two partitions of the TransactionHistory and TransactionHistoryArchive tables, including the number of rows, the number of pages, the type of data in the allocation unit, and the page where the allocation unit starts:

SELECT convert(char(25),object_name(object_id)) AS name,
    rows, convert(char(15),type_desc) as page_type_desc,
    total_pages AS pages, first_page, index_id, partition_number
FROM sys.partitions p JOIN sys.system_internals_allocation_units a
     ON p.partition_id = a.container_id
WHERE (object_id=object_id('[dbo].[TransactionHistory]')
   OR object_id=object_id('[dbo].[TransactionHistoryArchive]'))
  AND index_id = 1 AND partition_number <= 2;

Here is the data I get back. (I left out the page_type_desc because all the rows are of type IN_ROW_DATA.)

name                       rows    pages      first_page      index_id    partition_number
-------------------------  ------- ---------- --------------  ----------- -----------------
TransactionHistory         11155   89        0xD81B00000100  1           1
TransactionHistory         9339    74        0xA82200000100  1           2
TransactionHistoryArchive  89253   633       0x981B00000100  1           1
TransactionHistoryArchive  0       0         0x000000000000  1           2

Now you can move one of the partitions. The ultimate goal is to add a new partition to TransactionHistory to store a new month’s worth of data and to move the oldest month’s data into TransactionHistoryArchive. The partition function used by my TransactionHistory table divides the data into 12 partitions, and the last one contains all dates greater than or equal to August 1, 2012. You can alter the partition function to put a new boundary point in for September 1, 2012, so the last partition is split. Before doing that, you must ensure that the partition scheme using this function knows what filegroup to use for the newly created partition. With this command, some data movement occurs and all data from the last partition of any tables using this partition scheme is moved to a new allocation unit. Refer to SQL Server Books Online for complete details about each of the following commands:

ALTER PARTITION SCHEME TransactionsPS1
NEXT USED [PRIMARY];
GO
ALTER PARTITION FUNCTION TransactionRangePF1()
SPLIT RANGE ('20120901');
GO

Next, you can do something similar for the function and partition scheme used by TransactionHistoryArchive. In this case, add a new boundary point for October 1, 2011:

ALTER PARTITION SCHEME TransactionArchivePS2
NEXT USED [PRIMARY];
GO
ALTER PARTITION FUNCTION TransactionArchivePF2()
SPLIT RANGE ('20111001');
GO

Now move all data from TransactionHistory with dates earlier than October 1, 2011, to the second partition of TransactionHistoryArchive. However, the first partition of TransactionHistory technically has no lower limit; it includes everything earlier than October 1, 2011. The second partition of TransactionHistoryArchive does have a lower limit, which is the first boundary point, or September 1, 2011. To SWITCH a partition from one table to another, you must guarantee that all the data to be moved meets the requirements for the new location, so you need to add a CHECK constraint that guarantees that no data in TransactionHistory is earlier than September 1, 2011. After adding the CHECK constraint, I run the ALTER TABLE command with the SWITCH option to move the data in partition 1 of TransactionHistory to partition 2 of TransactionHistoryArchive. (For testing purposes, you could try leaving out the next step that adds the constraint and try just executing the ALTER TABLE/SWITCH command. You get an error message. After that, you can add the constraint and run the ALTER TABLE/SWITCH command again.)

ALTER TABLE [dbo].[TransactionHistory]
ADD CONSTRAINT [CK_TransactionHistory_DateRange]
CHECK ([TransactionDate] >= '20110901');
GO
ALTER TABLE [dbo].[TransactionHistory]
SWITCH PARTITION 1
TO [dbo].[TransactionHistoryArchive] PARTITION 2;
GO

Now run the metadata query that examines the size and location of the first two partitions of each table:

SELECT convert(char(25),object_name(object_id)) AS name,
    rows, convert(char(15),type_desc) as page_type_desc,
    total_pages AS pages, first_page, index_id, partition_number
FROM sys.partitions p JOIN sys.system_internals_allocation_units a
     ON p.partition_id = a.container_id
WHERE (object_id=object_id('[dbo].[TransactionHistory]')
   OR object_id=object_id('[dbo].[TransactionHistoryArchive]'))
  AND index_id = 1 AND partition_number <= 2;
RESULTS:
name                  rows    pages      first_page      index_id    partition_number
--------------------  ------- ---------- --------------  ----------- ----------------
TransactionHistory    0       0          0x000000000000  1           1
TransactionHistory    9339    74         0xA82200000100  1           2
TransactionHistoryAr  89253   633        0x981B00000100  1           1
TransactionHistoryAr  11155   89         0xD81B00000100  1           2

Notice that the second partition of TransactionHistoryArchive now has exactly the same information that the first partition of TransactionHistory had in the first result set. It has the same number of rows (11,155), the same number of pages (89), and the same starting page (0xD81B00000100, or file 1, page 7,128). No data was moved; the only change was that the allocation unit starting at file 1, page 7,128 isn’t recorded as belonging to the second partition of the TransactionHistoryArchive table.

Although my partitioning script created the indexes for the partitioned tables by using the same partition scheme used for the tables themselves, this isn’t always necessary. An index for a partitioned table can be partitioned using the same partition scheme or a different one. If you don’t specify a partition scheme or filegroup when you build an index on a partitioned table, the index is placed in the same partition scheme as the underlying table, using the same partitioning column. Indexes built on the same partition scheme as the base table are called aligned indexes.

However, an internal storage component is associated with automatically aligned indexes. As mentioned earlier, if you build an index on a partitioned table and don’t specify a filegroup or partitioning scheme on which to place the index, SQL Server creates the index using the same partitioning scheme that the table uses. However, if the partitioning column isn’t part of the index definition, SQL Server adds the partitioning column as an extra included column in the index. If the index is clustered, adding an included column isn’t necessary because the clustered index already contains all the columns. Another case in which SQL Server doesn’t add an included column automatically is when you create a unique index, either clustered or nonclustered. Because unique partitioned indexes require that the partitioning column is contained in the unique key, a unique index for which you haven’t explicitly included the partitioning key isn’t partitioned automatically.

Partitioning a columnstore index

To end this section, look at an example that combines partitioning with columnstore indexes, which Chapter 7 described. If you still have the dbo.FactInternetSalesBig table, you can follow the examples here.

First, create a very simple partition function and partition scheme. The partition function splits the data into five partitions, which eventually are mapped to the SalesTerritoryKey column of the big table. Then define a partition scheme that puts all the partitions on the PRIMARY filegroup:

USE AdventureWorksDW2012
GO
CREATE PARTITION FUNCTION [PF_TerritoryKey](int) AS RANGE LEFT FOR VALUES (2, 4, 6, 8)
GO
CREATE PARTITION SCHEME [PS_TerritoryKey] AS PARTITION [PF_TerritoryKey] ALL TO ([PRIMARY]);
GO

Now you can rebuild the clustered index to use this partitioning scheme, but you should get an error message initially, because if a table has a columnstore index, it must be partitioned aligned with the table. So you have to drop the columnstore index before you can rebuild the clustered index, and then you can rebuild the columnstore index using the same partitioning scheme.

DROP INDEX dbo.FactInternetSalesBig.csi_FactInternetSalesBig;
GO
CREATE CLUSTERED INDEX clus_FactInternetSalesBig ON  dbo.FactInternetSalesBig
(SalesTerritoryKey)
ON  PS_TerritoryKey (SalesTerritoryKey)
GO

Now you can rebuild the columnstore index on the same partitioning scheme, as shown in Listing 8-9.

Listing 8-9 Rebuilding the columnstore index on the same partitioning scheme

CREATE NONCLUSTERED COLUMNSTORE INDEX csi_FactInternetSalesBig
ON dbo.FactInternetSalesBig (
   ProductKey,
   OrderDateKey,
   DueDateKey,
   ShipDateKey,
   CustomerKey,
   PromotionKey,
   CurrencyKey,
   SalesTerritoryKey,
   SalesOrderNumber,
   SalesOrderLineNumber,
   RevisionNumber,
   OrderQuantity,
   UnitPrice,
   ExtendedAmount,
   UnitPriceDiscountPct,
   DiscountAmount,
   ProductStandardCost,
   TotalProductCost,
   SalesAmount,
   TaxAmt,
   Freight,
   CarrierTrackingNumber,
   CustomerPONumber
) ON PS_TerritoryKey (SalesTerritoryKey)
GO

To explore my partitions, you can re-create the Partition_Info view from Listing 8-8 in the AdventureWorksDW2012 database. Then you can determine how many rows are in each partition by looking at just a couple of columns from that view:

select PartitionNumber, Rows from Partition_Info
where ObjectName = 'FactInternetSalesBig';
GO

Here are my results:

PartitionNumber Rows
--------------- --------------------
1               4618240
2               6289920
3               3921408
4               5725696
5               10368512

Now that you have a columnstore index, you can also use the metadata view that Chapter 7 explored—namely, sys.column_store_segments. The following query groups by column to show you the total number of segments in the table. If you run this query, you’ll see 24 rows indicating 24 columns. The index had only 23 columns defined, but because the clustered index wasn’t unique, the uniquifier is added as a column. The result of this query also shows 41 total segments:

-- GROUP BY COLUMN
SELECT s.column_id,  col_name(ic.object_id, ic.column_id) as column_name,   count(*) as segment_
count
FROM sys.column_store_segments s join sys.partitions p on s.partition_id = p.partition_id
  LEFT JOIN sys.index_columns ic
       ON p.object_id = ic.object_id AND p.index_id = ic.index_id
      AND s.column_id = ic.index_column_id
WHERE object_name(p.object_id) = 'FactInternetSalesBig'
GROUP BY s.column_id,  col_name(ic.object_id, ic.column_id), object_name(p.object_id)
ORDER by 1;
GO

Because of the boundary values used, not every partition has exactly the same number of rows, as you saw in the data from the Partition_Info view. Each partition could have a different number of segments, and the following query shows how many segments are created for each partition:

SELECT   partition_number, count( segment_id) as NumSegments, sum(row_count) as NumRows
FROM sys.column_store_segments s join sys.partitions p on s.partition_id = p.partition_id
   JOIN sys.index_columns ic
       ON p.object_id = ic.object_id AND p.index_id = ic.index_id
      AND s.column_id = ic.index_column_id
WHERE object_name(p.object_id) = 'FactInternetSalesBig' and index_column_id = 2
GROUP BY partition_number WITH ROLLUP;
GO

The results show that partition 5 has more than 10 million rows and 12 segments, whereas partitions 1 and 3 each have only six segments. The grand totals produced by the ROLLUP clause, show the 41 total segments, and that the total number of rows in the table is 30923776.

partition_number NumSegments NumRows
---------------- ----------- -----------
1                6           4618240
2                8           6289920
3                6           3921408
4                9           5725696
5                12          10368512
NULL             41          30923776