Data Warehousing in Microsoft SQL Server 2008
SQL Server 2008 and Data Warehousing
Earlier versions of SQL Server had new features related to data warehousing, most notably Analysis Services, Reporting Services, and in SQL Server 2005, certain features of SQL Server Integration Services such as the Slowly Changing Dimensions task. But these earlier versions had very little at the level of the relational engine specifically targeting the particular needs of data warehousing. SQL Server 2008 delivers new features that squarely target data warehousing, particularly in relation to making very large databases more manageable and cost effective. This section will review the most important of the data warehousing–oriented enhancements in SQL Server 2008, starting with the Transact-SQL (T-SQL) enhancements aimed at working with data warehousing.
T-SQL MERGE Statement
The MERGE statement is covered in more depth in Chapter 2 and is applicable to many more scenarios than data warehousing. We cover it here too because it is also very relevant to data warehousing, specifically in the ETL context.
The MERGE statement provides what’s commonly referred to as upsert—meaning update the row if it already exists; otherwise, insert it. But there is more as well. MERGE requires a target table, which is joined in some relationship to a source table. The source table contains the data to be merged or synchronized with the target table. The MERGE statement supports up to three types of clauses defining the row-by-row action to be taken on the target table based on how it compares with the source table:
When Matched. The row exists in both merge and target tables (performs an inner join and allows UPDATE or DELETE).
When not Matched by Target. The row exists in the source table but not the target table (performs a left outer join and allows INSERT).
When not Matched by Source. The row exists in the target table but not the source table (performs a right outer join and allows UPDATE or DELETE).
In the data warehousing context, the MERGE statement is particularly suited to the maintenance of the dimension tables of star schemas. It is also very helpful in maintaining Type 1 slowly changing dimensions (SCDs), where changes simply overlay existing values, and Type 2 SCDs, where MERGE can do part of the job (a separate INSERT operation is still needed when an existing row is updated, to create the new version of it.) See the section entitled “Data Warehouse Design” earlier in this chapter for more details. (A full treatment of SCDs is beyond the scope of this chapter.)
In SQL Server 2008 Integration Services, MERGE can streamline and simplify the insert/update pattern that would be required under SQL Server 2005 Integration Services. Previously, the decision to insert or update in SQL Server 2005 Integration Services had to be based on a lookup of the source row using a Lookup task that was loaded with the target rows and two output data flows based on the failure or success of the lookup: one doing inserts and one doing updates against the target. With MERGE, the Lookup task is no longer needed, which simplifies the Integration Services package and avoids the performance, memory, and deadlock issues that can arise with the Lookup task if the target table is large.
Syntactically, MERGE requires two joinable tables or table-equivalents. (The target must be either a table or an updatable view; the source can be any table-equivalent.) For Integration Services, this means that the source table must exist or must be created in the package (as a temporary table, common table expression [CTE], or other equivalent).
The code in Example 14-2 shows a series of representative T-SQL expressions using MERGE against the AdventureWorksDW2008 database. Run each statement by hand as directed by the comments, followed by running the MERGE statement at the end. Note that GeographyKey is an identity column in DimGeography, so the column list must be explicit in the INSERT statement in the MERGE statement’s WHEN NOT MATCHED BY TARGET clause. Also note that the ending semicolon is required to terminate the MERGE statement.
Example 14-2. Using MERGE for a data warehousing update
USE AdventureWorksDW2008 GO -- Make a copy of the table. SELECT * INTO DimGeographyTest FROM DimGeography -- Create "Changes" table as another copy of same data. SELECT * INTO Changes FROM DimGeography -- If you now run the MERGE statement below, no changes will be reported. Note -- the condition on the UPDATE clause, which prevents unnecessary updates. -- Now force some UPDATES (53): UPDATE Changes SET SalesTerritoryKey = 11 WHERE SalesTerritoryKey = 10 -- Now running MERGE reports 53 updates. -- Now force DELETES (empty table will effectively delete every row in -- DimGeographyTest): DELETE Changes -- Now running MERGE will delete all 653 rows in DimGeographyTest. -- Testing INSERT is left as an exercise for the reader. -- MERGE statement: MERGE DimGeographyTest AS dg USING (SELECT * FROM Changes) AS c ON dg.GeographyKey = c.GeographyKey WHEN MATCHED and dg.SalesTerritoryKey <> c.SalesTerritoryKey THEN UPDATE SET dg.SalesTerritoryKey = c.SalesTerritoryKey WHEN NOT MATCHED BY TARGET THEN INSERT (City, StateProvinceCode, StateProvinceName, CountryRegionCode, EnglishCountryRegionName, SpanishCountryRegionName, FrenchCountryRegionName, PostalCode, SalesTerritoryKey) VALUES (c.City, c.StateProvinceCode, c.StateProvinceName, c.CountryRegionCode, c.EnglishCountryRegionName, c.SpanishCountryRegionName, c.FrenchCountryRegionName, c.PostalCode, c.SalesTerritoryKey) WHEN NOT MATCHED BY SOURCE THEN DELETE OUTPUT $action, INSERTED.*, DELETED.*;
The deletion possibilities of MERGE would be rare in a data warehousing scenario except in single-instance fixes of erroneous data, but it is worth knowing about for that purpose alone. In general, beware of using DELETE with MERGE. If your source table is inadvertently empty (as it is eventually in our example), MERGE with a WHEN NOT MATCHED BY SOURCE clause specifying DELETE could unintentionally delete every row in the target (depending on what other conditions were in the WHEN NOT MATCHED BY SOURCE clause).
Change Data Capture
Like one use of MERGE, the new Change Data Capture (CDC) feature in SQL Server 2008 targets the ETL component of data warehousing. CDC is available only in the Enterprise edition of SQL Server 2008 (and of course, the functionally equivalent Developer and Evaluation editions).
CDC is designed to efficiently capture and record relevant changes in the context of a data warehouse. Traditionally, detecting changes in an OpApp table that need to be applied to a data warehouse has required relatively brute force methods such as the following:
For updates, using the CHECKSUM function as a shortcut to detecting inequality of columns between source and target rows (SQL Server only), or comparing time stamps.
For inserts, outer-joining source and target rows and testing for NULL on the target.
For inserts and updates, implementing triggers on the source table to detect changes and take appropriate action against the target, or performing a lookup (perhaps using an Integration Services Lookup task) to compare source against target and then driving the update or insert by the success or failure of the lookup.
For inserts and updates, using the OUTPUT clause (SQL Server 2005 and 2008) or INSERT OVER DML (SQL Server 2008 only), which we cover in Chapter 2.
The CDC feature introduced in SQL Server 2008 provides a valuable new way of laying the groundwork for maintaining changing data in a data warehouse. Without resorting to triggers or other custom code, it allows capturing changes that occur to a table into a separate SQL Server Change Tracking table (the change table). This table can then be queried by an ETL process to incrementally update the data warehouse as appropriate. Querying the change table rather than the tracked table itself means that the ETL process does not affect the performance of applications that work with the transactional tables of your database in any way. CDC is driven by a SQL Server Agent job that recognizes changes by monitoring the SQL Server transaction log. This provides much better performance than using triggers, especially in bulk load scenarios typical in a data warehouse—and there’s no code to write or maintain with CDC. The tradeoff is somewhat more latency, which in a data warehouse is often perfectly acceptable. Figure 14-9 depicts a high-level view of CDC architecture using an illustration taken from SQL Server Books Online.
Figure 14-9. High-level architecture of CDC
Several new system stored procedures and table-valued functions (TVFs) are provided to enable, monitor, and consume SQL Server Change Tracking output. To begin, you execute the sp_cdc_enable_db procedure to enable CDC on the current database. (You must be in the sysadmin role to do this.) When you run this procedure, a new cdc user, cdc schema, and CDC_admin role are created. These names are hard-coded, so in the event that you already have a user or schema named cdc, you will need to rename it before using CDC.
Once the database is CDC-enabled, you enable CDC on a given table by executing sp_cdc_enable_table. (You must be in the db_owner role to do this.) When you do that, several objects are created in the cdc schema: a change table and at least one (but possibly two) TVFs. Let’s look at each of these objects in turn.
When CDC is enabled on a table, SQL Server creates a change table in the cdc schema corresponding to the table on which CDC is being enabled. The change table will be populated with change data automatically by CDC and is assigned a name based on both the schema and the table being tracked. For example, when you enable CDC on the Employee table in the dbo schema (as we’ll do shortly), SQL Server creates a corresponding change table named cdc.dbo_Employee_CT that will record all changes to the dbo.Employee table. The schema of the tracked table (dbo in this case) is part of the change table name so that same-named tables from different schemas can all be unambiguously tracked in the cdc schema. It is also possible to explicitly name the change table, as long as it’s unique in the database.
The ETL process will query this change table for change data in order to populate your data warehouse, but it will not normally do so by selecting directly from it. Instead, the ETL process will call a special TVF to query the change table for you. This TVF is also created for you by SQL Server automatically when the change table is created, and—like the change table—the TVF is also created in the cdc schema with a name based on the schema and table name of the tracked table. So again, if we’re tracking the dbo.Employee table, SQL Server creates a TVF named cdc.fn_cdc_get_all_changes_dbo_Employee that accepts parameters to select all changes that occur to dbo.Employee between any two desired points in time.
If you specify @supports_net_changes=1 when calling sp_cdc_enable_table, a second TVF is created for the change table as well. Like the first TVF, this one allows you to select changes between any two points in time, except that this TVF returns just the net (final) changes that occurred during that time frame. This means, for example, that if a row was added and then deleted within the time frame being queried using this second TVF, data for that row would not be returned—whereas the first TVF would return data that reflects both the insert and the delete. This second TVF is named in a similar fashion as the first, except using the word net instead of all. For dbo.Employee, this TVF is named cdc.fn_cdc_get_net_changes_dbo_Employee. Note that querying for net changes requires the tracked table to have a primary key or unique index.
Neither of these TVFs accept start and end times directly but instead require the range to be expressed as log sequence numbers (LSNs) by first calling sys.fn_cdc_map_time_to_lsn. So to query between two points in time, you call sys.fn_cdc_map_time_to_lsn twice—once for the start time and once for the end time—and then use the LSN values returned by this function as input values to the TVFs for querying change data. This might seem unnecessarily cumbersome, but in fact has good reason related to supporting two change tables on the same table, one feeding the production systems and another supporting ongoing development.
The sp_cdc_enable_table stored procedure has several optional parameters that give you a lot of flexibility. You can, among other options, specify your own name for the change table, a role that a user must belong to in order to query changes (if not in sysadmin or db_owner), which columns of the table should be tracked (you don’t need to track all of them), the file-group on which to create the change table, and whether the SWITCH_PARTITION option of ALTER TABLE can be executed against the tracked table (which has very important implications). Consult SQL Server Books Online for more details of sp_cdc_enable_table parameters.
When you no longer require CDC on a particular table, you can call the sp_cdc_disable_table stored procedure on the table. This procedure drops the change table and the TVFs and updates the system metadata to reflect that the table is no longer tracked. When you no longer require CDC on the database, call the sp_cdc_disable_db stored procedure to completely disable CDC for the entire database.
The change table records all changes to the requested columns, including intermediate states (per DML statement) between two points in time. Note that CDC supports sparse columns (covered later in this section) but not sparse column sets. Each change table row also includes five metadata columns of great value for change-consuming processes to determine what type of change (insert, update, or delete) each row represents and to group and order all changes belonging to the same transaction. One item it cannot capture is who made the change, which is why it is not ideal for maintaining audit trails. For that, you can use SQL Server Audit, which will track and record which users are performing data modifications as well as any other activity of interest. (We cover SQL Server Audit in Chapter 5.)
As we mentioned earlier, CDC relies on SQL Server Agent for automating the capture process. The first time sp_cdc_enable_table is executed on any table in a database, SQL Server also creates two SQL Server Agent jobs for that database. The first is a change-capture job, which performs the actual transaction log monitoring to apply changes on the tracked table to the corresponding change table. The second is a cleanup job, which deletes rows from change tables after a configurable interval (three days, by default) and removes all CDC artifacts if the tracked table is dropped. Therefore, SQL Server Agent must be running the first time this procedure is run to CDC-enable a table on any database in the server instance. Subsequently, if SQL Server Agent stops running, changes to tracked tables will accumulate in the transaction log but not be applied to the change tables until SQL Server Agent is restarted.
CDC can at first appear rather cumbersome to use, but it is well thought out in terms of its configuration flexibility and support for various scenarios. Some of these might not be immediately obvious—for example, what happens if a tracked table is dropped, or its structure changed, after CDC is enabled on it? We lack the space to delve into these essential aspects, but you’ll find comprehensive details in SQL Server Books Online. The code in Example 14-3 shows a complete example of using CDC.
Example 14-3. Using Change Data Capture
-- Create test database CREATE DATABASE CDCDemo GO USE CDCDemo GO -- Enable CDC on the database EXEC sp_cdc_enable_db -- Show CDC-enabled databases SELECT name, is_cdc_enabled FROM sys.databases -- View the new "cdc" user and schema SELECT * FROM sys.schemas WHERE name = 'cdc' SELECT * FROM sys.database_principals WHERE name = 'cdc' -- Create Employee table CREATE TABLE Employee( EmployeeId int NOT NULL PRIMARY KEY, EmployeeName varchar(100) NOT NULL, EmailAddress varchar(200) NOT NULL) -- Enable CDC on the table (SQL Server Agent *should* be running when you run this) EXEC sp_cdc_enable_table @source_schema = N'dbo', @source_name = N'Employee', @role_name = N'CDC_admin', @capture_instance = N'dbo_Employee', @supports_net_changes = 1 -- Show CDC-enabled tables SELECT name, is_tracked_by_cdc FROM sys.tables -- Insert some employees... INSERT INTO Employee VALUES(1, 'John Smith', 'email@example.com') INSERT INTO Employee VALUES(2, 'Dan Park', 'firstname.lastname@example.org') INSERT INTO Employee VALUES(3, 'Jay Hamlin', 'email@example.com') INSERT INTO Employee VALUES(4, 'Jeff Hay', 'firstname.lastname@example.org') -- Select them from the table and the change capture table SELECT * FROM Employee SELECT * FROM cdc.dbo_employee_ct -- Delete Jeff DELETE Employee WHERE EmployeeId = 4 -- Results from Delete SELECT * FROM Employee SELECT * FROM cdc.dbo_employee_ct -- (Note: result of DELETE may take several seconds to show up in CT table) -- Update Dan and Jay UPDATE Employee SET EmployeeName = 'Dan P. Park' WHERE EmployeeId = 2 UPDATE Employee SET EmployeeName = 'Jay K. Hamlin' WHERE EmployeeId = 3 -- Results from update SELECT * FROM Employee SELECT * FROM cdc.dbo_employee_ct -- See note above -- Give the CDC job a chance to initialize before accessing the TVFs WAITFOR DELAY '00:00:20' -- To access change data, use the CDC TVFs, not the change tables directly DECLARE @begin_time datetime DECLARE @end_time datetime DECLARE @from_lsn binary(10) DECLARE @to_lsn binary(10) SET @begin_time = GETDATE() - 1 SET @end_time = GETDATE() -- Map the time interval to a CDC LSN range SELECT @from_lsn = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal', @begin_time) SELECT @to_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal', @end_time) SELECT @begin_time AS BeginTime, @end_time AS EndTime SELECT @from_lsn AS FromLSN, @to_lsn AS ToLSN -- Return the changes occurring within the query window. -- First, all changes that occurred... SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_employee(@from_lsn, @to_lsn, N'all') -- Then, net changes, that is, final state... SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_employee(@from_lsn, @to_lsn, N'all')
Let’s examine this code closely. After creating our sample database CDCDemo, we enable CDC on that database by calling EXEC sp_cdc_enable_db. The next several SELECT queries demonstrate how to retrieve various kinds of CDC-related information. The first SELECT query shows how the is_cdc_enabled column in sys.databases returns true (1) or false (0), making it easy to find out which databases are CDC-enabled and which aren’t. The next two SELECT queries show how the new cdc schema and user can be found in sys.schemas and sys.database_principals.
The code then proceeds to create the Employee table, which has only three columns to keep our example simple. CDC is then enabled on the Employee table by calling EXEC sp_cdc_enable_table and passing parameters that identify the Employee table in the dbo schema for change capture. (Remember that SQL Server Agent must be running at this point.) The next SELECT statement shows how to query the is_tracked_by_cdc column in sys.tables to find out which tables are CDC-enabled and which aren’t.
Recall that enabling CDC on the Employee table creates a TVF for retrieving all changes made to the table between any two points in time. Recall too that by specifying @supports_net_changes = 1, this also creates a second TVF for retrieving only the net changes made between any two points in time. The difference between all changes and net changes will be very clear in a moment, when we call both of these TVFs and compare their results. But first the code performs a mix of INSERT, UPDATE, and DELETE operations against the Employee table to simulate database activity and engage the capture process. In Example 14-3, these operations are accompanied by SELECT statements that query the change table cdc.dbo_employee_ct. This is done purely to demonstrate that change data for the Employee table is being captured to the change table. However, you should normally not query the change tables directly in this manner and should instead use the generated TVFs to extract change information about the Employee table, as demonstrated by the rest of the code.
Our code then executes a WAITFOR statement to pause for 20 seconds before calling the TVFs, in order to give the SQL Server Agent change capture job a chance to initialize. This is a one-time latency only; it does not represent the normal latency for CDC-tracked changes to be recorded, which is on the order of 2 to 3 seconds. Without this delay, or if SQL Server Agent is not running when you call the TVFs, you will receive a rather misleading error message that unfortunately does not describe the actual problem.
To call either of the generated TVFs, you need to provide a value range that defines the window of time during which you want change data returned. As already explained, this range is expressed using LSN values, which you can obtain by calling sys.fn_cdc_map_time_to_lsn and passing in the desired start and end points in time. So first we establish a time range for the past 24 hours, which we obtain by assigning GETDATE() – 1 and GETDATE() to the start and end time variables. Then we call sys.fn_cdc_map_time_to_lsn on the start and end time variables to obtain the LSN values corresponding to the last 24 hours. (Note that the starting LSN gets adjusted automatically to compensate for the fact that there are no LSNs from 24 hours ago, as does the ending LSN, since there might not be any from a moment ago either.) We then issue two SELECT statements so that we can view the time and LSN range values, an example of which is shown here:
BeginTime EndTime ----------------------- ----------------------- 2008-07-08 23:42:55.567 2008-07-09 23:42:55.567 (1 row(s) affected) FromLSN ToLSN ---------------------- ---------------------- 0x0000001A0000001E0039 0x00000020000000A50001 (1 row(s) affected)
Equipped with the LSN range values, we issue two more SELECT statements. (These are the last two statements in Example 14-3.) The first statement queries the range against the all changes TVF, and the second statement queries the range against the net changes TVF. Comparing the results of these two queries clearly illustrates the difference between the TVFs, as shown here:
__$start_lsn __$seqval __$operation __$update_mask EmployeeId EmployeeName EmailAddress ---------------------- ---------------------- ------------ -------------- ---------- ------- ---------- -------------------------- 0x0000001E0000007C0013 0x0000001E0000007C0012 2 0x07 1 John Smith email@example.com 0x0000001E000000800003 0x0000001E000000800002 2 0x07 2 Dan Park firstname.lastname@example.org 0x0000001E000000810003 0x0000001E000000810002 2 0x07 3 Jay Hamlin email@example.com 0x0000001E000000820003 0x0000001E000000820002 2 0x07 4 Jeff Hay firstname.lastname@example.org 0x0000001E000000850004 0x0000001E000000850002 1 0x07 4 Jeff Hay email@example.com 0x0000001E000001AC0004 0x0000001E000001AC0002 4 0x02 2 Dan P. Park firstname.lastname@example.org 0x0000001E000001AE0004 0x0000001E000001AE0002 4 0x02 3 Jay K. Hamlin email@example.com (7 row(s) affected) __$start_lsn __$operation __$update_mask EmployeeId EmployeeName EmailAddress ---------------------- ------------ -------------- ---------- ----------------- ------------ -------------- 0x0000001E0000007C0013 2 NULL 1 John Smith john.smith@ ourcorp.com 0x0000001E000001AC0004 2 NULL 2 Dan P. Park steven. firstname.lastname@example.org 0x0000001E000001AE0004 2 NULL 3 Jay K. Hamlin jay.hamlin@ ourcorp.com (3 row(s) affected)
The first result set includes all the information about all changes made during the specified LSN range, including all interim changes. Thus, the information returned from the first all changes TVF shows every stage of change, or seven changes in total. In our scenario, John was inserted once and then never changed. So only his insert (__$operation value 2) is shown. Dan and Jay were inserted (__$operation value 2) and updated (__$operation value 4), so both changes (insert and update) are returned for each of them. Jeff, on the other hand, was deleted (__$operation value 1) after being inserted, so both changes (insert and delete) are returned for Jeff.
The second result set includes only the final changes made during the specified LSN range. So for the same LSN range, we receive only three change records from the second net changes TVF, each of which provides the final column values in the specified LSN range. John appears only once as in the previous query, since he was inserted only once and never modified or deleted within the LSN range. However, although Dan and Jay were inserted and updated, they each appear only once (with their final values for the LSN range), and not twice as in the previous query. And since Jeff was inserted and deleted within the window of time specified by the LSN range, no change data for Jeff is returned at all by the net changes TVF.
Partitioned Table Parallelism
In SQL Server, a partitioned table is a table whose physical storage is divided horizontally (that is, as subsets of rows) into multiple filegroups (invisibly to queries and DML) for the purpose of improved manageability and isolation of various kinds of otherwise potentially conflicting access. For example, different partitions of the same table can have different backup and compression strategies and indexes, each optimized to the use of the partition. Given the large size of many data warehouses, this flexibility can be invaluable.
The typical (although by no means required) partition key is Time, since that is so often the natural horizontal dividing line. Partitioning by Time allows, for example, “old” data to be indexed more lightly than current, more frequently accessed data. Old data can also be backed up and deleted without affecting simultaneous queries against more recent data. Partitioning is an important tool of physical implementation, particularly when building a very large data warehouse.
Another potential benefit of well-designed partitioning is more efficient query plans. Queries specifying the partitioning key that involve only a single partition benefit from having less data to traverse (and potentially more optimized indexes if the partition is for newer data). In addition, when SQL Server is running on multiple-core or multiple-CPU hardware and configured appropriately, multiple worker threads are available and can achieve parallelism in processing a query by assigning multiple threads to it.
SQL Server 2005 optimized parallelism for queries involving only a single partition, by allocating all available threads to the one partition. However, on a multipartition query, performance could suffer badly because then only one thread is allocated per partition—leading to some parallelism for the query as a whole but none per partition. The result was that queries varying only slightly in their partitioning key constraint could exhibit vastly different degrees of performance.
The new Partitioned Table Parallelism feature in SQL Server 2008 directly addresses this shortcoming by allocating all available threads to a multipartition query in round-robin fashion. The result is that each partition, as well as the query as a whole, achieves some degree of parallelism. This is automatic when applicable. The best gains will be achieved when the number of threads (that is, cores or CPUs) is significantly larger than the number of partitions on the table. The difference between SQL Server 2005 and 2008 in thread allocation for multipartition queries is illustrated in Figure 14-10. Under the latter in this example, three times as many threads per partition operate on the Feb YTD query, and with all else being equal, this should translate to a 200 percent performance improvement.
Figure 14-10. The difference between SQL Server 2005 and 2008 in how threads are allocated to multipartition queries
Another important feature of Partitioned Table Parallelism relates to table locking behavior. Previously, when deciding whether to elevate to a table-level lock on a partitioned table, the database engine did not take into account whether concurrent statements against the same table were each accessing a different partition. When they were, each was logically independent and there would be no reason for one to block another. But by not recognizing this and escalating one of the statements to a table lock, the database engine could unnecessarily block the remaining statements, in the process also enhancing the possibility of deadlocks among them. In SQL Server 2008, the default behavior on a partitioned table behaves as before, but Partitioned Table Parallelism enables a new ALTER TABLE option, which directs the database engine to use partition-level lock escalation, instead of table-level, on a partitioned table. The syntax is shown here:
ALTER TABLE MyTable SET (LOCK_ESCALATION = <option>)
The LOCK_ESCALATION option can be specified as TABLE, AUTO, or DISABLE. The default is TABLE, which means that only table-level lock escalation will occur. If you specify AUTO, you get partition-level locking on partitioned tables, table-level otherwise. With DISABLE, no lock escalation will occur (in most cases).
Star-Join Query Optimization
Star-Join Query Optimization is an important new feature in SQL Server 2008 (again, available in Enterprise edition only) in the context of data warehouse–oriented performance enhancements, but it does not lend itself to deep explanation in a book like this because it does not offer any user-adjustable properties and its operation is largely buried within the database engine. The good news is that you need not do anything to get the benefit of it when applicable.
As noted earlier, the star schema is a common physical data model in Kimball-style data warehousing architectures. Queries against such a physical model are typically characterized by a central fact table joined to multiple dimension tables, each on single-column equijoins (joins based on equality), where the fact table has much higher cardinality than the dimension tables (more rows in the fact table as compared with the dimension table), and the constraints of the query are all on the dimension tables—a pattern now known as a star-join. Since this pattern is common across a large range of data warehousing scenarios, it became apparent that a query optimizer that could recognize such a pattern could potentially produce more efficient query plans than otherwise.
Here’s the basic idea. Eliminate as many candidate rows from the fact table as early as possible in the query-resolution pipeline, since the fact table typically has by far the highest cardinality of the tables involved. In practice, this means determining the candidate join keys from the dimension tables first (taking advantage of the winnowing effect of the constraints typically on them) and then using this information to eliminate candidate rows from the fact table ahead of, and more efficiently than, the actual join process further down the pipeline would. The heuristics—or in other words the rules by which the optimizer recognizes a star-join query—are important to the effectiveness of this strategy.
Such mechanisms are complex and, for our purposes, largely opaque. SQL Server 2005 introduced some star-join optimization based on these principles, but SQL Server 2008 extends the degree to which it can recognize and optimize this pattern. Microsoft benchmarks assert that the degree of performance improvement on typical data warehouse queries at which this feature is targeted can range from 10% to 30%. The SQL Server 2008 enhancements in this area also include more detailed information in query plans, which help the designer to understand when or if this feature is being applied to particular queries.
Space considerations preclude us from discussing this feature in more detail here. To learn more, we recommend that you visit the links provided at the end of this section.
Not all practitioners are happy with NULL values in a relational database schema, but for better or worse, they are widely used in practice. Without engaging that debate, some will rationalize allowing nullable columns when physically modeling a type (for example, Product) that has many subtypes that have few attributes in common and many attributes unique to each subtype. It can be convenient, despite going against the grain of normalization, to physically model this situation as a single table with a column for every attribute across all subtypes. In such a case, each attribute column must be nullable and will be sparse—that is, containing NULL in a high percentage of cases. It would be beneficial if the storage for such sparsely populated nullable columns were optimized, particularly in the data warehousing context, given the often large database sizes involved.
In versions earlier than SQL Server 2008, storing NULL values was not optimized—it required storage for every NULL occurrence. SQL Server 2008 introduces the notion of the SPARSE column, a nullable column whose storage is optimized for NULL values—at the cost of increased storage overhead for non-NULL values. With this option enabled, occurrences of NULL use no storage. (Note that this is also true when SQL Server Data Compression, detailed in the next section, is used—although the two are not equivalent.) The density of a column’s NULL values required to achieve a 40 percent space saving using the SPARSE attribute, the nominal space saving value as reported by SQL Server Books Online, depends on the column’s data type and ranges from 42 percent for 38-digit high-precision numeric types to 98 percent for bit. The SPARSE attribute in particular benefits Microsoft Office SharePoint Server, which by its generic and end-user-driven nature is a particular case of the preceding scenario—needing to store many user-defined attributes that are sparse by nature.
A few data types cannot be SPARSE, and there are other, potentially significant, restrictions on using SPARSE columns—for example, they cannot have default values or rules or be part of a clustered index or unique primary key index. SQL Server Books Online provides full details.
This feature is enabled by decorating column definitions in your CREATE TABLE and ALTER TABLE statements with the new SPARSE attribute. Obviously, the column must also be declared NULL. Example 14-4 shows an example of usage.
Example 14-4. Declaring SPARSE columns
CREATE TABLE SparseTest (ID int IDENTITY(1,1), LastName varchar(50) SPARSE NULL, Salary decimal(9,2) NULL) GO ALTER TABLE SparseTest ALTER COLUMN Salary decimal(9,2) SPARSE GO
SQL Server 2008 introduces two other new features that have a relationship to the SPARSE feature but do not depend on it. The first is the column set, an optionally declared set of specified columns on a table that, once declared, associates an xml column with the table as metadata (that is, no additional storage is used). This column represents the specified columns as an XML document and allows querying and updating of the columns as a group using XQuery and XML DML (which we cover in depth in Chapter 6). The individual columns can still be referenced in the usual way, but the column set representation can be a more convenient method when a table has a large number of columns and might provide performance improvements in some cases. SPARSE columns relate to column sets in that a column set cannot be added to an existing table already containing any SPARSE columns, and if SPARSE columns are later added to a table with a column set, they automatically become part of the column set.
The second new feature is the filtered index. A filtered index is an optimized nonclustered index whose declaration includes a WHERE clause that restricts the values included in the index to those specified. This can have wide-ranging implications for index maintenance, index storage, and query plan optimization. This feature is most useful when the query patterns against the table are well understood and they naturally relate to distinct subsets of rows. SPARSE columns are good candidates to participate in a filtered index because they represent distinct, well-defined subsets (rows with NULLs in the columns and rows with non-NULLs). For more details of both these features, which involve considerable complexity in their own right, see SQL Server Books Online.
A final benefit of SPARSE columns is that, by their nature, they can reduce the size of large backups, potentially more so than any of the new compression features we cover in the next section.
Data Compression and Backup Compression
Data compression and backup compression are long-awaited enhancements to SQL Server—not surprisingly, also available only in the Enterprise edition (with one exception, as we’ll see when we discuss backup compression). They are of benefit in all scenarios, but especially for large data warehouses. Many factors cause a data warehouse to grow at least linearly with time: the desire to facilitate trend analyses, personalization, and data mining; the fact that most data warehouses increase the number of data sources included over time; and last that multiple copies of the data warehouse often exist for redundancy and development and QA purposes. SQL Server 2008 provides both data compression, targeting the database itself, and backup compression, targeting the backup/restore process.
As the size of the data warehouse increases, it affects the cost and complexity of maintaining the online version and of taking backups of it. SQL Server 2008 Data Compression provides many benefits. It aids online query performance by increasing the number of rows stored per page, lessening disk I/O and saving costs in disk space. It improves performance for a given amount of memory, as more rows can be held in memory at the same time. It can benefit the backup/restore process by minimizing the I/O and therefore time and media required, since less physical data needs to be transferred. Last, replication and mirroring scenarios can also benefit for all the same reasons.
SQL Server 2005 made a start at targeting data compression concerns with both its table-level vardecimal storage format (in Service Pack 2 for the Enterprise edition) and its ability to use NTFS file system file compression on SQL Server read-only secondary data files (or all files, including log files, if the database is read-only).
These enhancements remain supported in SQL Server 2008, although use of the vardecimal option is deprecated and use of NTFS compression for SQL Server data files is mostly not recommended. Instead, SQL Server 2008 goes considerably beyond these earlier enhancements in the features it provides for data compression.
The most basic form of data compression uses a storage format that eliminates unneeded precision in fixed-length data types—that is, representing each value in a column with the minimal number of bits necessary. For example, any value of 255 or less stored in an integer data type could be stored in one byte instead of four (neglecting some slight overhead). SQL Server 2005 provided such compression or variable-length storage only for the decimal and numeric data types, but SQL Server 2008 provides it for all formerly fixed-length data types (including decimal and numeric). Note that what is changing is storage format, not data type, so the semantics of each data type remain the same to T-SQL queries as well as applications.
Data compression comes in two forms: row compression (RC) and page compression (PC). RC is another name for the variable-length storage approach just detailed. With RC, all occurrences of 0 (zero) and NULL consume no space. RC is not effective for variable-length data types (they are already effectively compressed), for some shorter data types (where the overhead of compression outweighs the benefit), and for some other data types for technical reasons.
PC is a superset of RC and provides potentially greater overall compression than RC alone, at the cost of greater CPU overhead. Where RC is concerned with compressing scale and precision on each individual row-column value, PC is concerned with compressing redundancy across all the rows and their columns on a particular page. PC can be used with all the same database objects as RC. It applies three steps to the enabled object, in the order indicated:
RC to the leaf level of a table and to all levels of an index.
PC—on each page, for each column of each row on that the page, any common prefixes among all values stored in that column (if any) are identified and tokenized. Each such prefix value is stored once in the new Compression Information (CI) section of the page (by column), and values in each column are replaced with short encoded values that identify the prefix and how much of it applies (as a prefix to the remainder of the value).
Dictionary compression—on each page, repeating values from any column in any row on the page are identified and stored in the CI area, and the values are replaced with a pointer to the repeated value. This can further compress the results of the first two steps.
As data is added to a PC-enabled object, these operations are initiated only when a page becomes full. If PC is enabled on an existing object containing data, that object must be rebuilt, a potentially expensive operation.
The code in Example 14-5 shows an example of creating a table enabled for PC.
Example 14-5. Enabling PC on a table
CREATE TABLE RowCompressionDemo (FirstName char(10), LastName char(30), Salary decimal(8,2)) WITH (DATA_COMPRESSION = PAGE)
SQL Server 2008 provides a system stored procedure associated with both forms of compression aptly named sp_estimate_data_compression_savings, which can be used to evaluate whether compression is worth applying to a given object. It can be run for a given uncompressed table, index, or partition to estimate the size it would be, using both RC and PC. It can also do the reverse; reporting the size a compressed object would be if uncompressed. This procedure works by sampling the data of the indicated object into a temporary store and running the indicated compression or decompression on it. It is possible for it to report a larger size for compressed than uncompressed data, which indicates clearly that the nature of the data is such that the storage overhead associated with compression outweighs any benefit.
Of course, these forms of compression require more CPU cycles to use than would otherwise be required, both when writing (compressing) and reading (decompressing) data. Each represents a tradeoff between saving space (disk and memory) and increasing CPU use. In addition, the effectiveness of any compression scheme is sensitive to the data type and statistical distribution of the values being compressed. For example, compression of an int column (4 bytes) in which most values do not exceed 255 (which fit in 1 byte) would exhibit much more benefit from RC than if the values were evenly distributed or if the column were already declared as a tinyint (1 byte). For these reasons, as well as the fine grain of data types that this feature allows to be individually tuned for compression, it is advisable to experiment with the various compression options to determine the optimal combination of settings.
Data compression must be enabled—it is disabled by default. It can be enabled on an entire table (which applies to all of its partitions), on individual partitions of a table, on individual indexes of a table, on individual index partitions, and on the clustered index of an indexed view. These features, together with the separately selectable options of row or page compression, give the database administrator great flexibility in tuning the use of compression to achieve the best tradeoffs.
Data compression is enabled by CREATE TABLE (as shown earlier) and CREATE INDEX statements, and also by ALTER TABLE and ALTER INDEX. Note that SQL Server Data Compression is not automatically enabled on existing or subsequently created nonclustered indexes of a table on which data compression is enabled—each such index must be separately and explicitly enabled. The one exception to this is that a clustered index does inherit the compression setting of its table.
Last but not least, an uncompressed table can be rebuilt with either form of compression via the new ALTER TABLE...REBUILD WITH (DATA_COMPRESSION=xxx) statement, where xxx is either ROW or PAGE. As the compression process is CPU intensive, it lends itself to parallelism, and SQL Server 2008 can take advantage of the availability of multiple CPUs. The REBUILD clause therefore supports a MAXDOP option to control how many CPUs are allocated to the process.
SQL Server Backup Compression is a new option with the BACKUP statement. Although only the Enterprise edition can create a compressed backup, any edition can restore one.
Compared with data compression, backup compression is extremely coarse grained. It is either enabled or it isn’t for the entire backup—there are no options to tune the compression, and the compression methods are opaque. Nevertheless, it is a welcome enhancement since no earlier version of SQL Server provided any form of backup compression, forcing practitioners to compress backups in a separate step with other, non–SQL Server, utilities.
The option is disabled by default, but the default can be changed via server-level configuration or overridden in the BACKUP statement. It should be noted that an uncompressed 2008 backup operation (both create and restore) can benefit when SQL Server Data Compression has been used on a significant scale in the database being backed up, as a direct result of reduced I/O. If data compression has been used, backup compression will likely provide a smaller (possibly much smaller) space-saving benefit, and because of the additional CPU overhead, backup/restore time might perform worse than without backup compression. This feature is therefore most valuable when the database being backed up has not had significant data compression applied—your own experimentation is warranted.
As a simple example of the potential efficiency of backup compression, compare the size and time required to back up and restore the AdventureWorksDW2008 database, as shown in Example 14-6. The CHECKPOINT and DBCC DROPCLEANBUFFERS statements are used to ensure that all cache buffers are empty so that one test does not misleadingly improve the performance of the next. Create the directory C:\Backups prior to running the following code.
Example 14-6. Comparing the time and size between compressed and uncompressed backups
CHECKPOINT DBCC DROPCLEANBUFFERS BACKUP DATABASE AdventureWorksDW2008 TO DISK='C:\Backups\AWDWUncompressed.bak' -- 10.661 sec, 71 Mb CHECKPOINT DBCC DROPCLEANBUFFERS BACKUP DATABASE AdventureWorksDW2008 TO DISK='C:\Backups\AWDWCompressed.bak' WITH COMPRESSION -- 6.408 sec, 13 Mb CHECKPOINT DBCC DROPCLEANBUFFERS RESTORE DATABASE AWDWUncompressed FROM DISK = 'C:\Backups\AWDWUncompressed.bak' WITH MOVE 'AdventureWorksDW2008_Data' TO 'C:\Backups\AWDWUncompressed.mdf', MOVE 'AdventureWorksDW2008_Log' TO 'C:\Backups\AWDWUncompressed.ldf' -- 9.363 sec CHECKPOINT DBCC DROPCLEANBUFFERS RESTORE DATABASE AWDWCompressed FROM DISK = 'C:\Backups\AWDWCompressed.bak' WITH MOVE 'AdventureWorksDW2008_Data' TO 'C:\Backups\AWDWCompressed.mdf', MOVE 'AdventureWorksDW2008_Log' TO 'C:\Backups\AWDWCompressed.ldf'; -- 6.101 sec
In this example, you can see that there is much more improvement in the backup (compression) stage than the restore stage, but in both cases, performance for the compressed backup is superior to the uncompressed backup. This is due to the reduction in I/O required for processing the smaller (compressed) backup file. Of course, experiments are warranted in your particular scenario to determine exactly what improvements you will see for yourself.
We’ve made several references to SQL Server Books Online for more detailed information about many of the new data warehousing features in SQL Server 2008. In addition, you can learn more about all of these SQL Server 2008 data warehousing–oriented features by visiting the following links:
These links were valid as of press time, but if they don’t work, you can perform a Web search on “SQL Server 2008 data warehouse enhancements.”
We can also recommend these additional resources to learn more about the recommended practices of data warehousing:
Building the Data Warehouse, 4th ed., W. H. Inmon (Wiley, 2005)
The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd ed., Ralph Kimball and Margy Ross (Wiley, 2002), and The Data Warehouse Lifecycle Toolkit, Ralph Kimball et al. (Wiley, 2008)
The Data Warehousing Institute, http://www.tdwi.org/TDWI