Chapter 6: Scalable Data Warehousing

Microsoft Press
14y
9.9k
0
0

Article

This chapter is taken from book "Introducing Microsoft SQL Server 2008 R2" by Ross Mistry and Stacia Misner published for Microsoft Press.

Microsoft SQL Server 2008 R2 Parallel Data Warehouse is an enterprise data warehouse appliance based on technology originally created by DATAllegro and acquired by Microsoft in 2008. In the months following the acquisition, Microsoft revamped the product by changing it from a product that used the Linux operating system and Ingres database technologies to a product based on SQL Server 2008 R2 and the Windows Server 2008 operating system. SQL Server 2008 Enterprise has many features supporting scalability and data warehouse performance that Parallel Data Warehouse uses to its advantage. The combination of SQL Server scalability and performance with a massively parallel processing (MPP) architecture in Parallel Data Warehouse creates a powerful new option for hosting a very large data warehouse.

Parallel Data Warehouse Architecture

Parallel Data Warehouse does not install like other editions of SQL Server. Instead, it is a data warehouse appliance that bundles multiple software and hardware technologies, including SQL Server, into a platform well suited for a very large data warehouse. A key characteristic of this platform is the MPP architecture, which enables fast data loads and high-performance queries. This architecture consists of a multi-rack system, which parallelizes queries across an array of dedicated servers connected by a high-speed network to deliver results at speeds that are typically faster than possible with a traditional symmetric multiprocessing (SMP) architecture.

Data Warehouse Appliances

You purchase a data warehouse appliance as preassembled and preconfigured integrated components with all software preinstalled. When you place an order for an appliance with an authorized vendor, you specify the number of appliance racks that you want to purchase. The vendor works with you to add options, such as an optional backup node, and to optimize the system to meet your requirements for faster query performance and for storage of high data volumes. The vendor then assembles industry-standard hardware components and loads the operating system, SQL Server, and Parallel Data Warehouse software. When the assembly process is complete, the vendor ships the appliance to you using shockproof pallets. When it arrives, you remove the appliance from the pallets, plug it into a power source, and connect it to your network.

Parallel Data Warehouse is a data warehouse appliance that includes all server, networking, and storage components required to host a data warehouse. In addition, your purchase of Parallel Data Warehouse includes cables, power distribution units, and racks. Furthermore, the components have redundancy to prevent downtime caused by a failure. The vendor installs all software at the factory and configures Parallel Data Warehouse to balance CPU, memory, and disk space. After you receive the Parallel Data Warehouse at your location, you use a configuration tool that Parallel Data Warehouse includes to complete the network setup and configure appliance settings for your environment. You can also install Microsoft or third-party software to use when copying data between your corporate network and the appliance.

Processing Architecture

A traditional data warehouse deployment of SQL Server is an SMP architecture, in which identical processors share memory on a single server. One physical instance of a database processes all queries. You can improve performance by partitioning the data, thereby achieving multi-threaded parallelization. You can add higher powered servers with more CPU, memory, storage, and networking capacity to scale up, but the cost to scale up is high. By contrast, Parallel Data Warehouse is an MPP architecture that uses multiple database servers that operate together to process queries. Behind the scenes, each database server runs one SQL Server instance with its own dedicated CPU, RAM, storage, and network bandwidth. Each database managed by Parallel Data Warehouse is distributed across multiple database servers that execute Parallel Data Warehouse queries in parallel. Parallel Data Warehouse's architecture includes a controlling server to coordinate these parallel queries and all other database activity across the multiple database servers. This controlling server also presents the distributed database as a single logical database to users. If you need to scale out the MPP hardware, you can simply add inexpensive commodity servers and storage rather than expensive high-end servers and storage.

The Multi-Rack System

Parallel Data Warehouse is configured as a multi-rack system in which there is a control rack and one or more data racks, as shown in Figure 6-1. Each rack is a collection of nodes, each of which has a dedicated role within the appliance. These nodes transfer data among themselves using an InfiniBand network that ships with the appliance. Only the nodes in the control rack communicate with the corporate Ethernet network. The nodes in the data rack can export tables to a corporate SMP SQL Server database by using the InfiniBand network.

Figure-6.1.gif

FIGURE 6-1 The multi-rack system

The Data Rack

All activity related to parallel query processing occurs in the data rack, which is a collection of compute nodes. Each compute node consists of a server with dedicated storage, a SQL Server instance, and additional Parallel Data Warehouse software that provides communication and data transfer functions. Although the compute nodes run separate SQL Server instances in parallel to manage each distributed appliance database, you query the database as if it were a single database.

The number of compute nodes in a data rack varies among the vendors, although each vendor follows a standard architecture specification. For example, each data rack includes a spare server for high availability. If a compute node server fails or needs to be taken offline for maintenance, the compute node server automatically fails over to the spare server. The current connections to the appliance stay intact while the appliance reconfigures itself. Just as with SQL Server failover, queries that were in progress before the failover need to be restarted.

The Control Rack

The control rack is a separate rack that houses the servers, storage, and networking components for the nodes that provide control, management, or interface functions. It contains several types of nodes that Parallel Data Warehouse uses to process user queries, to load and back up data, and to manage the appliance. Some of the nodes serve as intermediaries between the corporate network and the private network that connects the nodes in both the control rack and data rack. You never interact directly with the data rack; you submit a data load or a query to the control rack, which then coordinates the processes between nodes to complete your request.

Most Parallel Data Warehouse activity involves coordination with the control node. To support high availability, the control node is a two-node active/passive cluster. If the active node fails for any reason, the passive node takes over. The redundancy between the two nodes ensures the appliance can recover quickly from a failure.

Parallel Data Warehouse uses multiple networking technologies. The control rack servers connect to the corporate network by using the corporate Ethernet. The compute node servers connect to their dedicated database storage by using a Fibre Channel network. A highspeed InfiniBand network internally connects all the servers in the appliance to one another. Because InfiniBand is much faster than a Gigabit Ethernet network, it is better suited for the Parallel Data Warehouse nodes, which must transfer high volumes of data and be as fast as possible. For high availability, the switching fabric of each network includes redundancy.

The Control Node

The control node is in the control rack and manages client authentication; accepts client connections to Parallel Data Warehouse; manages the query execution process, which it distributes across the compute nodes; and serves as the central point for all hardware monitoring. To support high availability, the control node is a two-node active/passive cluster in which the passive node instantly takes over if the active node fails for any reason. The control node also contains a SQL Server instance.

To support the distributed architecture of Parallel Data Warehouse, the control node contains the MPP Engine, the Data Movement Service (DMS), and Windows Internet Information Services (IIS), as shown in Figure 6-2. The MPP Engine coordinates parallel query processing, storage of appliance-wide metadata and configuration data, and authentication and authorization for the appliance and databases. The DMS, which runs on most appliance nodes, is the communication interface for copying data between appliance nodes. IIS hosts a Web application, called the Admin Console, that you access by using Windows Internet Explorer and use to manage and monitor the appliance status and query performance.

You can connect to the Parallel Data Warehouse control node by using a variety of client access tools. Parallel Data Warehouse integrates with SQL Server 2008 R2 Business Intelligence

Development Studio, SQL Server Integration Services, SQL Server Analysis Services, and SQL Server Reporting Services. The Nexus client is the query editor that you can use to submit queries by using SQL statements to Parallel Data Warehouse. Parallel Data Warehouse also includes DWSQL, a command-line tool for submitting SQL statements to the control node. These client tools use Data Direct's SequeLink client drivers that support the following data access driver types:

ODBC
OLE DB
ADO.NET

Figure-6.2.gif

FIGURE 6-2 Appliance software

The Landing Zone Node

The Landing Zone is a high-capacity data storage node in the control rack that contains terabytes of disk space for temporary storage of user data before loading it into the appliance. Using your ETL processes to move data to the Landing Zone, you can either copy data to the Landing Zone and then load it into the appliance, or you can load data directly without first storing it on the Landing Zone. With either approach, the Landing Zone uses the appliance's high-speed fabric to copy that data in parallel into the data rack. To perform parallel data loading, you can use SQL Server Integration Services or a command-line tool.

The Backup Node

Another node in the control rack is the Backup node that, as the name implies, is dedicated to the backup process, which it can perform at very high speed. The backup node uses SQL Server's native database-level backup and restore functionality and coordinates the backup across nodes. You can create full backups or differential backups of user databases, or backups of the system database that contains information about user accounts, passwords, and permissions. The initial backup takes the longest time because it contains all data in a database, but subsequent differential backups run much faster because they contain only the changes in the data that were made since the last full backup. Furthermore, the backup process runs in parallel across nodes to help performance.

TIP To restore the backup, the destination appliance must have at least as many of compute nodes as the appliance where the backup was created.

The Management Node

The final node in the control rack is the management node, which operates as the hub for software deployment, servicing, and system health and performance monitoring. This node also runs a Windows domain controller to manage authentication within the appliance. It performs functions related to the management of hardware and software in the appliance and is not visible to users. Like the control node, the management node is a two-node active/ passive cluster.

NOTE Parallel Data Warehouse does not use the domain controller on the management node for user authentication.

The Compute Node

Each compute node is the host for a single SQL Server instance and runs the DMS to communicate with and transfer data to other appliance nodes. Each compute node stores a subset of each user database. Before parallel query processing begins, Parallel Data Warehouse copies any necessary data to each compute node so that it can process the query in parallel with other compute nodes without requiring data from other locations during processing. This feature, called data colocation, ensures that each compute node can execute its portion of the parallel query with no effect on the query performance of the other compute nodes.

Hub-and-Spoke Architecture

Rather than using Parallel Data Warehouse exclusively for a data warehouse, you can use a hub-and-spoke architecture to support both a corporate data warehouse and special purpose data marts. These data marts reside on servers outside of the appliance. The data warehouse at the hub is the primary data source for the spokes. A spoke can be a data mart, a host for Analysis Services, or even a development or test environment. You can enforce business rules and data quality standards for all data at the hub, and then you can quickly copy data as needed from the Parallel Data Warehouse to the spokes residing outside the appliance.

Data Management

Loading, processing, and backing up terabytes of data with balanced hardware resources is vitally important in a very large data warehouse. Parallel Data Warehouse uses carefully balanced hardware to maximize the efficiency of each hardware component and avoid the need to over-purchase hardware. Parallel Data Warehouse accomplishes this goal of balancing speed and hardware by using a shared nothing (SN) architecture.

In addition to the shared nothing architecture, there are other differences from other editions of SQL Server to notice. For example, SQL commands to create a database and tables are slightly different from their standard Transact-SQL counterparts. In addition, although Parallel Data Warehouse supports most of the SQL Server 2008 data types, there are a few exceptions. Last, the architecture requires a new approach to query processing and data load processing.

Shared Nothing Architecture

An SN architecture is a type of architecture in which each node of a system uses its own CPU,memory, and storage to avoid performance bottlenecks caused by resource contention with other nodes. In Parallel Data Warehouse, each compute node contains its own data, CPU, and storage to function as a self-sufficient and independent unit. Although the SN architecture is gaining popularity as a data warehousing architecture, performance can still be slow when a parallel query must first move data among the nodes before execution. When a SQL join operation requires data that is not already on the requisite compute nodes, Parallel Data Warehouse copies data to these nodes temporarily for use during query execution.

You design the data layout on the appliance to avoid or minimize data movement for parallel queries by using either a replicated or a distributed strategy for storage. When planning which strategy to implement, you consider the types of joins that the parallel queries require. Some tables require a replicated strategy, whereas others require a distributed strategy.

Replicated Strategy

For best performance, you can add small tables-such as dimension tables in a star schema- to Parallel Data Warehouse by using a replicated strategy. Parallel Data Warehouse makes a copy of the table on each compute node, as shown in Figure 6-3. You then perform the initial load of the table, followed by any subsequent inserts, updates, or deletes, as if you were working with a single table, without the need to manage each copy of the table. Parallel Data Warehouse handles all changes to the table for you. When a query performs a join on a replicated dimension, Parallel Data Warehouse joins the dimension to the portion of the fact table that exists on the same compute node. All compute nodes run the query in parallel and can find data very quickly because the complete dimension table is on each compute node.

Figure-6.3.gif

FIGURE 6-3 Replicated strategy

Distributed Strategy

One of the keys to performance in an MPP architecture is the distribution of large tables across multiple nodes, as shown in Figure 6-4. To distribute a fact table, you simply select a column from the table to use as the distribution column, and when data is loaded into the table, Parallel Data Warehouse automatically spreads the rows across all of the compute nodes in the appliance. There are performance considerations for the selection of a distribution column, such as distinctness, data skew, and the types of queries executed on the system. For a detailed discussion of the choice of distributed tables, refer to the product documentation. To distribute the rows in the fact table, a hash function assigns each row to one of many storage locations based on the distribution column. Each compute node has 8 storage locations, called distributions, for the hashed rows. If a data rack has 8 compute nodes, the data rack has 64 distributions, which are queried in parallel.

Figure-6.4.gif

FIGURE 6-4 Distributed strategy

It is not essential that equal numbers of table rows are assigned to each distribution. There will almost always be some data skew among the distributions. If the amount of data skew becomes too large, the parallel system continues to run, but query times might be affected. You might have to experiment with several approaches before finding the best distributed strategy. A distributed strategy does not affect other table options that you might want to implement. For example, you can still define partitions and clustered indexes as needed.

DDL Extensions

To support the MPP architecture, Parallel Data Warehouse includes a SQL language that works with appliance databases. This SQL language includes data definition language (DDL) statements to create and alter databases, tables, views, and other entities on the appliance. You use these statements to operate on these objects as if they were on a single database instance. Behind the scenes, Parallel Data Warehouse allocates space for the objects and instantiates them across nodes.

CREATE DATABASE

The CREATE DATABASE statement has a set of options for supporting distributed and replicated tables. You determine how much space you need in total for the database for replicated tables, distributed tables, and logs. Parallel Data Warehouse manages the database according to your specifications.

Here is an example of the statement you use in Parallel Data Warehouse to create a new database:

CREATE DATABASE DW
WITH (
AUTOGROW = ON,
REPLICATED_SIZE = 50,
DISTRIBUTED_SIZE = 10000,
LOG_SIZE = 25
);

This statement uses the following options:

AUTOGROW This option specifies whether to enable or disable the automatic growth feature. This feature allows Parallel Data Warehouse to manage the growth of data and log files as needed over time.
REPLICATED_SIZE This specifies the total space in gigabytes allocated to replicated tables (and associated data) on each compute node. Parallel Data Warehouse stores replicated tables in a SQL Server filegroup on each compute node.
DISTRIBUTED_SIZE This specifies the total space in gigabytes allocated to distributed tables on the appliance. Parallel Data Warehouse divides the space among all distributions on the compute nodes and stores each distribution in a separate SQL Server filegroup. In the SN architecture of Parallel Data Warehouse, each distribution has its own set of disks for storage. This set of disks is configured as a logical unit number (LUN).
LOG_SIZE This option specifies the total space in gigabytes allocated to the transaction log on the appliance. You should plan for the log file size to be large enough to accommodate the largest data load that you expect. The automatic growth feature adjusts the log size as needed if you underestimate the required log file size.

CREATE TABLE

The CREATE TABLE statement syntax varies slightly from its syntax in standard Transact-SQL. For Parallel Data Warehouse, the statement includes options for specifying whether the table uses a replicated or a distributed strategy and whether to store the table with a clustered index or with a heap. You can also use this syntax to create partitions by specifying the partition boundary values.

NOTE Parallel Data Warehouse does not use the Transact-SQL partition schema or partition function. Also, you can create a clustered index only when you use CREATE TABLE. To create a nonclustered index, you use CREATE INDEX.

Here is an example of the syntax to create a replicated table:

CREATE TABLE DimProduct
(
ProductId BIGINT NOT NULL,
Description VARCHAR(50),
CategoryId INT NOT NULL,
ListPrice DECIMAL(12,2)
) WITH ( DISTRIBUTION = REPLICATE );

This syntax instructs Parallel Data Warehouse to create a table on all compute nodes. Subsequent commands to insert or delete data affect data in each copy of the table.

Here is an example of the syntax to create a distributed table:

CREATE TABLE FactSales
( CustomerId BIGINT,
SalesId BIGINT,
ProductId BIGINT,
SaleDate DATE,
Quantity INT,
Amount DECIMAL(15,2)
) WITH (
DISTRIBUTE = HASH (CustomerId),
CLUSTERED INDEX (SaleDate),
PARTITION ( SaleDate
RANGE RIGHT FOR VALUES
( '2009-01-01','2009-02-01','2009-03-01','2009-04-01','2009-05-01','2009-06-01'
,'2009-07-01','2009-08-01','2009-09-01','2009-10-01','2009-11-01','2009-12-01')
));

The CREATE TABLE statement for Parallel Data Warehouse includes the following items:

DISTRIBUTION Specifies the column to hash for distributing rows across all compute nodes in Parallel Data Warehouse.
CLUSTERED INDEX Specifies the column for a clustered index-if you omit this item from the statement, Parallel Data Warehouse stores the table as a heap.
PARTITION Specifies the boundary values of the partition and the column to use for partitioning the rows

In addition, you can use a CREATE TABLE AS SELECT statement to create a table from the results of a SELECT statement. You might use this technique when you are redistributing or defragmenting a table.

Here is an example of the syntax for a CREATE TABLE AS SELECT statement:

CREATE TABLE DimCustomer
WITH
( CLUSTERED INDEX (CustomerID) )
AS
SELECT * FROM DimCustomer;

Another option for creating tables is the CREATE REMOTE TABLE statement, which you can use to export a table to a non-appliance SQL Server database in an SMP architecture. To use this statement, you must ensure that the target database is available on the appliance's InfiniBand network.

Data Types

Many SQL Server data types supported by SQL Server 2008 are also supported by Parallel Data Warehouse. Character and binary strings are supported, but you must limit the string length to 8,000 characters. Another point to note is that Parallel Data Warehouse uses only Latin1_General_BIN2 collation.

The following data types are supported:

Binary and varbinary
Bit
Char and varchar
Date
Datetime and datetime2
Datetimeoffset
Decimal
Float and real
Int, bigint, smallint, and tinyint
Money and smallmoney
Nchar and nvarchar
Smalldatetime
Time

Query Processing

Query processing in Parallel Data Warehouse is more complex than in an SMP data warehouse because processing must manage high availability, parallelization, and data movement between nodes. In general, Parallel Data Warehouse's control node follows these steps to process a query (shown in Figure 6-5).

Parse the SQL statement.
Validate and authorize the objects.
Build a distributed execution plan.
Run the execution plan.
Aggregate query results.
Send results to the client application.

Figure-6.5.gif

FIGURE 6-5 Query processing steps

A query with a simple join on columns of replicated tables or distribution columns of distributed tables does not require the transfer of data between compute nodes before executing the query. By contrast, a more complex join that includes a nondistribution column of a distributed table does require Parallel Data Warehouse to copy data among the distributions before executing the query.

Data Load Processing

The design of data load processing in Parallel Data Warehouse takes full advantage of the parallel architecture to move data to the compute nodes. You have several options for loading data into your data warehouse. You can use your ETL process to copy files to the Parallel

Data Warehouse's Landing Zone. You then invoke a command-line tool, DWLoader, and specify options to load the data into the appliance. Or you can use Integration Services to move data to the Landing Zone and call the loading functionality directly. To load small amounts of data, you can connect to the control node and use the SQL INSERT statement. Queries can run concurrently with load processing, so your data warehouse is always available during ETL processing. DWLoader loads table rows in bulk into an existing table in the appliance. You have several options for loading rows into a table. You can add all rows to the end of the table by using append mode. Another option is to append new rows and update existing rows by using upsert mode. A third option is to delete all existing rows first and then to insert all rows into an empty table by using reload mode.

Monitoring and Management

Parallel Data Warehouse includes the Admin Console, a Web-based application with which you can monitor the health of the appliance, query execution status, and view other information useful for tuning user queries. This application runs on IIS on the control node and is accessible by using Internet Explorer.

The Admin Console allows you to view these options:

Appliance Dashboard Displays status details, such as utilization metrics for CPUs, disks, and the network, and displays activity on the nodes.
Queries Activity Displays a list of running queries and queries recently completed, with related errors, if any, and provides the ability to drill down to details to view the query execution plan and node execution information.
Load Activity Displays load plans, the current state of loads, and related errors, if any.
Backup and Restore Displays a log of backup operations.
Active Locks Displays a list of locks across all nodes and their current status.
Active Sessions Displays active user sessions to aid monitoring of resource contention
Application Errors Displays error event information.
Node Health Displays hardware and software alerts and allows an administrator to view the health of specific nodes.

To manage database objects, you might need to query the tables or view the objects. The version of SQL Server Management Studio included with SQL Server 2008 R2 is not currently compatible with Parallel Data Warehouse, but you can still use other tools. For example, you can use a command-line utility, Dwsql, to query a table. Using Dwsql is similar to using Sqlcmd. An alternative with a graphical user interface is the Nexus query tool from Coffing Data Warehousing (Coffing DW), which is distributed with each appliance installation. This tool operates much like SQL Server Management Studio (SSMS) by allowing you to navigate through an object explorer to find tables and views and to run queries interactively.

Business Intelligence Integration

Parallel Data Warehouse integrates with the SQL Server business intelligence (BI) components- Integration Services, Reporting Services, and SQL Server Analysis Services.

Integration Services

Integration Services is the ETL component of SQL Server. You use Integration Services packages to extract and merge data from multiple data sources and to filter and cleanse your data before loading it into the data warehouse. In SQL Server 2008 R2, Integration Services includes the SQL Server Parallel Data Warehouse connection manager and the SQL Server Parallel Data Warehouse Destination as new components that you use in Integration Services packages to load data into Parallel Data Warehouse. This new data destination provides optimized throughput and very fast performance because it loads data directly and quickly into the target database. You also have the option to deploy packages to the Landing Zone.

Reporting Services

You can use Parallel Data Warehouse as a data source for reports that you develop for Reporting Services using the Report Designer in Business Intelligence Development Studio or SQL Server 2008 R2 Report Builder 3.0. The Parallel Data Warehouse data source extension provides support for the graphical query designer, parameterized queries, and basic transactions, but it does not support Windows integrated security or advanced transactions. To use the Parallel Data Warehouse data source extension, you must install the ADO.NET data provider for Parallel Data Warehouse on the report server and each computer on which you create reports.

You can also use Parallel Data Warehouse as a source for report models. By using Report Manager or the report server API, you can generate a model from a Parallel Data Warehouse database. For more precise control of the model, you can use the Model Designer in Business Intelligence Development Studio.

Analysis Services and PowerPivot

Parallel Data Warehouse is also a valid data source for Analysis Services databases and Excel PowerPivot models. Using the OLE DB provider, you can configure an Analysis Services cube to use either multidimensional online analytical processing (MOLAP) or relational online analytical processing (ROLAP) storage. When using MOLAP storage, Analysis Services extracts data from Parallel Data Warehouse and stores it in a separate structure for reporting and analysis. By contrast, when using ROLAP storage, Analysis Services leaves the data in Parallel Data Warehouse. At query time, Analysis Services translates the multidimensional expression (MDX) query into a SQL query, which it sends to the Parallel Data Warehouse control node for query processing.

Up Next

Ebook Download

View all

Practical SQL for Data Analysts: Unlock the Power of Your Data

Read by 0 people

Download Now!

Learn

View all