This chapter is taken from book "Introducing
Microsoft SQL Server 2008 R2" by Ross Mistry and Stacia Misner published for
Microsoft Press.
Microsoft SQL Server 2008 R2 Parallel Data Warehouse is an enterprise data
warehouse appliance based on technology originally created by DATAllegro and
acquired by Microsoft in 2008. In the months following the acquisition,
Microsoft revamped the product by changing it from a product that used the Linux
operating system and Ingres database technologies to a product based on SQL
Server 2008 R2 and the Windows Server 2008 operating system. SQL Server 2008
Enterprise has many features supporting scalability and data warehouse
performance that Parallel Data Warehouse uses to its advantage. The combination
of SQL Server scalability and performance with a massively parallel processing (MPP)
architecture in Parallel Data Warehouse creates a powerful new option for
hosting a very large data warehouse.
Parallel Data Warehouse Architecture
Parallel Data Warehouse does not install like other editions of SQL Server.
Instead, it is a data warehouse appliance that bundles multiple software and
hardware technologies, including SQL Server, into a platform well suited for a
very large data warehouse. A key characteristic of this platform is the MPP
architecture, which enables fast data loads and high-performance queries. This
architecture consists of a multi-rack system, which parallelizes queries across
an array of dedicated servers connected by a high-speed network to deliver
results at speeds that are typically faster than possible with a traditional
symmetric multiprocessing (SMP) architecture.
Data Warehouse Appliances
You purchase a data warehouse appliance as preassembled and preconfigured
integrated components with all software preinstalled. When you place an order
for an appliance with an authorized vendor, you specify the number of appliance
racks that you want to purchase. The vendor works with you to add options, such
as an optional backup node, and to optimize the system to meet your requirements
for faster query performance and for storage of high data volumes. The vendor
then assembles industry-standard hardware components and loads the operating
system, SQL Server, and Parallel Data Warehouse software. When the assembly
process is complete, the vendor ships the appliance to you using shockproof
pallets. When it arrives, you remove the appliance from the pallets, plug it
into a power source, and connect it to your network.
Parallel Data Warehouse is a data warehouse appliance that includes all server,
networking, and storage components required to host a data warehouse. In
addition, your purchase of Parallel Data Warehouse includes cables, power
distribution units, and racks. Furthermore, the components have redundancy to
prevent downtime caused by a failure. The vendor installs all software at the
factory and configures Parallel Data Warehouse to balance CPU, memory, and disk
space. After you receive the Parallel Data Warehouse at your location, you use a
configuration tool that Parallel Data Warehouse includes to complete the network
setup and configure appliance settings for your environment. You can also
install Microsoft or third-party software to use when copying data between your
corporate network and the appliance.
Processing Architecture
A traditional data warehouse deployment of SQL Server is an SMP architecture, in
which identical processors share memory on a single server. One physical
instance of a database processes all queries. You can improve performance by
partitioning the data, thereby achieving multi-threaded parallelization. You can
add higher powered servers with more CPU, memory, storage, and networking
capacity to scale up, but the cost to scale up is high. By contrast, Parallel
Data Warehouse is an MPP architecture that uses multiple database servers that
operate together to process queries. Behind the scenes, each database server
runs one SQL Server instance with its own dedicated CPU, RAM, storage, and
network bandwidth. Each database managed by Parallel Data Warehouse is
distributed across multiple database servers that execute Parallel Data
Warehouse queries in parallel. Parallel Data Warehouse's architecture includes a
controlling server to coordinate these parallel queries and all other database
activity across the multiple database servers. This controlling server also
presents the distributed database as a single logical database to users. If you
need to scale out the MPP hardware, you can simply add inexpensive commodity
servers and storage rather than expensive high-end servers and storage.
The Multi-Rack System
Parallel Data Warehouse is configured as a multi-rack system in which there is a
control rack and one or more data racks, as shown in Figure 6-1. Each rack is a
collection of nodes, each of which has a dedicated role within the appliance.
These nodes transfer data among themselves using an InfiniBand network that
ships with the appliance. Only the nodes in the control rack communicate with
the corporate Ethernet network. The nodes in the data rack can export tables to
a corporate SMP SQL Server database by using the InfiniBand network.
FIGURE 6-1 The multi-rack system
The Data Rack
All activity related to parallel query processing occurs in the data rack, which
is a collection of compute nodes. Each compute node consists of a server with
dedicated storage, a SQL Server instance, and additional Parallel Data Warehouse
software that provides communication and data transfer functions. Although the
compute nodes run separate SQL Server instances in parallel to manage each
distributed appliance database, you query the database as if it were a single
database.
The number of compute nodes in a data rack varies among the vendors, although
each vendor follows a standard architecture specification. For example, each
data rack includes a spare server for high availability. If a compute node
server fails or needs to be taken offline for maintenance, the compute node
server automatically fails over to the spare server. The current connections to
the appliance stay intact while the appliance reconfigures itself. Just as with
SQL Server failover, queries that were in progress before the failover need to
be restarted.
The Control Rack
The control rack is a separate rack that houses the servers, storage, and
networking components for the nodes that provide control, management, or
interface functions. It contains several types of nodes that Parallel Data
Warehouse uses to process user queries, to load and back up data, and to manage
the appliance. Some of the nodes serve as intermediaries between the corporate
network and the private network that connects the nodes in both the control rack
and data rack. You never interact directly with the data rack; you submit a data
load or a query to the control rack, which then coordinates the processes
between nodes to complete your request.
Most Parallel Data Warehouse activity involves coordination with the control
node. To support high availability, the control node is a two-node
active/passive cluster. If the active node fails for any reason, the passive
node takes over. The redundancy between the two nodes ensures the appliance can
recover quickly from a failure.
Parallel Data Warehouse uses multiple networking technologies. The control rack
servers connect to the corporate network by using the corporate Ethernet. The
compute node servers connect to their dedicated database storage by using a
Fibre Channel network. A highspeed InfiniBand network internally connects all
the servers in the appliance to one another. Because InfiniBand is much faster
than a Gigabit Ethernet network, it is better suited for the Parallel Data
Warehouse nodes, which must transfer high volumes of data and be as fast as
possible. For high availability, the switching fabric of each network includes
redundancy.
The Control Node
The control node is in the control rack and manages client authentication;
accepts client connections to Parallel Data Warehouse; manages the query
execution process, which it distributes across the compute nodes; and serves as
the central point for all hardware monitoring. To support high availability, the
control node is a two-node active/passive cluster in which the passive node
instantly takes over if the active node fails for any reason. The control node
also contains a SQL Server instance.
To support the distributed architecture of Parallel Data Warehouse, the control
node contains the MPP Engine, the Data Movement Service (DMS), and Windows
Internet Information Services (IIS), as shown in Figure 6-2. The MPP Engine
coordinates parallel query processing, storage of appliance-wide metadata and
configuration data, and authentication and authorization for the appliance and
databases. The DMS, which runs on most appliance nodes, is the communication
interface for copying data between appliance nodes. IIS hosts a Web application,
called the Admin Console, that you access by using Windows Internet Explorer and
use to manage and monitor the appliance status and query performance.
You can connect to the Parallel Data Warehouse control node by using a variety
of client access tools. Parallel Data Warehouse integrates with SQL Server 2008
R2 Business Intelligence
Development Studio, SQL Server Integration Services, SQL Server Analysis
Services, and SQL Server Reporting Services. The Nexus client is the query
editor that you can use to submit queries by using SQL statements to Parallel
Data Warehouse. Parallel Data Warehouse also includes DWSQL, a command-line tool
for submitting SQL statements to the control node. These client tools use Data
Direct's SequeLink client drivers that support the following data access driver
types:
FIGURE 6-2 Appliance software
The Landing Zone Node
The Landing Zone is a high-capacity data storage node in the control rack that
contains terabytes of disk space for temporary storage of user data before
loading it into the appliance. Using your ETL processes to move data to the
Landing Zone, you can either copy data to the Landing Zone and then load it into
the appliance, or you can load data directly without first storing it on the
Landing Zone. With either approach, the Landing Zone uses the appliance's
high-speed fabric to copy that data in parallel into the data rack. To perform
parallel data loading, you can use SQL Server Integration Services or a
command-line tool.
The Backup Node
Another node in the control rack is the Backup node that, as the name implies,
is dedicated to the backup process, which it can perform at very high speed. The
backup node uses SQL Server's native database-level backup and restore
functionality and coordinates the backup across nodes. You can create full
backups or differential backups of user databases, or backups of the system
database that contains information about user accounts, passwords, and
permissions. The initial backup takes the longest time because it contains all
data in a database, but subsequent differential backups run much faster because
they contain only the changes in the data that were made since the last full
backup. Furthermore, the backup process runs in parallel across nodes to help
performance.
TIP To restore the backup, the destination appliance must have at least as many
of compute nodes as the appliance where the backup was created.
The Management Node
The final node in the control rack is the management node, which operates as the
hub for software deployment, servicing, and system health and performance
monitoring. This node also runs a Windows domain controller to manage
authentication within the appliance. It performs functions related to the
management of hardware and software in the appliance and is not visible to
users. Like the control node, the management node is a two-node active/ passive
cluster.
NOTE Parallel Data Warehouse does not use the domain controller on the
management node for user authentication.
The Compute Node
Each compute node is the host for a single SQL Server instance and runs the DMS
to communicate with and transfer data to other appliance nodes. Each compute
node stores a subset of each user database. Before parallel query processing
begins, Parallel Data Warehouse copies any necessary data to each compute node
so that it can process the query in parallel with other compute nodes without
requiring data from other locations during processing. This feature, called data
colocation, ensures that each compute node can execute its portion of the
parallel query with no effect on the query performance of the other compute
nodes.
Hub-and-Spoke Architecture
Rather than using Parallel Data Warehouse exclusively for a data warehouse, you
can use a hub-and-spoke architecture to support both a corporate data warehouse
and special purpose data marts. These data marts reside on servers outside of
the appliance. The data warehouse at the hub is the primary data source for the
spokes. A spoke can be a data mart, a host for Analysis Services, or even a
development or test environment. You can enforce business rules and data quality
standards for all data at the hub, and then you can quickly copy data as needed
from the Parallel Data Warehouse to the spokes residing outside the appliance.
Data Management
Loading, processing, and backing up terabytes of data with balanced hardware
resources is vitally important in a very large data warehouse. Parallel Data
Warehouse uses carefully balanced hardware to maximize the efficiency of each
hardware component and avoid the need to over-purchase hardware. Parallel Data
Warehouse accomplishes this goal of balancing speed and hardware by using a
shared nothing (SN) architecture.
In addition to the shared nothing architecture, there are other differences from
other editions of SQL Server to notice. For example, SQL commands to create a
database and tables are slightly different from their standard Transact-SQL
counterparts. In addition, although Parallel Data Warehouse supports most of the
SQL Server 2008 data types, there are a few exceptions. Last, the architecture
requires a new approach to query processing and data load processing.
Shared Nothing Architecture
An SN architecture is a type of architecture in which each node of a system uses
its own CPU,memory, and storage to avoid performance bottlenecks caused by
resource contention with other nodes. In Parallel Data Warehouse, each compute
node contains its own data, CPU, and storage to function as a self-sufficient
and independent unit. Although the SN architecture is gaining popularity as a
data warehousing architecture, performance can still be slow when a parallel
query must first move data among the nodes before execution. When a SQL join
operation requires data that is not already on the requisite compute nodes,
Parallel Data Warehouse copies data to these nodes temporarily for use during
query execution.
You design the data layout on the appliance to avoid or minimize data movement
for parallel queries by using either a replicated or a distributed strategy for
storage. When planning which strategy to implement, you consider the types of
joins that the parallel queries require. Some tables require a replicated
strategy, whereas others require a distributed strategy.
Replicated Strategy
For best performance, you can add small tables-such as dimension tables in a
star schema- to Parallel Data Warehouse by using a replicated strategy. Parallel
Data Warehouse makes a copy of the table on each compute node, as shown in
Figure 6-3. You then perform the initial load of the table, followed by any
subsequent inserts, updates, or deletes, as if you were working with a single
table, without the need to manage each copy of the table. Parallel Data
Warehouse handles all changes to the table for you. When a query performs a join
on a replicated dimension, Parallel Data Warehouse joins the dimension to the
portion of the fact table that exists on the same compute node. All compute
nodes run the query in parallel and can find data very quickly because the
complete dimension table is on each compute node.
FIGURE 6-3 Replicated strategy
Distributed Strategy
One of the keys to performance in an MPP architecture is the distribution of
large tables across multiple nodes, as shown in Figure 6-4. To distribute a fact
table, you simply select a column from the table to use as the distribution
column, and when data is loaded into the table, Parallel Data Warehouse
automatically spreads the rows across all of the compute nodes in the appliance.
There are performance considerations for the selection of a distribution column,
such as distinctness, data skew, and the types of queries executed on the
system. For a detailed discussion of the choice of distributed tables, refer to
the product documentation. To distribute the rows in the fact table, a hash
function assigns each row to one of many storage locations based on the
distribution column. Each compute node has 8 storage locations, called
distributions, for the hashed rows. If a data rack has 8 compute nodes, the data
rack has 64 distributions, which are queried in parallel.
FIGURE 6-4 Distributed strategy
It is not essential that equal numbers of table rows are assigned to each
distribution. There will almost always be some data skew among the
distributions. If the amount of data skew becomes too large, the parallel system
continues to run, but query times might be affected. You might have to
experiment with several approaches before finding the best distributed strategy.
A distributed strategy does not affect other table options that you might want
to implement. For example, you can still define partitions and clustered indexes
as needed.
DDL Extensions
To support the MPP architecture, Parallel Data Warehouse includes a SQL language
that works with appliance databases. This SQL language includes data definition
language (DDL) statements to create and alter databases, tables, views, and
other entities on the appliance. You use these statements to operate on these
objects as if they were on a single database instance. Behind the scenes,
Parallel Data Warehouse allocates space for the objects and instantiates them
across nodes.
CREATE DATABASE
The CREATE DATABASE statement has a set of options for supporting distributed
and replicated tables. You determine how much space you need in total for the
database for replicated tables, distributed tables, and logs. Parallel Data
Warehouse manages the database according to your specifications.
Here is an example of the statement you use in Parallel Data Warehouse to create
a new database:
CREATE DATABASE DW
WITH (
AUTOGROW = ON,
REPLICATED_SIZE = 50,
DISTRIBUTED_SIZE = 10000,
LOG_SIZE = 25
);
This statement uses the following options:
- AUTOGROW This option specifies
whether to enable or disable the automatic growth feature. This feature
allows Parallel Data Warehouse to manage the growth of data and log files as
needed over time.
- REPLICATED_SIZE This specifies the
total space in gigabytes allocated to replicated tables (and associated
data) on each compute node. Parallel Data Warehouse stores replicated tables
in a SQL Server filegroup on each compute node.
- DISTRIBUTED_SIZE This specifies the
total space in gigabytes allocated to distributed tables on the appliance.
Parallel Data Warehouse divides the space among all distributions on the
compute nodes and stores each distribution in a separate SQL Server
filegroup. In the SN architecture of Parallel Data Warehouse, each
distribution has its own set of disks for storage. This set of disks is
configured as a logical unit number (LUN).
- LOG_SIZE This option specifies the
total space in gigabytes allocated to the transaction log on the appliance.
You should plan for the log file size to be large enough to accommodate the
largest data load that you expect. The automatic growth feature adjusts the
log size as needed if you underestimate the required log file size.
CREATE TABLE
The CREATE TABLE statement syntax varies slightly from its syntax in standard
Transact-SQL. For Parallel Data Warehouse, the statement includes options for
specifying whether the table uses a replicated or a distributed strategy and
whether to store the table with a clustered index or with a heap. You can also
use this syntax to create partitions by specifying the partition boundary
values.
NOTE Parallel Data Warehouse does not use the Transact-SQL partition
schema or partition function. Also, you can create a clustered index only when
you use CREATE TABLE. To create a nonclustered index, you use CREATE INDEX.
Here is an example of the syntax to create a replicated table:
CREATE TABLE DimProduct
(
ProductId BIGINT NOT NULL,
Description VARCHAR(50),
CategoryId INT NOT NULL,
ListPrice DECIMAL(12,2)
) WITH ( DISTRIBUTION = REPLICATE );
This syntax instructs Parallel Data Warehouse to create a table on all compute
nodes. Subsequent commands to insert or delete data affect data in each copy of
the table.
Here is an example of the syntax to create a distributed table:
CREATE TABLE FactSales
( CustomerId BIGINT,
SalesId BIGINT,
ProductId BIGINT,
SaleDate DATE,
Quantity INT,
Amount DECIMAL(15,2)
) WITH (
DISTRIBUTE = HASH (CustomerId),
CLUSTERED INDEX (SaleDate),
PARTITION ( SaleDate
RANGE RIGHT FOR VALUES
( '2009-01-01','2009-02-01','2009-03-01','2009-04-01','2009-05-01','2009-06-01'
,'2009-07-01','2009-08-01','2009-09-01','2009-10-01','2009-11-01','2009-12-01')
));
The CREATE TABLE statement for Parallel Data Warehouse includes the following
items:
- DISTRIBUTION Specifies the column
to hash for distributing rows across all compute nodes in Parallel Data
Warehouse.
- CLUSTERED INDEX Specifies the
column for a clustered index-if you omit this item from the statement,
Parallel Data Warehouse stores the table as a heap.
- PARTITION Specifies the boundary
values of the partition and the column to use for partitioning the rows
In addition, you can use a CREATE TABLE AS
SELECT statement to create a table from the results of a SELECT statement. You
might use this technique when you are redistributing or defragmenting a table.
Here is an example of the syntax for a CREATE TABLE AS SELECT statement:
CREATE TABLE DimCustomer
WITH
( CLUSTERED INDEX (CustomerID) )
AS
SELECT * FROM DimCustomer;
Another option for creating tables is the CREATE REMOTE TABLE statement, which
you can use to export a table to a non-appliance SQL Server database in an SMP
architecture. To use this statement, you must ensure that the target database is
available on the appliance's InfiniBand network.
Data Types
Many SQL Server data types supported by SQL Server 2008 are also supported by
Parallel Data Warehouse. Character and binary strings are supported, but you
must limit the string length to 8,000 characters. Another point to note is that
Parallel Data Warehouse uses only Latin1_General_BIN2 collation.
The following data types are supported:
- Binary and varbinary
- Bit
- Char and varchar
- Date
- Datetime and datetime2
- Datetimeoffset
- Decimal
- Float and real
- Int, bigint, smallint, and tinyint
- Money and smallmoney
- Nchar and nvarchar
- Smalldatetime
- Time
Query Processing
Query processing in Parallel Data Warehouse is more complex than in an SMP data
warehouse because processing must manage high availability, parallelization, and
data movement between nodes. In general, Parallel Data Warehouse's control node
follows these steps to process a query (shown in Figure 6-5).
- Parse the SQL statement.
- Validate and authorize the objects.
- Build a distributed execution plan.
- Run the execution plan.
- Aggregate query results.
- Send results to the client application.
FIGURE 6-5 Query processing steps
A query with a simple join on columns of replicated tables or distribution
columns of distributed tables does not require the transfer of data between
compute nodes before executing the query. By contrast, a more complex join that
includes a nondistribution column of a distributed table does require Parallel
Data Warehouse to copy data among the distributions before executing the query.
Data Load Processing
The design of data load processing in Parallel Data Warehouse takes full
advantage of the parallel architecture to move data to the compute nodes. You
have several options for loading data into your data warehouse. You can use your
ETL process to copy files to the Parallel
Data Warehouse's Landing Zone. You then invoke a command-line tool, DWLoader,
and specify options to load the data into the appliance. Or you can use
Integration Services to move data to the Landing Zone and call the loading
functionality directly. To load small amounts of data, you can connect to the
control node and use the SQL INSERT statement. Queries can run concurrently with
load processing, so your data warehouse is always available during ETL
processing. DWLoader loads table rows in bulk into an existing table in the
appliance. You have several options for loading rows into a table. You can add
all rows to the end of the table by using append mode. Another option is to
append new rows and update existing rows by using upsert mode. A third option is
to delete all existing rows first and then to insert all rows into an empty
table by using reload mode.
Monitoring and Management
Parallel Data Warehouse includes the Admin Console, a Web-based application with
which you can monitor the health of the appliance, query execution status, and
view other information useful for tuning user queries. This application runs on
IIS on the control node and is accessible by using Internet Explorer.
The Admin Console allows you to view these options:
- Appliance Dashboard Displays status
details, such as utilization metrics for CPUs, disks, and the network, and
displays activity on the nodes.
- Queries Activity Displays a list of
running queries and queries recently completed, with related errors, if any,
and provides the ability to drill down to details to view the query
execution plan and node execution information.
- Load Activity Displays load plans, the
current state of loads, and related errors, if any.
- Backup and Restore Displays a log of
backup operations.
- Active Locks Displays a list of locks
across all nodes and their current status.
- Active Sessions Displays active user
sessions to aid monitoring of resource contention
- Application Errors Displays error event
information.
- Node Health Displays hardware and software
alerts and allows an administrator to view the health of specific nodes.
To manage database objects, you might need to
query the tables or view the objects. The version of SQL Server Management
Studio included with SQL Server 2008 R2 is not currently compatible with
Parallel Data Warehouse, but you can still use other tools. For example, you can
use a command-line utility, Dwsql, to query a table. Using Dwsql is similar to
using Sqlcmd. An alternative with a graphical user interface is the Nexus query
tool from Coffing Data Warehousing (Coffing DW), which is distributed with each
appliance installation. This tool operates much like SQL Server Management
Studio (SSMS) by allowing you to navigate through an object explorer to find
tables and views and to run queries interactively.
Business Intelligence Integration
Parallel Data Warehouse integrates with the SQL Server business intelligence
(BI) components- Integration Services, Reporting Services, and SQL Server
Analysis Services.
Integration Services
Integration Services is the ETL component of SQL Server. You use Integration
Services packages to extract and merge data from multiple data sources and to
filter and cleanse your data before loading it into the data warehouse. In SQL
Server 2008 R2, Integration Services includes the SQL Server Parallel Data
Warehouse connection manager and the SQL Server Parallel Data Warehouse
Destination as new components that you use in Integration Services packages to
load data into Parallel Data Warehouse. This new data destination provides
optimized throughput and very fast performance because it loads data directly
and quickly into the target database. You also have the option to deploy
packages to the Landing Zone.
Reporting Services
You can use Parallel Data Warehouse as a data source for reports that you
develop for Reporting Services using the Report Designer in Business
Intelligence Development Studio or SQL Server 2008 R2 Report Builder 3.0. The
Parallel Data Warehouse data source extension provides support for the graphical
query designer, parameterized queries, and basic transactions, but it does not
support Windows integrated security or advanced transactions. To use the
Parallel Data Warehouse data source extension, you must install the ADO.NET data
provider for Parallel Data Warehouse on the report server and each computer on
which you create reports.
You can also use Parallel Data Warehouse as a source for report models. By using
Report Manager or the report server API, you can generate a model from a
Parallel Data Warehouse database. For more precise control of the model, you can
use the Model Designer in Business Intelligence Development Studio.
Analysis Services and PowerPivot
Parallel Data Warehouse is also a valid data source for Analysis Services
databases and Excel PowerPivot models. Using the OLE DB provider, you can
configure an Analysis Services cube to use either multidimensional online
analytical processing (MOLAP) or relational online analytical processing (ROLAP)
storage. When using MOLAP storage, Analysis Services extracts data from Parallel
Data Warehouse and stores it in a separate structure for reporting and analysis.
By contrast, when using ROLAP storage, Analysis Services leaves the data in
Parallel Data Warehouse. At query time, Analysis Services translates the
multidimensional expression (MDX) query into a SQL query, which it sends to the
Parallel Data Warehouse control node for query processing.