Designing and implementing a Data Warehouse: Part 2

Article

HTML clipboard

Introduction

Business Intelligence has become a buzzword in recent years as a support to decision making. Today we can find several database systems which include data warehousing, online analytical processing (OLAP), and data mining technologies. Data warehousing provides an efficient storage, maintenance, and retrieval of data. OLAP is a service that provides a way to create ad hoc queries against the data warehouse in order to answer important business questions. Data mining is a disciple comprising of several algorithms for discovering knowledge in a large bulk of data.

In order to build a data warehouse solution, we need to model a consistent architecture where the operational data will fit well in an integrated and enterprise-wide view as well as to take into consideration a handful implementation strategies to provide a high quality application. The design and implementation of a data warehouse solution sometimes is a very complex challenge in theory and practice. In this article, I will cover the main principles and techniques to design and implement a data warehouse providing my own experience in such an overwhelming challenge.

This is the second article of a series of articles that I want to write in order to share my knowledge and experience in this subject matter.

Getting started with the data warehouse solution

A data warehouse has many functions to perform such as: data extraction, data loading, data transformation, data storage, and provision of an interface. Figure 1 illustrates a high-level architecture of the data warehouse.

Figure 1

There are several technologies and components to support the functions of a data warehouse. Figure 2 shows a detail view of the architecture of a data warehouse.

Figure 2

In the left side, you can see the Data Sources which enable extracting data from operational systems, and importing additional data from external sources. Then the all the data from the Data Sources is integrated in the Data Stating component in order to remove inconsistencies, detect missing values and transform the data to be stored in a suitable format for easy access and analysis, in this case structurally consistent with the star schema dimensional model. This step is very important in order to maintain data quality. Once the data is prepared in the Data Staging component of the solution, then later the data is loaded in the data warehouse storage medium, and the Data Storage component manages the storage of the data as well as keeps track information related to the data known as meta-data. It's remarkable to say that the data in the Data Storage is characterized as a stable and de-normalized (following the star schema) representing snapshots at specified periods, although the data may change periodically. It's also a critical decision between using a relational database or multidimensional database for the implementation of the data warehouse. Finally the Information Delivery component enables an interface to access the data in order to be displayed.
Regarding to the Data Storage component, we have three physical mechanisms to store data in a dimensional database: ROLAP, MOLAP, and HOLAP. ROLAP stands for Relational OLAP; in this case the dimensional model is mapped to a relational database engine such as Oracle and SQL Server. MOLAP stands for Multidimensional OLAP; in this case the dimensional model is mapped to cubes for example in SQL Server Analysis Server. And finally, we have HOLAP which stands for Hybric OLAP a combination of ROLAP and MOLAP.

Building the data warehouse solution

This section is intended to take everything we have discussed so far, and begin the process of putting it into practice by describing and demonstrating the full process of analyzing, designing, and building a dimensional model and the implementation of the underlying data warehouse. A data warehouse is typically developed one subject area at time, and in this section we're going to develop a department-specific data warehouse or data mart for sales analysis. The result mode will be a star schema format to enable multidimensional analysis.

Designing the dimensional data model

Let's suppose that our client is a manufacturer company whose transactional database is the AdventureWorks shipped with Microsoft SQL Server 2005. This company is finding difficult to give the sales manager meaningful information to answer several business questions to keep up with a changing and challenging environment and the report execution are adversely affecting the performance of the transaction processing systems during peak reporting periods. The main business questions to answer are concerning sales volumes and shipped units by products by customers by region and by time periods.

The first step is the development of the staging database that receives data from the underlying data sources (the data is almost stored in separate transactional systems), transforms the data and loads the data through an integration process in order to build the multidimensional database.

Once the requirements are gathered, all that we need to do is to establish what the facts are (the transactional data), and what the dimensions are (reference data or master data).
The next diagram (in the Figure 3) shows simple star schema containing a central fact table and several dimension tables surrounding. You can see that every dimension has a surrogate key generated within the data warehouse as the primary key of the dimension table as well as a business key which represents the different primary keys (for example, a product may have a product code in a data source different semantically and by its data type respect to the other data sources) of the underlying tables to be consolidated into the dimension. All the surrogate keys are declared as integer fields with the IDENTITY property turned on. In the case of the fact table, the primary key is usually the combination of all the dimension keys. However, depending on the semantic of the problem and the level of summarization so it does not matter whether we have multiple records, you can declare another attribute on the fact table as the primary key.

Figure 3

The dimension tables such as ProductDim, CustomerDim, RegionDim, and TimePeriodDim enable describing the SalesFact fact table. One interested area in the model is Customers; and the Sales Manager is only interested in the CustomerType attribute because this manufacturer does not sell directly to consumers but only to retailers. The simple SQL statement to load the data to this dimension from a relational schema is shown in Listing 1.

select CustomerID as CustomerBusinessID,CustomerType
from Sales.Customer;

Listing 1

In order for the Sales Manager to get more insight about the customers, then RegionDim table was introduced. This table records the fact that most companies divides up geographical areas into "sales territories", and these territories might not translate directly to a political association such as state or province. The data for the RegionDim table may be extracted directly from a table in the relational schema using the SQL statement shown in Listing 2.

Select TerritoryID as RegionBusinessID, [Name]+'-'+CountryRegionCode as RegionName
from Sales.SalesTerritory;

Listing 2

The ProductDim table represents the dimension related to the area of products. Products can be grouped by into subcategories and categories, and each product can contain descriptive attributes useful for the final reports. This dimension is characterized by the product name, color, and subcategory name. The data for this dimension may be extracted directly from the Product and ProductSubcategory tables in the relational schema as shown in the Listing 3.

Select p.ProductID as ProductBusinessID, p.[Name] as ProductName, p.Color, s.[Name] as SubcategoryName
from Production.Product p inner join Production.ProductSubcategory s
on p.ProductSubcategoryID=s.ProductSubcategoryID;

Listing 3

And finally, we have to deal with the Time dimension represented by the TimePeriodDim table. It's remarkable to say that most star schema designs will have a time dimension because we often need to analyze the information from different time periods. For example, the Sales Manager wants to see the sales volumes in the current month, or compare the value of the current period with the same period last year, and so on. This dimension includes information about the year, day, quarter, month, and week. In this way, we can access and summarize data in any of these periods, not a single period such as year. In order to define the time dimension, we need to understand the level of detail required by the fact table. Sometimes we think that we don't need a date time table because modern database systems include a lot functions to process date and time data types. But if we include a date time table containing a record for each day, we can semantically extend the logic associated to the date time data type by flagging days as holidays or other special days such as Christmas or 4th of July. You can also accommodate the calendar to see fiscal years.

As you can see the time dimension is a special dimension, and most of the time, we need to populate the dimension with calendar dates and date parts such as year, month, quarter and week. In order to generate this dimension, you can come up with SQL business logic (see Listing 4).

declare @dtStartDate datetime;
declare @dtEndDate datetime;

set @dtStartDate = '1/1/2008';
set @dtEndDate = '1/1/2010';

while(@dtStartDate <= @dtEndDate)
begin
select @dtStartDate, DATEPART(YYYY,@dtStartDate) as Calendar_Year,
         DATEPART(M,@dtStartDate) as Calendar_Month,
         DATEPART(Q,@dtStartDate) as Calendar_Quarter,
         DATEPART(WK,@dtStartDate) as Calendar_Week;
set @dtStartDate = @dtStartDate + 1;

end;

Listing 4

The fact table SalesFact contains all the keys to the dimensions such as ProductDim_ID, CustomerDim_ID, RegionDim_ID, and TimePeriodDim_ID as well as the columns to hold the measures to be analyzed such as SalesVolumes and ShippedUnits. The data in this table is the result of summarizing data from the SalesOrderHeader and SalesOrderDetail tables in the relational schema through the following SQL statement (see Listing 5).

select sd.ProductID as ProductDim_ID, sh.CustomerID as CustomerDim_ID, sh.TerritoryID as RegionDim_ID, OrderDate as TimeDim, TotalDue as SalesVolumes, sum(OrderQty) as ShippedUnits

from Sales.SalesOrderHeader sh inner join Sales.SalesOrderDetail sd

on sh.SalesOrderID=sd.SalesOrderID

group by sd.ProductID, sh.CustomerID, sh.TerritoryID, OrderDate, TotalDue;

Listing 5

In this case the measures SalesVolumes and ShippedUnits are fully additive meaning that the values of these attributes may be summed up by simple addition. When you cannot add up a certain measure attribute from the underlying instances and come up with the total, this kind of measure is semi-additive. It's remarkable to say that cubes are great at handling additive measures.

Sometimes you need to add columns to fact tables which are neither dimension keys nor numeric measures to help the users drill down to a level of detail that includes some business entities and a full detail of each. This is commonly known as degenerate dimension, because it's basically a dimension with only one column, which is the business key.

Conclusion

In this second article, I've explained how to get started with a data warehouse solution.