HTML clipboardIntroduction
Business Intelligence has become a buzzword in recent years as a support to
decision making. Today we can find several database systems which include data
warehousing, online analytical processing (OLAP), and data mining technologies.
Data warehousing provides an efficient storage, maintenance, and retrieval of
data. OLAP is a service that provides a way to create ad hoc queries against the
data warehouse in order to answer important business questions. Data mining is a
disciple comprising of several algorithms for discovering knowledge in a large
bulk of data.
In order to build a data warehouse solution, we need to model a consistent
architecture where the operational data will fit well in an integrated and
enterprise-wide view as well as to take into consideration a handful
implementation strategies to provide a high quality application. The design and
implementation of a data warehouse solution sometimes is a very complex
challenge in theory and practice. In this article, I will cover the main
principles and techniques to design and implement a data warehouse providing my
own experience in such an overwhelming challenge.
This is the second article of a series of articles that I want to write in order
to share my knowledge and experience in this subject matter.
Getting started with the data warehouse solution
A data warehouse has many functions to perform such as: data extraction, data
loading, data transformation, data storage, and provision of an interface.
Figure 1 illustrates a high-level architecture of the data warehouse.
Figure 1
There are several technologies and components to support the functions of a data
warehouse. Figure 2 shows a detail view of the architecture of a data warehouse.
Figure 2
In the left side, you can see the Data Sources which enable extracting data from
operational systems, and importing additional data from external sources. Then
the all the data from the Data Sources is integrated in the Data Stating
component in order to remove inconsistencies, detect missing values and
transform the data to be stored in a suitable format for easy access and
analysis, in this case structurally consistent with the star schema dimensional
model. This step is very important in order to maintain data quality. Once the
data is prepared in the Data Staging component of the solution, then later the
data is loaded in the data warehouse storage medium, and the Data Storage
component manages the storage of the data as well as keeps track information
related to the data known as meta-data. It's remarkable to say that the data in
the Data Storage is characterized as a stable and de-normalized (following the
star schema) representing snapshots at specified periods, although the data may
change periodically. It's also a critical decision between using a relational
database or multidimensional database for the implementation of the data
warehouse. Finally the Information Delivery component enables an interface to
access the data in order to be displayed.
Regarding to the Data Storage component, we have three physical mechanisms to
store data in a dimensional database: ROLAP, MOLAP, and HOLAP. ROLAP stands for
Relational OLAP; in this case the dimensional model is mapped to a relational
database engine such as Oracle and SQL Server. MOLAP stands for Multidimensional
OLAP; in this case the dimensional model is mapped to cubes for example in SQL
Server Analysis Server. And finally, we have HOLAP which stands for Hybric OLAP
a combination of ROLAP and MOLAP.
Building the data warehouse solution
This section is intended to take everything we have discussed so far, and begin
the process of putting it into practice by describing and demonstrating the full
process of analyzing, designing, and building a dimensional model and the
implementation of the underlying data warehouse. A data warehouse is typically
developed one subject area at time, and in this section we're going to develop a
department-specific data warehouse or data mart for sales analysis. The result
mode will be a star schema format to enable multidimensional analysis.
Designing the dimensional data model
Let's suppose that our client is a manufacturer company whose transactional
database is the AdventureWorks shipped with Microsoft SQL Server 2005. This
company is finding difficult to give the sales manager meaningful information to
answer several business questions to keep up with a changing and challenging
environment and the report execution are adversely affecting the performance of
the transaction processing systems during peak reporting periods. The main
business questions to answer are concerning sales volumes and shipped units by
products by customers by region and by time periods.
The first step is the development of the staging database that receives data
from the underlying data sources (the data is almost stored in separate
transactional systems), transforms the data and loads the data through an
integration process in order to build the multidimensional database.
Once the requirements are gathered, all that we need to do is to establish what
the facts are (the transactional data), and what the dimensions are (reference
data or master data).
The next diagram (in the Figure 3) shows simple star schema containing a central
fact table and several dimension tables surrounding. You can see that every
dimension has a surrogate key generated within the data warehouse as the primary
key of the dimension table as well as a business key which represents the
different primary keys (for example, a product may have a product code in a data
source different semantically and by its data type respect to the other data
sources) of the underlying tables to be consolidated into the dimension. All the
surrogate keys are declared as integer fields with the IDENTITY property turned
on. In the case of the fact table, the primary key is usually the combination of
all the dimension keys. However, depending on the semantic of the problem and
the level of summarization so it does not matter whether we have multiple
records, you can declare another attribute on the fact table as the primary key.
Figure 3
The dimension tables such as ProductDim, CustomerDim, RegionDim, and
TimePeriodDim enable describing the SalesFact fact table. One interested area in
the model is Customers; and the Sales Manager is only interested in the
CustomerType attribute because this manufacturer does not sell directly to
consumers but only to retailers. The simple SQL statement to load the data to
this dimension from a relational schema is shown in Listing 1.
select
CustomerID as CustomerBusinessID,CustomerType
from
Sales.Customer;
Listing 1
In order for the Sales Manager to get more insight about the customers, then
RegionDim table was introduced. This table records the fact that most companies
divides up geographical areas into "sales territories", and these territories
might not translate directly to a political association such as state or
province. The data for the RegionDim table may be extracted directly from a
table in the relational schema using the SQL statement shown in Listing 2.
Select
TerritoryID as RegionBusinessID,
[Name]+'-'+CountryRegionCode
as RegionName
from
Sales.SalesTerritory;
Listing 2
The ProductDim table represents the dimension related to the area of products.
Products can be grouped by into subcategories and categories, and each product
can contain descriptive attributes useful for the final reports. This dimension
is characterized by the product name, color, and subcategory name. The data for
this dimension may be extracted directly from the Product and ProductSubcategory
tables in the relational schema as shown in the Listing 3.
Select
p.ProductID as
ProductBusinessID, p.[Name]
as ProductName,
p.Color, s.[Name]
as SubcategoryName
from
Production.Product p
inner join Production.ProductSubcategory
s
on
p.ProductSubcategoryID=s.ProductSubcategoryID;
Listing 3
And finally, we have to deal with the Time dimension represented by the
TimePeriodDim table. It's remarkable to say that most star schema designs will
have a time dimension because we often need to analyze the information from
different time periods. For example, the Sales Manager wants to see the sales
volumes in the current month, or compare the value of the current period with
the same period last year, and so on. This dimension includes information about
the year, day, quarter, month, and week. In this way, we can access and
summarize data in any of these periods, not a single period such as year. In
order to define the time dimension, we need to understand the level of detail
required by the fact table. Sometimes we think that we don't need a date time
table because modern database systems include a lot functions to process date
and time data types. But if we include a date time table containing a record for
each day, we can semantically extend the logic associated to the date time data
type by flagging days as holidays or other special days such as Christmas or 4th
of July. You can also accommodate the calendar to see fiscal years.
As you can see the time dimension is a special dimension, and most of the time,
we need to populate the dimension with calendar dates and date parts such as
year, month, quarter and week. In order to generate this dimension, you can come
up with SQL business logic (see Listing 4).
declare @dtStartDate
datetime;
declare
@dtEndDate datetime;
set
@dtStartDate =
'1/1/2008';
set
@dtEndDate = '1/1/2010';
while(@dtStartDate
<= @dtEndDate)
begin
select @dtStartDate,
DATEPART(YYYY,@dtStartDate)
as Calendar_Year,
DATEPART(M,@dtStartDate)
as Calendar_Month,
DATEPART(Q,@dtStartDate)
as Calendar_Quarter,
DATEPART(WK,@dtStartDate)
as Calendar_Week;
set @dtStartDate
= @dtStartDate +
1;
end;
Listing 4
The fact table SalesFact contains all the keys to the dimensions such as
ProductDim_ID, CustomerDim_ID, RegionDim_ID, and TimePeriodDim_ID as well as the
columns to hold the measures to be analyzed such as SalesVolumes and
ShippedUnits. The data in this table is the result of summarizing data from the
SalesOrderHeader and SalesOrderDetail tables in the relational schema through
the following SQL statement (see Listing 5).
select
sd.ProductID as
ProductDim_ID, sh.CustomerID
as CustomerDim_ID,
sh.TerritoryID as
RegionDim_ID, OrderDate
as TimeDim, TotalDue
as SalesVolumes,
sum(OrderQty)
as ShippedUnits
from
Sales.SalesOrderHeader sh
inner join Sales.SalesOrderDetail
sd
on
sh.SalesOrderID=sd.SalesOrderID
group
by sd.ProductID,
sh.CustomerID,
sh.TerritoryID,
OrderDate, TotalDue;
Listing 5
In this case the measures SalesVolumes and ShippedUnits are fully additive
meaning that the values of these attributes may be summed up by simple addition.
When you cannot add up a certain measure attribute from the underlying instances
and come up with the total, this kind of measure is semi-additive. It's
remarkable to say that cubes are great at handling additive measures.
Sometimes you need to add columns to fact tables which are neither dimension
keys nor numeric measures to help the users drill down to a level of detail that
includes some business entities and a full detail of each. This is commonly
known as degenerate dimension, because it's basically a dimension with only one
column, which is the business key.
Conclusion
In this second article, I've explained how to get started with a data warehouse
solution.