Overview
Extract, Transform, and Load (ETL) processes have become a necessity in IT today. Being able to move data from one place to another has become a cornerstone in just about every company worldwide. SQL Server provides Integration Services (SSIS) that allow database administrators (DBAs) and Developers alike to create and deploy advanced ETL packages to leverage the data within their organizations.
Whether a company has a single time data or large chunk or multiple copies of data they want to be able to complete their process as early as possible.
An IT organization can have ‘n’ number of developers, developing and running applications on SQL Server. All this data, sooner or later will be required for further consolidation to analyze it or to design a central database (SQL) which will be from various other databases.
Introduction on ETL
The data in a Data Warehouse system is loaded with an ETL (Extract, Transform, Load) tool. As the name suggests, it performs the following three operations,
- Extracts the data from your transactional system which can be an Oracle, Microsoft, or any other relational database,
- Transforms the data by performing data cleansing operations, and then
- Loads the data into the OLAP data Warehouse.
Example
- Let us assume there is a manufacturing company having multiple departments such as sales, HR, Material Management, EWM, etc. All these departments have separate databases which they use to maintain information w.r.t. their work and each database has a different technology, landscape, table names, columns, etc. Now, if the company wants to analyze historical data and generate reports, all the data from these data sources should be extracted and loaded into a Data Warehouse to save it for analytical work.
- An ETL tool extracts the data from all these heterogeneous data sources, transforms the data (like applying calculations, joining fields, keys, removing incorrect data fields, etc.), and loads it into a Data Warehouse. Later, you can use various Business Intelligence (BI) tools to generate meaningful reports, dashboards, and visualizations using this data.
ETL Process
Let us now discuss in a little more detail the key steps involved in an ETL procedure −
Extracting the Data
It involves extracting the data from different heterogeneous data sources. Data extraction from a transactional system varies as per the requirement and the ETL tool in use. It is normally done by running scheduled jobs in off-business hours like running jobs at night or over the weekend.
Transforming the Data
It involves transforming the data into a suitable format that can be easily loaded into a DW system. Data transformation involves applying calculations, joins, and defining primary and foreign keys on the data. For example, if you want % of total revenue which is not in database, you will apply % formula in transformation and load the data. Similarly, if you have the first name and the last name of users in different columns, then you can apply a concatenate operation before loading the data. Some data doesn’t require any transformation; such data is known as direct move or pass through data.
Data transformation also involves data correction and cleansing of data, removing incorrect data, incomplete data formation, and fixing data errors. It also includes data integrity and formatting incompatible data before loading it into a DW system.
Loading the Data into a DW System
It involves loading the data into a DW system for analytical reporting and information. The target system can be a simple delimited flat file or a data warehouse.
ETL Tool Function
A typical ETL tool-based data warehouse uses staging area, data integration, and access layers to perform its functions. It’s normally a 3-layer architecture.
- Staging Layer:
The staging layer or staging database is used to store the data extracted from different source data systems.
- Data Integration Layer:
The integration layer transforms the data from the staging layer and moves the data to a database, where the data is arranged into hierarchical groups, often called dimensions, and into facts and aggregate facts. The combination of facts and dimensions tables in a DW system is called a schema.
- Access Layer:
The access layer is used by end-users to retrieve the data for analytical reporting and information.
DTS ( Data Transformation Services):
In SQL Server versions 6.5 and earlier, Database administrators (DBAs) used SQL Server Transfer Manager and Bulk Copy Program, included with SQL Server, to transfer data. These tools had significant shortcomings, and many DBAs used third-party tools. When SQL Server 7 was released, “Data Transformation Services” was packaged with it to replace all these tools.
SQL Server 2000 expanded DTS functionality in several ways. Many new types of tasks were made, including the ability to FTP files, move databases or database components, and add messages into Microsoft Message Queue (MSMQ).
SSIS (SQL server Integration Services):
One disadvantage of DTS was that, for development of package, one must be connected to SQL Server. This is overcome in SSIS.
SSIS is replacement of DTS in SQL Server 2005. SSIS introduced “Business Intelligence (BI) tool“, which is development IDE and installed automatically by SQL Server 2005 in Visual Studio 2005. Thus BI tool gives the advantage of Visual Studio development tools for DTS.
Advantages of SSIS over DTS:
SSIS gives you many new ways to control the flow of your package that could only be done before by writing code. One of the coolest control features is the ability to set up looping within the package. Two tasks, the For Loop Container and Foreach Loop Container, are available for this purpose.
Read more articles on SQL Server: