Data Cleansing Using Cloud Based Reference Data
Need for Data Quality Services:
For any business and organization high-quality data is critical for efficiency. An enterprise of any size can use Data Quality Services (DQS) in SQL Server 2012 to improve the value of its data and make it more suitable for use. Correct and accurate data can form the basis for Business Intelligence, Data Analytics and Data Warehouses. Decisions taken based on Quality Data will yield good results for the business.
Data Quality deals with the accuracy of data, completeness of data, timeliness of data, consistency of data and trust on data.
SQL Server Data Quality Services
SQL Server 2012 Data Quality Services (DQS) is the data quality product from Microsoft SQL Server 2012. DQS enables you to perform a variety of critical data quality tasks, including correction, enrichment, standardization and de-duplication of your data.
DQS provides the following features to resolve data quality issues.
- Data Cleansing
- Matching
- Reference Data Services
- Profiling and Monitoring
- Knowledge Base
DQS enables you to perform data cleansing using cloud-based reference data services provided by reference data providers. DQS also provides profiling that is integrated into its data-quality tasks, enabling to analyze the integrity of the data.
Data Quality Services consists of the following:
Data Quality Server
Data Quality Server is implemented as three SQL Server catalogs DQS_MAIN, DQS_PROJECTS, and DQS_STAGING_DATA.
DQS_MAIN includes DQS Stored Procedures, DQS engine, and published Knowledge Bases.
DQS_PROJECTS includes data that is required for Knowledge Base management and DQS project activities.
DQS_STAGING_DATA provides an intermediate staging database where you can copy your source data to perform DQS operations, and then export your processed data.
Data Quality Client
Data Quality Client is a standalone application that enables you to perform knowledge management, data quality projects and administration in one user interface. A Data Quality Client can be installed and run on the same computer as a Data Quality Server or remotely on a separate computer. Many operations in the Data Quality Client are wizard-driven.
Both are installed from within the SQL Server setup program.
The following details how Data Quality Services can be installed and used for Data Cleansing.
Step 1: Installing the Data Quality Services
Data Quality Services comes with SQL Server 2012 Setup, install it from the setup as in the following:
Once the Setup is completed, you can see the Data Quality Client and Server Installer under SQL Server in Programs and Features
Step 2: Post Installation Tasks
After the completion of SQL Server installation wizard, you must perform additional steps to complete the Data Quality Server installation and configuration.
The DQSInstaller will be part of the Microsoft SQL Server in the Binn Folder, browse to this folder and run the DQSInstaller.exe as Administrator; this creates the required databases for DQS, log files and also installs the required .NET assemblies.
This completes the Data Quality Server installation.
Next, grant DQS Roles to users and make your data available for DQS operations.
Step 3: Launch the Data Quality Client from "Start" – "SQL Server 2012" - "Data Quality Services" - "Data Quality Client"
Step 4: Connect to the instance on which Data Quality Services is installed. (Click on the down arrow to View and select the server name.)
Step 5: Data Quality Services Client
After selecting the server and clicking on "Connect", the Data Quality Services Client opens as shown in the following figure.
The Client has sections for Knowledge Base Management, Data Quality Projects and Administration.
Knowledge Base Management: This has an option to create a new Knowledge Base, open an existing Knowledge Base and list the recent Knowledge Bases.
Data Quality Projects: This has an option to create a new Data Quality Project, open an existing project and list the recent Data Quality Projects.
Administration: This is for Configuration and Activity Monitoring (data quality projects) of the Data Quality Services.
Step 6: Configuration and Data Market Account Id Setup using Azure Cloud Services
Open the Administration Section to configure the Data Market account Id. The Online Data Market reference Data Service Account Id must be specified here. The Data Market Account Id can be retrieved from Azure Data Cloud services.
Options to create a Data Market Account id are present in the same screen.
Use the following procedure to create a Data Market Account Id:
- Logon to "http://dqs.datamarket.azure.com" and get the Market Id, against which you need the Data validations.
- Sign in with the Microsoft Live id.
- Select the language and Region, complete the registration and get the Account id details.
The Account Key and Customer Id can be retrieved from the "My Account" - "Account Information" screen.
- Fill in the generated account Id in the Data Quality Services Client.
Step 7: After completion of the configuration click on Close that will return you to the Home Screen.
Step 8: Data Quality Project Creation
Click on "Create New Data Quality Project" then specify a name of your choice
Use the DQS Data as the Knowledge Base. There are multiple domains that are provided by DQS Data, like First name, Last Name and Country/Region Validations. Select the Domain and Activity that you want to Validate. (Here US- Last Name is selected for name validations and “Cleansing” Activity is selected.)
Step 9: Data Quality Project - Mapping, Cleansing, Managing and Results Export
Once the project is created, a tabbed interface is presented for Mapping, Cleansing, Managing and Results Export.
Tab1 - Map
We need to specify the Data Source, Database name and Table that we want to validate and the column that we need to validate. When the table is selected, the column names can be chosen from the Mappings Section.
Map the column to the Domain Validation. (Here the Last Name column of the Person table is mapped to US- Last Name.)
Click on "Next" to go to Cleanse.
Tab2: CleanseThis is the heart of the Data Quality services. This is the Profiling of Data Activity.
Click on "Start" to start the cleansing. This moves the table data into the DQS Staging area and completes the profiling for the Last Name Column.
The Profiling helps in giving the Correct Values, Suggested Values, Completeness and Accuracy of data.
The profiling also displays the statistics of the Record Count, Correct Records, Corrected Records, and Suggested Records.
Once the Profiling of records is completed, the status of the Processing/Cleansing is displayed in the same tab.
Click "Next" to view the results.
Tab3: Manage and View ResultsThis Tab helps in viewing the detailed results (record level) of the profiling. It lists the Suggested Values, New Values, Invalid, Corrected and Correct Records.
Click "Next" to export the results.
Tab4 – Export
Export has options to export to a table or a file. Select the option and corresponding attributes to export the data. (Here export to database is considered.)
Once the data is exported to the database we can view it in the database (Query Viewer of SQL Server Management Studio).
By completing the preceding steps we can do Data Quality Profiling and get reports of the data cleansing (Valid, Invalid, Correct, Corrected and suggested).