In this article we will explain about DQS (Data Quality Services)
Links to other Labs of Learn MSBI Step by Step.
Learn MSBI Part 1:- Creating SQL tables, Practicals of SSIS-ETL(Extraction, Transformation & Loading)
Learn MSBI Part 2:- Performing Validation on Records
Learn MSBI Part 3:- SCD (Slow Changing Data)
Learn MSBI Part 4:- Fact and Dimension.
Learn MSBI Part 5:- CUBE Creation
Learn MSBI Part 6:- Displaying reports using SSRS
Learn MSBI Part 8:- Data Profiling Task
Learn MSBI Part 9:- Pivot, UnPivot and Aggregation
Learn MSBI Part 10:- Execute SQL Task component (SSIS)
Learn MSBI Part 11:- Send Mail Task & Script Task Component
Learn MSBI Part 12:- CDC Concept using SSIS
Learn MSBI Part 13:- Asynchronous, Synchronous, Full-blocking, Semi-blocking & No-blocking
Learn MSBI Part 14:- Understanding Page Split and maintaining SSIS Performance
Learn MSBI Part 15:- Debugging in deployment – Instrumentation using Data Taps
Learn MSBI Part 16:- Using Multi-threading and its performance in SSIS program
DQS (Data Quality Services) is all about giving end user such a tool from where data coming is proper i.e.
bad data is itself stopped by end user to get into the system and only good or
quality data is in.
To understand the exact meaning for the above text mentioned let make clearer of an idea how that can be done. Consider below example where data has been randomly entered into the sheet where for same product under the list there are different names for each case like
Cello Gripper, Lexi and Pen are -> Pen so they should be clubbed together
Nataraj, Staedler and Pencil -> Pencil
Kores and Rubber -> Rubber(eraser)
And such data is best judged by end user so such stationary or the particulars rather than putting it differently it can listed as one entity so there are chances to reduce bad data content entering the system. Below is the image which depicts the same: -
- With DQS such authority can be given to end user by rectifying bad data and entering only correct data into the system
- End user will correct bad data or remove bad data based on his/her knowledge
base and this knowledge base should be improved over a period of time.
- And last some approval mechanism should be in place for the good knowledge
base which is implanted to keep good data and rectify bad data.
Once DQS tool is enabled into our SQL server then using updated knowledge
base system end user can distinguish between good data and bad data.
Next this article will demonstrate complete practical steps for DQS
Very first step to do installation of DQS from SQL Server version available
in the system.
For that go to Start -> Menu -> All Programs then expand SQL Server folder
under that expand Data Quality Services and under to that click on Data Quality
Server Installer.
Once clicked on the "Data Quality Server" installer it will open installer in command prompt as shown in the image below
It will prompt to set password for Database Master Key. Set here password
with mixed combination of keys with uppercase letters, alphanumeric, numeric and
special characters.
Please Note : For installation of DQS will take
maximum 5 to 10 mins depending on hardware of the system on which it is
installed.
Once password is set then it will proceed for installation and once the installation is done it will show successfully executed screen of DQS
as shown in the image below.
After the installation is done go to SQL Server Management Studio and under Databases it will show DQS installed with following databases under it :-
DQS_MAIN: It is the most important databases out of all three.
Knowledge base actually gets saved into it.
DQS_PROJECTS: So whatever knowledge base is saved under DQS_MAIN is used and applied on the data also it gets saved under DQS_PROJECTS.
DQS_STAGING_DATA: Data uploaded from various sources like excel, CSV
comes to DQS_STAGING_DATA on temporarily basis over here data is reformatted
again and goes into database.
Please Note :- With these DQS database table a user is not directly use here introduction is done so that in future if server migration is needed to be done then at that time required migration database should be known.
All these three DQS database should be migrated from server to server.
It is highly recommended to not to directly use these three DQS databases
instead make use of Data Quality Client which is used to work on DQS.
In order to open DQS Client go start menu of the system -> All Programs ->
expand Microsoft SQL Server 2012 -> expand Data Quality Services -> Data Quality
Client.
It will open the DQS client start up screen as shown in the image below :-
After being asked for Server name, provide the server name where the database of DQS is available. If not sure about the server name then go to open SQL Server Management Studio click on File a Connect Object Explorer copy server name from here and paste it on the SQL Server DQS
screen and then click on "Connect" as shown in the image below.
Once proper server name is given SQL Server DQS it will get connected and then open following screen. Here under DQS there are three sections :-
Knowledge Base Management - DQS first will create Knowledge base then
Data Quality Projects - this Knowledge base can be utilized by creating a project and
Administration - once project is created finally it will be
administered/monitored the progress of the project.
As this is a new project, first a New Knowledge Base will be created
by clicking first option under Knowledge Base Management.
It will open screen to create new Knowledge Base Management give it a
suitable name here name given to it is "CustomerClear" and then select Activity
as "Domain Management" followed by clicking "Next" to move to following screen.
Following screen is to create new Domain Management, for that click on icon of creating new domain which will open new screen called
"Create Domain". Give a suitable name to domain, here given as "StationaryName"
uncheck "Enable Speller" as spell check will be done manually in this case and
then click on OK.
Once domain is created for that next is to define rules and establish relations. So starting with first tab named
"Term Based Relation" under which it will correct input values to a common synonym like Cello Gripper and Lexi are pens so their common synonym is pen so it will be put under pen and similarly it is for Nataraj and Staedler
are pencils and for it common synonym is pencils.
For that click on the Add icon and manually enter the values as shown in the
image below.
Next tab is of "Domain Rules", in other words this tab will help to set
validation on the fields by using domain rule which will be applied on input
field.
For that click on Add Icon to add new rule then give suitable name and add description to it. Thereafter click on Add icon below given to add expression. Here set expression to
Values does not match a regular expression = [a-z]{1,20}
Above expression states that it will expect only input values having only
alphanumeric values (special characters and numeric characters not allowed) with
minimum length of 1 and maximum 20 characters will be accepted.
Next tab is of "Domain Values" for time being it is going to be kept unchanged as this is required when there are multiple projects.
And in this example there is only one project.
And last tab is of "Reference data" this is also not going to be used as it
will be required when other data provider are referred too.
Thereafter click on Finish and then finally click on Publish.
Once it is published it can be seen on the main section at the bottom of Knowledge Base.
Thereafter it can be edited any time by clicking Domain Management.
Next after Knowledge base it created then create Date Quality Projects by
clicking on "New Data Quality Project".
After Data Quality Projects screen is open the give suitable name to it and add description to it.
Then select the Knowledge base created from the dropdown and finally before
clicking on Finish select activity as Cleansing.
Next Data Quality Projects is four steps process and for each step there are
tabs: -
1. Starting with first tab called "Map" under which selection of the database
(in this case excel file will be selected) is going to be done to cleanse
against mapped domains.
- Under Map tab select "Excel fiie" as data source from the dropdown.
- Select Excel from location of local drive just by clicking on Browse.
- Under Mappings set Source Column –> StationaryName(String) and Domain –> StationaryName
from dropdown option available under Mappings.
- Then click on Next to move on to following tab.
2. Following tab is of Cleanse, here click on Start to perform cleanse the file selected as Data Source once it has done the process of cleansing it will show the Cleansing records in percentage how much it is successful and also it will show Source Statistics.
Once found everything is correct then click on Next.
3. In the third tab called Manage and View results is to set Approval for all the record selected from datasource excel file starting with first correct records from
"Invalid" tab as shown in the image below. Set record "909890" as correct to "Rubber"
& set null record "DQS_NULL" correct to "NotKnown".
Since now these invalid record has been corrected click on Approve available
button at the right side.
And other record available under "New" tab mark as Approve and then click on
Next.
Finally is the time for setting Export tab, before exporting do a final review of Output data which is available on the left side. Once review is done and found output data is correct then on right side select Destination Type as
"CSV File" for the output and browse the output to save it on local system.
Also select the Output Format as "Data Only" and click on Export.
Once the output is produced in CSV format go to the location on local hard drive and open CSV file and get the output file compared with old excel file named DQS file.
Notice the changes which has been done on old fields & resulting the change as
it should be which is clearly notice on new output file.
And the third section is of Administration which will work on earlier created Knowledge Base and Data Quality Project.
Here under Administration monitoring will be done on Data Quality activities and
also system settings can be done.
Once clicked on Activity monitoring it will open Administration screen from where activity can be track like is it active or inactive, elapsed time and start or end time.
Activity of both knowledge base and Data Quality Project can be checked.
Also get to see following video on MSBI project series which start from the basic to start SSIS : -