In this article we will explain how to use Multi threading and its performance in SSIS program.
Links to other Labs of Learn MSBI Step by Step.
Learn MSBI Part 1:- Creating SQL tables, Practicals of SSIS-ETL(Extraction, Transformation & Loading)
Learn MSBI Part 2:- Performing Validation on Records
Learn MSBI Part 3:- SCD (Slow Changing Data)
Learn MSBI Part 4:- Fact and Dimension.
Learn MSBI Part 5:- CUBE Creation
Learn MSBI Part 6:- Displaying reports using SSRS
Learn MSBI Part 7:- DQS (Data Quality Services)
Learn MSBI Part 8:- Data Profiling Task
Learn MSBI Part 9:- Pivot, UnPivot and Aggregation
Learn MSBI Part 10:- Execute SQL Task component (SSIS)
Learn MSBI Part 11:- Send Mail Task & Script Task Component
Learn MSBI Part 12:- CDC Concept using SSIS
Learn MSBI Part 13:- Asynchronous, Synchronous, Full-blocking, Semi-blocking & No-blocking
Learn MSBI Part 14:- Understanding Page Split and maintaining SSIS Performance
Learn MSBI Part 15:- Debugging in deployment – Instrumentation using Data Taps
In SSIS, whenever data moves from source to destination is in sequential manner it can also be termed as traditional way. Data is a passed from SQL Data component chunk wise in sequential manner to Derived Column and then it is processed also in sequential one by one into Derived Column. Suppose that there are millions of records coming from SQL table and as it reaches Derived Column component in sequential manner one by one which is going to be time consuming process and entire process will slow down and degrade the performance of the overall SSIS program.
To monitor and improve performance of the package we have written altogether a separate article, incase if you not read then here is the link for the same <<>>
Coming back to point of our existing article which is currently planned, how about making of sequential process into parallel process. Can it be done? …”Yes”, if yes… then how? So on……
One more thing would like to tell that there are various ways to achieve multithreading done. In this article we will discuss those different approaches and compare their time taken to accomplish the process (performance). Readers of this article can then decide approach which suits them the best for project by applying their knowledge skills and creativity to get best performance.
Let’s discuss it out in more detail, in SSIS there is also a provision to change sequential manner moving data to parallel with the help of multithreading.
SSIS has capability and can run up to 10 parallel threads or it can accomplish 10 parallel task at a time. It can be seen under the Data Flow pane properties by doing a right click on Data Flow pane as shown in the image down below: –
Multithreading is a concept with which multiple process can be made to run with parallelism within SSIS package so that task runs simultaneously to get work done. In order to make things running parallel and make optimum use of engine threads SSIS created logic should be in that way to support parallelism. For example above shown image of SSIS package has sequential logic written so now even SSIS engine threads supports 10 parallel threading it won’t be able to use that. So in order to effectively use power of multithreading design SSIS logic correctly.
Next will correct SSIS design logic to make it work as parallel rather than sequential for that remove current data path and introduce one more similar Derived Column which will process 50% of the data to make task parallel also along with that introduce Conditional Split component from Data Flow Transformation option available under SSIS Toolbox.
Conditional Split component will divide data into 50-50 percent and distribute data equally into both Derived Columns components. This arrangement done is termed as partitioning
Please Note: In order to achieve multithreading in SSIS first thing is to do partition of the data.
Once the partition of the data is done then SSIS will run each thread on each partitioned data. So depending on the partitioning logic threads will executed accordingly.
Next will configure the Conditional Split component in order to transfer 50-50% data to each Derived Column.
1) Drag the output of SQL data component as input to Conditional Split component.
2) Configure the component by doing right click on it which will open its Editor window. Here drag and drop column ‘id’ on Condition and write conditional logic as below for first Derived Column
Above written logic states that it will have first output half of the total send from SQL Data component. Its output is named as “Part1”.
Similarly for second output write the conditional logic as below: –
3) Drag one output of the Conditional Split component and drop it as input to first Derived Column which will open output selection window.
4) Here once the window is open select output as “Part1”.
5) Click OK to close the window.
6) Drag second output of the Conditional Split component and drop it as input to second Derived Column which will open output selection window.
7) On this window select output as “Part2”.
8) Click OK to close the window.
Now run the package and see how the package is getting executed, parallel or sequential. From the below image it clearly indicates that still executing process is parallel. Conditional split will execute first 50% of Part1 data path output and then it will execute remaining 50% of Part1 data path.
In short this approach after doing partition also this is not serving purpose of parallel processing that is multithreading.
And in order to check the SSIS performance which means how much time it has taken to process the package will go and click on “Progress” tab to check on time elapsed. To see the details clearly just copy the line on which time is mentioned on notepad.
From here on will note time elapsed for each approach on notepad and name this notepad as ‘comparison’. With this notepad it will help to compare each approach time elapsed. Below is the image of the notepad after copying value from progress tab of SSIS, it has consumed 46 seconds.
As our first approach of achieving multithreading was not successful next here will see second approach to apply multithreading and also notice its performance.
For that design structure of the SSIS will keep the same and just make a small change in the conditional logic of the Conditional Split component. Here will split the EVEN and ODD data with the help of conditional logic.
With this changes made in the conditional logic parallel processing will start immediately. For that edit and change ‘Part1’ data path Output Name conditional value as (id%3)==0. This set conditional logic means that data passing will be divided by 3 and if it has a remainder 0 then it will treated as ODD data and then it will be send from data path ‘Part1’.
Similarly to send EVEN data from data path output ‘Part2’ set conditional value as (id%2)==0 where passing data will be divided by 2 and if it has remainder 0 then it will be considered as EVEN data.
Once the conditional split condition logic are declared then on Data Flow pane introduce one more Derived Column which will handle the data whose remainder does not became 0 after dividing it by 2 and 3. So in order to handle those non divisible data third Derived Column component will be used.
Now as all the components are configured next is to run and execute the package. After the package is executed from the below image it is clear that process task is processing parallel as data is simultaneously going in each Derived Column component.
After the package is executed successfully in order to see time elapsed go to progress tab and copy the time elapsed on the comparison which was created earlier during ‘Approach 1’.
Once time elapsed is copied on the comparison notepad as ‘Approach 2’ which is 28 seconds almost half of the time which was done for ‘Approach 1’. So conclusion can be done ‘Approach 2’ is better than ‘Approach 1’.
It is good to see that task executing in parallel and also performance improvement in ‘Approach 2’ with compare to ‘Approach 1’. Over here if condition logic is running on Conditional Split component in memory within SSIS.
But still there are some issues in this approach where Conditional Split will be loaded heavily if millions of records are send from SQL data and can degrade the performance of the SSIS program.
How about thinking an approach of doing the partition at SQL Server so that partition can be avoided in SSIS program and component within do not get overloaded. In the next will see effect on performance.
To demonstrate this first thing is to remove existing partitioning logic from SSIS program and create new partition on SQL Server OLEDB command in order to achieve parallelism.
For that first do the calculation on excel sheet divide it into 5 partitions so we have all in total 433735 records so to distribute into 5 partitions it needs to be divided into 5 parts. Below image shows the calculations of all 5 parts and range of records which will inserted into each partition.
By removing Conditional Split component and other additional Derived Column component also rename source component “SQL Source” to “Part 1” thereafter just edit the OLEDB Source Editor as shown in the image down below. Once the source editor is open then under connection manager edit existing written SQL command
where (id>0 and id<86747)
Similarly copy from existing OLEDB Source component along with Derived Column and paste it on Data Flow pane in such a way that 5 OLEDB Source components with 5 Derived Columns are seen.
Now rename OLEDB Source components as Part 2, Part 3, Part 4 and Part 5 and then edit each OLEDB Source components then configure it by setting data range according to each partition. Following table depicts data range for each partition.
||Data range for partition
||0 to 86747
||86747 to 173494
||173494 to 260241
||260241 to 346988
||346988 to 433735
Once all partition are configured according to set data range next is to go and execute the package check its time elapsed for performance comparison with previous two approaches.
After executing the package it will immediately start process all the task simultaneously in parallel.
After successfully execution of the package will then see all components marked with green color. Now go on the Progress tab in order to note the elapsed time taken process out approach 3.
If noticed load earlier which was happening in Conditional Split component of SSIS program now it has shifted to SQL Server in the form of queries getting fired every now and then to execute partition logic.
Here SQL select statement is fired and 5 trips is going between SQL Server and SSIS program which again may lead to degradation of performance.
Time elapsed for this approach is copied from progress tab on comparison notepad which is 29 seconds which is almost similar to Approach 2 seen earlier.
Till now under all of approaches seen above had multithreading operation were carried out on Data Flow task using its properties of Engine Threads whose default property is 10 which states that 10 threads can run at a time. In this approach will use the Control Flow task and do multithreading using it properties called MaxConcurrentExecutables and then compare its performance.
For that go to Control Flow and paste the existing Control Flow twice where will divide 5 running partition of Data Flow in 3 parts which will be 2 Control Flows will have 2 each partitions and third Control Flow will have 1 partition within. And do not forget to change Control Flow properties MaxConcurrentExecutables –> 3 then three Control Flow task will execute simultaneously.
So below is the image of the multithreading arrangement of Control Flow: –
Execute the package now which will process all 3 control flows simultaneously i.e. parallel and then observe time elapsed under progress tab of the SSIS program also copy the details on comparison sheet to notice the difference.
It took almost 30 seconds to accomplish the task when multithreading held at Control Flow.
Conclusion: Practically speaking with this approach there is no much difference seen at multithreading carried between Data Flow and Control Flow task.
Please Note: Whenever doing a check for performance always perform task running 2 to 3 times in order to conclude best result. Single reading sometimes may go wrong for safer and strong conclusion in search for accurate result always go for multiple readings options.
In this approach will see another way of doing the partition and this time partition will be actually done on SQL Server. In order to create partition on SQL server there are three important steps which need to be followed as shown in the below image.
1) Create Partition: Write a logic code which for creating 5 partitions with name “MyFivePartition” for our data range.
|Data range for partition
|0 to 86747
|86747 to 173494
|173494 to 260241
|260241 to 346988
|346988 to 433735
2) Creation Scheme for partition: Give as suitable name to partition scheme, here name given is ‘psSample’ and it will be saved on the primary location of the SQL Server.
3) Apply Scheme on Table: Once the Scheme is saved it will be then be applied on specific column of the table in order to create partition. In this scenario will apply it on ‘id’ column and on the table named ‘tbo.tblSource’.
Once above mentioned steps are followed internally 5 partitions will be created on the id column of the table “dbo.tblSource”.
Now in order to see each value of the partition created just execute the below written query.
SELECT count(*) FROM dbo.tblSource
WHERE $PARTITION.MyFivePartiton(id) = 1;
When id value is changed manually it will show each partition data record. Value will be going to be same as records are equally divided into 5 parts.
Next inside SSIS program go and update the OLEDB command in editor wherever data range written with id of each partition as shown down in the image below. Go and change manually for each partition and put the partitioning number for all 5 partitions.
After finishing editing all OLEDB source command component by removing range of the data records with id of the actual data partitioning next is to go and execute the package and see its elapse time for performance gain benefits.
Execute the package and after all components are seen in green color which states that package execution is successful go to progress tab to note its elapse time and get it on comparison notepad.
Elapse time comes upto approximate 26 seconds when partitioning was on SQL table which is near to all our approaches done earlier. Though the value is very near but less as compare to others.
This approach is good as partitioning has moved from the program to SQL at the backend. With this approach there won’t be direct impact on the SSIS program directly but again it has drawback that
1) 5 times SQL query will be fired by SSIS program.
2) As partitioning is in SQL and if SQL server is in remote location then there will be network trip and latency facing issue.
Still this thought approach is better from current SSIS program perspective but again can vary from program and end requirement. So do think over and then apply appropriate approach to the project.
Hope that with this article you have got exposure to various approaches of multithreading find the best which suits to your need apply it and if require custom it.
Do not forget to take readings atleast 3 to 5 times of which ever approach you use. Multiple readings will help you to get performance results get accurate.
Also see our below Learn MSBI Step By Step project series videos: -