In this article we will know about Page Split and SSIS Performance.
Links to other Labs of Learn MSBI Step by Step.
Learn MSBI Part 1:- Creating SQL tables, Practicals of SSIS-ETL(Extraction, Transformation & Loading)
Learn MSBI Part 2:- Performing Validation on Records
Learn MSBI Part 3:- SCD (Slow Changing Data)
Learn MSBI Part 4:- Fact and Dimension.
Learn MSBI Part 5:- CUBE Creation
Learn MSBI Part 6:- Displaying reports using SSRS
Learn MSBI Part 7:- DQS (Data Quality Services)
Learn MSBI Part 8:- Data Profiling Task
Learn MSBI Part 9:- Pivot, UnPivot and Aggregation
Learn MSBI Part 10:- Execute SQL Task component (SSIS)
Learn MSBI Part 11:- Send Mail Task & Script Task Component
Learn MSBI Part 12:- CDC Concept using SSIS
Learn MSBI Part 13:- Asynchronous, Synchronous, Full-blocking, Semi-blocking & No-blocking
Learn MSBI Part 15:- Debugging in deployment – Instrumentation using Data Taps
Learn MSBI Part 16:- Using Multi-threading and its performance in SSIS program
Page Split and SSIS Performance
This article will deal with much more in SQL Server where discussion will be done on concept of page split and prime focus will be on SSIS performance which should not degrade. A good SSIS developer is the one who also has in-depth SQL Server knowledge like how data is stored in SQL Server, how clustered index works internally, how indexes helps to increase performance and concept of page split etc.
Understanding working of Indexes
Normally whenever a search operation is executed on SQL table it goes and search for the record in sequential manner. For example as shown in the image down below where search is done for the record 9. In general form, in order to reach till record no. 9 its scan will go in sequential form like 1, 2, 3 till 8 and it will find 9. Now consider a case where there are thousands of records under the SQL table and if user searches for a particular record then it has to do thousands of scan till it reaches the user’s desired search record in sequential manner which will make entire search process extremely slow and things would be quiet inefficient.
In such scenarios, where especially comes to search option if indexes are created then process becomes more faster. For that internally indexes uses “B-Tree structure” which means balance tree structure where it create non-leaf and leaf node. Non-leaf does the grouping and leaf node has values of that group as shown in the following. So with indexes which uses concept of “B-Tree structure” search becomes easier and simpler.
Again now consider the same scenario where the end user searches for 9th record of the SQL table so first it will check whether 9 is greater than 5 and after finding the answer if “Yes” then it will skip first non-leaf group which holds ‘1-5’ group and then it will search in next non-leaf group which is of ‘6-10’. Here in ‘6-10’ non-leaf group will now do a 3 sequential records scan which is 6, 7, 8 and then it will reach 9 which is the user’s search, as now search is found scanning of record will stop. So with this approach search becomes simpler, easier and faster.
All data in the SQL server is stored and retrieved in the form of pages. B-Structure and leaf nodes of the index have the physical address of the data which is stored on index pages while actual data is stored on the data pages.
These SQL Server pages are built of size 8 KB internally where data is actually stored, to know more about how does data gets stored and looks like refer to our below video: –
Coming back to the article, each page of 8 KB means 8 x 1024 bytes = 8192 bytes (approx. 8000 bytes will be saved on each page) and looking at the SQL table structure where our field data type requires 3000 bytes to store each record as shown in the image down below.
So maximum two records of 3000 KB can be saved on each page rest space is required to save other essential information like header and offset info of that page. And then third record would be saved on second page. This phenomenon of managing pages to save data is called page splitting.
Now consider a scenario where there are thousands of records coming from input source file under the ETL process of SSIS then page split will happen more and in turn the whole input process will become slow.
Detect Page Split
Page split happens only on clustered index because here leaf-node points towards data page and not on non-clustered index.
Detect table with clustered index or not
First check whether existing table is with clustered index or not for that go to SQL table under management studio do a right click and do click on Design which will open the table and table do right click and click on Indexes/Keys which will open new window of Indexes/Keys. Under it check the field name “Create As Clustered” it is marked as “Yes” which means that existing table is created as clustered index.
Please Note: Default value of primary key field is marked as “Yes” only which state that the table is clustered.
After identifying that table is non-clustered or clustered next is to see how to make input process faster.
For that first mark existing primary key of the table to set to field “Create As Clustered” as “No” and then close the window of Indexes/Keys and click to save option in order to save changes done to the table.
Then on SQL table do right click on the “Name” field and click on Indexes/Keys which will open its window and here now click on “Add” and then mark field “Create As Clustered” as “Yes” as shown in the image down below, finally click on Close.
This is done because it is a best practice to keep primary key and clustered index as separate.
One more thing would like to bring in notice that normally page split are checked on production server. And this production server is governed under Administrator who are very much protective about the environment and under that installation of third party software are not allowed or restricted things have to done with limited resource each and everything requires permission of higher authority
Under such case operating system inbuilt tool can be used called “perfmon” which is also known as performance monitor in order to detect performance of the application and also to detect number of page split.
To open the “perfmon” tool open run command window for that on keyboard while pressing window icon key then press ‘R’ and then write “perfmon” on run command window and press enter. It will open performance monitor window on which processor activity can be seen which is default activity.
Next task is to add page split performance activity monitoring for the SQL Server. For that first click on Add Counter icon available on the top and then select SQL Server version which is currently being used under it select “Access Methods” and under that select “Page Splits/sec” and then click on the Add button available at the bottom. Once counter is added under Added counters on the right side screen it will show Page Splits/sec and then click on OK.
Now once the counter of page split is added it will show performance screen. Now go to SQL table and then delete all existing record values and make fresh entry of the records by keeping performance monitor also open in parallel to it after adding 2 new records on performance monitor window a spike generated can be seen then again keep on adding new more records and then again new spike will be seen as shown in the image down below. These spikes generated on performance monitor screen is nothing but an indication of page split is being done. More the spikes generated means more page split is being occurred & process will slow down while performance will degrade.
Running Under SSIS
Now under SSIS package think for a while where there is bulk insert or update operation is being carried out as shown in the image below
Once the file with huge records is imported under Flat File Source and configured with OLE DB Destination as shown above. Next execute the package and check imported huge records in parallel with performance monitor as shown in the image down below and check effect on graph.
After the package is successfully executed on performance monitor there is a huge spike being displayed as there are large number of page split happening. For almost 465 rows there has been more than 200 page split being done which reduces the performance of the application and turning it down to slow pace.
Include Index and make Input process fast
Next will see by including indexes but at the same time will also make sure that the performance won’t be degraded even after having page splits.
In order to achieve the same will introduce two more new components of Execute SQL Task on Control Flow one before Data Flow Task and second one after Data Flow Task. With first Execute SQL Task will configure it such a way that it will drop the indexes before loading import files and then with second Execute SQL Task will again create indexes with this approach index will also remain and performance of the input process won’t be degraded as well.
Following is the image snapshot which has created on SSDT tool with above written thought process.
Now configure Execute SQL Task component for that follow below steps: –
1) Double click on it or do right click and click in Edit to open its Editor.
2) Click on the General tab.
3) Select the connection type, if connection type is not available after clicking on dropdown then click on New Connection to establish new connection. Please note steps to create New Connection would be same as demonstrated in the earlier labs.
4) Write following SQL statement “DROP INDEX test.IX_test” to drop index.
5) Connect the output of component “Drop Indexes” as an input to “Data Flow Task” and similarly connect output of “Data Flow Task” as an input to “Create Indexes”.
6) Double click on “Create Indexes” component or do right click and click on Edit to open its Editor.
7) Click on the General tab.
8) Under General pane, select the connection type from the dropdown option.
9) Write following SQL statement “CREATE CLUSTERED INDEX IX_test ON tbltest(Id)” to create index.
10) Set ByPassPrepare field as False.
Once the components are configured next is to execute the package along with performance counter to view the results.
After executing the package successfully with parallel running of performance monitor a smail spike generated can be seen which means page split is done but with less numbers as we have dropped index before loading of file and then again created index for it.
This states that performance is not much affected as page split is less.
For testing purpose now just disable both Execute SQL Task component and just execute the package and see the results.
After the package is executed a clear huge spike can be seen which is of creation of large number of page splits and ultimately leading to degradation of the input process performance.
This conclude when it comes to the performance as a best practice of the ETL process where large bulk file i.e. where there are thousands of records needs to be imported under such circumstances first drop the index to avoid page splits then load the file and finally re-create the index again in order to sustain the performance. So with this lab we are successfully able to demonstrate creation of index for bulk file with less hampering of performance.
Also see our below Learn MSBI Step By Step project series videos: -