Fragmentation in SQL Server| Internal and External Fragmentation

Neerajprasadsharma
Posted by in Sql Server category on for Advance level | Points: 250 | Views : 8643 red flag

In this article you will have a clearer understanding about fragmentation in SQL Server, how it occurs, what are the causes and some problem related to it.

Introduction

In this article you will have a clearer understanding about fragmentation in SQL Server, how it occurs, what are the causes and some problem related to it.
This article is dedicated to SQL Server fragmentation only, if you are following the current long running series, then you know, we wrote in the index internals article about basics of fragmentation, but as the series grows, we need to understand everything about fragmentation as it will help understand the complete series.

Background

Like all my articles, this article will be filled with practical based knowledge only. I will start with the same article I linked when I was trying to explain about SQL Server fragmentation so here as well, please read below.
 "There are two different types of fragmentation in SQL Server: Internal and External.  Internal fragmentation is the result of index pages taking up more space than needed.
 It is like having a book where some of the pages are left blank; we do not know what pages are blank until we read the entire book and the same applies to SQL Server, which has to read all the pages in the index wasting extra-time and server resources in the empty pages. External fragmentation occurs when the pages are not contiguous on the index. Following the book analogy,it is like having a book where pages are not ordered in a logical way (page 1, then page 2, then page 3 and so on) causing you to go back and forward to compound the information and make sense of the reading. Heavily used tables, that contains fragmented indexes will impact your database performance. "
Ok, let us  try to understand it with an example, suppose in SQL Server a page can hold only 6 pages, for every 7th row Storage Engine has to assign a new page in the storage devices. 
The storage engines always try to insert the SQL server index data page in a contiguous sequential manner, for example:
Suppose a user has a unique index on the ID column and insert data like 2,4,6,8,10,12,16,18,20,22,24,26,28,30,32 then  most probably you won't see any external fragmentation (for efficient, faster insert Some time storage engine can create a little external fragmentation,  but that is negligible) because Storage Engine takes care of that, but still you will find little internal fragmentation because when the some pages  are Not full by the data, technically it is internally fragmented right?
Look into the image below:



Look at how the data in the index are nicely continuing physically, but what if user insert 3 in the index?
There would be two options first inserts the row into the index at the right place and rearrange all the index accordingly.
That would be certainly very-very expensive approach, so the better approach would  be to allocate a new page and link it to Page 1 and 3, move the data accordingly.
That is what Storage Engine does it allocate a new page for index, move approx half of rows in the new page and link it back.

Look at the image below:

 

User inserted row 3 in the index, since id 3 should be in the page 1 and there is no space left on the page  so storage engine has allocated page id 4 and move half of the data in the page id 4. Now technically the index has fragmented internally and externally.

How internally? Because first page which has a capacity of containing 5 rows now only contains 3 rows in page id 1 and 4, and externally because now the pages are not physically contiguous, so if a user requested rows in the key order, storage engine has to move forward and then come backward to get IDs in order.

Thus, when the index is externally fragmented you will have random IO and random IOs are pretty slower than sequential IOs specially in the conventional rotating disks. I am quoting BOL on that and want readers to visit the link on that

"Accessing data sequentially is much faster than accessing it randomly because of the way in which the disk hardware works. The seek operation, which occurs when the disk head positions itself at the right disk cylinder to access data requested, takes more time than any other part of the I/O process. Because reading randomly involves a higher number of seek operations than does sequential reading, random reads deliver a lower rate of throughput"


Set up the table to test fragmentation in SQL Server

We will try to use the almost same example, what we tried to explain above, but we will create a large table because Storage Engine doesn`t take care of fragmentation in the small indexes for obvious reasons.

As we used  the number table to populate the table in this series, again we will use the number table to populate the table, but this time the table will have even rows to match the above example.

Script below:

 

CREATE TABLE [DBO].Fragmented (
Primarykey int NOT NULL,
Keycol VARCHAR(50) NOT NULL
,SearchCol INT
,SomeData char(20),
 
Keycol2 VARCHAR(50) NOT NULL
,SearchCol2 INT
,SomeData2 char(20),   
 
 
Keycol3 VARCHAR(50) NOT NULL
,SearchCol3 INT
,SomeData3 char(20))
 
GO
 
 
INSERT INTO [DBO].Fragmented
SELECT
n,
 
n,
n
,'Some text..'
,
n,
n
,'Some text..'
 
,
 
n,
n
,'Some text..'
 
FROM Numbers WHERE N/2.0=N/2
GO
 
 
CREATE UNIQUE CLUSTERED  INDEX  CI_Fragmented ON  [DBO].Fragmented (Primarykey)
GO
 

Ok, the table has created with even clustered index IDs so first look at the external fragmentation and internal fragmentation  using sys.dm_db_index_physical_stats  DMV, in this DMV external fragmentation shows as “avg_fragmentation_in_percent “column and internal fragmentation show as “avg_page_space_used_in_percent”.

Let us run the query against our newly created table and look at the fragmentation.

SELECT OBJECT_NAME(OBJECT_ID) Table_Name, index_id,index_type_desc,index_level,
avg_fragmentation_in_percent fragmentation, avg_page_space_used_in_percent Page_Density, page_count,Record_count
FROM sys.dm_db_index_physical_stats
(db_id('tutorialsqlserver'), object_id ('Fragmented'), null, null,'DETAILED')
Where
Index_Level =0 and Index_Id=1


There is no external fragmentation in the index and even all pages all almost fully occupied nicely.

So, we can start. Earlier we use to paste the DBCC IND command output into the excel find and then we use to filter the results accordingly to the requirement, but  since I as passionate in SQL server not in Excel so I decided to create a table with a couple of stored procedures to extract the result of DBCC IND command and insert it in the INDOUTPUT  table.

You can get the script from here. So, let us  get the DBCC IND command result in the INDOUTPUT table using our stored procedure "REFRESH_INDOUTPUT".

EXEC REFRESH_INDOUTPUT  'TutorialSQLServer', 'Fragmented', 1


 

Ok, great the table has refreshed, now let us look into the first three data pages in the index.

Select   Table_Name,  PrevPagePID, PagePID, NextPagePID from INDOUTPUT
Where   IndexLevel = 0 and PrevPagePID = 0
Select   Table_Name,  PrevPagePID, PagePID, NextPagePID from INDOUTPUT
Where   PagePID=264129
Select   Table_Name,  PrevPagePID, PagePID, NextPagePID from INDOUTPUT
Where   PagePID=264130

 

 Look into the above result all the first three page id and next to each other this is the same condition as we described in the introduction section.Let us go into one step deeper and look into the index pages using DBCC IND COMMAND.


DBCC PAGE ('TutorialSQLServer',1,264128,3) WITH TABLERESULTS
GO
DBCC PAGE ('TutorialSQLServer',1,264129,3) WITH TABLERESULTS
GO
DBCC PAGE ('TutorialSQLServer',1,264130,3) WITH TABLERESULTS
GO







We can clearly see in the output of DBCC IND command, the first data page/leaf level "264128" contains primary key  2 to 160 with its respective data from, next data page "264129" contains primarykey 162 to 316 and data page "264130" contains primarykey 318 to 472 all the pages are nicely externally and internally defragmented.

So let us fragment it by inserting the primarykey 3  in the table and see what happen.

INSERT INTO [DBO].Fragmented
SELECT
n,
 
n,
n,
'Some text..',
n,
n,
'Some text..',
n,
n
,'Some text..'
FROM Numbers WHERE N=3
GO


 

Row with primarykey (key value) 3 has inserted into the table so now as described in introduction section the index should be fragmented,

so let us again look into the output of sys.dm_db_index_physical_stats.

 

SELECT OBJECT_NAME(OBJECT_ID) Table_Name, index_id,index_type_desc,index_level,
avg_fragmentation_in_percent fragmentation, avg_page_space_used_in_percent Page_Density, page_count,Record_count
FROM sys.dm_db_index_physical_stats
(db_id('tutorialsqlserver'), object_id ('Fragmented'), null, null,'DETAILED')
Where
Index_Level =0 and Index_Id=1


 

Look at the result of DMV, now the index is  slightly fragmented, before inserting the above row DMV output  was not showing any fragmentation  and now its showing 0.0579206487112656 percent, and there is a change in page density and page count and offcorse the record count as well.

So as per out theory the second page should be out of order and linked to the first page with approx half of rows should be moved to the new page and this newly created page`s next page should be linked to the second page of the last result.

So let us again used our "REFRESH_INDOUTPUT" stored procedure to get the result of IND command.


EXEC REFRESH_INDOUTPUT  'TutorialSQLServer', 'Fragmented', 1
GO
 SELECT Table_Name,  PrevPagePID, PagePID, NextPagePID from INDOUTPUT
WHERE IndexLevel = 0 and PrevPagePID = 0
--274014
 SELECT Table_Name,  PrevPagePID, PagePID, NextPagePID from INDOUTPUT
WHERE PagePID=274014
--264129
SELECT    Table_Name,  PrevPagePID, PagePID, NextPagePID from INDOUTPUT
WHERE PagePID=264129

 

Great, as expected the second page is out of order, then linked back to the ordered next page.

Let us again look inside the all the first three pages using DBCC PAGE command.

 

DBCC PAGE ('TutorialSQLServer',1,264128,3) WITH TABLERESULTS
GO
DBCC PAGE ('TutorialSQLServer',1,274014,3) WITH TABLERESULTS
GO
DBCC PAGE ('TutorialSQLServer',1,264130,3) WITH TABLERESULTS
GO



 

Look at the results, now the first page contains the newly inserted row, but because there was no space left to insert a new row in the page,

The Storage Engine has created a new page moved half of rows into this newly created page and then linked it to the first and second page, half of the rows moved to a new page with id "274014" you can see there is no change in data of  page id "264129" as it wasn`t required.

 

 

Write performance problem with fragmentation (Page Split)

In this section we will look into a fragmentation related performance problem which occurs during the page split.
For that we will  insert a row and look into the Statistics IO and transaction log activity using sys.dm_tran_database_transactions.

In last test we have externally fragmented the table by inserting the 3 in the Unique Clustered Key value so we are cleaning the table and index by deleting key key value  and alter that we will Rebuild the table, So the Index can be de-Fragmented again and we can play further:)


Delete DBO.Fragmented
Where Primarykey = 3
GO
Alter Table DBO.Fragmented Rebuild
GO


So let us first insert a new row in sequence.

 

SET STATISTICS IO ON
 -- transaction log activity
Begin Tran
INSERT INTO [DBO].Fragmented
SELECT
500001,
 
500001,
500001,
'Some text..',
500001,
500001,
'Some text..',
500001,
500001,
'Some text..'
 
SELECT [database_transaction_log_bytes_used] FROM sys.dm_tran_database_transactions
WHERE [database_id] = DB_ID ('tutorialsqlserver');
Commit





 

Look at the output of statistics IO results, to complete this insert requested SQL Server required 3 logical reads and this transaction has generated 344 bytes  for the transaction log.

Now let us again insert the row, but this time we will insert a row in between the clustered key  so page split can occur.

In the below test we are inserting 3 in the Clustered Index key column.

 

-- Transaction log activity
Begin Tran
INSERT INTO [DBO].Fragmented
SELECT
3 ,
 
3,
3
,'Some text..'
,
3,
3
,'Some text..'
 
,
 
3,
3
,'Some text..'
 
SELECT [database_transaction_log_bytes_used] FROM sys.dm_tran_database_transactions
WHERE [database_id] = DB_ID ('tutorialsqlserver');
 
Commit

 





Look into the result of Statistics IO and  sys.dm_tran_database_transactions DMV , in this transaction page split occurs and it required 12 pages logical read to complete the transaction and 12148 bytes generated for the transaction log, more transaction logs means more sized transaction log database, and all the transaction log database should be backed up/log shipped/mirrored etc to save the database for disaster recovery. Thus more transaction log means more overhead.

In this particular scenario, this transaction is almost 4 times expensive compare to the last transaction. So where ever page split occurs, it takes more time, more IO and generate more logs, hence become more expensive in all terms, so if the page split can be avoided, then it should be avoided in all teams.

 


 
Summary

In todays article we learnt what is fragmentation, how fragmentation occurs why it is considered expensive, how SQL Server deals with it. But this topic hasn`t finished yet, we are coming up with the follow up article which will give you more insights about the fragmentation till then I have a question for readers.

Do you know fragmentation doesn`t matter when...

Complete the sentence and win…
What you think about this subject, write in the comment section.

 

 Quote

 "That’s been one of my mantras—focus and simplicity. Simple can be harder than complex; you have to work hard to get your thinking clean to make it simple."—Steve Jobs



Page copy protected against web site content infringement by Copyscape

About the Author

Neerajprasadsharma
Full Name: Neeraj Prasad Sharma
Member Level: Bronze
Member Status: Member
Member Since: 5/13/2016 8:42:37 AM
Country: India
Contact for Free SQL Server Performance Consulting and Training for you or your Organization.

Neeraj Prasad Sharma is a SQL Server developer who started his work as a dot net programmer. He loves SQL Server query optimizer`s capability to process the queries optimally. For the last six years he has been experimenting and testing Query Optimizer default behaviour and if something goes wrong his goal is to identify the reason behind it and fix it. I write technical article here: https://www.sqlshack.com/author/neeraj/ https://www.codeproject.com/script/Articles/MemberArticles.aspx?amid=12524731 https://www.mssqltips.com/sqlserverauthor/243/neeraj-prasad-sharma-/

Login to vote for this post.

Comments or Responses

Posted by: NEERAJPRASADSHARMA on: 2/4/2018 | Points: 25
So you are saying more CPU usage means better and faster query Really ?

Login to post response

Comment using Facebook(Author doesn't get notification)