This article is a small part of the big series i started with my very first article on create primary key in SQL Server followed by question answer series on the primary key and then and explaining all the point of the answer in detail with internal structure.
Introduction
By default SQL Server creates Clustered Index on the primary key, so it is important to understand the Clustered Index in depth and its general recommendation, because I am a firm, non believer in only theory because every theory is the result of some Practical experience, at least in SQL Server, but most of the way to find out what is happening inside the SQL Server is hidden and undocumented and unsupported, so it's not easily accessible, but some Microsoft engineer or ex Microsoft employees talk about internal and use them. But if something is undocumented then it should not be run on the production environment because if anything breaks in your database, Microsoft won`t take any responsibility, so undocumented commands should be run into the testing or developer system.
In this whole series we are using DBCC IND and DBCC PAGE command which is undocumented, hence not recommended to run on the production Servers.
This article is a demonstration why we recommend that Clustered Index should be unique, and the reason behind why SQL Server by default creates Clustered Index on the primary key because the primary key constraint is unique and suitable for Clustered Index.
In this article we will create a table which will have an equal number of rows and equal number of indexes and index definition with compare of our ideal table and then we will try to identify the difference and why it is said that Clustered Index should be Unique.
Below is the script to create a table with not unique clustered index:
CREATE TABLE [DBO].[NotUniqueCI] (
Primarykey int NOT NULL ,
Keycol VARCHAR(50) NOT NULL
,SearchCol INT
,SomeData char(20),
Keycol2 VARCHAR(50) NOT NULL
,SearchCol2 INT
,SomeData2 char(20),
Keycol3 VARCHAR(50) NOT NULL
,SearchCol3 INT
,SomeData3 char(20) )
INSERT INTO [DBO].[NotUniqueCI]
SELECT
n ,
n,
n
,'Some text..'
,
n,
n
,'Some text..'
,
n,
n
,'Some text..'
FROM Numbers
GO
--ALTER TABLE DBO.[NotUniqueCI]
--ADD CONSTRAINT PK_NotUniqueCI PRIMARY KEY (Primarykey)
CREATE CLUSTERED INDEX CI_NotUniqueCI ON [DBO].[NotUniqueCI] (Primarykey)
Create nonclustered index NC_NotUniqueCI_Keycol on [NotUniqueCI](Keycol,SearchCol,SearchCol2)
Create nonclustered index NC_NotUniqueCI_SearchCol on [NotUniqueCI](SearchCol,Keycol,SomeData)
Create nonclustered index NC_NotUniqueCI_SomeData on [NotUniqueCI](SomeData,Keycol,SearchCol)
Create nonclustered index NC_NotUniqueCI_Keycol2 on [NotUniqueCI](Keycol2,SearchCol,SearchCol2)
Create nonclustered index NC_NotUniqueCI_SearchCol2 on [NotUniqueCI](SearchCol2,Keycol,SomeData)
Create nonclustered index NC_NotUniqueCI_SomeData2 on [NotUniqueCI](SomeData2,Keycol,SearchCol)
----------This index is creted in the previous part of the series
CREATE INDEX NC_NotUniqueCI_SearchCol3 on NotUniqueCI(SearchCol3 )
Now let us see all the indexes on the both tables using SP_HELP command:
SP_HELP 'IdealCI'
GO
SP_HELP 'NotUniqueCI'
GO
Both tables have identical table structure, indexes and number of records, but there is just one slight difference in the both tables which is ideal table has a unique clustered index and it is not unique
in the Nonunique table, so let us check the total number of pages in the both tables using the
sys.dm_db_index_physical_stats DMV.
SELECT SUM (Page_Count *8 ) Space_Used_KB
FROM sys.dm_db_index_physical_stats
(db_id('TutorialSQLServer'), object_id ('IdealCI' ), null ,null,'DETAILED')
GO
SELECT SUM (Page_Count *8 ) Space_Used_KB
FROM sys.dm_db_index_physical_stats
(db_id('TutorialSQLServer'), object_id ('NotUniqueCI' ), null ,null,'DETAILED')
First penalty just not for creating a unique clustered index is costing an additional 928 KB of wasted space in a small table which contain only 500000 rows, so we need to look inside what is Extra in the NotUniqueCI table, doing this we will first use DBCC IND Command for the clustered index.
DBCC IND ('TutorialSQLServer', 'NotUniqueCI', 1);
Please note this is not the output we got from the above command I rearranged the columns and removed some so the output of the above command become more meaningful, as we already know how to identify the root level page id from the above output by looking the maximum index level, which would have previous and next page id is zero. If you are new to these things we strongly recommended to visit this article to get a better understanding on SQL Server internal of indexes.
So now we know the page id of the clustered index let`s look inside the page.
DBCC PAGE ('TutorialSQLServer',1,158322,3) WITH TABLERESULTS
Look at the second result set, there is an additional column referred as UNIQUIFIER (key) which we did not specify anywhere in the definition of index neither at the time of index creation. This is added by storage engine just to make our index unique, but wait Why with zero value? We will look into this further in the article. Now let us look into its first the data page look into the data page below using DBCC PAGE COMMAND.
How I know this is first page?
Remember, I wrote in a previous article, in the index level 0 which data page does not have previous page is first and which page does not have next page is the last page of index.
DBCC PAGE ('TutorialSQLServer',1, 158256,3) WITH TABLERESULTS
Here is the first page look these is again the additional column with column name Uniquifier which is actually a part of the Clustered Index to make the Clustered Index unique, so if it is part Clustered Index, then it has to be in the all non clustered index right?
Let us quickly look into the 8 the index id leaf level.
DBCC IND ('TutorialSQLServer', 'NotUniqueCI', 8);
DBCC PAGE ('TutorialSQLServer',1,242904,3) WITH TABLERESULTS – data page
Yes, there is an additional Uniquifier column here apart from the primary key column.
All the clustered index has to be unique somehow if you won`t make it, storage engine will make it for you in the above example, we have not specified the clustered index unique and yes but still we have unique data in the table.
But storage engine believes that you did not specify unique in the clustered index definition than you will insert some duplicate rows in the clustered index and since the clustered index has to be unique, so storage engine has added one extra column with the name of Uniquifier to make the clustered index unique, the way Storage Engine implemented to make the clustered index unique is quite smart, it has added a numeric value which start with zero if there is no duplicate value for the non unique clustered index and increased it by +1 whenever a duplicate value inserted/updated in the table, so it insert/updated in all the respective Clustered and non clustered index as well. I guess the Uniquifier is an integer column that is why it is costing us
4 extra bytes.
Now the more important question is why a clustered index has to be unique?
The motive behind this is all the non clustered index rows has a link to the clustered index rows in one to one manner relationship so it could seek the matching Row in the clustered index, this one to one relationship will make sure every non clustered index will link to only one row in the table.
Let us see the practical example using the simple query against the SearchCol3 column and look at the execution plan:
SELECT * From NotUniqueCI where SearchCol3=1
Look at the above execution plan, to fulfill the above query optimally the Query Optimizer has chosen the search3 non clustered index because it is indexed in the SearchCol3 column, but the user has requested all the column and this index does not contain all the requested columns so the Query Optimizer has to look into the Clustered Index for the rest of the columns using the clustered index key and Uniquifier in the clustered index to confirm that it will maintain one to one relationship, look in the output section of the outer part of the Nested Loop Join (by the way I am writing a series of articles on the
Nested Loop Join iterator if you are intereseted in knowing about it you should look into this), there are three column in the output searchcolumn , primarykey and Unq1002.
This unq1002 column is actually Uniquifier, which required by the Query Optimzer to join the Clustered Index along with a primarykey column.
So in the summary what the Query Optimizer is doing:
STEP 1: Seek for the SearchCol3 column 3 in the non clustered index.
STEP 2: Get a unique identifier column and primary key along with SearchCol3 column.
STEP 3: Then seek for the Uniquifier and primary key column in the clustered index using the Nested Loop Join iterator (why they called iterator) and fetched all the remaining column from the clustered index right.
Just to add some evidence of the 4 bytes theory of the Uniquifier column, we will query the NotUniqueCI and ideal table and look the difference in data flow estimation in the execution plan.
SELECT * FROM NotUniqueCI WHERE SearchCol3=1
GO
SELECT * FROM IdealCI WHERE SearchCol3=1
Look how 4 bytes and Uniquifier is extra in the comparison to the ideal clustered index.
Duplicate Values in the Clustered Index
The clustered index is not unique, so let us add some duplicate values in the key column of the clustered index and see the effect on the both clustered and non clustered index.
In the below query we will add 100 duplicate rows in the table:
DBCC PAGE ('TutorialSQLServer',1,158256,3) WITH TABLERESULTS
Look the image above, we are looking at the last rows of the Clustered Index first data page, to make the clustered index unique how storage engine has used the Uniquifier +1 with every row and this would have updated to all the non clustered index as well, so let us look into it.
DBCC PAGE ('TutorialSQLServer',1,242904,3) WITH TABLERESULTS
Here as well look at new rows inserted in this page and Uniquifier added as "+1" and this would happen in all non clustered index as well.
Whenever a duplicate value inserted in the non clustered index, storage engine will add +1 to the Uniquifier to make the row unique.
Ok, so let us delete all the data, duplicate rows and make all the index in the previous state by rebuilding the table.
DELETE from [DBO].[NotUniqueCI] where primarykey =1
GO
INSERT INTO [DBO].[NotUniqueCI]
SELECT
n ,
n,
n
,'Some text..'
,
n,
n
,'Some text..'
,
n,
n
,'Some text..'
FROM Numbers where n=1
GO
ALTER TABLE [DBO].[NotUniqueCI] REBUILD
GO
Let's look at the page count in the index.
SELECT SUM (Page_Count *8 ) Space_Used_KB
FROM sys.dm_db_index_physical_stats
(db_id('TutorialSQLServer'), object_id ('NotUniqueCI' ), null ,null,'DETAILED')
Wait, what happened as before inserting the rows the page count in the NotUniqueCI was 178080, now even we have deleted all the data and rebuild the index the claim the space?
Actually storage engine believes that the duplicate data might come back.
So let us make the primary key column, Unique Clustered Index and claim all the wasted space and even it will remove the Uniquifier column.
To do that, we will first drop the existing Clustered Index then create a new one with a unique clustered index then we will check the data pages in the index.
DROP INDEX CI_NotUniqueCI ON [DBO].[NotUniqueCI]
GO
CREATE UNIQUE CLUSTERED INDEX CI_Uniquepk ON [DBO].[NotUniqueCI] (Primarykey)
GO
ALTER TABLE [DBO].[NotUniqueCI] REBUILD
GO
Let us again count the data pages, but this time we will compare the size of data pages to IDEALCI table
as well.
SE
LECT SUM (Page_Count *8 ) Space_Used_KB
FROM sys.dm_db_index_physical_stats
(db_id('TutorialSQLServer'), object_id ('NotUniqueCI' ), null ,null,'DETAILED')
GO
SELECT SUM (Page_Count *8 ) Space_Used_KB
FROM sys.dm_db_index_physical_stats
(db_id('TutorialSQLServer'), object_id ('IDEALCI' ), null ,null,'DETAILED')
Great!! We have not just only claimed all the wasted space from the index, but now the index is much thinner and size is even equivalent to the our IDEALCI table :)
So look for potential unique index and convert it into unique, don't forget make an index unique won`t allow duplicated data on the Unique key column.
Closing lines and summary
So a journal question arises, Why Microsoft allows non unique clustered index if it creates the problem?
This is a valid question and should be addressed.
Anywhere you look, all over the internet filled with advice that your clustered index should be unique.
It's recommended to read very carefully and never accept everything, whosoever is saying not even white paper, always pay attention to the words and try to test it.
So as above, we bashed the non unique index so much that might push the readers into thinking that "All the clustered index must be unique" but it was never said that "All the clustered index should be unique".
There are some cases when you might consider a clustered index as non unique, these apples for almost everything where the general recommendation is something else but your database structure and requirement wants something else.
Right now on top of my head, I can think only 2 cases of non unique clustered index.
First would be where you're all read operations required against with one or more than one combination of columns and you required all the columns in the read operation and it is not required any unique constraint on the table, so having a unique clustered index with a support of non clustered index for read operation would be just a waste of space.
And the second case where you can consider a non unique clustered index is when you a very wide unique clustered index but your read operation required only few leading columns.
The bottom line is test yourself always.
Quote
"No one wants to die. Even people who want to go to heaven don't want to die to get there. And yet death is the destination we all share. No one has ever escaped it. And that is as it should be, because Death is very likely the single best invention of Life. It is Life's change agent. It clears out the old to make way for the new."- Steve Jobs