Systems Engineering and RDBMS

Column Order in a Composite Index

Posted by decipherinfosys on May 13, 2008

One of our readers asked 3 questions about composite indexes which we thought we would share with all of you:

Q1) For a composite index, is the order of the columns important and if so, can you please explain it with examples?

Q2) If I have all the columns of the composite index in the filter condition (WHERE clause), does the order in which they are specified matters?

Q3) When designing a composite index, assuming that all the columns of that index are going to be used in the filter condition, does it matter whether the most selective column i.e. the column with most distinct values is the first column in that index?

There are some key differences in behavior between SQL Server and Oracle in some cases and it is important to know those when you are dealing with both the RDBMS since how you design your indexes will have an impact on the execution plans and thus the performance of your queries and your application.

Let’s take an example in both the RDBMS and see it for ourselves:

SQL Server Syntax:

We will use the big_test_tbl that we have used in one of our previous blog post. Suppose that we have a nonclustered unique index on this table on the combination of these columns (in that order):

FirstName, LastName, ContactID

Let’s look at the data distribution:

select count(*) as total_records,
count(distinct FirstName) as Distinct_FirstName,
count(distinct LastName) as Distinct_LastName,
count(distinct ContactID) as Distinct_ContactID
from dbo.big_test_tbl

total_records Distinct_FirstName Distinct_LastName Distinct_ContactID
------------- ------------------ ----------------- ------------------
2336724       1018               1206              2336724

So, as you can see, the most selective column is CONTACTID for which a given value will qualify an exact 1 record. Next most selective is LastName
and the least selective is FirstName.

Now, let’s look at Question #1: “For a composite index, is the order of the columns important and if so, can you please explain it with examples?

We have already established that CONTACTID is the most selective column in this index meaning that if we give a value for the filter condition on that column, it should ideally use that index and do a seek operation on it…let’s see:

select * from big_test_tbl where ContactID = 2

As you can see from the execution plan above, it does an index scan operation instead of the index seek operation and then uses a RID (Row ID) to look up the data. The overall query cost is also 11.32. This goes to show that when the leading column of the index is not used in the filter condition, in SQL Server the index seek operation cannot be done since the data is not organized by ContactID, it is organized by FirstName, LastName and ContactID so it was never able to find the entry point into the index to do a seek operation.

Now, let us also include the leading column of the index:

select * from big_test_tbl where ContactID = 2 and FirstName = ‘Catherine’

Look at the change in the execution plan now – it does an index seek operation and the the parallelism operator goes away and the total query cost comes down to 0.013.

This tells us that not only is the column order important, it is very important for the leading column of the index to be part of the filter condition in order to make good use of the index and to have a good execution plan. This immediately brings another question to mind – now, we know that we need the leading column of the index to be present in the filter condition (WHERE clause) but does it make a difference if we have the least selective vs the most selective column as the leading column of that index?

In our example, we have established that the ContactID column is the most selective – every value qualifies for a single record and the FirstName column is the least selective. The index that we currently have on the table is: test and it is formed on the combination of the columns in this order:

FirstName, LastName and then ContactID

Now, let’s go ahead and create another index and this time, we will make ContactID as the first column:

create index test_2 on big_test_tbl (ContactID, LastName, FirstName)

Now, let us execute the same query but force it to use different indexes and see whether the execution plan and the query cost changes at all. We will use “SET SHOWPLAN_TEXT ON” and see the difference in the plan:

set showplan_text on
go

select * from big_test_tbl with (Index = test) where ContactID = 2 and FirstName = ‘Catherine’
go

StmtText
———————————————————————————————————————————————————————————————————————————–
|–Nested Loops(Inner Join, OUTER REFERENCES:([Bmk1000]))
|–Index Seek(OBJECT:([AdventureWorks].[dbo].[big_test_tbl].[test]), SEEK:([AdventureWorks].[dbo].[big_test_tbl].[FirstName]=N’Catherine’), WHERE:([AdventureWorks].[dbo].[big_test_tbl].[ContactID]=(2)) ORDERED FORWARD)
|–RID Lookup(OBJECT:([AdventureWorks].[dbo].[big_test_tbl]), SEEK:([Bmk1000]=[Bmk1000]) LOOKUP ORDERED FORWARD)

select * from big_test_tbl with (Index = test_2) where ContactID = 2 and FirstName = ‘Catherine’

StmtText
————————————————————————————————————————————————————————————————————————————-
|–Nested Loops(Inner Join, OUTER REFERENCES:([Bmk1000]))
|–Index Seek(OBJECT:([AdventureWorks].[dbo].[big_test_tbl].[test_2]), SEEK:([AdventureWorks].[dbo].[big_test_tbl].[ContactID]=(2)), WHERE:([AdventureWorks].[dbo].[big_test_tbl].[FirstName]=N’Catherine’) ORDERED FORWARD)
|–RID Lookup(OBJECT:([AdventureWorks].[dbo].[big_test_tbl]), SEEK:([Bmk1000]=[Bmk1000]) LOOKUP ORDERED FORWARD)

If you see the execution plans above, you will see that when we used the test index, the SEEK operation was done on FirstName and then the ContactID filter criteria was used to locate the entry. In the case of the test_2 index, it was the other way around since the first column in the index is ContactID.

The query cost was also different in the two cases – the test index usage yielded a 0.013 cost vs a cost of 0.0066 in the case of test_2. SQL Server maintains histograms on only the leading column of a composite index so the column order as well as the selectivity of the columns in an index does matter a lot depending upon how you are going to be accessing the data in the system. If the leading column is more selective, it can reduce the query cost by reducing the seek time as you have seen from the example above. So, in the case of SQL Server, having the most selective column as the leading column of the index is a good design assuming that that column is going to be used in your filter criteria.

Bottom line is that the order of the columns in a composite index is very important and it also depends upon how your queries are written. Let’s take one more example:

Say, I have a table TAB1 which has columns A, B, C and suppose I have two index:

Index1 on (a, b) and Index2 on (b,a).

If I have these queries:

Select * from TAB1 where A = @a and B = @b;

Select * from TAB1 where A = @a

then, Index1 is useful for these queries. If I have:

Select * from TAB1 where A = @a and B = @b;

Select * from TAB1 where B = @b

then, Index2 is more useful since it covers both the queries. We had mentioned before that there are some differences in behavior between Oracle and SQL Server and it is important to mention one of those differences here – beginning Oracle 9i, a new feature called “Index Skip Scan” was introduced using which both Index1 or Index2 can be useful in the above case but it depends upon the selectivity of those columns – we will cover index skip scan in a future post but you can read more over here if you are interested.

Key things to take away from this post are:

1) Order of the columns in an index is important but the most important thing is to first understand your queries and see how the index is going to be used. Say, I have three columns in a table with million records with this data distribution for the columns:

C: 2 distinct values

B: 5 distinct values

A: Million distinct values

As you can see from above, A is the most selective and C is the least so is A, B, C the right order? Answer is it depends. If you are using only equality operators in your filter condition, then definitely the answer is Yes. If however, say, the filter criteria is: WHERE C = ‘X’ and B = 2 and A > getdate(), then we would scan a lot of C=’Y’ and B != 2 values while we are scanning the A > getdate() values (this criteria of A > the value that we used in the example below qualified nearly 200,000 records). A better thing would be to use B,C,A in that case to reach the place where B and C equality matches are done and scan the rest of the records using the A > getdate() criteria. This is pretty simple to prove as well:

create table test (A datetime, B int, C varchar(5))
go
declare @i int
set @i = 1
while (@i <= 1000000)
begin
insert into test (A, B, C) select getutcdate(), case when @i < 200000 then 1
when @i >= 200000 and @i <400000 then 2
when @i >= 400000 and @i < 600000 then 3
when @i >= 600000 and @i < 800000 then 4
else 5
end,
case when @i <= 500000 then ‘X’
else ‘Y’
end
set @i = @i + 1
end
go

create index test_ind_1 on test (A,B,C)
create index test_ind_2 on test (B,C,A)

And now, let’s check the execution plan:

set showplan_text on
go
select * from test with (index = test_ind_1) where C = ‘X’ and B = 2 and A > ’2008-05-13 14:25:22.770′
go

StmtText
—————————————————————————————————————————————————————————————————————————————————–
|–Index Seek(OBJECT:([AdventureWorks].[dbo].[test].[test_ind_1]), SEEK:([AdventureWorks].[dbo].[test].[A] > ’2008-05-13 14:25:22.770′), WHERE:([AdventureWorks].[dbo].[test].[B]=(2) AND [AdventureWorks].[dbo].[test].[C]=’X’) ORDERED FORWARD)

select * from test with (index = test_ind_2) where C = ‘X’ and B = 2 and A > ’2008-05-13 14:25:22.770′
go
StmtText
———————————————————————————————————————————————————————————————————————————————–
|–Index Seek(OBJECT:([AdventureWorks].[dbo].[test].[test_ind_2]), SEEK:([AdventureWorks].[dbo].[test].[B]=(2) AND [AdventureWorks].[dbo].[test].[C]=’X’ AND [AdventureWorks].[dbo].[test].[A] > ’2008-05-13 14:25:22.770′) ORDERED FORWARD)

In both the queries above, the index seek operation takes place, however the seek operation is reversed due to the leading column in the index. The query cost when we make use of test_ind_1 in this case is 3.94 vs 0.66 when we make use of test_ind_2.

So, understand your queries and how you are going to be accessing the data and then decide how the composite index needs to be formed. No simple rule of thumb applies as you have seen from this post.

2) If I have all the columns of the composite index in the filter condition (WHERE clause), then the order in which they are specified in the filter criteria does not matter.

3) Understand your queries and depending upon the operators that are being used, the selectivity of the columns in a covered index even in scenarios where all the columns are being used in the filter condition can play an important role.

About these ads

2 Responses to “Column Order in a Composite Index”

  1. [...] Column Order in a Composite Index [...]

  2. [...] the factors that should be considered when designing indexes. In one of such posts, we had also covered indexes over multiple columns. In all of those posts, you will see one common thing that we try to [...]

Sorry, the comment form is closed at this time.

 
Follow

Get every new post delivered to your Inbox.

Join 74 other followers

%d bloggers like this: