Systems Engineering and RDBMS

Understanding the difference between Indexes and Statistics

Posted by decipherinfosys on March 19, 2008

This was one of the questions that a reader had asked us recently: “What is the difference between an index and a statistic and how do they get stored?”

An index is a physically implemented structure in the database (you can read up more in BOL on clustered and non-clustered indexes) whereas statistics are a set of values that help the optimizer during the execution plan formation stages to decide whether to use an index or not. And it is not a 1-1 relationship between indexes and statistics i.e. all indexes have statistics but one can have statistics without an index. And these statistics that do not associate to an index can also help in the formation of the right execution plan. We will cover that in another post as to how that can help. In this post, let’s cover the basics of these statistics.

Basic thing to take away from the comments from above is that an index helps the optimizer to find the data during the execution of the statements and statistics help the optimizer to determine which indexes to use.

So, what exactly do statistics contain that make them so useful? Statistics essentially contain two pieces of information:

1) A histogram which contains a sampling of the data values from the index and the distribution of the values in the ranges of data, and
2) Density groups (collections) over the column (or number of columns) of a table or an indexed view. Density essentially reflects the uniqueness of the values in a particular column.

Let’s take an example: I am picking Northwind database since it is also available in SQL Server 2000. In that database schema, let’s pick the Employees table. It has an index called: LastName which is created on the column “LastName” of that table. If I run DBCC SHOW_STATISTICS (‘employees’, ‘lastname’) on it, I get:

Name        Updated              Rows                 Rows Sampled         Steps  Density       Average key length String Index
 ----------- -------------------- -------------------- -------------------- ------ ------------- ------------------ ------------
 LastName    Feb  5 2008  8:12AM  9                    9                    8      1             18.22222           YES

(1 row(s) affected)

All density   Average Length Columns
 ------------- -------------- --------------
 0.1111111     14.22222       LastName
 0.1111111     18.22222       LastName, EmployeeID

(2 row(s) affected)

RANGE_HI_KEY         RANGE_ROWS    EQ_ROWS       DISTINCT_RANGE_ROWS  AVG_RANGE_ROWS
 -------------------- ------------- ------------- -------------------- --------------
 Buchanan             0             1             0                    1
 Callahan             0             1             0                    1
 Davolio              0             1             0                    1
 Dodsworth            0             1             0                    1
 Fuller               0             1             0                    1
 Leverling            1             1             1                    1
 Peacock              0             1             0                    1
 Suyama               0             1             0                    1

(8 row(s) affected)

As you can see from above, the density is just a fraction which represents how many records will be returned back when a query is made for a given value. So, for the primary key column index, the density will be 1. It however does not give us any information about the ranges i.e. the values that may appear more or less than the average N number of times. That information is provided by the histograms which are represented above.

Another thing of importance to note here is that in the cases of multiple columns in an index, the histogram is available only on the leading column of the index i.e. the first column. The density value will be available for each first column based subset, for example: If I create an index on say (LastName, FirstName), then I would have a density value for each of these:

a) LastName => the value for this would mean that for every LastName value, how many records get qualified?
b) LastName, FirstName => the value for this would mean that for every combination of the LastName and FirstName, how many records get qualified?
c) LastName, FirstName, EmployeeID => This includes the last column as the Primary Key column (a clustered index) since what we created was a non-clustered index.

Here is the output of the DBCC SHOW_STATISTICS command for this index:

DBCC SHOW_STATISTICS (‘employees’, ‘TEST’)

All density   Average Length Columns
 ------------- -------------- --------------------------------
 0.1111111     14.22222       LastName
 0.1111111     25.77778       LastName, FirstName
 0.1111111     29.77778       LastName, FirstName, EmployeeID

Another key thing to note is that if the database option “auto create statistics” is ON, and if there are queries such that the WHERE clause is referencing columns that are not already a part of an index (as a leading first column), then SQL Server automatically creates statistics on it. These automatically created statistics start with the name of “_WA_sys” and are followed by two hex strings. If you convert those 2 hex strings to decimal values, then the first one represents the table’s objectID value and the second one represents the index’s objectID value. One can of course choose to create statistics by using the “create statistics” command as well and name them in a particular fashion.

For more reading on this topic, we would suggest that you read this wonderful whitepaper from Microsoft on this topic:

http://www.microsoft.com/technet/prodtechnol/sql/2005/qrystats.mspx

About these ads

4 Responses to “Understanding the difference between Indexes and Statistics”

  1. [...] yielded a 0.013 cost vs a cost of 0.0066 in the case of test_2. SQL Server maintains histograms on only the leading column of a composite index so the column order as well as the selectivity of the columns in an index does matter a lot [...]

  2. [...] by decipherinfosys on December 2, 2008 In one of previous posts, we had discussed the differences between statistics and indexes and had also covered in another post, how one can make the optimizer think that the data [...]

  3. [...] Decipherinfosys along with a good description on the difference of indexes and statistics if you need some [...]

  4. [...] Understanding the difference between Indexes and Statistics … These automatically created statistics start with the name of “_WA_sys” and are followed by two hex strings. If you convert those 2 hex strings … [...]

Sorry, the comment form is closed at this time.

 
Follow

Get every new post delivered to your Inbox.

Join 79 other followers

%d bloggers like this: