Systems Engineering and RDBMS

Update Statistics in SQL Server 2005 – New Undocumented options

Posted by decipherinfosys on August 1, 2007

Yesterday, we had covered DBMS_STATS.SET_TABLE_STATS functionality in Oracle and had seen how one can make the optimizer believe that a different data distribution exists for the tables and indexes. Using that, one can then test their code in the development environment to see how the execution plan changes for the code. This is useful since by doing this, one can check for any issues due to data volume on the query execution plan. In case of the execution plan going bad and the cost for the query increasing, one can then look into tuning the query or look into playing with the indexes so that those issues are nipped in the bud during the development stages itself. Ideal thing for the query would be for the cost to not grow significantly with the data volume increase. The same can also be true when you are a vendor and your application code is getting a bad execution plan generated at a client site and bringing that 500GB database back in-office is not an option.

So, in SQL Server is there a way one can simulate the high data volume without “actually” creating the volume data? Yes, there is a way but it is un-documented. In addition, in version 2000 of SQL Server, the method of doing this is not straight-forward and involves updates to the system tables directly which needless to state should never be done in a production environment. In a development environment under controlled situations one can do it if one wants to be brave enough to venture out and do this. There is an excellent post by Joe Chang that talks about the different steps on doing this – you can read more on that here.

In this post, we are going to cover how one can achieve this in SQL Server 2005. The options are still un-documented (they are mentioned in BOL) but there is no messing around with the system tables. If you look up the UPDATE STATISTICS command in BOL, at the bottom of it, you will see:

<update_stats_stream_option> ::=
[ STATS_STREAM = stats_stream ]
[ ROWCOUNT = numeric_constant ]
[ PAGECOUNT = numeric contant ]

and in the description section, it states:

<update_stats_stream_option>
Reserved for SQL Server internal use only. Future compatibility is not guaranteed.

In here, we will go over the rowcount and the pagecount options. As the name of these options suggests, these can be used to change the internal meta-data of the tables and their indexes without a need for manually updating the system tables.  Since SQL Server’s optimizer is a CBO (Cost Based Optimizer), the query optimizer looks at these statistics in order to form the execution plans and having these values set will fool the optimizer into thinking that the data volume is very different than what it actually is.  This is on the same lines as Oracle’s command that we showed yesterday.

Here is the code snippet for this post:

/********************************************************************************************************************************/

use tempdb
go

/****************************************************************
–Create the dummy tables in tempdb database

*****************************************************************/
create table test_tbl1(col1 int not null primary key, col2 int)
go

create table test_tbl2(col3 int not null, col4 int, col5 int not null primary key)
go

alter table test_tbl2 add constraint fk_test_tbl2_to_test_tbl1 foreign key (col3) references test_tbl1 (col1)
go

create unique index test_tbl1_ind_1 on test_tbl1 (col2)
go

/****************************************************************
–Set the option for the statistics profiling
*****************************************************************/
set statistics profile on
go

/****************************************************************
–Query #1: Without the where clause
*****************************************************************/
select *
from dbo.test_tbl1
inner join dbo.test_tbl2
on test_tbl1.col1 = test_tbl2.col3
order by test_tbl2.col4, test_tbl1.col2
go

/****************************************************************
–Query #2: With the where clause
*****************************************************************/
select *
from dbo.test_tbl1
inner join dbo.test_tbl2
on test_tbl1.col1 = test_tbl2.col3
where col2 > 10 and col2 < 20
order by test_tbl2.col4
go

/****************************************************************
–Update the statistics now – effect all statistics including
Index
*****************************************************************/
update statistics test_tbl1
with rowcount = 100000, pagecount = 1000;

update statistics test_tbl2
with rowcount = 1000000, pagecount = 10000;
go

/****************************************************************
–Query #1 – using the recompile option to get the new plan
*****************************************************************/
select *
from dbo.test_tbl1
inner join dbo.test_tbl2
on test_tbl1.col1 = test_tbl2.col3
order by test_tbl2.col4, test_tbl1.col2
option (recompile)
go

/****************************************************************
–Query #2 – using the recompile option to get the new plan
*****************************************************************/
select *
from dbo.test_tbl1
inner join dbo.test_tbl2
on test_tbl1.col1 = test_tbl2.col3
where col2 > 10 and col2 < 20
order by test_tbl2.col4
option (recompile)
go

/****************************************************************
–Drop the tables now
*****************************************************************/
drop table test_tbl1
go
drop table test_tbl2
go

/********************************************************************************************************************************/

In this code snippet, as you can see, we first created two tables and then created a unique index on one of the columns in the first table. After that, we set the statistics profile option to be on for the session and fired off two queries to show how the execution exists with or without the filter criteria (the where clause) before making any changes to the statistics for the tables. Once that is done, we use the rowcount and the pagecount options to update the statistics for the two tables making an assumption that for both the tables, we can fit 100 records per page (though one can clearly see that we can store more per page – this is for illustration only). We then run those same queries again – this time with the recompile option to force a recompile and new plan generation.

The results can be seen by running the scripts from above – I am not uploading the word document for the output since the formatting gets messed up. You can run the above code snippet in your environment in order to see the execution plan changes and will get the same repeatable result.

You can see how the plan has changed for both the queries post these updates.

Query #1:

The plan drastically changed from a clustered index seek operation and an index scan operation to a clustered index scan and an index scan operator with parallelism and hash match kicking in. Take a look at the “EstimateRows” column and you will see the drastic changes in the values. The EstimateIO and the EstimateCPU numbers have increased and the TotalSubtreeCost is substantially higher after the changes to the stats. And all that is expected if you look at the query. The query does not have any filter criteria and is doing a sort operation on an un-indexed column. In addition, the foreign key is not indexed. See this post to see why un-indexed foreign keys can lead to issues.

These issue(s) can be easily caught by doing such type of analysis before it even hits the production environment. One can argue that if one knows good SQL programming, looking at the code itself, one can easily glean the two inferences – however, there are also junior members in the team and this can prove to be useful to them as well. In addition, as Joe Chang had pointed out in the SQL Server 2000 post that he did (see link above), this can prove to be very useful for simulating issues in production also without actually having to create that much data or bringing the huge back-up and restoring it in-house. All one needs to know is the data distribution.

Query #2:

In the second query, the filter criteria (where clause) is included – and this is on an indexed column which has unique sets of values. You can see the changes in the execution plan behavior specifically the change in the join operator (from nested loops to a merge join) with the different stats available now. One can then look into creation of covered index (or a change of clustered index etc.) to resolve such issues or look into tuning the query – is all the data (a select * ??) required or only certain columns are needed.

NOTE: Please do remember that just like Oracle’s SET_TABLE_STATS (that we showed yesterday), these options will only make the optimizer believe that the table’s rowcounts and pagecounts are different than what they actually are. However, the physical records in the tables and the indexes will remain as is. It is important to remember that the usage of these commands will let you see what kind of query plan changes will occur if the data volume changes – however, when you actually run those queries, those queries will complete very fast when run against empty or small data volume tables. One can make use of these execution plans to analyze and look at how the estimated query plan and the cost is changing as the data volume changes. In case the cost is growing significantly with the data volume changes, then it will be an indication of tuning the queries or adding/modifying indexes.

2 Responses to “Update Statistics in SQL Server 2005 – New Undocumented options”

  1. […] of our previous Oracle blog posts and had also covered the equivalent options in SQL Server 2005 in another post. That post also pointed out to the article by Joe Chang about the export/import of statistics in a […]

  2. […] doing some research recently, I came by an article that really got me thinking, could this really work? It is one thing to fool the optimizer, but it […]

Sorry, the comment form is closed at this time.

 
%d bloggers like this: