Systems Engineering and RDBMS

TABLESAMPLE clause in SQL Server 2005

Posted by decipherinfosys on June 18, 2007

SQL Server 2005 introduced a new clause – TABLESAMPLE. This clause restricts the number of rows returned from the “FROM” clause in a query to a sample number or a percentage of rows. Let’s look at an example first to understand this clause:

set nocount on
go
create table demo_ts (id_val int identity(1,1) , col1 int)
declare @i int
set @i = 1
while (@i <= 10000)
begin
insert into demo_ts (col1) values (@i)
set @i = @i + 1
end
go

We have populated the table with sample data. Now, let’s use the tablesample clause and retrieve 500 rows and then we will use the percent clause to get a certain percentage back:

select *
from demo_ts
tablesample (500 rows);

select *
from demo_ts
tablesample (10 percent);

If you keep running these queries over and over again, you will see the non-deterministic nature of this clause – it will keep giving back different results based on the sampling that it does on the data pages for that table. This sampling is done on the pages and not on the rows – once the random value is assigned to a page, all of the rows from those pages are returned (so it’s either all the rows from the page(s) that got sampled or none at all regardless of the rows on the page). Since the pages could be filled with different number of rows, the count of the records returned per execution could be different as well.

So, how will this behave if one or more of the tables in the from clause have this clause applied to them, example:

select *
from demo_ts a
tablesample (500 rows)
inner join demo_ts_2 b
on a.id_val = b.id_val

Assume that demo_ts_2 table has only a sub-set of the rows as compared to the demo_ts table. In this scenario, some rows returned from the alias b might not have a matching record in the sampled set from the alias “a” and would thus return different records. If you always want the same data set to be retrieved, you can use the REPEATABLE option with the same repeat_seed value – this will force SQL Server to return the same sub-set of the rows each time, example:

select *
from demo_ts
tablesample (10 percent)
repeatable (200);

So, what is a practical use of this clause? In some statistical or data-mining applications where one requires a sampling mechanism over a sub-set of the data, one can make use of this – however, if you are looking for a true random sampling, then read the post on random number generation and random sorts on our blog.

Sorry, the comment form is closed at this time.

 
%d bloggers like this: