Systems Engineering and RDBMS

Splitting a number of rows into equal groups

Posted by decipherinfosys on October 16, 2009

At the client site yesterday, one of the developers asked this question: “I want to take the large set of the data that I have in my gigantic table and split it up into balanced non-overlapping sets.  I want the starting value in the set, the ending value and the number of records in that group.  I need this to do my data processing logic for starting parallel concurrent processing of those large data sets.  What is the best way to achieve this?

This client site uses both Oracle and SQL Server so the solution had to work with both the RDBMS.  Luckily enough, there is an analytic function in both which does this very easily for us.  The function that we are talking about is NTILE().  Let’s take this up with an example:

SQL Server:

Let’s say that I want to split all the objects in sys.objects into say 10 different non-overlapping sets.  And then as per the requirement, I need to list out the starting value in the set, the ending value and the total count of the records in the set.  Here is a simple code sample that will do that:

select
min(object_id) as Starting_Value
,    max(object_id) as Ending_Value
,    count(*)       as Total_Records
,    grp_nbr           as Group_Nbr
from
(
select object_id,
ntile(10) over (order by object_id) grp_nbr
from SYS.OBJECTS
) AS IV
group by grp_nbr;

What I get as the output in my test database is this:

Starting_Value Ending_Value Total_Records Group_Nbr
-------------- ------------ ------------- --------------------
4              125243501    123           1
128719511      366624349    123           2
367340373      571865104    123           3
574625090      767341798    122           4
768721791      971150505    122           5
971866529      1163151189   122           6
1165247206     1403152044   122           7
1406628054     1707153127   122           8
1709249144     1941581955   122           9
1943677972     2144726693   122           10

Now, you will notice that not all sets have equal records but they are pretty close.  That is by definition of the NTILE() function.  Here is what BOL states:

If the number of rows in a partition is not divisible by expression, this will cause groups of two sizes that differ by one member. Larger groups come before smaller groups in the order specified by the OVER clause. For example if the total number of rows is 53 and the number of buckets is five, the first three buckets will have 11 rows and the two remaining buckets will have 10 rows each. If on the other hand the total number of rows is divisible by the number of buckets, the rows will be distributed evenly among the buckets. For example, if the total number of rows is 50, and there are five buckets, each bucket will contain 10 rows.

Oracle:

And the same function is available in Oracle as well. And the same code shown above will work in Oracle as well.  Just replace SYS.OBJECTS with ALL_OBJECTS.

Pretty simple way of splitting a large data set into non-overlapping sets, isn’t it!

Resources:

  • SQL Server MSDN BOL entry for the NTILE() function – here.
  • Oracle documentation of the NTILE() function – here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: