Splitting a number of rows into equal groups
Posted by decipherinfosys on October 16, 2009
At the client site yesterday, one of the developers asked this question: “I want to take the large set of the data that I have in my gigantic table and split it up into balanced non-overlapping sets. I want the starting value in the set, the ending value and the number of records in that group. I need this to do my data processing logic for starting parallel concurrent processing of those large data sets. What is the best way to achieve this?”
This client site uses both Oracle and SQL Server so the solution had to work with both the RDBMS. Luckily enough, there is an analytic function in both which does this very easily for us. The function that we are talking about is NTILE(). Let’s take this up with an example:
Let’s say that I want to split all the objects in sys.objects into say 10 different non-overlapping sets. And then as per the requirement, I need to list out the starting value in the set, the ending value and the total count of the records in the set. Here is a simple code sample that will do that:
min(object_id) as Starting_Value
, max(object_id) as Ending_Value
, count(*) as Total_Records
, grp_nbr as Group_Nbr
ntile(10) over (order by object_id) grp_nbr
) AS IV
group by grp_nbr;
What I get as the output in my test database is this:
Starting_Value Ending_Value Total_Records Group_Nbr -------------- ------------ ------------- -------------------- 4 125243501 123 1 128719511 366624349 123 2 367340373 571865104 123 3 574625090 767341798 122 4 768721791 971150505 122 5 971866529 1163151189 122 6 1165247206 1403152044 122 7 1406628054 1707153127 122 8 1709249144 1941581955 122 9 1943677972 2144726693 122 10
Now, you will notice that not all sets have equal records but they are pretty close. That is by definition of the NTILE() function. Here is what BOL states:
If the number of rows in a partition is not divisible by expression, this will cause groups of two sizes that differ by one member. Larger groups come before smaller groups in the order specified by the OVER clause. For example if the total number of rows is 53 and the number of buckets is five, the first three buckets will have 11 rows and the two remaining buckets will have 10 rows each. If on the other hand the total number of rows is divisible by the number of buckets, the rows will be distributed evenly among the buckets. For example, if the total number of rows is 50, and there are five buckets, each bucket will contain 10 rows.
And the same function is available in Oracle as well. And the same code shown above will work in Oracle as well. Just replace SYS.OBJECTS with ALL_OBJECTS.
Pretty simple way of splitting a large data set into non-overlapping sets, isn’t it!