Systems Engineering and RDBMS

The IN-List Iterator Issue

Posted by decipherinfosys on February 21, 2008

We have seen this issue time and again in many client applications. Many applications are designed to just plug in different values into the IN Clause of the filter criteria for a given column (or if the RDBMS supports row value constructors, then the set of values). That is not a scalable design. You should be specifying a criteria that qualifies for those say 1000 values (region id values or the company ids) i.e. a status code or a date criteria or something that made you select those 1000 values/records to begin with.

If you do that, then the SQL:

X IN (1, 2, 3, 4, 5, 20, 30, 40, 50….)

will just be

X.Stat_code = :1
AND X.date_field >= :2 AND X.date_field <= :3

where :1, :2 and :3 are bind variables. The idea is that there must have been some reason (some criteria) due to which those 1000 values were selected to begin with…instead of preparing an IN list, parsing it and dividing it up into multiple sets, you should use that criteria itself…what if tomorrow, instead of say 1000, a client has a requirement where you have over a million such qualifying values? Your overhead will increase…internally, these COL1 IN (x, y, z) get translated to COL1 = x OR COL1 = y OR COL1 = z…you get the picture now.

Let’s talk about the IN-List iterator for a second. As you know, the queries that use the IN clause (with values) are internally converted (by the optimizer) into a series of queries using OR statements. The optimizer calculates the estimation of cardinality and looks for a long continguous sequence where expressions are all column=constant, such as (a=1 or a=10 or a=11, …). Those tests are run against every row in the table that matches the other filter criterias and join conditions (if those are present). RDBMS short-circuit the OR comparisons (meaning if it made a match on the 3rd item in the list it will stop testing instead of testing the other 96 items in the list), but it’s messy even if most of the values are caught by the first 20% of the list. Non-matching values will still be tested against every item in the IN clause. If you use an IN clause with many values to filter a large table, the amount of overhead can be brutal compared to simply lining two ordered lists up next to each other and plucking out the matches based on the filter criteria. Performance will dip even more when there are multiple users firing off similar queries that do these things.

IMHO (can be proven by comparing logical/physical reads, CPU usage and of course execution times), the best approach on any platform with an IN list of large size is two-fold :

The first option below is the best option…

1) The criteria that was used to qualify those N number of IN values should be used directly while forming the SQL –> that way the joins & meaningful filter criteria (and thus their indexes) can be used for faster retrieval of the data.

Either that, or if the above is not feasible for whatever reason, then the next best approach is :

2) To put the IN data into a work table (by using a DB function) and use a join. The DB function is in the FROM clause and hence acts as a table and pivots the data and joins it like a derived table/inline view.

Example:

SELECT
EmpName
FROM Emp e
INNER JOIN Split(‘100,101,102,103,104,105’) d
ON e.EmpID = d.Value
WHERE e.salary > 50000

One of the optimizer improvements in DB2 version 8.2 and SQL Server 2005 is that it will flatten large IN lists and will at times turn them into worktables (simulating what I have mentioned above in option #2) but this is not always guaranteed. In Oracle as well, at times, it can opt to use USE_CONCAT or NO_EXPAND (depending upon a couple of other parameter settings) if it sees the cost of “OR Expansion” for the IN-LIST to be high but again that is not always the case since the optimizer has to consider the permutations before giving up the OR expansion and that itself can take a longer parse time.

Sorry, the comment form is closed at this time.

 
%d bloggers like this: