Systems Engineering and RDBMS

Join Internals

Posted by decipherinfosys on May 11, 2007

We have talked about different type of joins before in a couple of our blog posts. Let’s spend some time talking about the three common join operators that are used to process joins:

  • Nested Loop Join
  • Hash Join, and
  • Merge Join

A Nested Loops Join takes in smaller inputs and requires less memory usage. The inputs are two tables – an outer table and an inner table. The join takes a row from the outer table and scans the rows in the inner table for a match. If there are additional filter criterias that it can use to narrow down that set, it uses that to search among that list. Then, the join proceeds to the next row of the outer table and repeats the process. If the inner table is small, there is a better chance of the data pages remaining in memory. The optimizer might have to perform a sort operation on the table to order the joined column. Nested loops are most optimal for smaller sets of data.

Hash joins take in larger inputs and require high memory. It has two inputs to it – the build input and the probe input. The build input is used for hash buckets which are essentially a grouping of data elements. This grouping then creates a hash key from this data set which is used for group into buckets. A linked list is then created from these buckets. This is followed by taking one row of the data from the probe input and scanning all the hash values using the linked list for a match. Since these buckets are built up-front, the optimizer will try to use a smaller set for making the build input. Hash joins hog a lot of memory and cause high I/O. If you are joining data sets that are large and do not have proper index usages defined for them, then hash joins are the fastest mechanism for those scenarios.

Oracle Guru Tom Kyte has a good presentation on the hash join – this was one of the many questions he has answered on his site – you can view it here.

Merge joins take in larger inputs and require less memory as compared to the hash joins. A merge join would scan all rows of the inner table and all the rows of the outer table and will then merge the two results into a single set. If the input sets are large, then optimizer might perform a many to many operation which means that it will have to visit the same data many times. It will create a temporary area for both the inputs (SQL Server lingo : Table spool operator above the two sets) and will then create a join between the temporary tables.

One Response to “Join Internals”

  1. […] when the optimizer decides to do a Hash Join or a Merge Join (refer to the different join methods here). Bitmap filters are very useful in optimization of data warehouse queries. The star join query […]

Sorry, the comment form is closed at this time.

 
%d bloggers like this: