The DB2 optimizer can choose from a variety of different techniques as it creates optimal access paths for each SQL statement. In Part 1 of this series (http://www.toadworld.com/platforms/ibmdb2/b/weblog/archive/2015/10/29/db2-optimization-and-access-paths-understanding-the-basics-part-1-scans-versus-indexed-access) we discussed simple access paths when a single table is involved. But using an index versus scanning data is only the tip of the iceberg. What about joins and more complicated SQL statements? In part 2 of this 3 part series, we examine the different techniques DB2 can use to join table data.

Join Methods

The DB2 optimizer has at its disposal a series of techniques that can be used to join table data. When more than one DB2 table is referenced in the FROM clause (or a JOIN clause is specified), the SQL is requesting DB2 to perform a join. Based on the join criteria, a series of instructions must be carried out to combine the data from the tables.

How does DB2 do this? Each multi-table query is broken down into separate access paths. The DB2 optimizer selects two of the tables and creates an optimized access path for accomplishing that join. It does not do this randomly, but chooses based on what it deems will be the most optimal manner for joining. The optimizer will then continue joining to the other tables until the entire query has been optimized.

When joining tables, the optimizer will have to determine the best join algorithm to use. The join algorithm, or join method, defines the basic procedure for combining the tables. There are three basic types of joins that DB2 can deploy: nested loop, merge scan, and hash joins. Each join method operates differently from the others but achieves the same results. However, the choice of join method has an important effect on the performance of the join. Each join method used by DB2 is engineered such that, given a set of statistics, optimum performance can be achieved. Therefore, you should understand the different join methods and the factors that cause them to be chosen.

Certain basic steps are common to each join method. In general, the first decision to be made is which table should be processed first. This table is referred to as the outer table. After this decision is made, a series of operations are performed on the outer table to prepare it for joining. Rows from that table are then combined to the second table, called the inner table. A series of operations are also performed on the inner table either before the join occurs, as the join occurs, or both. Although all joins are composed of similar steps, each of the three join methods is quite different when you get beyond the generalities. The optimizer understands the advantages and disadvantages of each method and how the use of that method can affect performance. Based on the current statistics in the system catalog, the optimizer understands also which tables are best for the inner table and the outer table. Considerations used by the optimizer include:

  • The smaller the table is the more likely it will be chosen as the outer table. This helps to reduce the number of times the inner table must be re-accessed.
  • A table is more likely to be chosen as the outer table if selective predicates can be applied to it because the inner table is only accessed for rows satisfying the predicates applied to the outer table.
  • If it is possible to do an index lookup on one of the tables, then that table is a good candidate to use as the inner table. If a table does not have an index, it would not be a good candidate for the inner table because the entire inner table would have to be scanned for every row of the outer table.
  • In most cases the table with fewest duplicates will be chosen as the outer table in a join.

Of course, none of these are hard-and-fast rules. The above list is a broad generalization only. In the end, the optimizer will choose the outer and inner tables based on detailed cost estimates. Let’s now examine the types of joins available to DB2 and how they differ from one another.

Probably the most common type of join method is the nested loop join (NLJ). To perform a NLJ, a qualifying row is identified in the outer table, and then the inner table is scanned searching for a match. A qualifying row is one in which the predicates for columns in the table match. When the inner table scan is complete, another qualifying row in the outer table is identified. The inner table is scanned for a match again, and so on. The repeated scanning of the inner table is usually accomplished with an index to minimize I/O cost.

The second type of join method used by DB2 is the merge join (MJ). With the MJ, the tables to be joined need to be ordered by the join predicates. That means that each table must be accessed in order by the columns that specify the join criteria. This ordering can be the result of either a sort or indexed access. After ensuring that both the outer and inner tables are properly sequenced, each table is read sequentially, and the join columns are matched up. Neither table is read more than once during a merge scan join.

The third type of join depends on the platform on which you are running DB2. For DB2 for z/OS there is the hybrid join. The hybrid join combines data and pointers to access and combine the rows from the tables being joined. You can think of it as a combination (or hybrid) of the nested-loop and merge join techniques. The high-level steps employed by the hybrid join are:

  1. Outer table rows are either retrieved in sequence using an index or the qualifying values are retrieved and sorted. Using an index with a high cluster ratio is beneficial in this step.
  2. For each unique value in the outer table, a matching index scan is used to find RIDs for rows of the inner table. 
  3. Partial rows of qualifying RIDs are built from the inner table along with the qualifying columns from the outer table using work table spaces (DSNDB07). 
  4. The partial rows are sorted in RID sequence to eliminate duplicate RIDs. The work file and sort is not required if the inner table join column has a cluster ratio of 80 percent or better. Also, list prefetch can start before all partial rows are built. 
  5. Process the inner table with list prefetch.

For DB2 for Linux, Unix, and Windows, the third type of join is the hash join. Hash join requires one or more predicates of the form table1.ColX = table2.ColY, and for which the column types are the same. The inner table is scanned and the rows copied into memory buffers drawn from the sort heap allocation. The memory buffers are divided into partitions based on a “hash code” computed from the column(s) of the join predicate(s). If the size of the first table exceeds the available sort heap space, buffers from selected partitions are written to temporary tables. After processing the inner table, the outer table is scanned and its rows are matched to the inner table rows by comparing the “hash code.” Hash joins can require a significant amount of memory. So, for the hash join to produce realistic performance benefits, you may need to change the value of the sortheap database configuration parameter, and the sheapthres database manager configuration parameter.

Now then, which of these join methods should be preferred when? In general, the nested loop join is preferred in terms of execution cost when a small number of rows qualify for the join. As the number of qualifying rows increases, the merge join becomes a better choice. Finally, in the case of a hash join, the inner table is kept in memory buffers. If there are too few memory buffers, then the hash join is obliged to spill. The optimizer attempts to avoid this and so will pick the smaller of the two tables as the inner table, and the larger one as the outer table.

These performance generalizations are vague and your results will depend on the exact number of qualifying rows as well as other factors such as your database design, database organization, accuracy of statistics, type of hardware, and the setup of your DB2 environment.

For DB2 for LUW, there is the additional consideration of the optimization class, which we will discuss next.

Summary

In this second part of our 3-part series on DB2 optimization and access paths we reviewed the methods that DB2 can use to combine data from multiple tables using joins. Next month, in the third and final part of this series, we will examine the DB2 for LUW optimization class and its impact on access path divination.