In part one of our overview of SQL performance we examined relational optimization; part two took a look at query analysis and introduced single-table access path formulation. In today’s blog post, we move forward to look at the various methods that can be used to combine data from more than one table.
Joins
When multiple tables are accessed, the optimizer must determine how to combine the tables in the most efficient manner. When determining the access path for a join, the optimizer must determine the order in which the tables will be joined, compute the overall cost estimate of each access path, and choose a join method for the particular query. The DBMS can utilize several different methods for joining tables.
Regardless of the join method, the DBMS must make several decisions and perform certain operations. The first decision is to choose the table to process first—this table is referred to as the outer table. Next, a series of operations are performed on the outer table to prepare it for joining. Rows from that table are then combined with rows from the second table, called the inner table. A series of operations are also performed on the inner table before the join occurs, as the join occurs, or both.
Although all joins are similar in functionality, each join method works differently behind the scenes. Let’s investigate two common join methods: the nested-loop join and the merge-scan join.
The nested-loop join works by comparing qualifying rows of the outer table to the inner table. A qualifying row is identified in the outer table, and then the inner table is scanned for a match. A qualifying row is one in which the predicates for columns in the table match. When the inner table scan is complete, another qualifying row in the outer table is identified. The inner table is scanned for a match again, and so on. The repeated scanning of the inner table is usually accomplished with an index to avoid undue I/O costs. The smaller the size of the inner table, the better a nested-loop join performs, because fewer rows need to be scanned for each qualifying row of the outer table.
A second type of join method is the merge-scan join. In a merge-scan join, the tables to be joined are ordered by the keys. This ordering can be accomplished by a sort or by access via an index. After ensuring that both the outer and inner tables are properly sequenced, each table is read sequentially, and the join columns are matched. During a merge-scan join, no row from either table is read more than once. Merge-scan joins can be beneficial when an appropriate index is not available on one (or both) of the tables.
Depending on the DBMS, other join methods may be available. For example, star joins to support data warehousing and analytical queries are typical in many DBMSes. Other hybrid methods and methods using hashes also may be supported.
Join Order and Table Selection
The optimizer reviews each join in a query and analyzes the appropriate statistics to determine the optimal order in which the tables should be accessed to accomplish the join. To find the optimal join access path, the optimizer uses built-in algorithms containing knowledge about joins and data volume. It matches this intelligence against the join predicates, databases statistics, and available indexes to estimate which order is more efficient. In general, the optimizer will deploy an algorithm that minimizes the number of times the inner table must be accessed for qualifying outer table rows.
The optimizer understands the advantages and disadvantages of each method and how the use of that method can impact performance. Based on the current statistics in the system catalog, the optimizer chooses which table is best for the inner table and the outer table. Here is a high-level summary of some of the issues that must be considered by the optimizer:
- The smaller the table is the more likely it will be chosen as the outer table. This helps to reduce the number of times the inner table must be re-accessed.
- A table is more likely to be chosen as the outer table if selective predicates can be applied to it because the inner table is only accessed for rows satisfying the predicates applied to the outer table.
- If it is possible to do an index lookup on one of the tables, then that table is a good candidate to use as the inner table. If a table does not have an index, it would not be a good candidate for the inner table because the entire inner table would have to be scanned for every row of the outer table (for a nested-loop join).
- The table with fewest duplicates may have a propensity to be chosen as the outer table in a join.
Of course, none of these are hard and fast rules. In the end, the optimizer will choose the outer and inner tables based on detailed cost estimates.
However, none of today’s relational optimizers are perfect. As such, DBAs and performance analysts must review the choices made by the optimizer by evaluating the access paths generated by the optimizer. This can be an arduous process because it involves understanding the data, statistics, and access path details. Organizations that are serious about achieving optimal performance use automated tools to help evaluate the performance of their SQL.
Summary
Today’s blog post provides a high-level view into SQL optimization and the multi-table access paths. Of course, there are many intricate details that differ from DBMS to DBMS and you will need to understand the way your systems work to achieve the performance you desire.
Through the first three parts of the four-part series we have examined the core processes of SQL optimization, but there is still more to learn. In our next installment we will take a look at some of the important additional SQL optimization considerations you should be aware of.