In part one of our overview of SQL performance we examined relational optimization and the things that impact it. In today’s blog post, we turn our attention to query analysis and the ways that SQL gets turned into executable code.
The optimization process, at a high level, consists of four steps:
- Receive and verify the SQL statement.
- Analyze the environment and optimize the method of satisfying the SQL statement.
- Create machine-readable instructions to execute the optimized SQL.
- Execute the instructions or store them for future execution.
The first thing that needs to be done is to verify that the SQL is written correctly. This does not mean that it will do what you want it to do, just that it conforms to the required syntax. The SQL will be parsed and examined. If it encounters any errors, the process stops and you will have to revise the SQL until it is correct. After verifying SQL syntax, the next step is to check semantics, such as data types, referential constraints, check constraints, views, and triggers.
The second step of this process is the most intriguing. How does the optimizer decide how to execute the vast array of SQL statements that can be sent its way? This query analysis step scans the SQL to determine its overall complexity. The formulation of the SQL statement is a significant factor in determining the access paths chosen by the optimizer. The complexity of the query, the number and type of predicates, the presence of functions, and the presence of ordering clauses enter into the estimated cost that is calculated by the optimizer.
The more complex the SQL statement, the more work the query analysis must do to make sense of the SQL statement. During query analysis, the optimizer analyzes aspects of the SQL statement and the database system, such as
- Which tables in which database are required
- Whether any views are required to be broken down into underlying tables
- Whether table joins or subselects are required
- Whether UNION, EXCEPT, or INTERSECT are required
- Which indexes, if any, can be used
- How many predicates (WHERE clauses) must be satisfied
- Which functions must be executed
- Whether the SQL uses OR or AND
- How the DBMS processes each component of the SQL statement
- How much memory has been assigned to the data cache(s) used by the tables in the SQL statement
- How much memory is available for sorting if the query requires
In other words, the query analysis breaks down the SQL statement into discrete tasks that must be performed to return the query results.
Modern relational optimizers are cost-based, which means that the optimization process will always attempt to formulate an access path for each query that reduces the overall cost. To accomplish this, the optimizer applies query cost formulas that evaluate and weigh multiple factors for each potential access path: these factors include things such as CPU cost, I/O operations, statistical information in the system catalog, and the actual SQL statement code.
The optimizer may rewrite the query, transforming it into an equivalent but more easily compiled and optimized version. Predicate pushdown and transformation can occur at this point. The SQL is then optimized. Multiple access paths will be reviewed and analyzed to choose the least expensive option. The last step is to create the actual executable code.
The relational optimizer has numerous options for creating SQL access paths. At a high level, there are methods of accessing data in a single table and methods of combining data in two tables. These can be combined into a series of access methods that create an overall access path for a SQL statement.
For single-table access, data can be retrieved using a scan or an index. After the optimizer determines the indexes available to be used for each predicate, it will decide whether to use a single index, multiple indexes, or no index at all.
It is tempting to say that indexed access will outperform scanned access, but this is not always the case. The optimizer must evaluate the amount of data that must be accessed as well as the nature of the query. For example, if you are creating a report containing every row in a table, then using an index might be slower than just reading all the data using a scan.
Table scans are the simplest form of data access. A table scan is performed simply by reading every row of the table. Depending on the DBMS, an alternate type of scan may exist, called a tablespace scan. The tablespace scan reads every page in the tablespace, which may contain more than one table. Obviously, a tablespace scan will run slower than a table scan because additional I/O may be incurred reading data that does not apply.
Another form of scanning is the partition scan. If the DBMS can determine that the data to be accessed exists in certain partitions of a multi-partition table (or tablespace), it can limit the data that is scanned to the appropriate partitions. A partition scan should outperform a table scan or tablespace scan because the amount of I/O required is reduced.
Typically, the optimizer will choose to scan data for one of the following reasons:
- The query cannot be satisfied using an index, possibly because no index is available, no predicate matches the index, or the predicate precludes the use of an index.
- A high percentage of the rows in the table qualify. In this case, using an index is likely to be less efficient because most of the data rows need to be read anyway.
- The indexes that have matching predicates have low cluster ratios and are efficient for only small amounts of data.
- The table is so small that use of an index would actually be detrimental. For small tables, adding index access to the table access can result in additional I/O, instead of less I/O.
To assist the performance of a scan, the optimizer can invoke data prefetch. Data prefetch causes the DBMS to read data pages sequentially into the data cache even before they are requested. Essentially, data prefetch is a read-ahead mechanism—when data scans get around to requesting the data, it will already exist in memory. Prefetch is particularly useful for scans but can be practical for any type of sequential data access. You should learn how and why your particular DBMS prefetches data.
The majority of your accesses should use indexes, which brings us to the alternative to scanning, or indexed access. The optimizer must first discover whether any indexes exist. An index does not have to be defined before SQL can be written to access a column—you can query any column of any table known to the database.
Additionally, at least one indexed column must be referenced within an indexable predicate in the SQL statement. The DBMS is not capable of using an index for every WHERE clause. You must learn what types of predicates can use indexes to ensure that the appropriate indexes are created for the queries in your database applications. Every DBMS has a different list of what is, and what is not, indexable. Furthermore, what is indexable tends to change from version to version of each DBMS.
The optimizer can choose to use an index in many different ways. The first, and simplest, type of indexed access is the direct index lookup. For the DBMS to be able to perform a direct index lookup, values must be provided for each column in the index. To perform a direct index lookup, the DBMS compares the value requested in the predicate to the values stored in the root page of the index. Based on this comparison, the DBMS will traverse the index to the next-lower set of pages. If intermediate nonleaf pages exist, the appropriate nonleaf page is read, and the value is compared to determine which leaf page to access. The appropriate leaf page is read; the index leaf page contains pointer(s) to the actual data for the qualifying rows. Based on the pointer(s) in the leaf page index entries, the DBMS reads the appropriate table data pages.
However, assume that not all of the columns of the index are supplied in the SQL statement. A direct index lookup cannot be chosen because the DBMS cannot match the full index key. Instead, an index scan could be chosen. When an index scan is invoked, the leaf pages of the index are read sequentially, one after the other.
There are two basic types of index scans: matching index scans and nonmatching index scans. A matching index scan is sometimes called absolute positioning. A matching index scan begins at the root page of an index and works down to a leaf page in much the same manner as a direct index lookup does. However, because the complete key of the index is not available, the DBMS must scan the leaf pages of the index looking for the values that are available, until all matching values have been retrieved.
For a matching index scan to be used, you must specify the high-order column in the index key; that is, the first column specified in the index DDL. The high-order column provides the starting point for the DBMS to traverse the index structure from the root page to the appropriate leaf page.
Consider the consequences of not specifying the high-order column in the query. The DBMS can deploy a nonmatching index scan, sometimes referred to as relative positioning. When a starting point cannot be determined because the first column in the index key is not specified, the DBMS cannot use the index tree structure. However, it can scan the index leaf pages. A nonmatching index scan begins with the first leaf page in the index and scans subsequent leaf pages sequentially, applying the available predicates.
A nonmatching index scan can be more efficient than a table or tablespace scan, especially if the data pages that must be accessed are in clustered order. Additionally, keep in mind that index pages (or blocks) contain more entries than table pages because the index “row” is shorter than the table row, thereby making index page I/O more efficient than scanning table pages.
Today we’ve looked at Query Analysis and Access Path Formulation at a high level, as we learned about the components of query analysis and the single table access methods. But there is still more to learn. In our next installment we will examine the multi-table access methods that can be used by relational optimization.