Today’s blog post concludes our four-part look into SQL performance and optimization. Part one introduced and explained relational optimization; part two reviewed query analysis and introduced single-table access path formulation; part three took a look at multi-table SQL optimization… and today we conclude with a quick overview of several additional aspects of SQL optimization that we have yet to discuss.
Using Indexes to Avoid Sorts
To satisfy certain types of SQL requests it may be necessary for the DBMS to sort data. But sorting is cost prohibitive and should be avoided if at all possible. The relational optimization process understands the overhead of sorting and bakes this into its optimization decisions.
That said, there are things that can be done to help the optimizer out. DBAs can design indexes to avoid sorts by creating them on columns that need to be sorted. When indexes exist the relational optimizer will try to use those indexes to avoid sorts as much as possible. Sorting might occur when the following clauses are specified:
- DISTINCT: When this clause is specified, the DBMS requires every column of the resulting data to be in order so that duplicate rows can be removed from the results set.
- UNION: This operation requires the columns in each SELECT list to be ordered because the results set can have no duplicate rows. If the DBMS supports INTERSECT and EXCEPT, the same consideration applies to those operations.
- GROUP BY: When this clause is specified, the DBMS requires data to be sorted by the specified columns in order to aggregate data.
- ORDER BY: When this clause is specified, the DBMS will ensure that the results set is sorted by the specified columns.
Consider the following SQL statement:
SELECT last_name, first_name, middle_initial, empno, position
WHERE position in ('MANAGER', 'DIRECTOR', 'VICE PRESIDENT')
ORDER BY last_name;
If an index exists on the last_name column, the query can use this index and avoid sorting. Using an index to avoid a sort trades off the additional CPU cost required to sort for the additional I/O cost required for indexed access. Of course, if the index is going to be used anyway, the choice is a no-brainer. Whether or not using an index is actually faster than scanning the data and sorting depends on several factors, including the number of qualifying rows, the speed of the sort and characteristics of the index (such as clustering).
Additionally, when coding queries that specify the UNION operation, take care to examine the requirements of your application. If there could be no duplicate rows, or if you do not care if duplicates exist in the results set, you can specify UNION ALL to avoid the sort for duplicate removal.
Why Wasn’t an Index Chosen?
Situations sometimes arise where you think the optimizer should have chosen an index, but it didn’t. Any number of reasons can cause the optimizer to avoid using an index.
Consult the following checklist for ways to encourage index selection:
Does the query specify a search argument? If no predicate uses a search argument, the optimizer cannot use an index to satisfy the query.
Are you joining a large number of tables? The optimizer within some DBMSs may produce unpredictable query plan results when joining a large number of tables.
Are statistics current? If large amounts of data have been inserted, updated, and/or deleted, database statistics should be recaptured to ensure that the optimizer has up-to-date information upon which to base its query plans.
Are you accessing a small table? For very small tables simply accessing all of the data and sorting it can be more efficient than using an index.
Are you using stored procedures? Sometimes the DBMS provides options whereby a stored procedure, once compiled, will not reformulate a query plan for subsequent executions. You may need to recompile or reoptimize the stored procedure to take advantage of up-to-date statistics, new indexes, or any other pertinent database changes.
Are additional predicates needed? A different WHERE clause might possibly enable the optimizer to consider a different index.
And there are some tables where there may not even be any indexes defined at all. It is a myth that every table must have at least one index. Here are some situations when it can make sense to have no indexes defined on a table:
When all accesses retrieve every row of the table. Because every row will be retrieved every time you want to use the table, an index (if used) would just add extra I/O and would degrade, not improve performance. Though not extremely common, you may come across such tables in your organization.
A very small table with only a few pages of data and no primary key or uniqueness requirements does not need to be indexed. Small tables (perhaps 20 to 30 or so pages) might not need an index because simply reading all of the pages is very efficient already.
When performance doesn’t matter and the table is only accessed very infrequently. But, hey, when do you ever have that type of requirement in the real world?
Other than for these circumstances, you will most likely want to build one or more indexes on each table, and try to get the DBMS to use those indexes.
Some database systems also support hashing. A hash is an algorithm that converts a key (one or more columns) into a small number, usually a storage location on disk. The values returned by a hash function are called hash values, hash sums, or simply hashes. Key values are processed by the hash algorithm (also known as a hash function, hash routine, or randomizer) and translated into a storage location. When data is inserted, the hash algorithm tells the DBMS where to physically store the data; when data is accessed by the key, the algorithm tells the DBMS exactly where to find the data.
A hash can be more efficient than even a direct index lookup because it generally incurs less I/O. Instead of having to traverse through multiple index pages (from the root through nonleaf pages out to the leaf page and then to the data), a hash converts the key to a specific location on disk. This incurs a single I/O in the best-case scenario. If the hashing algorithm produces the same location for multiple keys, hash collisions occur, which require an additional I/O.
Hashing is used primarily to optimize random I/O for small amounts of data, such as for looking up a code table value or accessing a single row based on the value of its primary key. Nevertheless, hashing is not commonly implemented in relational systems.
If and when hashed exist, the optimizer will consider using any hashing structures when formulating access paths.
The relational optimizer may choose to run queries in parallel. When query parallelism is invoked by the DBMS, multiple simultaneous tasks are invoked to access the data. Three basic types of parallelism can be supported by the DBMS:
- I/O parallelism enables concurrent I/O streams to be initiated for a single query. Running parallel I/O tasks can significantly enhance the performance of I/O bound queries. Breaking the data access for the query into concurrent I/O streams executed in parallel can reduce the overall elapsed time for the query.
- CPU parallelism enables multitasking of CPU processing within a query. Invoking CPU parallelism frequently also invokes I/O parallelism to enable each CPU engine with its own I/O stream. CPU parallelism decomposes a query into multiple smaller queries that can be executed concurrently on multiple processors. CPU parallelism can further reduce the elapsed time for a query.
- Finally, the DBMS can deploy system parallelism to further enhance parallel query operations. System parallelism enables a single query to be broken up and run across multiple DBMS instances. Allowing a single query to take advantage of the processing power of multiple DBMS instances can decrease the overall elapsed time for a complex query even further.
Parallelism is not a panacea. Typically, the optimizer will have to formulate multiple access paths (one for parallelism and one for single-streamed) to have a fallback method if the appropriate system resources are not available for parallel processing. That said, it can make sense to enable the optimizer to build parallel access paths for SQL that is I/O and/or CPU bound.
This blog series was designed to provide a DBMS-agnostic introduction to SQL performance and optimization. There are, of course, a myriad of additional details that must be understood and digested to do the topic justice. Foremost among these details is the exact features and support for the DBMS that you are working with.
Ensuring that proper query plans are formulated using the correct indexes and access methods is a time-consuming process, but one that can pay huge dividends in the form of enhanced performance. DBAs should train their application development staff to understand relational optimization and to create optimal SQL. The onus falls on the application developer to code efficient SQL and program logic.
However, the DBA is the sentry of relational database performance. When performance problems occur, the DBA is the one who has to search for the cause of the problem and suggest remedies to resolve it. Furthermore, the DBA should conduct design reviews to seek out and tune inefficient SQL before suboptimal access paths and programs are migrated to production status.