Application developers must code efficient SQL and understand how to optimize SQL in order to develop efficient applications. But it is not just the programmer, but also the DBA who must understand SQL optimization. It is a joint responsibility and frequently requires robust tools to help code, modify, and optimize SQL effectively.
The relational optimizer is at the heart of the DBMS. It is an inference engine responsible for determining the best possible database navigation strategy for any given SQL request. The developer specifies what data is needed by coding the SQL statements, the DBMS accesses metadata about where the data is located, and the relational optimizer decides how to efficiently navigate the database. The end user needs no knowledge of where and how the actual data is stored. The optimizer knows this information.
To optimize SQL, the relational optimizer must analyze each SQL statement by parsing it to determine the tables and columns that must be accessed. The optimizer will also access statistics stored in either the system catalog or the database objects themselves. The statistics are used to determine the best method of accomplishing the tasks that need to be performed to satisfy the SQL request. This process is called relational optimization.
Relational optimization is very powerful because it allows queries to adapt to a changing database environment. The optimizer can react to changes by formulating new access paths without requiring application coding changes to be implemented. The application can therefore be flexible as tables expand or contract in size, as indexes are added or removed, and as the database becomes disorganized or reorganized.
Regardless of how the data is physically stored and manipulated, SQL can be used to access data, and the DBMS will take the current state of the database into account to optimize that data access. This separation of access criteria from physical storage characteristics is called physical data independence.
Every relational database system relies on an optimizer to render SQL statements into executable access paths. Furthermore, each vendor’s relational optimizer works a little differently, with different steps and using different information. Nevertheless, the core of the process is the same from DBMS to DBMS. The optimizer parses the SQL statement and performs various phases of optimization, typically involving verification of syntactic and semantic correctness, followed by query analysis and formulation of the access paths to satisfy the query.
The relational optimizer can deploy many types of strategies for optimizing SQL statements. The internal operations and instructions that are used by each DBMS’s optimizer are closely guarded secrets. Modern relational optimizers are cost based, meaning that the optimizer will attempt to formulate an access path for each query that reduces overall cost. To function in this manner, the optimizer must evaluate and analyze multiple factors, including estimated CPU and I/O costs, database statistics, and the actual SQL statement.
CPU and I/O Costs
The optimizer uses formulas and models to estimate the machine cost to run each potential access path for the query being optimized. Based on CPU information, the optimizer can arrive at a rough estimate of the CPU time required to run the query using each optimized access path it analyzes.
Furthermore, a relational optimizer must estimate the cost of the actual writing and retrieval of the data. The optimizer estimates the cost of I/O to the query by using a series of formulas based on the database statistics, the data cache efficiency, and the cost of I/O to intermediate work files. These formulas result in a filter factor, which determines the relative I/O cost of the query.
Database Statistics
A relational optimizer is of little use without accurate statistics about the data stored in the database. A relational DBMS provides a utility program or command to gather statistics about database objects and to store them for use by the optimizer (or by the DBA for performance monitoring). For example, to collect statistics in DB2, the DBA must execute the RUNSTATS utility; to collect statistics in SQL Server the UPDATE STATISTICS command is issued.
It is necessary to collect modified statistics whenever a significant volume of data has been added or modified. Failure to do so will result in the optimizer basing its cost estimates on inaccurate statistics. This may be detrimental to query performance.
Database statistics provide the optimizer with information about the state of the tablespaces, tables, columns, and indexes. The DBMS collects statistical information such as
- Number of rows in the tablespace, table, or index
- Number of unique values stored in the column
- Most frequently occurring values for columns
- Index key density, or the average percentage of duplicate values stored in the index key column(s)
- Details on the ratio of clustering for clustered tables
- Correlation of columns to other columns
- Structural state of the index or tablespace
- Amount of storage used by the database object
Not all statistics are gathered every time you request new statistics; you can specify which type of database statistics to collect. Of course, the exact statistics collected vary from DBMS to DBMS; additional or fewer statistics may be available within your database system. The key, though, is to keep the statistics as accurate as possible to ensure efficient and useful relational optimization.
When developing an application against test databases, the statistics for the test data will not accurately reflect the statistics for the production database. Whenever possible, the DBA should work with the application development team to create a script to populate production statistics into the test system. Depending on the DBMS, this may be accomplished with SQL statements or a data testing tool. Without production statistics, the DBMS will likely choose different access paths in the test environment from the ones it would choose in production—potentially causing performance problems when the application goes into production status.
Summary
In today’s blog post we introduced the topic of relational optimization, including its requirements and important issues to consider. In the next part, we will take a look at query analysis and some of the methods that the optimizer can deploy to formulate SQL access paths.