SQL Performance, Part 1: Relational Optimization

by Nov 13, 2018

Categories

Tags

Administration agent-based monitoring Agentless Monitoring alert responses alert thresholds alerting Alerts Amazon Aurora Amazon EC2 Amazon RDS Amazon RDS / Aurora Amazon RDS for SQL Server Amazon Redshift Amazon S3 Amazon Web Services (AWS) Analytics application monitoring Aqua Data Studio automation availability Azure Azure SQL Database azure sql managed instance Azure VM backup Backup and recovery backup and restore backup compression backup status Backup Strategy backups big data Blocking bug fixes business architecture business data objects business intelligence business process modeling business process models capacity planning change management cloud cloud database cloud database monitoring cloud infrastructure cloud migration cloud providers Cloud Readiness Cloud Services cloud storage cloud virtual machine cloud VM clusters code completion collaboration compliance compliance audit compliance audits compliance manager compliance reporting conference configuration connect to database cpu Cross Platform custom counters Custom Views customer survey customer testimonials Dark Theme dashboards data analysis Data Analytics data architect data architecture data breaches Data Collector data governance data lakes data lineage data management data model data modeler data modeling data models data privacy data protection data security data security measures data sources data visualization data warehouse database database administration database administrator database automation database backup database backups database capacity database changes database community database connection database design database developer database developers database development database diversity Database Engine Tuning Advisor database fragmentation database GUI database IDE database indexes database inventory management database locks database management database migration database monitoring database navigation database optimization database performance Database Permissions database platforms database profiling database queries database recovery database replication database restore database schema database security database support database synchronization database tools database transactions database tuning database-as-a-service databases DB Change Manager DB Optimizer DB PowerStudio DB2 DBA DBaaS DBArtisan dBase DBMS DDL Debugging defragmentation Demo diagnostic manager diagnostics dimensional modeling disaster recovery Download drills embedded database Encryption End-user Experience entity-relationship model ER/Studio ER/Studio Data Architect ER/Studio Enterprise Team Edition events execution plans free tools galera cluster GDPR Getting Started Git GitHub Google Cloud Hadoop Healthcare high availability HIPAA Hive hybrid clouds Hyper-V IDERA IDERA ACE Index Analyzer index optimization infrastructure as a service (IaaS) infrastructure monitoring installation Integrated Development Environment interbase Inventory Manager IT infrastructure Java JD Edwards JSON licensing load test load testing logical data model macOS macros managed cloud database managed cloud databases MariaDB memory memorystorage memoryusage metadata metric baselines metric thresholds Microsoft Azure Microsoft Azure SQL Database Microsoft PowerShell Microsoft SQL Server Microsoft Windows MongoDB monitoring Monitoring Tools Monyog multiple platforms MySQL news newsletter NoSQL Notifications odbc optimization Oracle PeopleSoft performance Performance Dashboards performance metrics performance monitoring performance schema performance tuning personally identifiable information physical data model Platform platform as a service (PaaS) PostgreSQL Precise Precise for Databases Precise for Oracle Precise for SQL Server Precise Management Database (PMDB) product updates Project Migration public clouds Query Analyzer query builder query monitor query optimization query performance Query Store query tool query tuning query-level waits Rapid SQL rdbms real time monitoring Real User Monitoring recovery regulations relational databases Releases Reporting Reports repository Restore reverse engineering Roadmap sample SAP Scalability Security Policy Security Practices server monitoring Server performance server-level waits Service Level Agreement SkySQL slow query SNMP snowflake source control SQL SQL Admin Toolset SQL CM SQL code SQL coding SQL Compliance Manager SQL Defrag Manager sql development SQL Diagnostic Manager SQL Diagnostic Manager for MySQL SQL Diagnostic Manager for SQL Server SQL Diagnostic Manager Pro SQL DM SQL Doctor SQL Enterprise Job Manager SQl IM SQL Inventory Manager SQL Management Suite SQL Monitoring SQL Performance SQL Quality SQL query SQL Query Tuner SQL Safe Backup SQL script SQL Secure SQL Security Suite SQL Server sql server alert SQL Server Migration SQL Server Performance SQL Server Recommendations SQL Server Security SQL statement history SQL tuning SQL Virtual Database sqlmemory sqlserver SQLyog Storage Storage Performance structured data Subversion Support tempdb tempdb data temporal data Tips and Tricks troubleshooting universal data models universal mapping unstructured data Uptime Infrastructure Monitor user experience user permissions Virtual Machine (VM) web services webinar What-if analysis WindowsPowerShell

Application developers must code efficient SQL and understand how to optimize SQL in order to develop efficient applications. But it is not just the programmer, but also the DBA who must understand SQL optimization. It is a joint responsibility and frequently requires robust tools to help code, modify, and optimize SQL effectively.

The relational optimizer is at the heart of the DBMS. It is an inference engine responsible for determining the best possible database navigation strategy for any given SQL request. The developer specifies what data is needed by coding the SQL statements, the DBMS accesses metadata about where the data is located, and the relational optimizer decides how to efficiently navigate the database. The end user needs no knowledge of where and how the actual data is stored. The optimizer knows this information.

To optimize SQL, the relational optimizer must analyze each SQL statement by parsing it to determine the tables and columns that must be accessed. The optimizer will also access statistics stored in either the system catalog or the database objects themselves. The statistics are used to determine the best method of accomplishing the tasks that need to be performed to satisfy the SQL request. This process is called relational optimization.

Relational optimization is very powerful because it allows queries to adapt to a changing database environment. The optimizer can react to changes by formulating new access paths without requiring application coding changes to be implemented. The application can therefore be flexible as tables expand or contract in size, as indexes are added or removed, and as the database becomes disorganized or reorganized.

Regardless of how the data is physically stored and manipulated, SQL can be used to access data, and the DBMS will take the current state of the database into account to optimize that data access. This separation of access criteria from physical storage characteristics is called physical data independence.

Every relational database system relies on an optimizer to render SQL statements into executable access paths. Furthermore, each vendor’s relational optimizer works a little differently, with different steps and using different information. Nevertheless, the core of the process is the same from DBMS to DBMS. The optimizer parses the SQL statement and performs various phases of optimization, typically involving verification of syntactic and semantic correctness, followed by query analysis and formulation of the access paths to satisfy the query.

The relational optimizer can deploy many types of strategies for optimizing SQL statements. The internal operations and instructions that are used by each DBMS’s optimizer are closely guarded secrets. Modern relational optimizers are cost based, meaning that the optimizer will attempt to formulate an access path for each query that reduces overall cost. To function in this manner, the optimizer must evaluate and analyze multiple factors, including estimated CPU and I/O costs, database statistics, and the actual SQL statement.

CPU and I/O Costs

The optimizer uses formulas and models to estimate the machine cost to run each potential access path for the query being optimized. Based on CPU information, the optimizer can arrive at a rough estimate of the CPU time required to run the query using each optimized access path it analyzes.

Furthermore, a relational optimizer must estimate the cost of the actual writing and retrieval of the data. The optimizer estimates the cost of I/O to the query by using a series of formulas based on the database statistics, the data cache efficiency, and the cost of I/O to intermediate work files. These formulas result in a filter factor, which determines the relative I/O cost of the query.

Database Statistics

A relational optimizer is of little use without accurate statistics about the data stored in the database. A relational DBMS provides a utility program or command to gather statistics about database objects and to store them for use by the optimizer (or by the DBA for performance monitoring). For example, to collect statistics in DB2, the DBA must execute the RUNSTATS utility; to collect statistics in SQL Server the UPDATE STATISTICS command is issued.

It is necessary to collect modified statistics whenever a significant volume of data has been added or modified. Failure to do so will result in the optimizer basing its cost estimates on inaccurate statistics. This may be detrimental to query performance.

Database statistics provide the optimizer with information about the state of the tablespaces, tables, columns, and indexes. The DBMS collects statistical information such as

  • Number of rows in the tablespace, table, or index
  • Number of unique values stored in the column
  • Most frequently occurring values for columns
  • Index key density, or the average percentage of duplicate values stored in the index key column(s)  
  • Details on the ratio of clustering for clustered tables
  • Correlation of columns to other columns
  • Structural state of the index or tablespace
  • Amount of storage used by the database object

Not all statistics are gathered every time you request new statistics; you can specify which type of database statistics to collect. Of course, the exact statistics collected vary from DBMS to DBMS; additional or fewer statistics may be available within your database system. The key, though, is to keep the statistics as accurate as possible to ensure efficient and useful relational optimization.

When developing an application against test databases, the statistics for the test data will not accurately reflect the statistics for the production database. Whenever possible, the DBA should work with the application development team to create a script to populate production statistics into the test system. Depending on the DBMS, this may be accomplished with SQL statements or a data testing tool. Without production statistics, the DBMS will likely choose different access paths in the test environment from the ones it would choose in production—potentially causing performance problems when the application goes into production status.

Summary

In today’s blog post we introduced the topic of relational optimization, including its requirements and important issues to consider. In the next part, we will take a look at query analysis and some of the methods that the optimizer can deploy to formulate SQL access paths.