Imagine that you’re the lead DBA managing a high profile, mission critical production database whose application processes $20 million per hour. The database schema design is complex with approximately 1,000 tables and 1,500 indexes. Moreover the application code and its SQL statements were written by many different developers over the years with varying SQL programming skills. In short it’s your typical DBA job description and the fundamental situation that you find yourself in.
Now imagine that the database application experiences a noticeable slowdown and that key business management is now asking (or possibly even screaming) to fix it and fix it now. So what’s the current recommended performance resolution approach for the DBA to follow?
There’s a common best practice often adhered to, often silently without even realizing that it’s being followed, do not throw hardware at a performance problem. In fact it’s sometimes true that throwing more hardware at a problem can make things worse, but it’s not a given. Now 20 or even 10 years ago this was sound advice because hardware was so very expensive especially as compared to consulting person hours to find and fix the issue. Furthermore the application often could not be down to permit hardware upgrades. So historically DBAs have tried to research and locate the underlying problem and then fix it. However this approach could sometimes yield little to no improvement at a high person hour cost and an even high opportunity cost of business opportunities missed due to a slow system. But it was basically our only choice.
However today things are quite different: hardware is now cheap, amazing new hardware technologies exist, robust hyper-visors provide amazing virtualization options and now we have the cloud where as an option where dynamic scale up or out is simply a check box and increased hourly cost. Therefore today's DBAs has many viable options to consider. Remember that the business generally does not demand an academically ideal solution, but an expedient one that minimizes downtime and lost business costs. So in some cases trying options that historically were bad advice such as throwing hardware at the problem or simple dynamic scale up and out might be a reasonable alternative. But DBAs will need to throw away baggage of past best practices which may no longer apply.
Let’s look at a couple simple examples. While these examples may not be the best possible advice, the idea is to demonstrate some contemporary alternatives that might have seemed unreasonable or to be avoided from past experiences.
- Pre-canned reports and ad-hoc queries which perform lots of sort by and order by operations overwhelm the IO bandwidth of the TEMP space residing on traditional magnetic disk. Moving TEMP to a simple SSD (solid state disk) or even a PCIe NVMe (nonvolatile memory express) flash disk might remedy the fundamental problem for the predictable and very low cost of roughly $500 to $5,000.
- Transaction log files and transaction rollback mechanisms overwhelm the IO bandwidth of the LOG and/or ROLLBACK space residing on traditional magnetic disk. Once again a simple move to an SSD or PCIe NVMe flash disk might solve the underlying problem for the predictable and very low cost of roughly $500 to $5,000.
- The database is IO starved, where a low database memory data cache hit ratio might result in a higher degree of both physical and logical IO. This issue actually has several viable alternatives:
- Some databases (e.g. Oracle and SQL Server) permit utilizing an SSD or PCIe NVMe flash disk as an extended but slower memory allocated to the data cache.
- If virtualized, then simply allocate more RAM to the VM. Most hyper-visors permit this without downtime. Then increase the database data cache to use more RAM which also often does not require downtime.
- If in the cloud, then simply step up to the next capability/price level. Most clouds permit this without downtime. Then increase the database data cache to use more RAM which also often does not require downtime.
- The database is CPU starved, where a consistently high CPU used percentage might result in a higher degree process wait time (i.e. queue). This issue actually has several viable alternatives:
- If virtualized, then simply allocate more CPU to the VM. Most hyper-visors permit this without downtime. Then increase the database configuration to use more CPU which also often does not require downtime and sometimes is automatically detected and applied.
- If in the cloud, then simply step up to the next capability/price level. Most clouds permit this without downtime. Then increase the database configuration to use more CPU which also often does not require downtime and sometimes is automatically detected and applied.
Do you note a pattern? Do these options make you uncomfortable? Don’t feel bad. DBAs simply need to evolve their performance diagnostic and optimization tool belt to include some new options which have not existed before or were considered historically bad practices. From a business perspective where all they generally care about is fix it fast, these new options might be a reasonable or even best solution.