I know that many database professionals will cringe at the title of this blog post, but it is absolutely the truth. DBAs are charged with many duties including backing up and recovering data, assuring optimal performance of database systems and applications, designing and creating databases, integrating changes into database systems over time, securing and protecting data, and much more. Of all the things the DBA does to keep the systems and applications running, assuring recoverability is the most important.
I hear some of you protesting out there… many DBAs will claim that managing performance is the most important thing they do. Unfortunately, they are confusing frequency with importance. Yes, many DBAs probably are managing performance more often than building backup plans – and they better be managing performance more frequently than they are actually recovering their databases or their company has big problems! But frequency is not the same as importance.
Why do I assert that recoverability should be at the very top of the DBA task list? Well, if you cannot recover your databases after a problem then it won’t matter how fast you can access them, will it? Anybody can deliver fast access to the wrong information. It is the job of the DBA to keep the information in our company’s databases accurate, secure, and accessible.
So what do we need to do to assure the accessibility and accuracy of our database data? The first thing is to understand the availability needs of our data in terms of the business. In the event of a failure how rapidly must we be able to recover from that failure? Keep in mind that the failure could be either physical, such as a failed disk drive; or logical, such as applying the wrong input to a process which corrupts the database.
Recovery SLAs… or RTOs
Only after we know the impact to the business can we develop an appropriate backup and recovery plan. We need service level agreements (SLAs) for recovery just like we have SLAs for performance. An SLA for recovery is known as a Recovery Time Objective, or RTO. Without RTOs, it is difficult (if not impossible) to gauge the state of recoverability and the efficacy of the backups being taken.
It is imperative that the DBA team creates an appropriate recovery strategy for each database object. This requires mapping database objects to applications so we can adopt the proper strategy in accordance with the application recovery SLA. Some database objects will participate in multiple applications, and their recovery strategy will therefore be more complex.
Each database object should have an RTO assigned to it. The RTO needs to take into account the same type of things that an SLA considers. In other words, the business must prioritize its applications, DBAs must map database objects to the applications, and together they must identify the amount of time, effort, and capital that can be expended to assure the minimization of downtime for those applications.
The RTO ensures that, when problems occur requiring database recovery, the application outage is limited to what has been defined as tolerable for the business (in terms of uptime and cost to provide that uptime). All parties involved must agree on stated objectives for downtime and time to recovery. The end users must be satisfied with the potential duration of their application’s downtime, and the DBAs and technicians must be content with their ability to recover the system to the objectives. And again, cost is a contributing factor. The RTO cannot simply be I need my application up in 5 minutes and I can’t spend any more money to do that, because that is not reasonable (or possible).
Without written RTOs, DBAs can provide due diligence to make sure that database objects are backed up and recoverable, but cannot really provide any guarantee in terms of how quickly the data can be recovered (or perhaps, to what point in time) when an outage occurs. Of course, the DBA can create and review backup policies and procedures to encourage a recoverable environment. But there won't be any way to ensure with any consistency that the backup plan can deliver the time-to-recovery needed by the business.
Tactics and Approaches to Backup and Recovery
Although this post will not get into the nitty gritty details of all the technical details of database backup and recovery, it is important to recognize that there are different strategies to assure data availability.
Organizations can optimize recovery needs with redundant hardware. RAID devices, for example, combines physical hard disks into a single logical unit storing data redundantly so a single failure does not cause an outage. And mirroring can be used to copy data synchronously to more than one disk. Automatic error correction can be used to detect and correct problems using redundantly stored data. But such hardware can be expensive so it will not likely be used for every database under our control. And, of course, even with expensive failover hardware, backups will still be needed.
Another approach for high availability in the event of an error is a hot site set up with a copy of the database that is updated in concert with the primary site. If the primary site fails, the administrator can switch to the secondary site with little, or no downtime. But even this type of setup does not obviate the need for backup and recovery. It can for hardware failures sure, but what about logical failures where the wrong data is used. Such an error will be propagated from the primary to the secondary site as would any data modification. And without a backup to recover from, how will you get rid of the bad data?
Sometimes organizations think that adopting a cloud approach means that they don’t need to worry about backup and recovery. And to some extent, this is true. In most cases, your cloud service provider (CSP) will perform the backups. But you really need to work with the cloud provider to make sure that your needs are covered. If you do not engage with the CSP with RTOs for your cloud data, then it is not likely that your recoverability will match your business needs.
Have You Evaluated Your Recoverability Lately?
Establishing a reasonable backup schedule requires you to balance two competing demands: the need to take image copy backups frequently to assure reasonable recovery time, while at the same time dealing with the need to take image copies infrequently so as not to interrupt daily business. All the while keeping in mind, if you make fewer image copies you will need to apply more log records during the recovery, and the recovery will take longer. The DBA must balance these competing objectives based on RTOs, usage criteria, and the capabilities of the DBMS.
When was the last time you re-evaluated and tested your backup and recovery plans? Oh, you may have looked at disaster plans, but have you examined your ability to recover locally? Do you know how long it would take to recover your most important primary customer tables, for example, if you took a hit in the middle of the day?
Regular recoverability health checking should be a standard documented responsibility for every DBA staff; and if you can acquire software to automate the health-check process, all the better.
The Bottom Line
A wiose sage once said that there are two types of DBAs:
- DBAs that do backups
- DBAs that will do backups
And there is a truism there. Assuring access to your data requires strong backup and recovery planning where technicians and business align to create achievable RTOs. If this process is approached with diligence and in a spirit of cooperation, you can build a proper backup and recovery environment for your systems… and your DBAs can go back to spending most of their time firefighting performance issues!