I often run into situations where DBAs are having discussions with their Storage Administrator around not compressing database backups so that they can get better deduplication from storage devices like EMC Data Domain. The Storage Administrator has been telling them not to compress their backups. I will tell you why this can be a bad idea.
What is deduplication?
Appliances like EMC Data Domain sit between the server and the storage medium, connected through the network. When data is saved to the storage appliance it is sent from the server to the appliance. The appliance deduplicates the data and then stores the reduced file. Deduplication works, in most cases, by identifying duplicate blocks within a file. The difference between compression and deduplication is that the deduplication process is essentially transparent such that it doesn’t matter what type of file is being stored.
The key thing to note, is that the file has to be transferred over the network to the appliance and then deduplication occurs. So, for example, if you had a database backup file that was 800 megabytes, it would be sent over the network to the appliance and then deduped down to maybe 50 megabytes. So you save in storage space but you incur the cost in time of the transfer of the file.
How Deduplication Slows Backup Operations
Most of the time the bottleneck in a database backup is either the network file transer or writing the file to storage. IDERA’s SQL Safe Backup product uses a proprietary technology, Intellicompress, which uses idle CPU cycles to compress database backups. What this does is reduce the size of the database backup considerably as the file is being saved. The end result is faster transfer rates over the network as well as faster disk writes.
For deduplication, the entire file is written to disk over the network and then deduplicated. So you get a reduction in storage at the cost of the file transfer.
Deduplication Messes with Restores Also
Think about it, suppose you need to restore a database from a backup on EMC Data Domain. Lets use our 800 megabyte backup file in the example. The entire 800 megabyte backup file is transferred across the network and restored. If the file had been compressed using IDERA SQL Safe the backup, the file could be as small as 80 megabytes. This means a faster network transfer rate. If you verify your backups then you essentially go across the network twice.
Log shipping provides a unique performance hit in that it often performs restores across multiple servers using multiple files. This means that there is continuous back and forth network transfer of data.
The Storage Administrator is Wrong
Deduplication works really well with uncompressed data. This is typically found with file servers. Deduplication typically saves quite a bit of SAN space for this type of application. Storage Administrators look at database backups as nothing more than data files being written to the SAN and often get frustrated by the lack of compression rate. If you use uncompressed backups you might get 80% deduplication rates but there is a price to pay as we mentioned earlier. They also don’t see that they could be storing 10x more data as well as extending restore times, which could be a critical side effect.
The easy way to prove your point is to write a compressed backup to regular file storage and write an uncompressed file to deduplication storage. Measure transfer times as well as the time it takes to restore from the backups. You will quickly prove your point.
Lagniappe (something given as a bonus or extra gift)
People often ask why deduplication of compressed data does not yield the desired result of decreased file size on the storage appliance. You can think of the process as the equivalent of compressing a compressed file. Duplicate bits of data are removed during compression so trying to find duplicate file blocks with deduplication is not going to yield much in file size savings.
Another question arises around using encrypted files with deduplication. You do not get very good deduplication rates with encrypted data because encryption works by using a “mask” and randomizing bytes using a key. Guess what, this means that deduplication is not going to find many if any duplicate data blocks.