Why Database Backup Compression Is Better Than Deduplication

by Jan 12, 2017

Categories

Tags

Administration agent-based monitoring Agentless Monitoring alert responses alert thresholds alerting Alerts Amazon Aurora Amazon EC2 Amazon RDS Amazon RDS / Aurora Amazon RDS for SQL Server Amazon Redshift Amazon S3 Amazon Web Services (AWS) Analytics application monitoring Aqua Data Studio automation availability Azure Azure SQL Database azure sql managed instance Azure VM backup Backup and recovery backup and restore backup compression backup status Backup Strategy backups big data Blocking bug fixes business architecture business data objects business intelligence business process modeling business process models capacity planning change management cloud cloud database cloud database monitoring cloud infrastructure cloud migration cloud providers Cloud Readiness Cloud Services cloud storage cloud virtual machine cloud VM clusters code completion collaboration compliance compliance audit compliance audits compliance manager compliance reporting conference configuration connect to database cpu Cross Platform custom counters Custom Views customer survey customer testimonials Dark Theme dashboards data analysis Data Analytics data architect data architecture data breaches Data Collector data governance data lakes data lineage data management data model data modeler data modeling data models data privacy data protection data security data security measures data sources data visualization data warehouse database database administration database administrator database automation database backup database backups database capacity database changes database community database connection database design database developer database developers database development database diversity Database Engine Tuning Advisor database fragmentation database GUI database IDE database indexes database inventory management database locks database management database migration database monitoring database navigation database optimization database performance Database Permissions database platforms database profiling database queries database recovery database replication database restore database schema database security database support database synchronization database tools database transactions database tuning database-as-a-service databases DB Change Manager DB Optimizer DB PowerStudio DB2 DBA DBaaS DBArtisan dBase DBMS DDL Debugging defragmentation Demo diagnostic manager diagnostics dimensional modeling disaster recovery Download drills embedded database Encryption End-user Experience entity-relationship model ER/Studio ER/Studio Data Architect ER/Studio Enterprise Team Edition events execution plans free tools galera cluster GDPR Getting Started Git GitHub Google Cloud Hadoop Healthcare high availability HIPAA Hive hybrid clouds Hyper-V IDERA IDERA ACE Index Analyzer index optimization infrastructure as a service (IaaS) infrastructure monitoring installation Integrated Development Environment interbase Inventory Manager IT infrastructure Java JD Edwards JSON licensing load test load testing logical data model macOS macros managed cloud database managed cloud databases MariaDB memory memorystorage memoryusage metadata metric baselines metric thresholds Microsoft Azure Microsoft Azure SQL Database Microsoft PowerShell Microsoft SQL Server Microsoft Windows MongoDB monitoring Monitoring Tools Monyog multiple platforms MySQL news newsletter NoSQL Notifications odbc optimization Oracle PeopleSoft performance Performance Dashboards performance metrics performance monitoring performance schema performance tuning personally identifiable information physical data model Platform platform as a service (PaaS) PostgreSQL Precise Precise for Databases Precise for Oracle Precise for SQL Server Precise Management Database (PMDB) product updates Project Migration public clouds Query Analyzer query builder query monitor query optimization query performance Query Store query tool query tuning query-level waits Rapid SQL rdbms real time monitoring Real User Monitoring recovery regulations relational databases Releases Reporting Reports repository Restore reverse engineering Roadmap sample SAP Scalability Security Policy Security Practices server monitoring Server performance server-level waits Service Level Agreement SkySQL slow query SNMP snowflake source control SQL SQL Admin Toolset SQL CM SQL code SQL coding SQL Compliance Manager SQL Defrag Manager sql development SQL Diagnostic Manager SQL Diagnostic Manager for MySQL SQL Diagnostic Manager for SQL Server SQL Diagnostic Manager Pro SQL DM SQL Doctor SQL Enterprise Job Manager SQl IM SQL Inventory Manager SQL Management Suite SQL Monitoring SQL Performance SQL Quality SQL query SQL Query Tuner SQL Safe Backup SQL script SQL Secure SQL Security Suite SQL Server sql server alert SQL Server Migration SQL Server Performance SQL Server Recommendations SQL Server Security SQL statement history SQL tuning SQL Virtual Database sqlmemory sqlserver SQLyog Storage Storage Performance structured data Subversion Support tempdb tempdb data temporal data Tips and Tricks troubleshooting universal data models universal mapping unstructured data Uptime Infrastructure Monitor user experience user permissions Virtual Machine (VM) web services webinar What-if analysis WindowsPowerShell

I often run into situations where DBAs are having discussions with their Storage Administrator around not compressing database backups so that they can get better deduplication from storage devices like EMC Data Domain.   The Storage Administrator has been telling them not to compress their backups.   I will tell you why this can be a bad idea.

What is deduplication?

Appliances like EMC Data Domain sit between the server and the storage medium, connected through the network.   When data is saved to the storage appliance it is sent from the server to the appliance.   The appliance deduplicates the data and then stores the reduced file.    Deduplication works, in most cases, by identifying duplicate blocks within a file.  The difference between compression and deduplication is that the deduplication process is essentially transparent such that it doesn’t matter what type of file is being stored.

The key thing to note, is that the file has to be transferred over the network to the appliance and then deduplication occurs.   So, for example, if you had a database backup file that was 800 megabytes, it would be sent over the network to the appliance and then deduped down to maybe 50 megabytes.   So you save in storage space but you incur the cost in time of the transfer of the file.

How Deduplication Slows Backup Operations

Most of the time the bottleneck in a database backup is either the network file transer or writing the file to storage.  IDERA’s SQL Safe Backup product uses a proprietary technology, Intellicompress, which uses idle CPU cycles to compress database backups.   What this does is reduce the size of the database backup considerably as the file is being saved.  The end result is faster transfer rates over the network as well as faster disk writes.

For deduplication, the entire file is written to disk over the network and then deduplicated.  So you get a reduction in storage at the cost of the file transfer.

Deduplication Messes with Restores Also

Think about it, suppose you need to restore a database from a backup on EMC Data Domain.   Lets use our 800 megabyte backup file in the example.   The entire 800 megabyte backup file is transferred across the network and restored.   If the file had been compressed using IDERA SQL Safe the backup, the file could be as small as 80 megabytes.   This means a faster network transfer rate.   If you verify your backups then you essentially go across the network twice.

Log shipping provides a unique performance hit in that it often performs restores across multiple servers using multiple files.   This means that there is continuous back and forth network transfer of data.

The Storage Administrator is Wrong

Deduplication works really well with uncompressed data.  This is typically found with file servers.  Deduplication typically saves quite a bit of SAN space for this type of application.  Storage Administrators look at database backups as nothing more than data files being written to the SAN and often get frustrated by the lack of compression rate.   If you use uncompressed backups you might get 80% deduplication rates but there is a price to pay as we mentioned earlier.  They also don’t see that they could be storing 10x more data as well as extending restore times, which could be a critical side effect.

The easy way to prove your point is to write a compressed backup to regular file storage and write an uncompressed file to deduplication storage.   Measure transfer times as well as the time it takes to restore from the backups.   You will quickly prove your point.

Lagniappe (something given as a bonus or extra gift)

People often ask why deduplication of compressed data does not yield the desired result of decreased file size on the storage appliance.   You can think of the process as the equivalent of compressing a compressed file.   Duplicate bits of data are removed during compression so trying to find duplicate file blocks with deduplication is not going to yield much in file size savings.

Another question arises around using encrypted files with deduplication.   You do not get very good deduplication rates with encrypted data because encryption works by using a “mask” and randomizing bytes using a key.  Guess what, this means that deduplication is not going to find many if any duplicate data blocks.