Backup & Disaster Recovery

1. Backup Fundamentals

RPO & RTO: The Business Continuity Metrics

Before designing a backup solution, establish what data loss and downtime are acceptable. These two metrics define the requirements that technology must meet.

RPO (Recovery Point Objective): Maximum acceptable data loss measured in time. "We can lose up to 4 hours of data" means RPO = 4 hours. Backups must run at least every 4 hours.
RTO (Recovery Time Objective): Maximum acceptable downtime before the system must be restored. "We must be back online within 2 hours" means RTO = 2 hours. Recovery process must complete within that window.
RPO drives backup frequency. RTO drives recovery infrastructure investment. Lower values cost exponentially more.
Different systems have different RPO/RTO requirements — financial transaction systems might need RPO of 0 (synchronous replication) while archive systems might accept 24-hour RPO.
RPO and RTO must be tested, not assumed. If you've never restored from backup under time pressure, your assumed RTO is fiction.
CDP (Continuous Data Protection) uses journaling to enable restore to any point in time — near-zero RPO for supported systems.

Backup Types

Each backup type involves tradeoffs between storage cost, backup speed, and restore time. Most enterprise strategies combine multiple types.

Full Backup: Complete copy of all selected data. Slowest to create, fastest to restore (single backup set). Storage-intensive — each full backup is a complete copy.
Incremental Backup: Only changes since the last backup of any type. Fastest to create, smallest storage per backup. Slowest to restore — must chain all incrementals since last full.
Differential Backup: All changes since the last full backup. Moderate size (grows each day until next full). Faster restore than incremental — only need last full + last differential.
Synthetic Full: Backup software constructs a "virtual" full backup by merging the last real full with subsequent incrementals — without re-reading source data. Best of both worlds.
CDP (Continuous Data Protection): Every write operation is journaled in real time. Enables restore to any point in time, not just scheduled backup points. High storage overhead.

The 3-2-1-1-0 Rule

The 3-2-1-1-0 rule is the modern evolution of the classic 3-2-1 backup rule, updated to address the ransomware threat specifically.

3 copies: One production copy plus two backups. If any single copy is lost or corrupted, two others remain.
2 different media types: Don't store both backups on the same type of storage (e.g., disk + tape, or disk + cloud). Different media fail in different ways.
1 offsite copy: Geographic separation protects against physical disasters (fire, flood, datacenter loss). Cloud storage or a secondary site satisfies this.
1 offline or immutable copy: This is the ransomware-specific addition. One copy must be either physically disconnected (air-gapped) or immutably written (WORM). Ransomware cannot encrypt what it cannot reach or modify.
0 errors (verified restores): A backup that has not been successfully restored is unverified. Automated restore testing — the backup software spins up the system and verifies it boots — must confirm every backup set is actually usable.

Industry standardRansomware protection

Ransomware as the Primary Backup Threat

Modern ransomware is purpose-built to defeat conventional backup strategies. Operators spend days or weeks inside networks specifically targeting backup infrastructure before deploying encryption.

Ransomware operators identify and delete or encrypt backup catalogs first — Veeam repositories, Windows shadow copies (vssadmin delete shadows), backup agent credentials
Common targets: Veeam backup servers, Windows VSS (Volume Shadow Copy), network-attached backup drives, backup agents on NAS devices
Attackers map backup infrastructure using legitimate admin tools during the dwell period — they know where your backups are before you know they're inside
The entire business case for immutable backup: ransomware can reach everything connected to the network, but it cannot modify an immutable object store or overwrite air-gapped tape
Backup infrastructure must be treated as a Tier 0 asset — same security standards as Domain Controllers

Backup Type	Storage Multiplier	Backup Speed	Restore Speed	RPO	Best For
Full	100% per run	Slowest	Fastest (single set)	Backup interval	Weekly anchor, archives
Incremental	Low (daily delta only)	Fastest	Slowest (chain all incrementals)	Backup interval	Daily backups, storage-sensitive
Differential	Medium (grows daily)	Medium	Fast (full + last diff)	Backup interval	Balanced speed/storage
Synthetic Full	Low (no re-read of source)	Fast (no source read)	Fastest (single set)	Backup interval	Enterprise production systems
CDP	High (all writes journaled)	Continuous	Fast (any point)	Near-zero (seconds)	Critical financial systems, DBs

2. Immutable Backups

S3 Object Lock (WORM)

Amazon S3 Object Lock implements WORM (Write Once Read Many) storage for backups. Objects cannot be overwritten or deleted during the retention period — even by the account owner or AWS itself.

Governance Mode: Objects can be deleted or retention period shortened by users with specific IAM permissions (s3:BypassGovernanceRetention). Useful for testing and operational management.
Compliance Mode: Retention period is absolute — no one, including the root account, can reduce it or delete objects until the retention period expires. Used for regulatory compliance (SEC, FINRA, HIPAA).
Enable Object Lock at bucket creation (cannot be added later). Set default retention policies at the bucket level.
Legal holds can supplement retention — lock specific objects regardless of retention period for litigation or investigation purposes.
Cross-region replication to a second AWS region with separate Object Lock settings provides geographic redundancy for immutable backups.
Use a dedicated backup AWS account with no shared IAM roles with production — compromise of production account cannot reach backup account.

Veeam Hardened Linux Repository

Veeam Backup & Replication supports storing backups on a hardened Linux repository configured to prevent modification or deletion — providing immutability for on-premises backups.

Configured as an append-only repository — the Veeam service account can write new backups but cannot delete or overwrite existing ones
Immutability period: Configurable days (e.g., 30 days). During this period, no process — not even root on the Linux server — can delete the backup files due to immutable file attribute (chattr +i)
Security hardening: disable root SSH login, enable SSH key-only auth, no RDP, remove unnecessary services, SELinux or AppArmor enforcing, firewall to Veeam backup server only
The Linux server should have NO integration with Active Directory — domain compromise cannot reach it
Out-of-band management: iDRAC/iLO for physical access independent of the OS network stack

Air-Gapped & Offline Backups

An air-gapped backup is physically disconnected from all networks. It is the strongest protection against ransomware — an attacker who cannot reach the backup cannot encrypt or delete it.

Tape: LTO tape is the classic air-gap medium. Tapes are ejected after backup and stored in a secure vault. Offline until physically loaded. Capacity is high and cost-per-TB is low.
Rotational External Disk: Multiple external drives rotated — one offsite, one connected for backup, one previous cycle in storage. The disconnected drives are air-gapped during rotation.
Physical Media Vault: Iron Mountain or equivalent offsite storage for tape media — geographic separation plus physical security in a purpose-built facility
Air-gap access control: tape loading should require two-person authentication; physical access logs mandatory
Backup Encryption: Always encrypt backup data at rest and in transit. If backup media is stolen, encrypted data is useless. Use AES-256; store encryption keys separately from backup media.

Ransomware immuneSlower restore

Your Backup is Also a Target

If ransomware can reach your backup infrastructure — your backup server is on the same network as production, shares domain credentials, uses the same admin accounts, or stores data in an S3 bucket accessible from production systems — you don't have a backup. You have a second victim waiting for encryption. Immutability is not optional for ransomware protection; it is the only property that breaks the attacker's ability to destroy your recovery capability. Design your backup architecture assuming the attacker already has domain admin credentials before you start.

3. Backup Solutions

Enterprise Backup Platforms

Veeam Backup & Replication: Market leader for VMware/Hyper-V/cloud workloads. SureBackup automated restore testing, hardened Linux repository support, immutability features, cloud tier to S3. Most widely deployed enterprise backup solution.
Commvault: Comprehensive data management platform covering backup, archive, discovery, and compliance. Strong for complex heterogeneous environments. High-capability, high-complexity.
Rubrik: Cloud-native immutable backup platform. All backups inherently immutable — no ability to delete or modify. Strong API, threat hunting on backup data, ransomware recovery SLA guarantees. Higher cost tier.
Cohesity: Hyperconverged secondary data platform. Backup, DR, file services, and analytics on one platform. Strong deduplication and compression ratios.

Cloud-Native Backup Services

AWS Backup: Centralized backup service for EC2, EBS, RDS, Aurora, DynamoDB, EFS, S3, and more. Cross-region and cross-account backup policies. Integrates with Organizations for enterprise-wide backup governance. Supports Backup Vault Lock (WORM).
Azure Backup: Covers Azure VMs, SQL Server, Azure Files, SAP HANA, Azure Blobs. Enhanced soft delete and immutability for backup vaults. Geo-redundant storage option.
GCP Backup and DR: Managed backup service for compute, databases, and VMware workloads in Google Cloud. Application-consistent backup with instant recovery.

Open Source & Self-Hosted

restic: Modern backup tool written in Go. Encrypted by default (AES-256-CTR), deduplicated, multiple backend support (S3, B2, SFTP, local). Client-side encryption — cloud provider cannot read your data. Excellent for Linux servers and cloud-native environments.
BorgBackup: Deduplication, encryption, compression, and authenticated encryption. Excellent for Linux-to-Linux backups. Borgmatic wraps Borg with YAML configuration and health monitoring integration.
Bacula / Bareos: Enterprise-class open source backup with client-server architecture. Supports tape, disk, and cloud. More complex to configure but highly capable for large environments.
Backblaze B2: S3-compatible cloud storage at significantly lower cost than AWS S3. Supports Object Lock for immutability. Widely used with restic and BorgBackup for offsite backups.

Solution	Type	Immutability	Ransomware Protection	Open Source	Cost Tier
Veeam	Enterprise on-prem/cloud	Yes (hardened repo, S3 lock)	Strong (immutable, isolated repo)	No	$$
Rubrik	Cloud-native enterprise	Yes (inherent)	Very strong (no delete ever)	No	$$$
AWS Backup	Cloud-native (AWS)	Yes (Vault Lock)	Strong (cross-account)	No	Pay-per-use
Azure Backup	Cloud-native (Azure)	Yes (immutable vault)	Strong	No	Pay-per-use
restic	Open source	Via S3 Object Lock	Good (encrypted, cloud-backed)	Yes	Free + storage costs
BorgBackup	Open source	No native; filesystem-level	Moderate (encryption)	Yes	Free + storage costs

4. Disaster Recovery Planning

DR Site Approaches

Disaster recovery requires infrastructure to run workloads when the primary site is unavailable. The approach chosen determines both cost and achievable RTO.

Hot Site: Fully operational duplicate of production, running at all times. Failover measured in seconds to minutes. Most expensive — you're paying for double infrastructure continuously. Required for Tier 1 systems with sub-hour RTO.
Warm Site: Infrastructure pre-provisioned and ready to receive workloads but not actively running production. Failover in minutes to hours. Systems are updated periodically. Good balance of cost and recovery speed.
Cold Site: Physical space and power available; hardware may or may not be present. Workloads must be deployed from backups. Failover measured in days. Lowest cost, highest RTO.
Cloud-Based DR: Cloud provider becomes the DR site. No capital expenditure — pay only during DR tests and actual disasters. AWS DRS (Disaster Recovery Service), Azure Site Recovery, or Zerto for cloud-target replication.
DRaaS (DR as a Service): Managed DR service handling replication, failover, and runbooks. Zerto, Carbonite, Datto. Good for organizations without DR expertise or infrastructure.

System Criticality Tiers

Not all systems require the same DR investment. Tiering systems by criticality allows rational allocation of DR budget to where it matters most.

Tier 1 (Mission Critical): Systems where failure directly stops revenue or endangers life. ERP, payment processing, 911/emergency dispatch, hospital clinical systems. Target: RTO 1 hour, RPO 15 minutes.
Tier 2 (Business Critical): Important operational systems — productivity suite, CRM, key internal applications. Business can operate short-term without them but impact is significant. Target: RTO 8 hours, RPO 4 hours.
Tier 3 (Business Operational): Useful but not critical — reporting, analytics, non-customer-facing tools. Can operate manually or without for 1-2 days. Target: RTO 48 hours, RPO 24 hours.
Tier 4 (Non-Critical): Development environments, test systems, documentation portals. Can be rebuilt from source control. Target: RTO 5 business days, RPO best-effort.
Tier assignments require business stakeholder input — IT alone cannot make these determinations

DR Runbooks & Communication Plan

Technical DR capability is useless without documented procedures and communication plans that work under the stress of an actual disaster.

Runbooks: Step-by-step procedures for each DR scenario. Role-specific (Network Engineer's runbook vs. DBA runbook). Should be executable by a skilled person who has never done it before during an actual disaster.
Decision Tree: Who decides to invoke DR? What criteria trigger failover? What business impact threshold? Document this clearly — avoid "committee decision" that delays action during an incident.
Communication Plan: Escalation tree (who calls whom, backup contacts if primary unreachable), customer notification procedures, regulatory notification procedures, public statement templates.
Vendor Dependencies: Map which DR steps depend on external vendors (ISP, cloud provider, hardware vendor). Get SLAs and emergency contact numbers documented in the runbook.
Store runbooks offline — if the disaster took out your documentation system, you need printed or off-network copies

DR Approach	RTO	RPO	Relative Cost	Complexity	Best Use Case
Hot Site (active-active)	Seconds	Seconds (sync replication)	2x production	Very High	Tier 1 financial/critical systems
Hot Site (active-passive)	Minutes	Minutes	1.5-2x production	High	Tier 1 with budget constraints
Warm Site / Cloud DR	1-4 hours	1 hour	0.3-0.7x production	Medium	Tier 1-2 enterprise systems
Cold Site	1-3 days	24 hours	Low (space + power only)	Low	Tier 3-4, compliance-only DR
Cloud Backup Restore	Hours-days	Backup interval	Very Low (storage only)	Low	Small org, Tier 3-4

5. Testing & Validation

Automated Restore Testing

The only backup that matters is one that has been successfully restored. Automated restore testing eliminates the assumption that backups work.

Veeam SureBackup: Automated test of every backup — boots the VM from the backup in an isolated network bubble, runs health checks (DNS resolution, application heartbeat, custom scripts), flags any backup that doesn't produce a functional VM.
Rubrik Live Mount: Instantly mount any backup as a live, fully functional VM for testing or immediate use. Zero-copy technology — no data movement required until the VM actually writes data.
AWS Backup Restore Testing: automated restore jobs that restore resources to a sandbox account and run validation scripts. Generates compliance evidence of successful restore attempts.
Custom scripts: even without enterprise tools, write scripts that restore a sample of backup data monthly and verify checksums or service startup
Document every test: timestamp, backup set, restore target, validation method, pass/fail, time to restore. This evidence is required for compliance audits.

DR Test Cadence

DR testing must be scheduled, structured, and escalating in scope. Tabletop exercises validate plans; actual failover tests validate infrastructure and people.

Monthly: Automated backup restore verification (ideally automated; at minimum sampled manual restore)
Quarterly: Tabletop exercise — walk through a specific disaster scenario. Who does what? Where are the gaps? No systems involved — just people and process.
Semi-Annual: Partial failover test — fail over a non-critical Tier 3 or Tier 4 system to DR environment. Validate the process end-to-end with real infrastructure.
Annual: Full DR game day — fail over Tier 1 systems to DR during a planned maintenance window. Run production from DR for 2-4 hours before failing back. This is the only way to discover real RTO.
Post-test retrospective: what worked, what failed, what took longer than expected. Track improvement across tests. Update runbooks immediately after each test.

Backup Monitoring & Reporting

Backup programs that aren't monitored degrade silently. A backup job that has been failing for 30 days is discovered only when recovery is needed.

Alert on every backup job failure — immediately, to an on-call queue, not just an email inbox no one watches
Alert on backup job success when it suddenly stops — a silently stopped job (no success, no failure) is as dangerous as an explicit failure
Track retention policy compliance: are old backups being expired on schedule? Are immutability periods set correctly?
Dashboard metrics: backup success rate (%), storage capacity trend, backup window duration, number of backups expiring soon, last successful restore test date
Monthly backup report to security and IT leadership: coverage gaps, restore test results, capacity forecast, RPO/RTO compliance by system tier
NIST SP 800-34 Contingency Planning Guide provides the framework for formal CP documentation and testing requirements

An Untested Backup is a False Sense of Security

The only backup that actually matters is the one you successfully restored from. Organizations regularly discover during ransomware incidents that their backups are incomplete (backup agent never ran on critical servers), unrestorable (backup catalog corrupted), outdated (the last clean backup predates the compromise by weeks), or too slow (RTO of 3 days versus assumed RTO of 4 hours). Test restores monthly at minimum. Test full DR failover annually. Document every test result. Your backup program's effectiveness is defined by recovery success rate, not backup job success rate.