Disaster Recovery in AWS: “When IT Hits the Fan”

When dealing with disaster recovery, two key concepts define how well a system can recover: Recovery Point Objective (RPO) and Recovery Time Objective (RTO). Understanding these helps businesses plan for data loss and downtime effectively.

RPO and RTO Explained

RPO (Recovery Point Objective): This determines how often backups are taken. The time between the last backup and a disaster represents the amount of data that could be lost. The lower the RPO, the less data loss your business will experience.
Example: If backups are taken every 12 hours and a disaster occurs, you could lose up to 12 hours of data.
RTO (Recovery Time Objective): This is the time taken to recover after a disaster. The lower the RTO, the faster you can resume operations.
Example: If your RTO is 2 hours, your system should be fully operational within 2 hours after a disaster.

Backup and Restore (High RPO)

This is the simplest disaster recovery method.
On-premises: Large backups may require shipping data physically using tools like AWS Snowball.
Cloud: Scheduled backups ensure data is available but recovery can be slow.
Example: A company that takes nightly backups may lose an entire day’s data if disaster strikes before the next backup.

Pilot Light Approach

A small version of your application is always running in the cloud.
Useful for critical systems that need a faster recovery than full backup and restore.
In case of disaster, the environment is quickly scaled up.
Example: A bank keeps its transaction processing system active in pilot light mode so it can immediately recover.

Warm Standby

The entire system runs in the cloud but on a minimum scale.
During a disaster, it scales up to full production.
Example: A retail company runs a minimal version of its website and scales up only when needed.

Multi-Site / Hot Site Approach

It is the lowest RTO but very expensive.
The full production environment is always running both on-premises and in AWS.
If going fully cloud-based, AWS Multi-Region ensures redundancy.
Example: A global e-commerce site runs production servers in multiple AWS regions, ensuring zero downtime.

DMS (Database Migration Service)

Fast, secure, and resilient database migration.
Keeps the source database active during migration.
Supports homogeneous (e.g., Oracle to Oracle) and heterogeneous (e.g., SQL Server to Aurora) migrations.
Uses CDC (Change Data Capture) for continuous replication.
Requires an EC2 instance for replication tasks.
For different database engines, AWS Schema Conversion Tool helps convert schemas.
Multi-AZ ensures high availability by maintaining a standby replica in a different region.
Example: A company migrating from MySQL to PostgreSQL uses DMS for seamless migration without downtime.

TL;DR

RPO = How much data loss you can afford.
RTO = How quickly you need to recover.
Backup & Restore = Simple but slow recovery.
Pilot Light = Keeps critical systems running for quicker recovery.
Warm Standby = Full system running at minimal capacity scales up when needed.
Multi-Site/Hot Site = Full-scale, real-time backup for zero downtime.
DMS = Secure database migration with minimal disruption.

Planning your disaster recovery depends on how much downtime and data loss your business can handle!

Disaster Recovery in AWS: “When IT Hits the Fan” — RPO vs. RTO Explained