Disaster recovery (DR) is about planning for uncertainties. Just like taking insurance, disaster recovery plans for the best but expects the worst situation. Things may be going on well for a long time but within the twinkle of an eye, things can go really bad. When that happens, a well planned and tested disaster recovery procedures will save the day. As businesses increasingly depend on a well functioning IT infrastructure, from hospitals to airlines to government, disruption of services can lead to loss of millions of dollars. Not only that, the company reputation could be at stake too. As a result, disaster recovery should be a top priority for any organization.
RTO/RPO of Disaster Recovery
Recovery Time Objective (RTO) is how long a service downtime can be tolerated. It is an answer to the question: how long can your business afford to be down? For example, if RTO is 1 hour, it means that when a service is down at 8am, it must be up by 9am. Recovery Point Objective (RPO) is how much of data loss is permitted. It is an answer to the question:how much data can you afford to lose? For example, if RPO is 1 hr, it means that if a service went down at 8am, data must be recoverable up until 7am.
The lower the RTO/RPO the better but as much as everyone would like a low RTO/RPO, it comes with high cost. The bell curve below illustrates this point. This simply means that minimizing downtime when disaster strikes means putting more investment into backup and recovery.
Advantages of Cloud Disaster Recovery
Compared to conventional DR, Cloud DR has several advantages such as lower cost, lower time of recovery, better scalability and more adaptable to automation. The table below discusses these in more details.
|Traditional Disaster Recovery||Cloud Disaster Recovery|
|Traditional disaster recovery usually involves maintaining a physical redundant data center. This means doing everything twice such as purchasing and maintaining identical equipment, substantial connectivity between systems, space and cooling for all the hardware, advanced mirroring software and specialized technical staff. These are prohibitively expensive for many companies.||Snapshots of physical or virtual servers are provided at the primary data center. The organization pays for:
|May prove burdensome to scale.||Scaling a cloud disaster recovery is fast and easy.|
|Disaster recovery site could take minutes (if not hours) to come online. Booting a physical machine takes at least a minute or more than a virtual machine. The time it takes to make a disaster recovery site live will take more time, in comparison to a cloud DR. In addition, data loss is directly related to downtime.||The disaster recovery site can be brought online within seconds or minutes. A virtual machine instance can be up and running within seconds. A cloud DR site that boots up within a few seconds translates to data loss of just that time frame.|
|In case connectivity is not available with the physical disaster recovery setup, manual operations may be required to start the site’s operations.||A cloud based disaster recovery service can be triggered from anywhere using a wireless Internet connection.|
|Backup Location vs Latency|
|Backup locations are usually within the same physical region. This is good for low latency and data compliance issues, an advantage over cloud disaster recovery.||If the cloud resources are in a different geographic regions compared to the physical data centers, this could affect latency and data compliance issues depending on the organization.|
How to perform Cloud Disaster Recovery in AWS
They are 4 major ways that disaster recovery is performed on AWS. The discussions below are culled from AWS whitepaper on disaster recovery. You may reference the white paper for detailed discussions.
1. Backup and Restore
Traditional backups usually involve daily tape backups sent-off to a remote site. The problem with this is the time it takes to restore. AWS provides services that are particularly suitable for backup and restore.
Amazon S3 – is a suitable destination for data backup that requires quick restore. Being a network based storage means that data can be stored from any location in the world. For very large data sets AWS Import/Export can be used to transfer very large data sets directly to AWS. Amazon Glacier is a cheaper alternative (starting from $0.01/GB per month) to S3 for longer-term data storage where retrieval times of several hours are adequate.
AWS Storage Gateway create snapshots of your on-premises data volumes copied directly into Amazon S3 for backup. You can subsequently create local volumes or Amazon EBS volumes from these snapshots.
Storage-cached volumes allow you to store your primary data in Amazon S3, but keep your frequently accessed data local for low-latency access. As with AWS Storage Gateway, you can snapshot the data volumes to give highly durable backup. In the event of DR, you can restore the cache volumes either to a second site running a storage cache gateway or to Amazon EC2.
AWS solution integrates with many third party solutions. You can use the gateway-VTL (Virtual Tape Library) configuration of AWS Storage Gateway as a backup target for your existing backup management software. This can be used as a replacement for traditional magnetic tape backup.
2. Pilot Light
The term pilot light is often used to describe a DR scenario in which a minimal version of an environment is always running in the cloud.
This scenario is similar to a backup-and-restore scenario. For example, with AWS you can maintain a pilot light by configuring and running the most critical core elements of your system in AWS. When the time comes for recovery, you can rapidly provision a full-scale production environment around the critical core.
Infrastructure elements for the pilot light itself typically include your database servers, which would replicate data to Amazon EC2 or Amazon RDS. Depending on the system, there might be other critical data outside of the database that needs to be replicated to AWS. This is the critical core of the system (the pilot light) around which all other infrastructure pieces in AWS (the rest of the furnace) can quickly be provisioned to restore the complete system.
The pilot light method gives you a quicker recovery time than the backup-and-restore method because the core pieces of the system are already running and are continually kept up to date. AWS enables you to automate the provisioning and configuration of the infrastructure resources, which can be a significant benefit to save time and help protect against human errors. However, you will still need to perform some installation and configuration tasks to recover the applications fully.
3. Warm Standby
The term warm standby is used to describe a DR scenario in which a scaled-down version of a fully functional environment is always running in the cloud. A warm standby solution extends the pilot light elements and preparation. It further decreases the recovery time because some services are always running. By identifying your business-critical systems, you can fully duplicate these systems on AWS and have them always on.
These servers can be running on a minimum-sized fleet of Amazon EC2 instances on the smallest sizes possible. This solution is not scaled to take a full-production load, but it is fully functional. It can be used for non-production work, such as testing, quality assurance, and internal use.
In a disaster, the system is scaled up quickly to handle the production load. In AWS, this can be done by adding more instances to the load balancer and by resizing the small capacity servers to run on larger Amazon EC2 instance types.
4. Multi-Site Solution Deployed on AWS and On-Site
A multi-site solution runs in AWS as well as on your existing on-site infrastructure,in an active-active configuration. The data replication method that you employ will be determined by the recovery point that you choose. In addition to recovery point options, there are various replication methods,such as synchronous and asynchronous methods.
You can use a DNS service that supports weighted routing, such as Amazon Route 53, to route production traffic to different sites that deliver the same application or service. A proportion of traffic will go to your infrastructure in AWS, and the remainder will go to your on-site infrastructure.
In an on-site disaster situation, you can adjust the DNS weighting and send all traffic to the AWS servers. The capacity of the AWS service can be rapidly increased to handle the full production load. You can use Amazon EC2 Auto Scaling to automate this process. You might need some application logic to detect the failure of the primary database services and cut over to the parallel database services running in AWS.
The cost of this scenario is determined by how much production traffic is handled by AWS during normal operation. In the recovery phase, you pay only for what you use for the duration that the DR environment is required at full scale. You can further reduce cost by purchasing Amazon EC2 Reserved Instances for your “always on” AWS servers.
Gaining both time and money can be difficult to achieve but this is what cloud disaster recovery provides.
Using cloud for disaster recovery is a major cloud use case that is growing in popularity and this can only increase with time. Compared to traditional disaster recovery, Cloud disaster recovery is cheap, fast, and scalable both for backup and recovery purposes.
AWS provides various services to assist with backup and recovery. The various disaster recovery methods in AWS are Backup and recovery, pilot light, Warm standby and multi-site solution deployed on AWS and On-Site.