7 Cloud Disaster Recovery Tips for DevOps
Cloud service disruptions may happen. For example, during the past December, AWS had an ELB problem in us-east-1.
This service disruptions was not a huge concern for us, since the system was in staging phase and there was no traffic. However, we took this opportunity to learn and prepare ourselves. I’m sparing you the learning shenanigans, so you can follow and implement the solution without waiting for the next outage to happen.
Back in December, I got a call from a customer about an AWS Elastic Beanstalk Environment becoming red and getting a SNS notification.
First Rule: Prior to taking any action, look at all the indicators.
At first, it seemed unexpected, since the underlying EC2 Instances were idle, and no visible sign of failure happened. If we did, papertrail would let us know, as well as CloudWatch and copperegg (we run both to help us centrally manage it).
So it looked like it was something out of the ordinary. How to find it? Well, the canonical way is to look into the AWS Status Dashboard, but if you’re one of the lucky persons to spot the error before anyone else, here are some hints:
- Look for PaaS Providers Dashboards. I particularly would look closely at what is happening at Heroku, for instance. Understand that not all PaaS are equals – There are some which simply didn’t report, for the sole reason that they didn’t depend on ELB (Elastic Load Balancer) at all
- Create a Twitter search for AWS. Subscribe into it – you can combine with IFTTT and get up-to-date alerts
At that time, since Heroku was having issues, we clearly had a problem, but we still couldn’t figure it out correctly.
[Newvem analyzes your baseline disaster recovery (DR) status, reflecting how well AWS DR best practices have been implemented, and recommends AWS features and best practices to reach optimal availability, increase outage protection, and quick recovery. Create a Free Account Now]
Refrain from Taking Any Action - Until you understand everything
What we understood at first was there was a failure in the EC2 Instances, and a Red Load Balancer, but it didn’t reboot any of the machines. The application kept running as such, and there was clearly an AWS Problem.
So we decided not to do anything and wait for the solution, but meanwhile, we decided to get ready for the next major outage.
Always keep a Disaster Recovery Plan - At least in another Region
In general, AWS outages affect just a single Region at once (I actually don’t remember anything ever affecting multiple regions, except in cases of Global Services like Route 53). What to do?
That was our first perception: It was time to move to us-west. The us-east region is great, since it is the most popular one, offers more services, and it is the cheapest option, but consider us-west at the very least to store copies of your data. Why? It is marginally close in price to us-east, and it is a bit more stable.
So the first thing we did was to make sure the Continuous Integration Scripts were able to deploy to both us-west as well as us-east. This involved some upgrading (and dropping the usage of Custom AMI’s feature of Elastic Beanstalk), but it run smoothly. We also launched the environment in us-west and devised some docs outlining how to fire up the backup environment in case something happens in us-east again.
Keep a “Disaster Recovery”/”Read Only” Flag in your Application nearby
If your replication works just within a single Region, perhaps it is a good idea to keep a “Read Only” mode, so users are able to access it with reduced functionality.
The idea is actually behind SEDA - The Staged Event-Driven Architecture, and I think it is great.
“The staged event-driven architecture (SEDA) refers to an approach to software architecture that decomposes a complex, event-driven application into a set of stages connected by queues. It avoids the high overhead associated with thread-based concurrency models, and decouples event and thread scheduling from application logic. By performing admission control on each event queue, the service can be well-conditioned to load, preventing resources from being overcommitted when demand exceeds service capacity.” Source: Wikipeda
Your Topology is not for Decorating Walls
Look at your system topology. Now, picture each service and do a what-if analysis of each failure. Think about:
- Which kinds of failures are going to happen?
- How to Detect?
- How to React?
Keep this in a document. Print, and keep a local copy on your computer.
Who watches the watchers?
I always forget this one – as stupid as it might look, you can have all kinds of alerts, but the most important involves DNS and Machine Reachability, and we often forget. Tools like pingdom are great for those cases. Also, make sure your monitoring agent (e.g., you run Nagios on your infrastructure) is also testable from external entities.
Keep an Eye on your Team’s Cognitive Capabilities
This one we’ve learnt from Heroku: Keep an “Incident Commander” and make sure your people get enough sleep (or get Polyphasical). Outage Post Mortens are always a great read. Take your time, read and think how you could prepare your organization for an outage.
Hope this helps.
[Newvem analytics tracks you AWS cloud utilization:
- Hourly Utilization Pattern Analysis
- Reserved Instances Decision Tool
- Resource Resizing Opportunities
About the Author
Aldrin Leal, Cloud Architect and Partner at ingenieux Aldrin Leal works as an Architect and QA Consultant, specially for Cloud and Big Data cases. Besides his share of years worth between the trenches in projects ranging from Telecom, Aerospatial, Government and Mining Segments, he is also fond with a passion to meet new paradigms and figure a way to bring them into new and existing endeavours.
Keywords: Amazon web services, Amazon AWS console, Amazon Cloud Services, AWS CloudWatch, CloudWatch, AWS SNS, Simple Notification Service, CloudWatch Alarm, Monitoring, Alerts, AWS ELB, Route 53, AWS Outage, Disaster Recovery, Cloud Outage