Disaster Recovery on AWS Cloud by Emind Systems

Disaster Recovery on AWS Cloud by Emind Systems

DR on AWS 1_ dilbertIn case of a disaster we would like to make sure that our applications are still up and running, while taking advantage of our failover hosting/cloud provider. The Disaster Recovery (DR) architecture is driven by the criticality of applications and data. The decision regarding what to back up and deploy eventually translates into ongoing costs that can be extremely significant. Every IT organization has its own high level policy guidelines. These policies are eventually translated into the policy deployed for each of the different applications the enterprise runs. The CIO and its team need to make sure they define both the high level policies and the actual budget that can be spent for DR matters.

Three main approaches to handling DR in the AWS cloud - 


1.
 Server Mirroring – Automatic provisioning of the required instances to support business continuity.

2. Data Replication – Database and storage replication between the on-premise resources and the AWS cloud. All data transfers must be secured.

3. Configuration Replication– Replication of the configuration data only. This practice might not fit the enterprise organization applications; however it might be a very good fit to a support DR of multitenant (SaaS) deployments. In a multitenant environment that consists of multiple instances, you can hold your AWS DR environment in a stopped mode, start the cloud instance only if an internal on-premise resource fails.

Images credit: 


[Newvem analyzes your baseline disaster recovery (DR) status, reflecting how well AWS DR best practices have been implemented, and recommends AWS features and best practices to reach optimal availability, increase outage protection, and quick recovery. Learn More]


Our best practice deployment is to divide the IT resources and applications according to this three approach level – so first data mirroring, maintaining a relevant policy and backup cycles according to the data criticality and update frequency. DR on the cloud can be much cost effective because once data replication is needed the instance “comes to live”, and the backup process takes place. When it is no longer needed, the DR environment can enter a “stopped state”. Eventually, we pay only for what we need. If backing up only the configuration data is an option (due to the main application files’ storage that serves all our webservers), we can use efficient processes that use resources only for the fast replication of this small amount of data. The result is obvious, an always up-to-date DR environment maintained at very low cost.

The Infrastructure-as-a-Service (IaaS) vendors invest a lot in their global presence. The global Amazon public cloud enables you to quickly and efficiently build multiple offshore DR sites. This is almost impossible for the medium-size enterprise and even the largest ones. Building such an option includes a huge upfront investment.

Data replication with security in mind

The replication of data includes migrating data over the web from on-premises (or co-location) to the public cloud. This involves security and compliance risks. Amazon cloud supports the hybrid option pn several layers, starting from the option to build your DR site inside your own Virtual Private Cloud resources (VPC), (AWS Cloud Virtual Private Network) are connected to each other but not exposed to the internet. Another advanced option is Direct Connect, an actual physical separated connection from the on-premise to the cloud. The advantages of the latter include not only an enhanced security but also performance. In order to benefit from all the compliance certifications that are supported by Amazon cloud, the integration should also bear in mind the principle of cloud shared responsibility, making sure that applications are deployed in keeping with the relevant security considerations.

The DR policy and cost is based on the required SLA, however a good DR deployment on the Amazon public cloud bears in mind the important pay-per-use cloud principal. We manage to deploy 80-90% of the environment as on-demand DR. The relatively low ongoing costs stem from the needed storage and the few active instances. The right DR deployment and cloud usage can save you hundreds of thousands of dollars in comparison to the traditional DR investment.

Backup and Restoration Techniques

Level 1: Data Backup on Amazon S3 Storage

Level 2: DR site in Stopped Mode

Real-time database synchronization while the webservers are in stopped mode. Once a disaster happens, the webserver instances will automatically start and the DNS records will be adjusted accordingly. The online service will be back in matter of minutes. The RTO (Recovery Time Objective) includes the time that it takes to detect the failure and the time it takes to get the DR site running.

Level 3: Active (Spear Wheel) DR Site

In cases where RTO is crucial and data update frequent, it is highly recommended to maintain an active DR site. The DR site consists of low standby capacity, which in some cases can even help out with workloads and bursts. In case of a disaster, the DNS records will be adjusted. Horizontal or vertical on-demand scaling must be leveraged to fit the production loads and gain the cost benefits from the cloud. For example vertical scaling from 2 cores to 8 cores can be done by a click of a button in the Amazon public cloud.

Active-Active DR architecture means that the service is being hosted on two different off shore sites. Balancing workloads between off shore sites done in order to avoid latency and generates better end user experience performance. This type of a web service deployment might also support a disaster scenario while the active site scales up its amount of resources.

A Few Tips

Plan – Sounds trivial, however we found IT teams and individuals that use the public cloud’s appealing provisioning capabilities to “just “ quickly back up a specific database or instance. This missing order approach generates great inefficiencies and puts cloud adoption at risk due to high unexpected costs. You should define your own high-level policies taking into consideration application criticality and data update frequency.

Adapt – Match your DR architecture to the origination high-level policies. Define the specific service RTO and RPO (Recovery Point Objective) taking in consideration the monthly DR budget.

POC – The public cloud enables innovation and experiments. The proof of concept is the seed for the initial project phases. Start by experimenting with the DR on a very small amount of resources for a very thin layer of the application. Measure and draw conclusions about the amount of effort and costs involved.

Tests – The chaos monkey (https://github.com/Netflix/SimianArmy) is a very interesting testing tool made by Netflix for the AWS cloud. Netflix developers created a service (an open source on github) that tests system robustness by shutting down instances randomly. Make sure to continuously test your deployment while continuously striving to improve it. These improvement cycles will, in time, include less and less needed fixes, and the cloud operation will mature while making the service better available.

Your Cloud Availability: 98%, 99.99% or 99.9999%? 

Monitor Above all, measure and monitor to reveal any things you don’t know. The system evolves from the development side and from the other users’ demand changes. The unknown is probably the scariest enemy of IT.


[Security Group Breach Prevention - Newvem scans and identifies the status of your security group configurations, continuously monitors their status, and alerts you of vulnerabilities. Learn More]


About the Author

Lahav Savir

lahav savir Emind

Architect & CEO of Emind Systems Ltd. A System Architect with over 15 years of experience and specialization in back-end systems and design and deployment of high-end, on-line services. Lahav’s  main focus is on high performance Messaging and Voice Systems including value-added services and data centric systems with considerable attention given to Management and Monitoring issues. As both system specialist and CEO, Lahav leads Emind Systems, a boutique company providing high-end IT services through design, implementation and on-going data center management.

Emind Systems – a certified Amazon Solution Provider and Consulting Partner – proudly offers goCloud, the field-proven and assured route to effective hosting of your system in the cloud. Emind have developed goCloud to provide us with the power to smoothly design, deploy, manage, and maintain secure high-availability production environments in the cloud.


Keywords: Amazon web services, Amazon AWS console, AWS S3, Amazon Cloud Services, SLA, AWS Management Console, Disaster Recovery, Public Cloud, Hybrid Cloud, Backup, Deployment, Database replication, RTO, RPO, VPN, VPC, Recovery Time Objective, Recovery Point Objective, Horizontal Scale, Vertical Scale, DNS, Compliance, Cost Effectivness, HA, High Avalability

Content Disclaimer

You must be to post a comment.

* As a bonus, you'll receive our weekly newsletter!

Hitchhiker's Guide to The Cloud

Newvem's eBook for Cloud Operations