Do you even need a DR plan for the cloud? Let’s talk about it.
“The cloud will handle all the disaster recovery for me!” – NO!
I had a meeting with a customer once, a large enterprise. When we discussed security operations, they were surprised to learn that if they run VMs in the cloud, they still need to patch them. Understanding the cloud operations model in this case helps.
It is the same with high availability and disaster recovery for cloud workloads. Public cloud services can help you with them, but you are responsible for thinking about how to prepare for recovery and making it happen.
Why?
Because the cloud provider does not know your workload or requirements for it. What they give you is the infrastructure to support it and protect your workloads from disasters at the local, datacenter, and region levels.
You still need to understand the disaster recovery, high availability, and SLA models of your services, and put your thoughts and planning efforts into making sure they support your requirements.
A practical example of DR requirements driving architecture choices
One of our customers hosted their corporate website on SharePoint, on local servers. Their profile suggested that they might be targeted by cyberattacks. They also required their website to be operational at a high SLA, even if the hosting infrastructure in the entire region went dark.
The result – the entire hosting infrastructure was spread across two continents, Europe and North America (United States). Initially, they used some server hosting, but soon we realized that all requirements can be satisfied only with an Azure cloud solution spread across the EU and US regions. And it is exactly what we architected for them.
There is a drawback to it. If you want to protect your solution in all disaster scenarios, it will cost you more! DR was always a costly thing.
From my pre-cloud era, I remember a solution offered by one of the vendors. They offered a service with five 9s (99.999) availability, but it came with a hefty price tag and an entire room of equipment that you could not touch. It covered all the redundancy, power supplies, infrastructure to switch between redundant elements, and software to manage it.
Now it is gone. I mean it is still there, but it is exactly what public cloud operators build in their data centers. You operate at a higher layer, but even so, making it disaster-proof will cost you. The cost (but not all elements) might be related to:
You need to take it into account when architecting your solution. When you have a requirement of “I need 5 9s”, do the cloud cost math and ask “OK, let’s rethink it because here is how much it will cost us.”
The sum of it all
Remember that your solution will consist of different elements. You will use infrastructure services, you will use the network, and maybe some PaaS services. Each of them has its own architecture and set of DR capabilities and SLA parameters. There is no single SLA for all the services from your cloud vendor.
Bad news: you (or your cloud architects) need to get to know them and understand them when architecting your solution.
SLA is a whole other topic. Remember that the SLA of your cloud solution will be the result of all its elements and how they are connected. Depending on the architecture of the solution and the services you are using, your SLA might be higher or lower than the sum of its parts.
Let’s look at a practical example!
Here is a good reference example of an Azure cloud solution designed with availability in mind. A website with SQL backend sounds simple, but what if you throw in two additional requirements:
And that’s where the picture changes. Take a look: Multi-region web app with private connectivity to database
The higher up the stack of cloud services you are, the less you need to worry about it at the infrastructure level. Also, services at the higher level of the cloud service can help you prepare for disaster even if your solution is based on lower-level services, e.g. running on VMs in a traditional infrastructure.
Let’s look at the earlier example of a web application solution.
Traffic Manager, a PaaS service hosted in Azure, helps you redirect the traffic across the different regions with high availability of response and load distribution. You don’t need to think about it.
If you want to move it a bit higher you can use a service like Front Door to act as an entry point for your application, to hide its multi-region complexity from the users.
When using Azure AD B2C to handle your users’ security, you get it combined – this service supports HA at the data layer level across multiple data centers (but not across different geographies) and is using global entry points (similar to Front Door) to make sure that your customers can reach it across the globe (with the fastest link).
Bottom line – the higher up the stack in the cloud, the more of this heavy lifting is handled by the vendor. It comes with drawbacks, mainly possible higher cost, and you have less control over the process. This means that you need to understand your cloud model better and design accordingly.
Consider your metric – is it DR capability or time to recovery?
Last but not least – make sure that you design for the correct metric. Is it a full disaster-proof solution you aim for, or is what you care about the ability to quickly recover when a disaster happens?
The cloud in its essence is a computer and storage, with services built on top. What it changed is not the underlying components but the way we use them – not as physical things but as APIs. This brought for us a way to treat infrastructure and services as code and delivered a new practice – DevOps.
What DevOps with the cloud gave us is the possibility to lower one, important metric – Mean Time To Recovery (MTTR). This is crucial here.
You can architect against all possible disasters and failures, or you can take into account your operations and architect with the ability to quickly recover from disaster with the right DevOps process in place.
Automation does magic. Maybe instead of paying additional $$$ all the time for cloud resources, you are fine with 4hrs downtime to recover from a failure, by redeploying the solution.
As a bonus (and to make this update less biased) I also want to include some key insights on DR from Patryk Wolski, Cloud Architecture and Ops Lead at Accenture.
If you are deep into DR and HA topics, this article only scratches the surface, and with many shortcuts. If you are new to it, I hope it gave you a starting point. In either case, I really hope that if a disaster strikes, you will be able to say “Move on people, nothing to see here. We are fine.” because of the combination of your architecture and operations in place.
If you don’t have the time to deal with all this SLA model evaluation, we can help you out. Visit this page, tell us what you need, and we’ll take it from there.
By the way, if you’re wondering where the title comes from – I’m a huge fan of Sci-Fi and it is one of the classic books from Philip K. Dick: “Do Androids Dream of Electric Sheep?” (you may know it better as the movie “Blade Runner”) that inspired this one. Check them out and enjoy!
Read other similar articles