ArrowLinkToArchivePageBlog

Do clouds dream about disaster recovery? Disaster recovery


Disasters happen. And they can happen to anyone. The question is – do you have a plan for them?

Do you even need a DR plan for the cloud? Let’s talk about it.

Key points

  • Will clouds handle the disaster recovery themselves?
  • How important are SLA and DR requirements in disaster recovery strategy?
  • What are Azure best practices in disaster recovery?

Can the cloud handle disaster recovery?

“The cloud will handle all the disaster recovery for me!” – NO!

I had a meeting with a customer once, a large enterprise. When we discussed security operations, they were surprised to learn that if they run VMs in the cloud, they still need to patch them. Understanding the cloud operations model in this case helps.

It is the same with high availability and disaster recovery for cloud workloads. Public cloud services can help you with them, but you are responsible for thinking about how to prepare for recovery and making it happen.

Why?

Because the cloud provider does not know your workload or requirements for it. What they give you is the infrastructure to support it and protect your workloads from disasters at the local, datacenter, and region levels.

You still need to understand the disaster recovery, high availability, and SLA models of your services, and put your thoughts and planning efforts into making sure they support your requirements.

Key points to note here

  • Define your requirements for DR and recovery time
  • Define your SLA requirements
  • Know what you want to prepare for (failure of a single data center, failure of the entire region, being available even if the entire continent infrastructure is down)
  • Understand your cloud DR model and services, and design your architecture bearing them in mind.

A practical example of DR requirements driving architecture choices

One of our customers hosted their corporate website on SharePoint, on local servers. Their profile suggested that they might be targeted by cyberattacks. They also required their website to be operational at a high SLA, even if the hosting infrastructure in the entire region went dark.

The result – the entire hosting infrastructure was spread across two continents, Europe and North America (United States). Initially, they used some server hosting, but soon we realized that all requirements can be satisfied only with an Azure cloud solution spread across the EU and US regions. And it is exactly what we architected for them.

Want to get these updates sooner? Leave your email address to get the latest insights every two weeks. Subscribe

Disaster recovery is expensive

There is a drawback to it. If you want to protect your solution in all disaster scenarios, it will cost you more! DR was always a costly thing.

From my pre-cloud era, I remember a solution offered by one of the vendors. They offered a service with five 9s (99.999) availability, but it came with a hefty price tag and an entire room of equipment that you could not touch. It covered all the redundancy, power supplies, infrastructure to switch between redundant elements, and software to manage it.

Now it is gone. I mean it is still there, but it is exactly what public cloud operators build in their data centers. You operate at a higher layer, but even so, making it disaster-proof will cost you. The cost (but not all elements) might be related to:

  • Higher tier of services you need to use to support your SLA and DR requirements
  • Cost of additional instances and storage to support them (or higher tier of storage)
  • Cost of network traffic between the regions and data centers.

You need to take it into account when architecting your solution. When you have a requirement of “I need 5 9s”, do the cloud cost math and ask “OK, let’s rethink it because here is how much it will cost us.”

The sum of it all

Remember that your solution will consist of different elements. You will use infrastructure services, you will use the network, and maybe some PaaS services. Each of them has its own architecture and set of DR capabilities and SLA parameters. There is no single SLA for all the services from your cloud vendor.

Bad news: you (or your cloud architects) need to get to know them and understand them when architecting your solution.

Questions you need to ask yourself and know the answers to design your solution

  1. What is the DR model for a specific service?
  2. Is it offered across different regions or not?
  3. Is it available in regions where we can use it (as there might be limitations)?
  4. What is the model to initiate DR capabilities? Will it happen automatically? Do we have to trigger it and then the vendor will do the rest? Do we have to take care of it on our own?

SLA is a whole other topic. Remember that the SLA of your cloud solution will be the result of all its elements and how they are connected. Depending on the architecture of the solution and the services you are using, your SLA might be higher or lower than the sum of its parts.

Let’s look at a practical example!

Here is a good reference example of an Azure cloud solution designed with availability in mind. A website with SQL backend sounds simple, but what if you throw in two additional requirements:

  • It has to be spread across regions
  • It has to maintain the privacy of connections as much as possible

And that’s where the picture changes. Take a look: Multi-region web app with private connectivity to database

Disaster recovery – Azure best practices

The higher up the stack of cloud services you are, the less you need to worry about it at the infrastructure level. Also, services at the higher level of the cloud service can help you prepare for disaster even if your solution is based on lower-level services, e.g. running on VMs in a traditional infrastructure.

Let’s look at the earlier example of a web application solution.

Traffic Manager, a PaaS service hosted in Azure, helps you redirect the traffic across the different regions with high availability of response and load distribution. You don’t need to think about it.

If you want to move it a bit higher you can use a service like Front Door to act as an entry point for your application, to hide its multi-region complexity from the users.

When using Azure AD B2C to handle your users’ security, you get it combined – this service supports HA at the data layer level across multiple data centers (but not across different geographies) and is using global entry points (similar to Front Door) to make sure that your customers can reach it across the globe (with the fastest link).

Bottom line – the higher up the stack in the cloud, the more of this heavy lifting is handled by the vendor. It comes with drawbacks, mainly possible higher cost, and you have less control over the process. This means that you need to understand your cloud model better and design accordingly.

Consider your metric – is it DR capability or time to recovery?

Last but not least – make sure that you design for the correct metric. Is it a full disaster-proof solution you aim for, or is what you care about the ability to quickly recover when a disaster happens?

The cloud in its essence is a computer and storage, with services built on top. What it changed is not the underlying components but the way we use them – not as physical things but as APIs. This brought for us a way to treat infrastructure and services as code and delivered a new practice – DevOps.

What DevOps with the cloud gave us is the possibility to lower one, important metric – Mean Time To Recovery (MTTR). This is crucial here.

You can architect against all possible disasters and failures, or you can take into account your operations and architect with the ability to quickly recover from disaster with the right DevOps process in place.

Automation does magic. Maybe instead of paying additional $$$ all the time for cloud resources, you are fine with 4hrs downtime to recover from a failure, by redeploying the solution.

As a bonus (and to make this update less biased) I also want to include some key insights on DR from Patryk Wolski, Cloud Architecture and Ops Lead at Accenture.

Expert tips

  • Establish what a “disaster” means to you. Is it only when a literal disaster happens or maybe prolonging the outage of your application would also trigger your recovery plan?
  • Think about your application landscape more holistically and ensure your recovery plan takes into account various interconnections with other components and not just the single application.
  • Determine “what’s next?” after your “failover”. Is it business as usual, or do you need to re-establish your DR? How to handle communication with your customers and partners?

What is your choice when it comes to a disaster recovery strategy?

If you are deep into DR and HA topics, this article only scratches the surface, and with many shortcuts. If you are new to it, I hope it gave you a starting point. In either case, I really hope that if a disaster strikes, you will be able to say “Move on people, nothing to see here. We are fine.” because of the combination of your architecture and operations in place.

If you don’t have the time to deal with all this SLA model evaluation, we can help you out. Visit this page, tell us what you need, and we’ll take it from there.

By the way, if you’re wondering where the title comes from – I’m a huge fan of Sci-Fi and it is one of the classic books from Philip K. Dick: “Do Androids Dream of Electric Sheep?” (you may know it better as the movie “Blade Runner”) that inspired this one. Check them out and enjoy!

Key takeaways

  1. To make good choices while architecting your solution, define your disaster recovery, recovery time requirements, SLA requirements, understand its model and services, and last but not least – what failures you want to prepare for.
  2. Don’t omit the cost which may be related to the higher tier of services, additional instances and storage to support them, and cost of network traffic between the regions and data centers.
  3. There is no single SLA for all the services from your cloud vendor, therefore, your solution will consist of different elements of their own architecture and set of DR capabilities and SLA parameters – infrastructure services, the network, and maybe some PaaS services.
  4. Take your time to think whether you aim for a full disaster-proof solution, or the ability to quickly recover in case of a disaster.

Sign up for Predica Newsletter

A weekly, ad-free newsletter that helps cutomer stay in the know. Take a look.

SHARE

Want more updates like this? Join thousands of specialists who already follow our newsletter.

Stay up to date with the latest cloud insights from our CTO