Building an Effective Disaster Recovery Plan for SaaS Data

For decades, a reliable disaster recovery plan (DR plan) has been crucial for companies that base their business processes on software. Even with the shift from on-premise to the cloud, this hasn't changed. If you rely on SaaS applications such as Atlassian Jira, Jira Service Management, or Confluence for your business processes, your DR plan should include specific steps for recovering critical data stored in these services. In this post, we'll walk you through the key building blocks of creating a disaster recovery plan for your SaaS applications.

Why is a Disaster Recovery Plan necessary for SaaS data?

We get this question at least once a day from peers. “Why would I need a disaster recovery plan for the cloud, specifically SaaS?”

Here’s why: when businesses develop their own software, they have control over data privacy, security, and reliability. Cloud applications take away much of the responsibility of hosting your own solutions, but not all of it.

Jodocus disaster recovery plam

It’s called the Shared Responsibility Model. It’s often associated with AWS but the concept of shared responsibility governs all cloud computing – including Atlassian. Essentially, you and the Cloud provider share the responsibility of protecting data. SaaS companies will guarantee everything, except User Access and User Data. You are on the hook for safeguarding those things.  

While some SaaS products offer backup and restore capabilities, they may lack fine-grained recovery controls, comprehensive data visibility through detailed audit logs, or guarantees of data safety. Therefore, having a disaster recovery plan for each SaaS product you depend on is essential.

Step One: Define your RTO & RPO

The two key metrics that should drive your recovery strategy are the Recovery Point Objective (RPO), which determines your data loss tolerance and the Recovery Time Objective (RTO), which is setting the benchmark for minimizing downtime.

The RPO metric revolves around one fundamental question: "How much data can you afford to lose?". Here is how we often frame things: If you back up data once every 24 hours at midnight, and a disaster strikes at 11:59 pm, you would lose an entire day's worth of data.

That level of risk will be tolerable for some businesses but unacceptable for others. Defining your risk tolerance for data loss will guide your technical decisions in meeting these requirements effectively.

To determine your RPO, you’ll want to consider the criticality of your data and the impact of potential data loss on your organization. Different services or applications may have varying RPO thresholds. For instance, mission-critical systems may demand real-time replication with zero or near-zero data loss, while non-critical systems might tolerate a longer interval between backups.

The RTO metric focuses on how quickly you can recover from a disaster and resume normal operations. Imagine a scenario where a meteorite strikes your data center. How long would it take to get your systems back up and running? This involves factors such as procuring alternate infrastructure or restoring from backups. The time required for recovery varies depending on the service in question.

Some services can achieve rapid recovery times, potentially within minutes, while others may take considerably longer, perhaps an entire day or more. Understanding the unique recovery timelines for each service or application is essential for effective disaster recovery planning.

A key way to get started is by engaging with stakeholders across your organization to ensure their buy-in and agreement on the defined RPO and RTO targets. This collaboration will foster a shared understanding of the potential impacts of a disaster and the required recovery timelines.

Step Two: Choosing a Recovery Strategy

Choosing the right strategy boils down to a trade-off of robustness versus cost. Here’s a graphic from AWS which also applies to products provided by Atlassian and helps illustrate the spectrum of choices

Jodocus Disaster Recovery Plan Recovery Strategy
Source: Disaster recovery options in the cloud - Disaster Recovery of Workloads on AWS: Recovery in the Cloud  

Let's start on the left side with the Backup & Restore solution. It's the simplest and most affordable option. However, the recovery time here can take hours, or even longer. Essentially, you're restoring your latest backup(s) to your disaster recovery location.

Moving along, we have the Pilot Light option. With this approach, you run some essential services in a reduced capacity. Most services are running but scaled down to a "scale to zero" level. Code or application updates are pushed to the DR location just as you would update your primary location.

Next up is the Standby strategy. Here, everything is up and running, albeit at a smaller capacity compared to your primary environment. It's similar to the Pilot Light option, but all services are operational with at least some capacity — nothing scaled to zero.

Finally, we have the Active/Active solution. This is the most comprehensive approach where you run full services in two parallel streams, allowing you to switch between them in near real-time. However, it's worth noting that this option comes with doubled costs, making it less feasible for many companies.

Your choice of recovery strategy depends on your risk tolerance and how much you're willing to invest to mitigate that risk. Depending on your industry, you'll have core systems that form the foundation of your operations, as well as peripheral systems. While a full day of downtime for a core system can be painful for most businesses, the impact may be less significant if it's a tool for running marketing programs, gathering statistics, or another secondary service.

This means you'll need distinct disaster recovery plans for various systems within your organization. While there may be overlapping elements between service disaster recovery plans, it's important to consider each service's plan due to potential variations in RPOs and RTOs.

Step Three: Testing a Disaster Recovery Plan

To effectively test your disaster recovery plan here is a helpful checklist to help you focus your efforts:

  1. Organize a tabletop test of your DR plan:
    A tabletop test is a great way to start by putting all the ideas and methods in front of everyone. It allows all stakeholders to have a say in how you should proceed with your disaster recovery efforts.
  2. Walkthrough any internal and external dependencies that could potentially hinder or even prevent your disaster recovery plan from being fully effective. It's vital to address and resolve these dependencies before implementing the plan.
  3. This meeting forms the foundation of our disaster recovery plan, so we must document everything thoroughly. It's a good idea to get participants to sign off on the documentation to avoid any confusion later on.
  4. Create accountability lists:
    This identifies who is on call if a disaster disrupts your business; the people responsible for executing different phases of the DR plan. The list should be clearly outlined and updated so newer team members are aware of who’s going to be doing what in an emergency.
  5. Plan for different types of disasters:
    While you can't plan for every possible disaster, it's important to assess and prioritize the risks you face. Whether it’s malware attacks, data center outages or third-party provider outages, pick which ones you want to prepare for.
  6. Understand the cost of downtime and data loss:
    When SaaS tools are vital to processes and workflows, outages can drain productivity and cash. Atlassian’s fourteen-day outage in April 2022 is one such example, which ended up affecting over 50,000 users. The downtime resulted in the revocation of access to critical SaaS products like Jira, Confluence, and Opsgenie. It also resulted in the loss of data. By Atlassian’s own calculation, the average cost of downtime to customers is $5,600, but the actual cost does vary from business to business.
    Of course, preparing for such an event will incur costs on multiple levels, so be sure to carefully consider what failsafes you want to invest in and to what extent. A good place to start is backup and recovery software to keep critical operations up, even when part of the app isn’t functional.

A Recap

At a glance, your checklist for testing your disaster recovery plan should look something like this:

  • Understand why you need a SaaS disaster recovery plan.
  • Set your RPO & RTO
  • Confirm with relevant stakeholders what business functions you want to protect and secure.
  • Decide on the internal and external tooling required to carry out a Disaster Recovery plan taking into consideration the security and privacy of your SaaS data.
  • Create clear and relevant accountability lists that illustrate who’s on call for what and in what situations.
  • Scope the types of disasters you’re planning for, because you can’t plan for everything.
  • Document the plan, make it easily accessible, and have the proper stakeholders sign off on it.

Do you have any questions or would you like to learn more about Disaster Recovery Plans? Just contact us — we're happy to help you.

Popular articles