Cloud has over the years gained popularity among application developers. It helps us build robust applications that can scale depending on the customer demand. However, Cloud isn't always fail-safe. We can still hear various news about Cloud outages from time to time.
Different Scale of Outages
If we look closely, these outages can range from simple server outages to large-scale data centers or even across data centers. One such example is the Amazon S3 service disruption that occurred on 28th February 2017. The impact was huge where many popular applications suffered downtime. Though it is a very rare case as its Amazon S3 is a regional service (redundant across multiple data centers and availability zones), we can experience different frequencies of outages at lower levels.
Therefore, it's important to design our application to serve our customers amidst Cloud outages.
Designing Applications for High Availability
Before we start designing applications for high availability, we need to understand the availability criteria defined for them. The required uptime could depend on the usage patterns or the ability to meet service level agreements (SLAs) signed with your customers.
Reference: (Wikipedia) https://en.wikipedia.org/wiki/High\_availability
However, it's essential to understand that higher availability comes at a cost. Therefore, before deciding on the availability percentage number, it's a good start to refer to the availability chart. It directly showcases the downtime we could expect. For instance, if we go for 99.99% ("four nines"), we can expect only 52.60 minutes of downtime for an entire year. Then, we need to have zero-downtime releases and automatic failover to meet such criteria.
Besides, you need to design your application with redundancy to meet the availability requirements. Most of the time, you could utilize fully managed services in AWS Cloud such as Amazon S3 Amazon DynamoDB, which have already inbuilt redundancy in place.
- Amazon S3: The S3 Standard storage class is designed for 99.99% availability and 99.999999999% durability.
- Amazon DynamoDB: Uptime Percentage for each AWS region during any monthly billing cycle is at least 99.999% if the Global Tables SLA applies, or 99.99% if the Standard SLA applies.
Once you design the solution, the weakest link in the architecture (typically in the critical path) determines the availability of the overall system.
So, let's examine the critical path through Amazon API Gateway -> Lambda -> Amazon DynamoDB in the above example. Since Amazon API Gateway provides 99.95% availability, it determines the availability of the entire path.
Besides, suppose we need to increase the availability from 99.95% to 99.99%. In that case, we need both redundant API Gateway and Lambda functions in a different region, accessing Amazon DynamoDB Global Tables.
In most practical cases, we don't need to go up to such extremes. But what if you do need to go beyond these limits. What are the patterns available?
Different Patterns to Design Highly Available Applications
There are well-established patterns to design highly available applications. First, let's look at how we can build an application that can survive a server downtime. I will use examples from AWS Cloud, which are also applicable for other Cloud providers.
Using Auto Scaling Servers
While moving to Cloud, one of the main advantages is its elasticity. Since we can provision servers on-demand, we can design our application to survive a server downtime.
If you look at the above example, you can find servers (multiple EC2 instances) behind a load balancer. Furthermore, you can see that the servers are placed in two availability zones. In AWS, these zones are different data centers physically separate from each other. Therefore, even if a data center goes down (stopping one server), the other one can still serve its users.
When doing this, one of the design considerations is to configure your application to self-start and serve in a stateless manner in a server. If there is a need for a state, we can store it externally, outside the server.
In addition, these autoscaling groups can provision new EC2 instances on-demand scaling to meet the increased load when the load goes up.
Utilizing Serverless Services
If you are developing new applications in Cloud, one of the main advantages is using Serverless services (or fully managed services). Since most of these services come with inbuilt redundancy, it becomes quite straightforward to achieve availability criteria.
For example, all the services defined below supports redundancy at the regional level. Therefore, it can survive a data center outage with zero downtime.
In some cases, you can find servers such as CDNs (AWS CloudFront) DNS (AWS Route53), that goes even beyond the region. They will further prepare our applications to withstand Cloud outages.
If we need an application to survive an entire region outage, we need to go beyond Cloud regions. There, the main challenge is to ensure that the data is synchronized across these regions for seamless failover.
This could be achieved in two different ways. One approach uses failover routing at the DNS level (here in AWS Route53), where if one endpoint fails in the health check, the traffic is forwarded to the other. In a normal situation, the traffic is load-balanced across the two environments.
The other approach is to use the CDN service (here its AWS CloudFront), which could failover to a different environment depending on the response of each endpoint.
We have so far discussed how our application can withstand Cloud outages at different levels. But, what happens to the services that go down due to an outage. Do we have to intervene and restore them manually?
In Cloud, the best approach is to design our solution to self-heal when an outage happens. For example, we can start by setting up our server configurations to automatically start the application or even use Serverless services that automatically return to normal.
Likewise, we can consider all the compute services as disposable and fault-tolerant while having enough redundancy for persistent services such as databases file storage.
But, what happens if an unavoidable disaster happens to create an outage in Cloud? How can we ensure business continuity and recover from failure? Let's cover these in the next section.
Preparing for Disasters
Preparing for a Cloud outage due to a disaster is rare since it is designed to avoid such situations. For example, each availability zone in AWS is distant enough to withstand an earthquake, power outage, or landslide. However, since we cannot rule out the risk entirely, we need to prepare our applications for such disasters.
A few strategies for doing that include going for hybrid Cloud environments or deploying applications having redundancy across availability zones or regions. Still, the cost factor plays a dominant role here. Therefore, depending on your reliability requirements, you can go for one of the following strategies.
Backup and Restore
This is the simplest and most cost-effective strategy in place. The idea is we keep a continuous backup of our data in a safe place to use if a disaster happens. In Cloud, typically, we keep the backup in a different region. Sometimes, it could even be replicated across regions for additional safety.
But there comes the challenge when restoring the application to normal. It will take more time. However, if you have the infrastructure in code and automation in place, the entire process can be done in a few minutes to hours, depending on the complexity.
In some cases, you can't afford the downtime with Backup and Restore. But you might also want to spend as little as possible to keep another environment up and running.
For those use cases, you can keep the Database only synchronized to another region. There, if a disaster happens, you only have to spin up the servers. Typically, you can do it in minutes instead of hours. Still, keep in mind that your application will suffer downtime until the new environment takes over the operation entirely.
This approach makes more sense for those who can't afford a complete downtime. The idea here is to set up a new environment in a different region and keep it on standby. This environment might contain minimal resources on standby and can scale to a normal environment once it takes over.
However, if the application goes down in a peak load, it might take a few seconds to minutes to scale out the application where some of your users might experience downtime.
This setup is the most reliable way of preparing for a disaster. In addition, you can set up multiple environments in different regions for your application and use them to serve your customers.
The critical component here is a proper load balancing mechanism that could distribute the traffic across these environments. Besides, we need to establish a data synchronization technology across environments so that the application can access the most up-to-date data.
Overall, since each site can handle decent traffic, others can take over and continue serving the customers even if a disaster happens.
I hope now you are fully aware of different strategies to prepare your application for Cloud outages. It ultimately comes down to the recovery time and recovery point objectives, as well as the cost you can allocate to manage it.
Besides, it's necessary to have different perspectives such as high availability, fault tolerance, and disaster recovery when designing reliable applications.
And there we have it. I hope you have found this helpful. Thank you for reading.