As the world migrates to the concept of DevOps the primary goal becomes clear. How can we increase velocity while maintaining our system’s stability? With the motive of going faster, we also risk increased chances of failure. The risk is even higher when considering concepts such as GitOps where we see a move from push-based pipelines to pull-based ones with the objective of increasing velocity in going to production with changes to our Kubernetes systems.
In taming the risk of failures as we step on the accelerator, the industry has seen the concept of several strategies and concepts in the past. These range from Incident management to leveraging observability in our production environments. However, we can also look towards how we actually design our systems to be more resilient in terms of outages and customer-experienced incidents. The idea should be to embed the principles of DevOps in the core product itself and not only have it as some form of ancillary concept propped up by DevOps enabling tools and DevOps practicing individuals.
One such core concept, which has also been a topic of computational academia for several years now is active/active architectures. The whole point of such a concept is most famously realized by the Netflix Christmas outage of 2012 Therefore, this blog aims to explore the concepts of such an architecture and how to materialize it in a cloud environment. It also aims to argue that thinking of such an architecture with a serverless mindset unlocks further potential.
The Basics of Active/Active Multi-Region
Things fail. Yes, this is established, but this is also a fact that one need not submit to in the journey to building great products. The main cause of this failure, when looking at it from a broad perspective, is the need for scaling services and the increasing velocity in their development. The probability of failure increases when both scalability and velocity increase and this is just the phenomena observed in the field and validated through academic literature.
When Thinking of why would consider a multi-region system we see three major benefits from the get-go. These include reduced latency to customers in remote regions, reduced possibility of outages, and of course meeting product requirements such as data residency to comply with legal requirements. The benefit of reduced failures is especially enticing as a network of distributed resources achieves a redundant network from which traffic can be routed to the most optimal and available resources.
The active/active architectural concept, which companies such as Netflix are looking towards, provides measures to mitigate the consequences of the inevitable failure. It must be noted that it does not reduce the probability of the failure but rather the impact of it, and this in itself is the defining point of the concept on top of which the practice is built. Therefore, it can be seen that the concept is built on the premise that failures are inevitable, and so aims to tackle downtimes instead.
The goal here is that the Mean Time To Resolve (MTTR) is low enough that it is insignificant in how it affects the consumer’s perception of the service’s availability. Hence, the resolution should not be measured from the point of view of the impacted service but rather from the consumer’s point of view. Low MTTR values for increased availability of the service as perceived by the user.
The way in which active/active architectures achieve this, in a nutshell, is by continuously being aware of the available service resources and routing user traffic accordingly. Therefore, when one of the resources or services experiences an incident, then the overall architecture should be built such that customer requests get serviced from other available resources or services. Now, this of course is a very high-level description of what the active/active concept entails. When diving deeper into how to actually execute this idea, we come across concepts such as redundancy, replication, statelessness, and eventual consistency. Nevertheless, the problem that the creators of the active/active architecture have to wrestle can be divided into two points:
- How to route customer traffic to available services or resources and being aware of disrupted service or resources within a redundant network of these resources.
- How to ensure that each service or resource available is consistent with the others so that customers do not face discrepancies when suddenly being services by another service or resource.
The Serverless Cherry on Top
Now that we are familiar with the basics of the active/active multi-region concept, let us consider what benefits serverless can provide to this concept. So before we continue let us specify what are the three major advantageous characteristics of serverless. These are listed below:
- Server management is abstracted to the vendor.
- Pay-as-you-go model where you only pay for what you use.
- Automatically scalable and highly available.
Actually, before we even get into how these three characteristics can boost the benefits of our active/active multi-region system we should consider the fact that serverless promotes statelessness.
The fact that the environment is torn down after usage leading to thinking of stateless architectures is usually a limitation complained about when adopting serverless. This aspect of the technology can actually lead to anti-patterns and act as a barrier to adoption. However, even though we see cloud vendors providing solutions to get over this restriction, when thinking of active/active architectures this restriction actually becomes a benefit.
When considering active/active architectures it is imperative that we consider the fact that each request should be independently treatable regardless of previous user actions. For any input, the application should provide the same response to any user request. By having dependencies on previous user actions, the active/active concept breaks down as we reduce the effectiveness of decoupled nodes available to process requests in the case of the primary node experiencing an outage.
Serverless applications force us to think of stateless architectures, active/active architectures. Additionally, serverless applications have the ability to scale horizontally as any user request can be handled by any available computing resource.
Considering this aspect, we can see how the pay-as-you-go characteristic becomes beneficial. When a resource is not being used, you do not need to worry about the costs of running these redundant resources.
Similarly, as we think of scaling, serverless is prized for its auto-scalability characteristic. If a resource begins experiencing an increase in traffic due to a sudden spike in requests as several of the reserve nodes fail, a serverless-based compute resource will theoretically scale to mee this spike in traffic. It must be noted that this benefit is truly realized when thinking of an end-to-end serverless application, including serverless data stores such as DynamoDB for example. After all, what is the point of having serverless compute services, or serverless routing components if your datastore itself cannot scale accordingly?
Finally, by appreciating the pay-as-you-go model and the auto scalability, it is soon realized that all of this is attained with the minimal operation of the underlying infrastructure thanks to the fully-managed nature of serverless. This aspect really does adhere to the core strategy of DevOps in its objective to reduce Ops and promote a left-shift. With reduced responsibilities, teams can now focus primarily on the business code as well as having the luxury to attend to other areas of the development pipeline.
As a result, it becomes apparent how serverless can greatly benefit us when developing an active/active multi-region architecture. With its three characteristics and intrinsic property of statelessness, serverless applications boost the DevOps goal of adopting the active/active concept.
Conclusion
As we further push the boundaries of software development fulled by inexorable competition we must rethink our entire development methods and practices from start to finish. However, even more, we need to rethink how we build our core application itself. The need to go faster while maintaining the stability of the product, under the umbrella of DevOps, warrants the need for architectures such as active/active. Fortunately, the rise of serverless technologies has lit a path forward, reducing the barriers in implementing such required architectures. When there’s anything goes wrong with your application, Thundra APM provides alternative solutions to figure out the reasons. You can get started right away by providing access to your logs and then instrument the application for further insights.