The date was 24th December 2012, Christmas eve. The world’s largest video streaming service, Netflix experienced one of its worst incidents in company history. The incident was an outage of video playback on TV devices for customers in Canada, the United States, and the LATAM region. Fortunately, the enduring efforts of responders over at Netflix, along with AWS where the Amazon Elastic Load Balancer service experiencing disruptions resulting in the cause of the incident, managed to restore services just in time for Christmas. If one were to think about the events that ensued over at Netflix and AWS that day, it would be comparable to all those movies of saving Christmas that we all love to watch around that time of year.
This idea of incident management comes from the ubiquitous fact that incidents will happen. This is not an unknown fact and best immortalized by Amazon VP and CTO Werner Vogels when he said “Everything fails all the time”. It is, therefore, understood that things will break but the question that persists is can we do anything to mitigate the impact of these inevitable incidents? The answer is of course yes.
The incident that Netflix incurred was not a ‘sudden’ wake-up call to build more resilient systems. In fact, Netflix already was aware of the possible risk of their system and was experimenting with a more resilient architecture termed Isthmus, albeit for a different purpose. The incident however did highlight the importance of such works and brought forward the need for an active/active multi-region architecture.
The concept of active/active architectures is not a new one and can in fact be traced back to the 70s when digital database systems were being newly introduced in the public sphere. Now as cloud vendors roll out new services, one of the factors they are abstracting away for users is the set-up of such a system. After all, one of the major promises of moving to the cloud is the abstraction of these types of complexities along with the promise of reliability. Today, an effective active/active multi-region architecture can be built on almost all cloud vendors out there.
Considering the ability and maturity of cloud services in the market today, this article will not act as a tutorial on how to build the intended architecture. There are already various workshop guides and talks on the matter. In fact, one of the champions of resilient and high available cloud architectures, Adrian Hornsby who is the Principal Technical Evangelist at AWS, has a great series of blogs guiding the reader through active/active multi-region architectures on AWS.
However, what is missing, or at least what has been lost, is the theory and clear understanding of the motive behind implementing such an architecture. With cloud services abstracting most of the complexity away, it is easy to dismiss the ‘how’ and ‘why’ of such architectures to magic under the hood. Therefore, this article aims to present the distributed systems knowledge behind active/active multi-region systems and showcase how the basic understanding of the concept can thus empower us to build on any cloud vendor taking into consideration the vendor’s services.
Back to the Basics
Things fail. Yes, this is established, but this is also a fact that one need not submit to in the journey to building great products. The main cause of this failure, when looking at it from a broad perspective, is the need for scaling services and the increasing velocity in their development. The probability of failure increases when both scalability and velocity increase and this is just the phenomena observed in the field and validated through academic literature.
The active/active architectural concept, which as mentioned Netflix and others are looking towards, provides measures to mitigate the consequences of the inevitable failure. It must be noted that it does not reduce the probability of the failure but rather the impact of it, and this in itself is the defining point of the concept on top of which the practice is built. Therefore, it can be seen that the concept is built on the premise that failures are inevitable, and so aims to tackle downtimes instead.
The goal here is that the Mean Time To Resolve (MTTR) is low enough that it is insignificant in how it affects the consumer’s perception of the service’s availability. Hence, the resolution should not be measured from the point of view of the impacted service but rather from the consumer’s point of view. Low MTTR values for increased availability of the service as perceived by the user.
The way in which active/active architectures achieve this, in a nutshell, is by continuously being aware of the available service resources and routing user traffic accordingly. Therefore, when one of the resources or services experiences an incident, then the overall architecture should be built such that customer requests get serviced from other available resources or services. Now, this of course is a very high-level description of what the active/active concept entails. When diving deeper into how to actually execute this idea, we come across concepts such as redundancy, replication, statelessness, and eventual consistency. Nevertheless, the problem that the creators of the active/active architecture have to wrestle can be divided into two points:
- How to route customer traffic to available services or resources and being aware of disrupted service or resources.
- How to ensure that each service or resource available is consistent with the others so that customers do not face discrepancies when suddenly being services by another service or resource.
From here on out, these various services or resources will be termed as nodes. That is because theoretically, everything can fail. Not only compute services but also resources such as datastores, event buses, routing services, and other such resources can all fail, and should be replaced with either copies or other like service lightening the blow of the failure. Hence for simplicity, we shall refer to all of these components as nodes.
Redundancy and Request Routing
Before addressing the traffic routing problem, let us revisit the idea of redundancy. In the world of distributed systems, redundancy can be termed as the existence of services or resources, nodes as we would like to term them, that are not strictly necessary for achieving the business logic and functioning of the system. This may seem counterintuitive and against the intrinsic principles of software such as DRY, but is critical to the idea of active/active systems. This is because, with these redundant nodes, the overall system becomes more resilient to outages, from the customer perspective. Whenever one node is down, another is brought in for service. So yes, the overall system is littered with redundant resources, but what is achieved is the much sought-after resiliency of the system with increased fault-tolerance.
As a result, the theory is that customer traffic comes into a redundant network of nodes, where each node has access to a datastore. Any user theoretically can be connected to any processing node. Now of course in the practical world, there are issues such as GDPR compliance and application localization. So clusters of processing nodes for cloud applications would be found in specific regions, but nevertheless, the concept is the same. Two or more processing nodes available where many of these nodes are redundant acting as standby for when the primary node experiences disruptions.
Now the question is how to be aware of node outages and traffic routing. There are several methods to solve the issue, where some of the more popular methods work by rerouting the request automatically when a disturbed node has been hit. The fundamental idea of how to how to reroute to another available node is an intriguing area of research in computational academia.
Both industry-backed R&D and academic institutions are continuously exploring more advanced and optimal rerouting algorithms. The field is actually becoming even more interesting as machine learning has been added to the mix in recent years. For example, the telecommunications giant, Ericsson, has been exploring graph machine learning in a distributed network. This innovation is something that also has usage in routing traffic to the next healthy node optimally when a disrupted node is hit.
In the industry, cloud vendors such as AWS are working on their own set of services to perform the routing such as Amazon Route 53. Netflix employs these services along but has also built its own ancillary technology, one of them being Zuul which even though is primarily for edge computing, Netflix has added further capabilities to aid with how they route traffic in their active/active architecture.
Overall, the idea of routing traffic is well understood, and advancements to best route traffic are being noted. However, this is only half of the work. The second part is to ensure that all nodes in this redundant network are practically the same. After all, for the customer, they must be as the experience of using the platform must go unhindered when the primary serving node fails. Hence raising the importance of stateless compute services and data replication. Both of these notions should be enforced effectively to assure synchronization across the network.
Synchronizing in Active/Active
As mentioned, when achieving synchrony across the active/active architecture we must be mindful of stateless nodes and data replication across the data stores available. The former is a much easier notion to tackle, as the latter hits a well defined conceptual barrier.
Stateless can be defined as when a service can handle an incoming request without being aware of previous requests. It is evident why this is a crucial block for an active/active architecture. Any node apart from the primary servicing node can begin receiving requests upon the failure of the primary node. The idea of stateless services can be applied to both the datastores as well as the compute services.
When considering stateless architectures with stateless compute services, this is where serverless services can be leveraged. Serverless does not always mean stateless but does promote the notion. This is because serverless compute services such as AWS Lambda when torn down do not retain the existing state, and thus when being utilized, we must always be mindful of the stateless nature. This idea is not a new one, and there have been many explorations on how serverless can be used for stateless architectures.
Now when thinking of statelessness in data stores, this is something that can not always be achieved. There is only so much that statelessness can be achieved, and eventually, when considering actual data within the system, there needs to be some form of synchronization among the various data stores in the active/active architecture. This is where the notion of data replication kicks in, but there is an inherent barrier that we need to address and that is the CAP theorem.
In the ideal world, what we would like to see in our architecture is high consistency, partition tolerance, and high availability. High consistency is where all data stores have the same exact data and hence can service all nodes ins the system. High availability entails that every client gets a response regardless of the state of each individual node and data store servicing the node. Partition tolerance can be defined as a failure of one part of the system not having any effect on another part of the system, and the requests being processed without any disruption.
As per the CAP theorem, only two out of the three conditions mentioned can be satisfied at any given time. The theorem was first formalized by, then systems professor at the University of California Berkeley, Eric Brewer, in the late 90s and 2000. It was later confirmed in a research paper published in ACM SIGACT News, co-authored by Seth Gilbert and Nancy Lynch in 2002.
As a result, when the inability to meet the ideal state, the axioms of eventual consistency and strict consistency arise where active/active distributed architectures lean towards eventual consistency. An explanation of eventual consistency and strict consistency is best captured in this stack Stackoverflow answer by Chris Shain.
The reason why active/active architectures lean towards eventual consistency is because of the overhead in replicating data across the data stores present without disrupting the availability and partition tolerance of the system. Both availability and partition tolerance are of more priority when considering the motivation behind the active/active. Alas, as tech has demonstrated before, there is no such thing as a silver bullet.
As we further push the boundaries of software development fulled by inexorable competition we must rethink our entire development methods and practices from start to finish. The need to go faster while maintaining the stability of the product, under the umbrella of DevOps, warrants the need for architectures such as active/active, albeit the difficulties that arise in its implementation. Fortunately, cloud vendors have taken up the task of abstracting away the strenuous parts under their plug-and-play model services. However, even though these responsibilities have been delegated to the cloud vendors, as Netflix and others have learned, knowing the theory behind the concept can go a long way.