Serverless and chaos engineering, is it possible?

Feb 4, 2020

 

serverless-chaos-eng-possible-

Chaos Engineering Through a Serverless Lens

The continual evolution of infrastructure technology has grown alongside the programming languages for which the infrastructure is being built to support. We’ve grown from custom bash scripts written to interface with physical servers to auto-scaling hardware tied to continuous integration and deployment systems. This expansion of capability and automation has given application developers a powerful toolkit to leverage when building applications. Just ten years ago, shared smoke testing and integration servers were the norm in most webdev organizations. But today, developers have containerized deployment systems like Docker that allow for near-perfect replication of production environments. 

While containerization can give your team valuable confidence in releases, it can also expose your organization to potential pitfalls when the system is subject to extreme load. In serverless applications, this problem is exacerbated by the destruction of the application hardware as soon as it is no longer needed. In this type of environment, it is critical to test your production hardware in as many failure scenarios as possible to enhance the reliability of your software. 

In this article, we’ll take a look at Chaos Engineering and how it can help serverless application developers foster a deeper sense of trust in the behavior of their application, especially with edge cases of reliability requirements.

What Is Chaos Engineering?

Modern development and deployment practices allow software engineers develop their functionality deliverables in an environment that exactly mirrors production from a hardware and software perspective. These changes are frequently made in the context of a complex system architecture that can include dozens of services working together to provide a singular user experience. Finding out how those systems intermix and work together is often a process of seeing what problems arise when a service fails in production, then fixing the cause of the failure in the crashing service while addressing the behavior of the dependent services that failed as a result.

Many failures in microservice architectures arise when the system is placed under load. This can consist of an overload of concurrent requests, repeated calls to a process that relies on processor-intensive behavior, or simply the normal behavior of the application degrading by design as the system scales. Addressing these types of issues in a development environment is challenging, as companies at any significant level of scale will likely see processing and bandwidth needs in production that are significantly higher than the capabilities of a standard development laptop. 

When trying to improve their system’s reliability and uptime metrics, Netflix quickly discovered that achieving 100% availability for the user meant being able to recover from less-than-100% reliability in the application architecture. Thus, Chaos Engineering was born.

A Brief History of Chaos

Chaos Engineering is a concept that originated at Netflix in 2011. While working on improving the reliability of the customer-facing Netflix application, engineers realized that the key aspect of resilient systems was not how they behaved in ideal development conditions, but rather how they behaved when conditions were far from ideal. This approach to maintenance made the base assumption that failure is a facet of normal system operation, rather than an exceptional situation. 

By focusing on how to inject failure into production systems in a resilient and well-monitored way, Netflix engineers were able to introduce controlled chaos into the application’s production environment. This gave the development team valuable data about where the edge cases were in their system as well as clues about how to address them.

Benefits of Chaos Engineering

Chaos Engineering is intended to help you address production behavior when application behavior is erratic or undependable. This could be due to a misfiring server, a dedicated attack against a content delivery network, a third-party outage, or any combination of factors that can lead to a failure in the application’s availability. By focusing on these cases during a normal development process, instead of waiting for failure, developers can add additional recovery logic and failure behavior into the application when things go wrong. This enhances the observability of the failure while simultaneously addressing any degradation in user experience. The end result is stronger software that can survive nearly catastrophic infrastructure events, whether it’s a single service not responding or an entire AWS Availability Zone.

Common Terminology and Tools

As Chaos Engineering has grown from its roots as an internal Netflix tool into a common and established best practice for ensuring resilient delivery of functionality, the specialized language surrounding chaos has expanded to cover changes in the standard approach. Below are a few common Chaos Engineering terms to know and keep in mind as you perform your own research:

  • Chaos Engineering is the process of examining and enhancing the resilience of a software application by injecting deliberate failure into the application’s production environment.
  • Chaos is any potential source of failure, from an Engineering perspective, that cannot be anticipated and addressed by standard development efforts.
  • Chaos Monkey was the initial Chaos Engineering tool developed by the Netflix team when trying to solve the challenge of achieving continually reliable content delivery. The term derives from “...the idea of unleashing a wild monkey with a weapon in your data center ... to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption.”
  • Fallback functions are functions intended to serve as a replacement for a primary control flow mechanism in an application that is experiencing a failure. 
  • Logging is the practice of recording characteristics of an application’s normal operation in a secondary location.
  • Traceability, similar to Logging, is focused on reporting characteristics of the operation of your application.

Can Chaos Engineering Apply to Serverless Applications?

Chaos Engineering was originally developed as a tool for testing microservice architectures. One common characteristic of these architectures was dedicating hardware to an individual collection of services. Applications often made a choice between virtual machines running in a shared execution environment and stand-alone servers running microservices, each with a specific task in the overall architecture. Introducing failures in this environment was a fairly straightforward process of breaking the connection between these services in some way and observing the behavior. While this made recovering from errors very quick, true resilience was directly tied to the money you were able to commit to infrastructure.

By building on prior containerization efforts to remove the question of hardware availability from your infrastructure, serverless technology removes the need to apply Chaos Engineering methodologies to your application’s hardware directly. However, this only serves to heighten the need of testing non-hardware chaos factors against your production environment. While serverless systems are designed to be always available by their nature, the additional attack vectors present in a traditional web application are still present in a serverless context–latency injection, traffic spikes, third-party system failures, and other kinds of non-hardware chaos can still throw your serverless application into confusion.

Best Practices

There are three best practices that together represent most types of chaos failure likely to be seen in serverless systems. The first is having fallback strategies when your serverless functions fail to respond–simply being able to swap out a buggy function for a previously deployed version with known good behavior solves a lot of the deployment concerns serverless developers are likely to encounter. Coupled with this is the second best practice–-having a retry strategy for when things go wrong. Finally, incorporating parameter validation and failure recovery into your serverless function architecture will let you quickly respond to–-and address–changing APIs that can impact your system as the underlying code is modified.

Tools for Serverless Chaos Engineering

The Serverless field is still young, and there are still a lot of opportunities for enterprising open-source developers looking to address Chaos Engineering from a serverless perspective. That being said, there are two quick tools for serverless applications that can help you gauge and improve the behavior of your software in a chaotic environment: AWS Lambda Aliases and Thundra.

AWS Lambda Aliases

AWS Lambda Function Aliases are symbolic names given to individual deployments of serverless functions in your AWS Lambda application. As you deploy new versions of your production code, each deployment is assigned a semantic label. By referring to these semantic labels as you work with your application code, you can easily fall back to previous good releases when new code releases introduce chaos into your production environment. This also gives you the tools you need to quickly emulate “misbehaving” Lambda functions by swapping between the known good code in your production environment and code with specific failure characteristics, allowing you to test garbled responses, timeouts, failures, and any other scenario you can think of.

Thundra

Thundra is an Application Observability and Security Platform dedicated to improving serverless observability and offers a number of powerful tools that can be used to inject chaos into your serverless application. Through Thundra’s capability to emulate hardware failures (which, despite the “Serverless” name, can still occur), your developers will be able to respond to these types of issues as they happen. Additional tools from the company allow you to simulate request issues, from garbled responses to invalid parameters. On top of these (and other) capabilities, Thundra provides a transactional view of failures that ties tracing, loggability, and actionable alerts into a single interface, providing you with unmatched detail when things inevitably go wrong in production.

Conclusion

To achieve true resilience in functionality delivery, you need to be able to plan for the unknown. Chaos Engineering helps you build this resilience by focusing on the potential errors that occur rarely in the average software application. In this way, you treat failure as a foregone conclusion and change the question of “What happens if it fails?” into the more useful “What do we do when things go wrong?” 

While Serverless development introduces some efficiency gains when applying Chaos Engineering principles, implementing standard failure modes, like increased latency or invalid parameter values, requires additional effort. Through tools such as Thundra and AWS Lambda Aliases, you can introduce Chaos Engineering into your serverless production environment, allowing you to build up the resilience needed in your application in order to bring your users the value they deserve.