x

[eBook Presented by AWS & Thundra] Mastering Observability on the Cloud đź“– Download:

Chaos Engineering with Serverless

May 18, 2020

 

Chaos Engineering with Serverless


What if My Third-Party APIs Respond Slower?

The distributed nature of serverless technology is both a strength and a weakness. On-demand hardware’s natural resilience and responsiveness make it a powerful tool for rapidly driving your application’s development and growth, solving many early headaches in scaling and security.

However, to implement this in AWS Lambda, you must make extensive use of remote calls to external servers via HTTP, triggers from event-producing AWS services, and other external communications to maintain your application state in the absence of a core server. As a result of this constant remote chatter, responses to your serverless function’s calls are not as certain to receive a well-formed response as a related function in a traditional architecture. 

Chaos engineering helps us counter this by testing breakdowns in communication before they happen. In this article, we’ll explore testing of third-party reliability with chaos engineering. After covering basic techniques, we’ll provide you some real-world examples to demonstrate how to apply this technique, on your own and via a third-party tool like Thundra. Once the chaos settles, you’ll be ready to handle the most common communication failures in your serverless application.

Defining the Problem

Serverless applications, by their very nature, will always need to plan around lags in requests. The application’s core functions are distributed to a sometimes extreme degree. What would be a standard function call in a more traditional application now carries the additional complexity of remote procedure calls, often implemented over lossy protocols like HTTP. Coupled with the spinup times of the functions themselves, these delays can result in sudden errors in your serverless application that are challenging to track down. 

Transaction monitoring is a key component of identifying and resolving these failures. Without a cohesive record of the steps that led to the failure, the only direct evidence of the failure you will have is an error in a single function call. While this is useful, finding out if the issue is systemic is going to require, at a minimum, deep analysis of application logs and other inexact indicators. Furthermore, without any kind of cohesive data changes tracking throughout your call stack, your application will have data integrity issues as things fail. This leads to phantom failures far in the future that are often disconnected from their root cause by months.

While many of these problems can be solved with a bit of design and focus work on your application’s architecture, third parties will continue to be a pain point from all of these perspectives. Setting aside logical errors such as malformed requests, in many cases internet communications is not built on 100% reliability. Request latency can be the source of numerous issues. Some are as small as increased load times on every page. Others are as complex as an HTTP request to a serverless function that times out due to a third-party failure and results in loss of data.

Latency Monkey to the Rescue

These delays, in what are essentially function calls in a traditional application, complicate an already complex problem. Instances of response-time lag that originate from a set of third-party API calls add up, reducing the quality of the user experience as the total delay climbs. 

Beyond simple loading-time issues, sudden failures due to a server hiccup or a dropped packet result in unpredictable data-integrity issues. As we approach our system from a chaos engineering perspective, we will need a tool to emulate these failures in our production environment in order to build defenses against them.

Enter the Latency Monkey. Chaos engineering, as originally implemented, relied on a series of “monkeys” focused on introducing different kinds of chaos into the system. The Latency Monkey, as the name implies, introduces latency issues into an application architecture at random, exposing any issues in your system based on the specific timing of the responses you receive. Developers who create these failures deliberately, in a controlled setting, will be able to more effectively analyze and respond to the failure modes that latency introduces.

Implementing Latency Monkey in Serverless

Implementing a Latency Monkey in your system is, at its core, a straightforward addition of an artificial delay to your running code. The very center of any Latency Monkey will be some variation on a sleep command. The key to getting an effective measurement from your system’s failure state is to implement “controllable chaos.” This means that while you can turn the chaotic latency activity on or off in your application, you cannot control the specific amount—or the specific location—of latency added to the system. The latency must be implemented in a random fashion: Anything too predictable will produce invalid results without offering full visibility into all of the application’s possible failure modes.

Below we’ll look at three ways to implement a randomized Latency Monkey in our system. The first uses Lambda function aliases, switching between two versions of a function or two default parameters for the same function. The second uses flags in the call itself to trigger the latency, such as a set of data that matches a specific test pattern or a simple flag that says “cause trouble.” In the final example we’ll leverage the power of Thundra to implement chaotic latency with a minimum of effort.

In recognition of the innate value of all forms of life, we’ll name our Latency Monkey Larry. Larry likes bananas, prefers to climb to the top of the office building in the morning to catch the “golden hour” for his filming business, and in his spare time enjoys injecting latency into serverless applications. Let’s let Larry loose!

Using Lambda Function Aliases

Let’s begin with a very simple function, something we can use to illustrate the chaos that Larry brings. Below is a simple function, about as basic as we can get:

/*
 * Builds a basic response to be sent back to the caller
 */
buildResponse = () => {
    return {
            'statusCode': 200,
            'body': JSON.stringify({
                message: 'Normal operation succeeded!'
            })
        };
};

Figure 1: Basic Lambda response builder

/**
 * Basic function to return a standard response without chaos
 */
exports.lambdaHandler = async (event, context) => {
    try {
        response = buildResponse();
    } catch (err) {
        console.log(err);
        return err;
    }
 
    return response;
};

Figure 2: Basic Lambda response builder

While this function doesn’t do much that is useful, it will give us a solid view of the pattern, particularly when integrated as a set of larger function calls with multiple interdependencies.  There is no truly general set of guidelines we can provide for the best places to introduce this latency. This will be highly dependent on how control flows in your serverless application and how your services interact. We’ll merely observe that our goal is to show you the path to Larry, but only you can open the door and receive his chaotic wisdom.

To introduce latency-based chaos into our application, we’ll first create an alternative version of the function that implements the latency we desire. This can be represented in Node as follows:

function sleep(ms) {
    return new Promise((resolve) => {
        setTimeout(resolve, ms);
    });
}   

Figure 3: A shared sleep function

exports.chaoticLambdaHandler = async (event, context) => {
    try {
        // Larry strikes!
        await sleep(1000);
        response = buildResponse();
        response.body.larry = "Meets with Larry's Approval";
        console.log( "Meets with Larry's Approval" );
        
    } catch (err) {
        console.log(err);
        return err;
    }
 
    return response;
};

Figure 4: Larry gets his hands on our node.js in this function

Below is a sample AWS SAM template.yaml file that implements these two functions:

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: >
  Exploring Chaos Engineering
 
  AWS SAM project for exploring Latency Monkeys
Globals:
  Function:
    Timeout: 3
 
Resources:
  ThundraTestFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: thundra-test/
      Handler: app.lambdaHandler
      Runtime: nodejs12.x
      Events:
        ThundraTest:
          Type: Api
          Properties:
            Path: /thundra
            Method: get
  ChaoticThundraTestFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: thundra-test/
      Handler: app.chaoticLambdaHandler
      Runtime: nodejs12.x
      Events:
        ThundraTest:
          Type: Api
          Properties:
            Path: /thundra-chaotic
            Method: get
 
Outputs:
  ThundraTestApi:
    Description: "API Gateway endpoint URL for Prod stage for Thundra Test function"
    Value: !Sub "https://${ServerlessRestApi}.execute-api.${AWS::Region}.amazonaws.com/Prod/thundra/"
  ThundraTestFunction:
    Description: "Thundra Test Lambda Function ARN"
    Value: !GetAtt ThundraTestFunction.Arn
  ThundraTestFunctionIamRole:
    Description: "Implicit IAM Role created for Thundra Test function"
    Value: !GetAtt ThundraTestFunctionRole.Arn
  ChaoticThundraTestFunction:
    Description: "Larry Presents the Thundra Test Lambda Function ARN"
    Value: !GetAtt ChaoticThundraTestFunction.Arn
  ChaoticThundraTestFunctionIamRole:
    Description: "Implicit IAM Role created for Larry's Thundra Test function"
    Value: !GetAtt ChaoticThundraTestFunctionRole.Arn

Figure 5: AWS SAM template for Larry

The last step here is to put both functions behind a Lambda alias and have it switch between them some percentage of the time using Lambda traffic shifting. It’s important to note that per the AWS documentation, traffic shifting is generally implemented among two different versions of the same function.

This approach gives us a great mechanism for introducing latency into our application. With simple tweaks to our chaos function we can adjust the latency to meet our needs, and the traffic shifting handles all the sampling needed. The biggest downside of this approach is that it is, essentially, completely random, giving you no effective influence over the control and sample groups in your experiment. To build that functionality, we’ll need to improve our chaos function a bit.


Using Debug Flags for Latency Generation

With simple tweaks to the code presented in the figures above, we can move our traffic shifting and sampling algorithms into the Lambda functions themselves. This approach might seem more complex, but it gives you greater flexibility to refine the target audience for your test. Whereas the first approach gives Larry a paint roller and says, “Have fun!” this approach takes the time to teach him brush strokes, complementary colors, and an artful vision worthy of the finest galleries. This allows Larry to focus the chaos he introduces more specifically, targeting subsections of the user base that would benefit from more chaos in their lives.

While implementing debug flag-driven latency often just means including some extra code in your Lambda function guarded by a Boolean flag, with a bit of extra effort you can refactor the test sampling code into a reusable software package, or leverage third-party libraries such as the chaos engineering tools from Netflix.

Leveraging Thundra to Introduce Latency

While implementing your own chaotic latency tool can provide you with great flexibility, it adds additional complexity to your code that must be maintained as rigorously as any customer-facing feature. With compressed development schedules, third-party tools like Thundra can be essential in delivering a high-quality management experience for your serverless applications.

Thundra makes use of span objects to trace the control flow of code in your serverless application. The flexibility of spans in distributed tracing means that you can subject them to any number of manipulations and updates as the application flows. For Larry’s latency, Thundra offers SpanListeners. These objects are registered with Thundra and invoked as your code enters and exits the spans. With SpanListeners, you generally create an object, then register it with your application’s trace support. The sample code below demonstrates a simple latency injector that injects random latency into your serverless function responses, using a delay of two seconds and a standard deviation of one second when generating the response latency:

// Larry gets clever!
exports.spanListenerLambdaHandler =  thundra(async (event, context) => {
    const FilteringSpanListener = thundra.listeners.FilteringSpanListener;
    const LatencyInjectorSpanListener = thundra.listeners.LatencyInjectorSpanListener;
    const StandardSpanFilterer = thundra.listeners.StandardSpanFilterer;
    const SpanFilter = thundra.listeners.SpanFilter;
 
    const filteringListener = new FilteringSpanListener();
    const filter = new SpanFilter();
    filter.className = 'AWS-Lambdas';
    filter.tags = {
       'aws.lambda.name': 'upstream-lambdas'
    }
 
    const filterer = new StandardSpanFilterer([filter]);
 
    const latencyInjectorSpanListenerConfig = {
        delay: 5000,
        injectOnFinish: true
    };
 
    const latencyInjectorSpanListener = new LatencyInjectorSpanListener(latencyInjectorSpanListenerConfig);
    filteringListener.listener = latencyInjectorSpanListener;
    filteringListener.spanFilterer = filterer;
 
    thundra.tracer().addSpanListener(filteringListener);
    console.log( "The Long Paw Of Larry Was Here" );
    try {
        response = buildResponse();
    } catch (err) {
        console.log(err);
        return err;
    }
 
    return response;
});

Figure 6: Thundra-driven latency, powered by Larry

With this simple modification, you have a quick and easy means of testing your application in a chaotic manner. This is integrated deeply into the Thundra platform, letting you trace both initial execution and failures in a contextual manner resembling the true control flow of your serverless application. By solving many of the most common serverless DevOps challenges out of the box, Thundra can help you quickly demonstrate ROI to your shareholders and provide value to your customers.

Larry and Thundra Help You Cope

Request latency will be a continual challenge when you’re building a truly “always available” system. To maintain logical data consistency, you’ll need to carefully consider how your system responds when request responses are delayed. By accepting that request latency is a fact of life, we can plan for the eventual failures. With Thundra, you can easily add latency-based chaos testing to your serverless applications, allowing you to quickly improve their stability.