What you gonna do when failures come for you?

Aug 20, 2019


chaos-gears-guest-post

Failures – same story over and over again

We’re all aware of the fact that everything fails from time to time. No matter what type of failure we have to deal with, its aftermath is generally a pain in the ass. This is especially true for distributed microservice ecosystem when particular request has to cross multiple bounded contexts (microservice with their independent databases). And even doubly so when we take into consideration a serverless methodology of application development. In the vast ocean filled with tiny Lambda functions, it’s pretty easy to come across a failure. Moreover, problems appear in connections between Lambda functions and other AWS services.

For example, Dynamodb sets limits on reading and writing activities. When this limit is exceeded, there is a penalty either in the form of increased AWS monthly billing or rejected requests. The latter one requires some handling and counter-reaction if, simultaneously, other functions have changed the data.

Build automated workflows – AWS Step Functions

Step Functions allow the user to define a state machine using Amazon States Language (ASL), which is a JSON object that defines the available states of the state machine, as well as the connections between them. Reading JSON may be painful for some of us, therefore, AWS generates a nice looking flowchart from our ASL code which allows us to better visualize the machine, as seen here. We are going to use it as an example in this article:

chaos-gears-1

Step Functions – example

Step functions service gives us an opportunity to build activity flows (not to be mistaken with state machines) that are really helpful when the transaction-like request has to be implemented. What I have in mind can be explained by Wikipedia’s definition of “atomic transaction”:

“An atomic transaction is an indivisible and irreducible series of database operations such that either all occur, or nothing occurs. A guarantee of atomicity prevents updates to the database occurring only partially, which can cause greater problems than rejecting the whole series outright.”

As I’ve mentioned before, these activity flows consist of small steps (Lambda functions, Waiters, Choices, etc.) which create a one common logic flow. If such flow is putting something into S3 and simultaneously saving some metadata (like in our example), then it won’t necessarily need to complete half of the operations successfully. Loss of consistency is completely unacceptable. That’s the gap where Saga Patterns do their job.

Saga Patterns – dealing with long-lived transactions

I found this definition of saga: “A saga is a sequence of local transactions where each transaction updates data within a single service. An external request corresponding initiates the first transaction to the system operation, and then each subsequent step is triggered by the completion of the previous one.”

Additional notes from me: Saga is a failure handling pattern, so when any failure occurs during one of those long-live flows, we apply the corresponding compensating actions to return to the initial state when the saga/activity flow started.

In Chaos Gears we tend to use two scenarios: a sequential one and the parallel one, depending on needs.

chaos-gears-2

Saga Patterns – sequential scenario

chaos-gears-3

Saga Patterns – parallel step scenario

The diagram pasted at the beginning of my article covered the parallel scenario, containing a step with two Lambda Functions (DynamodbFallback, BucketPathFallback) which generally are the compensating mirror reflections for CreateDeploymentS3Path and SaveDeploymentInfoDynamoDb. Forget about the names which have been changed for the sake of this article. I hope my dear readers have already got the point. Whenever you code a Lambda Function which is going to be used in a workflow to change the state, always think about keeping the consistency in case of failure. I don’t have to remind that keeping the idempotence in such scenario is obviously a must-have.

chaos-gears-4

In case of failure

 

chaos-gears-5

In case of success

 

Those of you, who follow our blog, should already know that in Chaos Gears we rely on Serverless Framework when launching serverless environments. For us, the massive benefits of this framework lie in the plugins.

Basically, you don’t have to code everything from scratch. Just remember that the devil is in the details, so be cautious.

This link will take you to a list of plugins: https://serverless.com/plugins/. One of them is for AWS Step Functions configuration: https://serverless.com/plugins/serverless-step-functions/. Believe me, it eases the pain caused by building complex flows. I prefer to read YAML rather than JSON, which is more human-readable.

Implementation – save your priceless time

Below, you’ll see the example of a flow describing the diagram shown at the beginning of the article. I want to draw your attention to the types of the “Parallel” states which allow you to invoke several Lambda functions simultaneously. Whenever one of them fails, the whole Step is treated as a failed one, and Fallback procedures (compensating transactions) are launched.

stepFunctions:
  stateMachines:
   Add:
      name: ${self:custom.app}-${self:custom.service_acronym}-add-${self:custom.stage}
      metrics:
        - executionsTimeOut
        - executionsFailed
        - executionsAborted
        - executionThrottled
      events:
        - http:
            path: ${self:custom.api_ver}/path
            method: post
            private: true
            cors:
              origin: "*"
              headers: ${self:custom.allowed-headers}
            origins:
              - "*"
            response:
              statusCodes:
                400:
                  pattern: '.*"statusCode":400,.*' # JSON response
                  template:
                    application/json: $input.path("$.errorMessage")
                200:
                  pattern: '' # Default response method
                  template:
                      application/json: |
                        {
                        "request_id": '"$input.json('$.executionArn').split(':')[7].replace('"', "")"',
                        "output": "$input.json('$.output').replace('"', "")",
                        "status": "$input.json('$.status').replace('"', "")"
                        }
            request:
              template:
                  application/json: |
                    {
                    "input" : "{ \"body\": $util.escapeJavaScript($input.json('$')), \"contextid\": \"$context.requestId\",
\"contextTime\": \"$context.requestTime\"}"
, "stateMachineArn": "arn:aws:states:#{AWS::Region}:#
{AWS::AccountId}:stateMachine:${self:custom.app}-${self:custom.service_acronym}-add-${self:custom.stage}" } definition: StartAt: CreateCustomerDeployment States: CreateCustomerDeployment: Type: Parallel Next: Final OutputPath: '$' Catch: - ErrorEquals: - States.ALL Next: Fallback ResultPath: '$.error' Branches: - StartAt: CreateDeploymentS3Path States: CreateDeploymentS3Path: Type: Task Resource: arn:aws:lambda:#{AWS::Region}:#
{AWS::AccountId}:function:${self:custom.app}-${self:custom.service_acronym}-function-a TimeoutSeconds: 5 End: True Retry: - ErrorEquals: - HandledError IntervalSeconds: 1 MaxAttempts: 2 BackoffRate: 1 - ErrorEquals: - States.TaskFailed IntervalSeconds: 2 MaxAttempts: 2 BackoffRate: 1 - ErrorEquals: - States.ALL IntervalSeconds: 2 MaxAttempts: 1 - StartAt: SaveDeploymentInfoDynamoDb States: SaveDeploymentInfoDynamoDb: Type: Task Resource: arn:aws:lambda:#{AWS::Region}:#
{AWS::AccountId}:function:${self:custom.app}-${self:custom.service_acronym}-function-b TimeoutSeconds: 5 End: True Retry: - ErrorEquals: - HandledError IntervalSeconds: 1 MaxAttempts: 2 BackoffRate: 1 - ErrorEquals: - States.TaskFailed IntervalSeconds: 2 MaxAttempts: 2 BackoffRate: 1 - ErrorEquals: - States.ALL IntervalSeconds: 1 MaxAttempts: 2 BackoffRate: 1 Fallback: Type: Pass InputPath: '$' Next: CancelData CancelData: Type: Parallel InputPath: '$' Next: CancelledResults Branches: - StartAt: DynamodbFallback States: DynamodbFallback: Type: Task Resource: arn:aws:lambda:#{AWS::Region}:#
{AWS::AccountId}:function:${self:custom.app}-${self:custom.service_acronym}-function-b-compensating TimeoutSeconds: 5 End: True Retry: - ErrorEquals: - HandledError IntervalSeconds: 1 MaxAttempts: 2 BackoffRate: 1 - ErrorEquals: - States.TaskFailed IntervalSeconds: 2 MaxAttempts: 2 BackoffRate: 1 - ErrorEquals: - States.ALL IntervalSeconds: 1 MaxAttempts: 2 BackoffRate: 1 - StartAt: BucketPathFallback States: BucketPathFallback: Type: Task Resource: arn:aws:lambda:#{AWS::Region}:#
{AWS::AccountId}:function:${self:custom.app}-${self:custom.service_acronym}-function-a-compensating TimeoutSeconds: 5 End: True Retry: - ErrorEquals: - HandledError IntervalSeconds: 1 MaxAttempts: 2 BackoffRate: 1 - ErrorEquals: - States.TaskFailed IntervalSeconds: 2 MaxAttempts: 2 BackoffRate: 1 - ErrorEquals: - States.ALL IntervalSeconds: 1 MaxAttempts: 2 BackoffRate: 1 CancelledResults: Type: Succeed Final: Type: Pass End: True


Where to go next

Establishing a consistency and maintaining it across services and with their databases is the main challenge you face, when you design and develop serverless architectures. It’s almost impossible to handle that task without saga patterns. But let’s make something clear – AWS Step Functions won’t solve all of your problems and won’t fit in every serverless scenario. However, this service offers a pleasant way to simplify the complexities of dealing with a long lived transaction across distributed components.

This is a guest post by Karol Junde and this post originally published at https://chaosgears.com/