Failures – same story over and over again
We’re all aware of the fact that everything fails from time to time. No matter what type of failure we have to deal with, its aftermath is generally a pain in the ass. This is especially true for distributed microservice ecosystem when particular request has to cross multiple bounded contexts (microservice with their independent databases). And even doubly so when we take into consideration a serverless methodology of application development. In the vast ocean filled with tiny Lambda functions, it’s pretty easy to come across a failure. Moreover, problems appear in connections between Lambda functions and other AWS services.
For example, Dynamodb sets limits on reading and writing activities. When this limit is exceeded, there is a penalty either in the form of increased AWS monthly billing or rejected requests. The latter one requires some handling and counter-reaction if, simultaneously, other functions have changed the data.
Build automated workflows – AWS Step Functions
Step Functions allow the user to define a state machine using Amazon States Language (ASL), which is a JSON object that defines the available states of the state machine, as well as the connections between them. Reading JSON may be painful for some of us, therefore, AWS generates a nice looking flowchart from our ASL code which allows us to better visualize the machine, as seen here. We are going to use it as an example in this article:
Step Functions – example
Step functions service gives us an opportunity to build activity flows (not to be mistaken with state machines) that are really helpful when the transaction-like request has to be implemented. What I have in mind can be explained by Wikipedia’s definition of “atomic transaction”:
“An atomic transaction is an indivisible and irreducible series of database operations such that either all occur, or nothing occurs. A guarantee of atomicity prevents updates to the database occurring only partially, which can cause greater problems than rejecting the whole series outright.”
As I’ve mentioned before, these activity flows consist of small steps (Lambda functions, Waiters, Choices, etc.) which create a one common logic flow. If such flow is putting something into S3 and simultaneously saving some metadata (like in our example), then it won’t necessarily need to complete half of the operations successfully. Loss of consistency is completely unacceptable. That’s the gap where Saga Patterns do their job.
Saga Patterns – dealing with long-lived transactions
I found this definition of saga: “A saga is a sequence of local transactions where each transaction updates data within a single service. An external request corresponding initiates the first transaction to the system operation, and then each subsequent step is triggered by the completion of the previous one.”
Additional notes from me: Saga is a failure handling pattern, so when any failure occurs during one of those long-live flows, we apply the corresponding compensating actions to return to the initial state when the saga/activity flow started.
In Chaos Gears we tend to use two scenarios: a sequential one and the parallel one, depending on needs.
Saga Patterns – sequential scenario
Saga Patterns – parallel step scenario
The diagram pasted at the beginning of my article covered the parallel scenario, containing a step with two Lambda Functions (DynamodbFallback, BucketPathFallback) which generally are the compensating mirror reflections for CreateDeploymentS3Path and SaveDeploymentInfoDynamoDb. Forget about the names which have been changed for the sake of this article. I hope my dear readers have already got the point. Whenever you code a Lambda Function which is going to be used in a workflow to change the state, always think about keeping the consistency in case of failure. I don’t have to remind that keeping the idempotence in such scenario is obviously a must-have.
In case of failure
In case of success
Those of you, who follow our blog, should already know that in Chaos Gears we rely on Serverless Framework when launching serverless environments. For us, the massive benefits of this framework lie in the plugins.
Basically, you don’t have to code everything from scratch. Just remember that the devil is in the details, so be cautious.
This link will take you to a list of plugins: https://serverless.com/plugins/. One of them is for AWS Step Functions configuration: https://serverless.com/plugins/serverless-step-functions/. Believe me, it eases the pain caused by building complex flows. I prefer to read YAML rather than JSON, which is more human-readable.
Implementation – save your priceless time
Below, you’ll see the example of a flow describing the diagram shown at the beginning of the article. I want to draw your attention to the types of the “Parallel” states which allow you to invoke several Lambda functions simultaneously. Whenever one of them fails, the whole Step is treated as a failed one, and Fallback procedures (compensating transactions) are launched.
stepFunctions:
stateMachines:
Add:
name: ${self:custom.app}-${self:custom.service_acronym}-add-${self:custom.stage}
metrics:
- executionsTimeOut
- executionsFailed
- executionsAborted
- executionThrottled
events:
- http:
path: ${self:custom.api_ver}/path
method: post
private: true
cors:
origin: "*"
headers: ${self:custom.allowed-headers}
origins:
- "*"
response:
statusCodes:
400:
pattern: '.*"statusCode":400,.*' # JSON response
template:
application/json: $input.path("$.errorMessage")
200:
pattern: '' # Default response method
template:
application/json: |
{
"request_id": '"$input.json('$.executionArn').split(':')[7].replace('"', "")"',
"output": "$input.json('$.output').replace('"', "")",
"status": "$input.json('$.status').replace('"', "")"
}
request:
template:
application/json: |
{
"input" : "{ \"body\": $util.escapeJavaScript($input.json('$')), \"contextid\": \"$context.requestId\",
\"contextTime\": \"$context.requestTime\"}",
"stateMachineArn": "arn:aws:states:#{AWS::Region}:#
{AWS::AccountId}:stateMachine:${self:custom.app}-${self:custom.service_acronym}-add-${self:custom.stage}"
}
definition:
StartAt: CreateCustomerDeployment
States:
CreateCustomerDeployment:
Type: Parallel
Next: Final
OutputPath: '$'
Catch:
- ErrorEquals:
- States.ALL
Next: Fallback
ResultPath: '$.error'
Branches:
- StartAt: CreateDeploymentS3Path
States:
CreateDeploymentS3Path:
Type: Task
Resource: arn:aws:lambda:#{AWS::Region}:#
{AWS::AccountId}:function:${self:custom.app}-${self:custom.service_acronym}-function-a
TimeoutSeconds: 5
End: True
Retry:
- ErrorEquals:
- HandledError
IntervalSeconds: 1
MaxAttempts: 2
BackoffRate: 1
- ErrorEquals:
- States.TaskFailed
IntervalSeconds: 2
MaxAttempts: 2
BackoffRate: 1
- ErrorEquals:
- States.ALL
IntervalSeconds: 2
MaxAttempts: 1
- StartAt: SaveDeploymentInfoDynamoDb
States:
SaveDeploymentInfoDynamoDb:
Type: Task
Resource: arn:aws:lambda:#{AWS::Region}:#
{AWS::AccountId}:function:${self:custom.app}-${self:custom.service_acronym}-function-b
TimeoutSeconds: 5
End: True
Retry:
- ErrorEquals:
- HandledError
IntervalSeconds: 1
MaxAttempts: 2
BackoffRate: 1
- ErrorEquals:
- States.TaskFailed
IntervalSeconds: 2
MaxAttempts: 2
BackoffRate: 1
- ErrorEquals:
- States.ALL
IntervalSeconds: 1
MaxAttempts: 2
BackoffRate: 1
Fallback:
Type: Pass
InputPath: '$'
Next: CancelData
CancelData:
Type: Parallel
InputPath: '$'
Next: CancelledResults
Branches:
- StartAt: DynamodbFallback
States:
DynamodbFallback:
Type: Task
Resource: arn:aws:lambda:#{AWS::Region}:#
{AWS::AccountId}:function:${self:custom.app}-${self:custom.service_acronym}-function-b-compensating
TimeoutSeconds: 5
End: True
Retry:
- ErrorEquals:
- HandledError
IntervalSeconds: 1
MaxAttempts: 2
BackoffRate: 1
- ErrorEquals:
- States.TaskFailed
IntervalSeconds: 2
MaxAttempts: 2
BackoffRate: 1
- ErrorEquals:
- States.ALL
IntervalSeconds: 1
MaxAttempts: 2
BackoffRate: 1
- StartAt: BucketPathFallback
States:
BucketPathFallback:
Type: Task
Resource: arn:aws:lambda:#{AWS::Region}:#
{AWS::AccountId}:function:${self:custom.app}-${self:custom.service_acronym}-function-a-compensating
TimeoutSeconds: 5
End: True
Retry:
- ErrorEquals:
- HandledError
IntervalSeconds: 1
MaxAttempts: 2
BackoffRate: 1
- ErrorEquals:
- States.TaskFailed
IntervalSeconds: 2
MaxAttempts: 2
BackoffRate: 1
- ErrorEquals:
- States.ALL
IntervalSeconds: 1
MaxAttempts: 2
BackoffRate: 1
CancelledResults:
Type: Succeed
Final:
Type: Pass
End: True
Where to go next
Establishing a consistency and maintaining it across services and with their databases is the main challenge you face, when you design and develop serverless architectures. It’s almost impossible to handle that task without saga patterns. But let’s make something clear – AWS Step Functions won’t solve all of your problems and won’t fit in every serverless scenario. However, this service offers a pleasant way to simplify the complexities of dealing with a long lived transaction across distributed components.
This is a guest post by Karol Junde and this post originally published at https://chaosgears.com/