This is part II of a three-part series about using the Well Architected Framework (WAF) and the Serverless Application Lens (SAL) with Thundra to design better serverless systems.
In the first article, we learned what the WAF and SAL offer us and created a basic serverless architecture based on some simple use cases.
In this article we will answer five of the nine SAL questions for the architecture we created in the first article. These five questions belong to two SAL pillars, security and reliability.
The Starting Architecture
After thinking about some use cases for a recipe-sharing app, we came up with a serverless architecture for it. Figure 1 is a simple diagram of its components.
Figure 1: Application architecture
This application doesn’t have any monitoring capabilities besides the ones that AWS offers by default, and it doesn’t try to answer the SAL-related questions yet. It just tries to solve the use cases.
Answering the Security Pillar Questions
The security pillar questions ask about who can access your system, what other services your system can access, and how you check your system inputs.
SEC 1: How do you control access to your serverless API?
The first question is about authorization for system access.
In our recipe application, we have three entry points from the internet: the Cognito service, which handles user management; the API Gateway, which handles the CRUD and search actions; and the S3 bucket, which handles direct image uploads.
Cognito is a managed service that handles all its input validation and sanitation by itself.
S3 direct uploads are done with signed URLs, which we have to create in one of our Lambda functions behind the API Gateway. This means that only clients that have access to the API Gateway can upload to S3.
API Gateway offers us different options to manage access.
For internal APIs we could use IAM users/roles, which is the easiest option, but since our API isn’t internal, that’s not possible.
We could implement a custom Lambda authorizer, which is a good way to integrate with a legacy authentication scheme, but since we are writing a new application from scratch, there is a better option.
The best option for a new public API is to use Cognito user pools. We’ve already included Cognito in our architecture for user management, but we can also use it to restrict access to our API Gateway, so we don’t have to change a thing in our architecture.
SEC 2: How are you managing your serverless application’s security boundaries?
The second security question focuses on the other direction: What can our system access outside?
The active parts of our system are the Lambda functions. They execute arbitrary code that can access other services inside or outside of the application the Lambda function belongs to.
Single Responsibility Lambdas and IAM Roles
Every Lambda function needs an IAM role to allow it to access other resources, and the best practice is to give every Lambda just enough access that it can do its work. The more restrictive the IAM role, the better.
This leads to an important design principle that sometimes goes against the intuitive idea of building one Lambda per feature. For example, in our architecture, we have one Lambda for all CRUD actions to the DynamoDB table. This means we have to give it an IAM role that allows it to read and write to that table. If we split that Lambda into a read Lambda and a write Lambda, we could give the read Lambda a more restrictive role.
This would also allow us to delete all the create, update, and delete code from the read Lambda, making it less error-prone and more secure.
Figure 2: Architecture with split Lambda functions
Thundra Denylist and Allowlist
Many serverless frameworks bundle multiple AWS resources into one and implicitly generate IAM roles for them. This can leave us with an IAM role for a Lambda that can do much more than it needs to. Also, most of the time we don’t want to write IAM roles manually.
For such predefined IAM roles, Thundra offers us denylists and allowlists.
If we instrument all our Lambda functions with a Thundra Layer, we can filter all network access by one of these lists. We can restrict their access with this feature even if an IAM role is too permissive.
Figure 3: Application architecture with Thundra layer
SEC 3: How do you implement application security in your workload?
This question is about what kind of data our system accepts as input and how it protects its secrets.
We can add basic request validation for JSON, headers, and query strings to our API Gateway to keep malicious data outside.
Since we access an external API for nutrition, we also need to store a secret. For this we can use AWS Secrets Manager, which makes sure our API secret is encrypted until the moment we actually access the API.
Figure 4: Application architecture with Secrets Manager
Answering the Reliability Pillar Questions
The reliability pillar questions ask about how your system keeps itself working when things go wrong.
REL 1: How are you regulating inbound request rates?
The first reliability question is about how we ensure that the system is used with the right access patterns. Even if we take care with authentication and authorization, a client could still overload our APIs.
To prevent this, we have to communicate clear SLAs for our APIs and enforce throttling at API level. This can be done globally for all clients, or via API keys to give every client specific limits.
Another problem can arise when the services your Lambda functions access can’t scale as fast as Lambda. For this, we have to keep an eye on Lambda function concurrency.
It’s always a good idea to return a 429 HTTP status to let the clients know that they’re sending too many requests.
REL 2: How are you building resiliency into your serverless application?
The second reliability question wants us to think about how our system reacts when it can’t satisfy requests right away. Sometimes the time it takes to execute the actual work is too long for a client to wait.
Asynchronous API Calls
Asynchronous APIs are a good way to make systems more resilient. The backend can respond to the client right away so the frontend won’t be blocked.
Depending on the use case, the frontend can mark the action as pending or simply display the expected result and revert it if an error is found later. On the backend, this allows us to put the tasks into a queue, which can then prioritize them and retry them if needed.
If synchronous requests are absolutely necessary, we have to be sure to keep them as short as possible so we don’t hit the limits of API Gateway or AppSync.
Chaos Engineering with Thundra
A prominent way to test the resiliency of distributed systems is chaos engineering. This testing method will randomly inject anomalies into your production system so you get the chance to find problems before they get serious.
Thundra offers a way to insert random latency into your system with the help of the Thundra Lambda Layer. This way, the chaos engineering code is encapsulated inside the Thundra Lambda Layer code and doesn’t clutter your application code.
Answering the First Five SAL Questions
In this article we answered the first five SAL questions relating to the security and reliability pillars. To get good answers, we needed a slight modification of our first attempt at architecting our application.
We split our CRUD Lambda function into smaller read and write functions and added the AWS Secrets Manager to keep our API keys secure.
We also instrumented the Lambda functions with Thundra to get support for tracing, denylisting, and allowlisting and encapsulated chaos engineering features.
In the third and last part of this series, we’ll discuss the last four SAL questions, relating to operational excellence, performance efficiency, and cost optimization. Follow us on Twitter and don’t miss the next article!