Many of the monitoring solutions available in the market aren’t built considering the nature of AWS Lambda environment. Many of these tools publish monitoring data in the request in a synchronous way which is an anti-pattern for serverless monitoring because of the following reasons:
Longer request duration:
Publishing monitoring data in the request itself in a synchronous way increases the execution duration. It adds an overhead of a few tens of milliseconds to your Lambda function execution duration in the best case. In the worst case, you can’t send your monitoring data if network fails or service outage occurs in the data receiver side. Even if you manage to send the data after retrying a few times, you would lose hundreds of milliseconds or even seconds by doing so. Increase in the duration means spending a lot more money as you pay per 100 ms. you use and degraded user experience which can cost even more in the long run.
Lambda functions must be stateless because a container which runs your Lambda function can be destroyed any time when it is not handling a request. Therefore, saving monitoring data in local storage to publish in batch is not a good idea.
Using background threads to publish monitor data is not a good idea either. Because while the container is not handling any request, it is in the frozen state. AWS Lambda doesn’t allocate CPU resource/slot in frozen state, so background threads can’t publish monitoring data. If data publishing takes place during request execution, it increases the duration as discussed in the previous problem.
In other words, you cannot send the monitoring data in batch if you don’t tolerate data loss. Thus, you need to send monitoring data before the invocation ends because your code can only be run when the container is active handling requests.
Data publish failures:
Monitoring data receiver-side might not be available all the time. In this case, you will need to either retry the monitor data to send the request or just skip it silently.
In case of data publish failures:
- You may retry the publish attempts until they succeed. But there is still a chance of request failure because of the maximum 5 minute execution time limit. In this case, you would lose the monitoring data. Even though it succeeds, the function execution would be delayed during retries.
- You may skip sending the data if you can tolerate data lost. If the failures end up quickly, the losses might be acceptable depending on your system monitoring needs and expectations. But the failures might continue during hours (or even days). In this case, if there is a problem with your Lambda functions, you will have no idea, and this is not acceptable for most systems.
Access within VPC:
If you put your functions in a VPC and need to access internet to publish monitoring data, you need to define a NAT gateway and connect it to Internet gateway. This means that you need to do additional operations work on your side and NAT gateway might be a bottleneck because it’ll handle all monitoring data.
Publishing monitoring data asynchronously
Because of the reasons mentioned above, Thundra publishes monitoring data through CloudWatch. This is also considered as a best practice by AWS as explained in “Serverless Architectures with AWS Lambda” whitepaper .
Capture the metric within your Lambda function code and log it using the provided logging mechanisms in Lambda. Then, create a CloudWatch Logs metric filter on the function streams to extract the metric and make it available in CloudWatch. Alternatively, create another Lambda function as a subscription filter on the CloudWatch Logs stream to push filtered log statements to another metrics solution. This path introduces more complexity and is not as near real-time as the previous solution for capturing metrics. However, it allows your function to more quickly create metrics through logging rather than making an external service request.
In our approach, trace, metric and log data are logged in a structured JSON format for CloudWatch through `com.amazonaws.services.lambda.runtime.LambdaLogger` provided by `com.amazonaws.services.lambda.runtime.Context`. Then, the printed monitor data logs are sent to CloudWatch asynchronously by AWS Lambda without affecting the request performance because logs printed through `LambdaLogger` are written to shared memory in the container under the hood on AWS Lambda to be sent to CloudWatch later in an async way. We also have another Lambda function; let’s call it “monitor lambda”, which subscribes to log groups of Lambda function to be monitored with a subscription filter to be only triggered by monitor data. Then the “monitor lambda” can send the received monitor data to ElasticSearch (directly or indirectly through Kinesis or Firehose stream) to be queried and analyzed later. Also, since the “monitor lambda” is invoked as Event invocation type (there are also Request/Response and DryRun invocation types) by CloudWatch, if a DLQ is specified for “monitor lambda” after a few retries, monitor data for failed invocations are put into specified SQS queue automatically by AWS. Thus, we don’t lose any monitor data. Then another Lambda function, which is triggered by scheduled CloudWatch event polls the DLQ and invokes the “monitor lambda” function with the polled monitor data.
The following diagram shows our async monitoring architecture: