6 minutes read

POSTED Oct, 2020 dot IN

Three Types of Serverless Monitoring Metrics on AWS

Ismail Egilmez

Written by Ismail Egilmez

Business Development Manager @Thundra


When a serverless application does not perform well or errors out, you need to troubleshoot it, but it is not as easy as before with legacy systems. As a consequence of the black-box nature, Lambda code runs on hardware only during the requests, and you never know where it runs with regards to the hardware. It is difficult to maintain serverless applications healthy. Thence, knowing and utilizing the metrics that matter the most play an important role in maintaining the health of the applications. If you know what metrics to track and how to analyze them in an aggregated way, then you can easily lift the veil of mystery at your applications when production problems appear. In this article, we’ll explore the serverless metrics that are crucial to your application’s health.

What are those three essential metrics?

Let’s get started by defining what we will discuss in this article. In this exploration, we’ll focus on three general categories of metrics: operational, load-related, and business-specific. They each represent a part of your application’s execution process. You’ll need to consider all of them in order to create a holistic view of the performance of your application’s production.

Operational Metrics

Operational metrics are the most straightforward metrics which track the operational performance of your serverless functions by comparing the results of calls. These metrics enable you to make sure that your application runs without any problems by allowing you to set the thresholds for alerts. Aggregate error count and aggregate execution count are the two most important operational metrics in AWS Lambda that we primarily focus on.

The aggregate error count is, as its name implies, the number of errors in your AWS Lambda function in a specific time period. You can think that this can be enough to have an idea about the success of your functions in production but ideally, you do not get the holistic view without the aggregate execution count. The number of executions within a specific time period is called the aggregate execution count. By comparing the aggregate execution count to the aggregate error count, you can build a generalized picture of the reliability of your serverless application’s component functions.

Load-related Metrics

Load-related metrics represent the lead on the hardware that your application runs. So this would obviously be like RAM and CPU usage or network bandwidth for a legacy application. But these are partially meaningless when it comes to serverless applications because they do not maintain the hardware but still there are some lead-related metrics we can look at.

From a serverless point of view, we look at the metrics that are related to the duration of execution. Different measurement maths can be done with the duration of execution. The first is the average duration of execution. This is the average execution time, in milliseconds, for each of your application’s functions. It is calculated using the statistics provided in the context object, originating from within the AWS Lambda ecosystem. This tells us how long, on average, our serverless functions will run when called, and it can help us gauge resource usage as it relates to the overall function timeout to which Lambda functions must adhere.

If you want to understand the aggregate performance of your functions, the execution time metric can be inaccurate. You should look at execution times from a statistical perspective as the duration of the functions can widely vary from each other. This means that we want to look at metrics in terms of the distribution of your functions’ execution times. We do these using probability thresholds, such as p90 and p99. These metrics deal with the execution time of your functions from a statistical perspective, with p90 giving you the time at which the 90th percentile of the latency of your function calls, will complete, and p99 expanding upon this to include the 99th percentile. These methods will tell you how close you are getting to various critical values, such as the timeout value for your Lambda functions. If the p99 time for your serverless application is very close to the timeout value of your Lambda functions, for example, you are going to be more likely to see timeout errors from your serverless functions as they execute, giving you critical information on where to start your investigation.

The number of throttles is another metric to be aware of. When the execution frequency increases too much for the AWS Lambda functions, or the concurrency highly increases, the rate at which they are called is throttled. This metric simply counts the number of throttles seen by your Lambda functions as they execute. As this metric increases, it indicates the increase in the risk of throttled individual calls due to excessive concurrency, system load, or the other potential factors that are explained as a throttling behavior of the Lambda functions.

Finally, we need to look at the measures of infrastructure needs, like a function’s provisioned concurrency and capital-driven metrics like execution cost. These metrics are created by your serverless application’s load which it puts on the AWS Lambda infrastructure. Provisioned concurrency controls how many functions you can execute at once, with executions above the provisioned concurrency threshold being subject to throttling by AWS. This reduces the response time of your functions, negatively impacting customer experience. On the financial side, capital-driven metrics like execution cost (delivered in the context object for each function execution) help you control your spending. This is very important as your application starts to scale up, as sudden peaks in Lambda function execution request numbers will be directly led to increases in the AWS bill.

Business-Specific Metrics

Business metrics are very specific to every application because of the genuineness of businesses. The application developers generally build these business-specific metrics as custom ones and add them to their instrumentation logic. The metrics are limited only to the creativity of the developers and the needs of the operations. These metrics use custom information built on top of your application’s feature set, providing meaningful measures of your application’s performance as it relates to your critical user-facing functionality.

The objective is to pinpoint and analyze the metrics most critical to your application’s health in terms that can be actively trusted, analyzed, and monitored. For example, if you were writing a serverless payment processor, you might want to track the number of transactions recorded in your application per day so that you can get a feel for the volume of your business. These metrics will be tied to the business goals of your application, and when coupled with automated infrastructure metrics they can help you identify problem areas in your application, stress points in your architecture, or business validation failures — depending on your need, the sky’s the limit in terms of metrics that matter to your business.

Utilizing the Native Tools

You need to define your toolset once you decide your way to go with a set of metrics for your serverless application. When we say native tools in AWS, CloudWatch comes to mind at first of course because of its capabilities of monitoring Lambda functions in serverless applications. It provides some key metrics, viewable both at a per-service level and a cross-service level spanning multiple AWS resources. CloudWatch defines a number of highly valuable metrics for your serverless application natively.

CloudWatch provides easy access to hard statistics on your function, including invocation count, error count, average duration, and average throttle. CloudWatch also gives you the capability to build in custom metrics, with simple API calls populating CloudWatch with the metrics that are important to your application. CloudWatch Insights let you go farther, using analysis of structured log data to automatically generate metrics based on your application logs - all without any extra code. You can view these statistics for a single AWS region or across multiple regions. CloudWatch provides many of the crucial operational and load metrics that will drive your application’s health.

Native Tools aren’t always enough!

Essentially, native tools like CloudWatch are limited in scope. Although they provide a foundational monitoring power, it is not as comprehensive enough to utilize the necessary metrics to keep the application health at a desired level. It is always challenging to have an aggregated overview of your application with the native tools because they mainly operate at the function level. CloudWatch metrics in particular can be limited if you’re looking to monitor behavior, as functions are presented irrespective of the control flow that invoked them. This causes an inadequate vision of the important metrics for your application.

Custom Business Metrics Make a Difference

Native tools provide a good amount of visibility about that current execution. It is possible to generate a general view of your application’s current execution characteristics with the available operational and load metrics. But gathering metrics other than these is usually very important for business’ success. Think of an e-commerce payment processor serving many countries. You’ll need to track the transactions that take place in each country individually when provisioning server capacity. This is pretty custom to the business performance and success of the application so it is not possible to be provided by the metrics that native tools provide.

You still need to somehow stream those business metrics into your dashboard once you are done with defining them which matter the most to your business. In a native-only approach, your metrics will be limited to the tools available within AWS, which speak more to the general characteristics of Lambda execution than to any particular custom measurement. Thundra lets you consolidate all of your resources into the Smart Dashboard, helping you get the metrics you care about running in production, complete with monitoring and alerting.

Lock Stock and Barrel with Thundra

When the AWS native monitoring tools are not enough, third-party tools come to rescue as a pain reliever. Thundra provides additional functionality on top of the metrics generated by your serverless application in CloudWatch and serves the metrics that are important to maintain the health of your application on its live dashboard. Querying ability gives easy access to the desired metrics.

The AI-driven anomaly detection functionality in Thundra’s Smart Dashboard provides analytical insights into your application. Without any configuration, you can have granular visibility with Thundra’s fully automated alerts and insights on the critical metrics for your application. Thundra provides the ability to define the business metrics you need, combining all of the metrics reported by your functions into a single, easy-to-use dashboard.

Monitoring applications like a pro

Nobody wants things to go wrong but everything fails at all times. Monitoring is vital to dope out of incidents especially if you have serverless applications. Native tools do not pull through with the business-specific metrics which are the most important ones among the three essential metric types (business-specific, operational and load-related metrics.) It is always the easiest way to go with the native monitoring solutions like CloudWatch but the metrics it provides, at least many of them, can fall through when it comes to findability, visibility, and deep detailed views. Thundra’s Smart Dashboard helps you add monitoring and alerting on top of the robust support for operational and load metrics in AWS thereby you can keep posted about the important metrics for your applications. You can monitor your applications like a pro with Thundra and maintain high-level of health. Thundra provides all the information you need to troubleshoot by aggregating operational, load-related, and business-specific metrics.