Since its inception, Thundra has been helping the serverless community in monitoring and troubleshooting serverless applications. Starting from our own need for Java tracing, we are now happy to help hundreds of users of Node.js, Python, Go and .NET. There have been lots of improvements since our inception but the following four the ones that we build with our customers and attracting the new customers most.
- Architecture Overview: Providing visibility over serverless architecture which enables you to discover problems at a glance easily.
- Full tracing: Tracking serverless transactions from entry point APIs to the specific lines of code in local methods.
- Outlier Analysis: Understanding the root causes of outliers easily
- Advanced Searching: Questioning and investigating your system with flexible queries enabling you to ask tough questions.
Today, we are very excited to announce a feature that we can write along with the ones above: Alerting! By deep analysis with our customers, we designed and implemented our alerting feature and we expect it to be perceived as the best in town for troubleshooting issues and reducing MTTR.
What’s in Thundra Alerting?
As you may know, we have very detailed querying capabilities on your functions and invocations. For example; you have the chance of filtering the invocations of a function as follows:
tags.user.id="1" AND Duration>200 AND ColdStart=false
In this query, you filtered the invocations in which my function is processing the information of the user with id 1 and the invocation duration is bigger than 200ms while there is no cold start. You use this query to check if there is something wrong with your function while processing the data of an important customer or processing the specific type of event. What if I’d say that you can now save this query as an alert policy and let Thundra check it for you every minute?
The use case above actually explains why we didn’t build up an alerting so far and what does alerting capable of. We wanted to first develop a querying mechanism which got very positive reactions. Those reactions encouraged us, even more, to implement the alerting on top of these queries.
With Thundra’s new alerting, you can save such a query easily for a function or for more than one function, or for all the functions in a project, or all of the functions in your Thundra account. With that, you will have the flexibility of targeting functions in the most convenient way.
Additionally, as you may know, Thundra had query support not only for invocations but also for the functions. For example; you can write a query as follows:
`tags.serviceName=team AND EstimatedCost>0.01`
In this query, you filtered the functions whose serviceName tag is set to `team` and whose EstimatedCost in the given period exceeds a penny. In such a query, you can keep track of the cost of your functions at any time interval you wish. It can be a minute or an hour or a day.
From now on, you will be able to save such query as an alert and you will keep your eye on the overall metrics of your system. You can say, for example, I’d like to get alerted when my functions in payment service get more than 10 cold starts in last 1 minute and this makes their estimated cost more than a cent. You can set up alerts with a periodic schedule starting from “Every 1 minute” to “Every month”. Thanks to that you can check whatever you need flexible according to your time SLAs that you need to achieve.
We are providing the option to set severities for Lambda functions. You can define an alert policy with the severity of
INFO just to make sure that everyone hears about it. On the other hand, you can also define it as
CRITICAL if you want to emphasize the urgency of the situation in your organization when the policy is violated.
Battling with Alert Fatigue
As a Thundra team, we have very deep knowledge about being on-call and handling the multiple alerts at the same time in the organizations. We know that it is really annoying to stay in an alert storm when you are already trying to resolve the issue. For such kind of use cases, we have two solutions:
- You can proactively prefer to throttle the alert when you first receive the first alert. This option is especially useful when you know that your system will continue to violate this policy once it is violated. You can throttle the alert for x minutes, hours or days according to your use case.
- You can disable the alert which is creating an alert storm for a while when it starts bugging you. You can achieve this with one click in the Alert policies page.
Making Alerts Actionable with Thundra!
Your hard job will start after you receive the alert event. You need to check what might go wrong and fix it as soon as possible. Thundra comes to aid at that point as well. When you see a Thundra alert in your mailbox, in your Slack channel or whatever, you can just click on that and see the event which is violating the alerts.
When you jump in alert you will see the sample results causing alert and you have the chance of seeing all results by jumping on functions list or invocation list page. But more importantly, you are armed with critical information which may help you resolve the issue even when you see it for the first time. In the following example; you can see an event. This event triggered because there is one invocation for a specific user with id 1 whose duration is bigger than 300ms while it is not cold started.
In the bottom part, you can see Statistics and Resource Usages parts. In Statistics, you can see that this function is also erroneous because the health is 0% for the events causing Alert. Normally, your function is cold started for 6% of the whole invocations but this one was warm. When we check the resource usages, we see an interesting story there. The time wasted in AWS-SQS is a lot higher for alerts. It dominated the invocation so badly that PostgreSQL stayed short. You managed to understand even without checking anything else, there is a problem with AWS-SQS that makes your function take longer than expected. With such options, we aimed to reduce the MTTR even more drastically because you will have an initial idea on what actually went wrong.
We’re happy to come up with Alerting Support v1 today. We put one default alert policies in all of our active customers and from now on, every new signup will start with this default alert. In the following days, we’ll come up with new updates such as new notification channels like Opsgenie, Pagerduty, etc. We’ll make the filtering distributed traces and the logs possible with our queries. So imagine, you’ll be able to get alerted when there is an exception in the resources used in a trace and the whole duration of the chain of invocations exceeds a threshold. You won’t wait for much about all of these in the following days. We’ll be happy to listen to your ideas and needs to facilitate a peaceful success in your serverless stack using Thundra. Please submit your requests to firstname.lastname@example.org or come to our Slack and let’s chat. You can sign up to Thundra or see the live demo to see Thundra in action.