4 minutes read

POSTED Nov, 2019 dot IN Observability

Introducing Smart Dashboard with AI-Driven Anomaly Detection

Emrah Samdan

Written by Emrah Samdan

Former VP of Product @Thundra

dashboard-anomaly X

Serverless community is living in a dream week with all pre-Invent announcements such as with FIFO support with SQS, runtime support for Java 11, Python 3.8, and Node.js 12. These are for sure just teasers about the big updates to be rolled out in two weeks. In the meantime, Thundra developers are also very busy with the updates that will change our definition and value proposition radically. Today, I’m going to walk through a kind of pre-Invent feature for Thundra which will tease you for the bigger update. 

Since its inception, Thundra has been helping its customers with flexible querying on its rich observability data. Until now, we deliberately avoided providing a static dashboard that will show something but hide many other things at the same time. This is why we are “late” coming up with the dashboard because we were implementing something to make our dashboard more useful. Let’s reveal it: We are very happy to come up with a dynamic dashboard with active anomaly detection on the most used metrics.


What’s in it?

Our dashboard aims to make the macro metrics clear while making the tiny details accessible. Let’s go over each module in our dashboard and the opportunities it gives to Thundra customers. 

We let you look at the issues project-wise thanks to our project selection droplist. You can see macro metrics like error count, cold start, health, and cost. You can see the change with respect to the previous period.  

Just next to it, there’s Alerts&Insights area which lists the violated alert policies and insights. Insights are the next cool thing that will let you know about the anomalies even if you didn’t set any alert about it. For example; we’ll let you know when invocation duration jumps to its double or more. Similarly, we also urge you with the unusual jump in invocation or error count or cost. Serverless is for unexpected jumps but sometimes it’s worth to check if there is any unusual issue. We cover you for that! 

I interviewed with many users about what they want to check first when they have a static dashboard. The two most prominent responses were unhealthy and costly functions.  For this purpose, we’ll let you switch between the most unhealthy and most costly functions. As seen in the below image, you can see the functions which cost you most while providing a clue about the root cause with invocation count and duration. 


Let’s come to the more exciting part last. We are proud of coming up with the anomaly detection feature that will show the discrepancies in your data trend for invocation count, error count and average duration for all functions in a project or for a particular function. In serverless, it’s normal to see ups and downs because it’s mostly used for unexpected load, right? But even the most unusual graphs have a trendline if you look close/far enough. We are providing a trendline analysis for the metrics of your functions to detect the points that will go outside of the general trendline. If you are on your good day, you won’t see any data going out of its boundaries as you can see in the below image. 


However, you may witness some anomalies if your function goes beyond its normal condition like in the below image. As you can see invocation count slightly went out of the trendline at a particular time. 


It’s fair and square that you can ask how we are calculating the anomalies. We are basically making a trend analysis in the data by using several parameters. We are using two variables in our calculations of the trend that are namely “period” and “rollup”. “Period” is basically the recurring pattern of your data. For example; the invocation count of your function can go very low on weekdays but can go crazy high for weekends in a normal recurring condition. In this case, it’s advised that you use the period as “week”. “Roll-up” is the period of data to make our aggregations. When you select 5 minutes as a roll-up interval, we’ll aggregate the metrics in 5 minutes into a single metric and consider (or not consider) it as an anomaly. Using the benchmarks that we make on our anonymized data, we are providing default values for period and roll-up for each selection from our global date-time picker. However, we don’t do any learning to discover your period and roll-up intervals, and it’s advised to switch to other periods when your trend is different than the benchmark (it’s so normal that you can have)

What’s coming next!

We’re very happy to provide our smart dashboard which will gather your attention in the easiest way possible to what’s important and detecting anomalies in your data trend. However, this is not the final destination that we want to arrive either for dashboards or for anomaly detection. In the following months, we’ll work on letting our customers to create their custom dashboards using Thundra generated data or custom metrics that they send to us. This means that you will able to draw the line or pie chart fo a custom metric that you send us. To give an example; you will be able to draw the line of memory usage or invocation duration while your function processes the data of an important customer. Plus, you’ll be able to detect the anomalies on this chart by exporting it to a custom dashboard and share it with your colleagues. We, as Thundra, believe that a tool is as useful as it’s configurable for developers. In order to put this vision into practice, today’s announcement can be seen as a starting point. Lots of crazy announcements of Thundra, as a tool to run fast safely with serverless, is coming next week. Stay tuned for our updates and swing by our booth #627 at AWS re:Invent. 

You can either directly sign up to Thundra or you can subscribe to our service from AWS Marketplace. Please hit us with new questions over Twitter(@thundraio) and reach out to us over email.