In one of our recent talks with Serkan, our newly announced CEO, he told me about how Thundra actually sparked. He wrote the first Java library of Thundra because there was no monitoring solution for AWS Lambda for Java back in the summer of 2017. We both are fascinated to see that Thundra is not only a well-established company, but the product can now monitor not only Java but Node.js, Python, Golang, and .NET functions. Thus, making Thundra the only serverless monitoring tool that supports .Net, and the monitoring tool with the most AWS runtime support agents. We invested so much into the monitoring capabilities that we can instrument and monitor AWS Lambda functions with a broad set of capabilities. Today, we can proudly and comfortably say that we are the most capable for local tracing the AWS Lambda functions. In the meantime, we didn’t update our web console too much. Instead, we developed integrations with Splunk and Honeycomb to let the serverless community make use of the great visualization capabilities of these tools. During those partnerships, we came across even more use cases and enlarged our vision.
Now, I am super excited to announce that we are introducing our new UI which will replace our existing UI. Say goodbye to the old UI (left) and embrace new one (right) from now on.
DISCOVERING "JOBS TO BE DONE" BY THUNDRA
In the previous months, especially starting with Re:invent 2018, I was concentrated on customer research heavily. I spent quite a time to understand the customer use cases and needs for serverless monitoring. During the talks that I made with potential customers, I was also searching for the way of building a disruptive product, and I came across the “Jobs to be done” framework.
According to this approach, it is not actually important who your customers are but it is crucially important for what jobs they might/will want to get done with your product. Sometimes, when you build up a feature that you think is useful, and it turns out to something that might mean a completely different thing when blended their needs. This example shows how Intercom adapted their maps into something that actually gets the job done.
During my talks with customers, I took this framework into consideration a lot. I spoke with engineering managers, developers, operations people, non-engineer C-level people and noted their interesting use cases. My first impressions said to me that the serverless market is still young but the challenges with monitoring are undeniable as people expect the same experience with the non-serverless environment. Further, I noticed that they are all asking for some fundamental jobs to be achieved by any tool like Thundra. I believe we achieved to get some of the jobs done that I discovered. The rest of the jobs that are not done yet contribute to our roadmap plan for this year. So, let me start with the jobs Thundra will achieve with this release (Note that “I” refer to the so-called Thundra user in the following jobs):
After I develop my function locally with local or mocked components, when I test my Lambda function with its connecting components in a pre-production environment, I want to evaluate as quickly as CloudWatch to check if there is a performance issue or error with the actual components on the AWS platform so that I can work on the issues with the function or component and test again on AWS.
After I got alerted about an error or errors when I am observing the health of the serverless architecture, I want to understand the root cause and/or problematic downstream services as quickly as possible so that I can decide whether I turn my attention to Lambda function or to the system resource.
After I saw spikes in invocation duration of one or more Lambda functions when I am observing the health of serverless architecture, I want to understand whether this is caused by a series of cold starts or a bottleneck just happened in one of the downstream services.
During making a post-mortem analysis of a previous incident with my serverless architecture, I want to check the detailed metrics changed during the incident in order to have more indicators and prevent the next incident by keeping eye on abnormal metrics.
GETTING JOBS DONE! BUT HOW?
We developed our new look considering the above jobs. First of all, I noticed that even if people have different expectations from Thundra, they generally prefer to check one or two angles of their serverless stack according to their expectations. Bad news for us was that those angles were different for everyone. Some want to check some particular service and keep an eye on the duration in this service while others are only interested in with the cost in the period and the cost that they will see at their AWS bill. And, there were others who want to use Thundra during development to test the integrity and robustness of their system before going to production. For those; neither cost nor duration of was important they were here to check the harmony fastly and turn back to coding to fix if there is any bottleneck. Thus, coming up with dashboards which can provide information from an angle that we decide wasn’t getting all the jobs done.
Considering all of those, we decided to come up with a query language to explore serverless functions. This gives the flexibility to everyone to build their own queries and let them look at their system with Thundra from their own angles. We provide some predefined queries in order to give you some useful ideas depending on our experiences with serverless architecture. Our query language is an SQL-variation and it is easy to get used to. In case you need some assistance, you can use the query helper just next to save button in query bar. When you think that you created the right angle to look at your serverless stack, you can save this query and even make it your default and you will see this look when you first enter Thundra. In the following picture, Thundra user is listing the functions in “user” service sorted with respect to their costs.
In development, the most important thing is to see the last invocation, and to understand the level of impact of your last touch on your function
For this, we are providing the new invocations list including“Latency Breakdown”. Assume that you just changed something in your code or in your configuration related with DynamoDB table that your function is interacting with. You can see the effect when you compare the duration and latency breakdown with the previous invocation. As you can see in the following fictional view, you can see the effect that you make by optimizing DynamoDB interaction even without diving inside of an invocation.
In the two jobs that are mentioned, people wanted to just understand quickly what is the root cause of an error or bottleneck. So, we experimented with many different views to discover what can show this best. In the end, we concluded that heatmap is the most general and comprehensive look. In our heatmap, you will be able to see the distribution of your invocations over time. A darker cell, for example, indicates that there are more invocations at that time in this duration interval.
This view is particularly useful when you want to learn about outliers. Just select the area that you think is interesting. The below parts of resource usage, count and duration charts and invocation table will be updated with your selection. Thanks to this, you can understand why the outlier happened; it might because of a cold start or an error or a throttle on the resource or a timeout. When you want to see this single problematic invocation, just click on it on invocation table and understand to the greatest detail with Thundra’s unique trace chart. The following picture shows what are the resource usages and shows us there is one single erroneous invocation with this outlier region.
WHAT’S COMING NEXT?
We built this version of Thundra based on the jobs that the serverless community wants to get done. Similarly, the other jobs that the serverless community wants to get done help us create our roadmap for this year.
My research revealed that people want to hire a serverless monitoring solution to track their requests through asynchronous event-driven architectures on top of local tracing capabilities that we currently provide. For this reason, we are now working really hard to provide the serverless community with an architectural view of their serverless stack and ability to track a transaction happened through several Lambda functions and resources.
As you may notice, job definitions start with “when I get alerted”. For now, people are okay with the CloudWatch alerts but the common consensus is on setting up alerts for more specific cases in the future. Now, we have customizable query support that lets you query the invocations of a particular function when it processes data of a particular user which is cold started and took more than a second. Suppose you will have alerts on top of this flexibility, this will serve a lot better than alerting at every single error which eventually creates alert fatigue. You can expect this advanced alerting in the following months from us.
In closing, we are happy to come up with our new look and functionality which we think will help serverless people understand their serverless stack better. You can sign up to our new environment and start experiencing the new UI. If you want to play with Thundra before signing up to be sure, you can explore our demo environment. We are very curious about your comments and feedback. Join our Slack channel or send a tweet or contact us from our website!