3 minutes read

POSTED Oct, 2018 dot IN Observability

Thundra in action: Spotting a problem in AWS Lambda

Emrah Samdan

Written by Emrah Samdan

Former VP of Product @Thundra


When you monitor an ordinary application using an ordinary APM tool, you are capable of taking many actions to trace your code. However, when it comes to serverless, you may be less comfortable because you don’t have the tools to gather information about the performance of your serverless functions. Thundra rises as a true solution at that point, claiming serverless functions can be instrumented and traced just like ordinary architectures. In this post, you will see how Thundra can help you spot reasons for latency in AWS Lambda due to JDBC behavior under the hood.

The Experiment

In our experiment with Thundra, we are monitoring a function called “user-getAll” which retrieves the users from a PostgreSQL table with given IDs. Here is the code of that function:


user-getAll function
user-getAll function


We run that function with low detail monitoring of Thundra with asynchronous monitoring. This way, we know that there will not be latency posed by Thundra to the function. First, we request for only 1 user with user ID “1”. The execution of this function took 500 ms on average.


Execution asking for 1 user
Execution asking for 1 user

Since our query is expected to retrieve the users with IDs, the average execution time should not change by increasing the number of users. We input user IDs from 1 to 9 this time to see whether this is true or not.


Execution asking for 9 users
Execution asking for 9 users

Oops– there is a problem; invocation took 1300 ms on average. This seems weird. Now, we apparently need some wisdom which will be provided by Thundra.

The “user-getAll” function delegates the request to the service layer to retrieve the user IDs .Let’s start tracing the service layer with Thundra. To make this happen, we need to switch on the service layer tracing using environment variable without modifying the application code and with no need to redeploy the function. We make all public methods of all classes under “io.thundra.lambda.demo.service” traceable in the following way:


Enabling tracing of all methods
Enabling tracing of all methods of all classes under “io.thundra.lambda.demo.service”


The resulting trace chart shows that all of the invocation time is allocated to this service call. We also see in the logs that the users with the given IDs are retrieved. So, we are on the right path, but where do we look next?


Resulting trace chart
Resulting trace chart after enabling tracing of service methods


We need to look at the “userService.getAll” function even closer. In the function, user IDs are first validated and retrieved from the cache if they are available. Then, it tries to retrieve from a database. There is a lot to check in this function.


UserService.getAll function
UserService.getAll function


We need to trace this function line by line to see which part causes the latency. To achieve that, we need to append auditing configurables to the target classes. Again, we are not changing any code of the function. With our new parameters, we enabled line by line tracing and we attached the source code of the line to the trace by setting “auditLinesWithSource” parameter to true. See the following image.


Enabling line by line tracing
Enabling line by line tracing with environment variable


At this point, we need to say that there is more to achieve using those parameters, such as tracing local values, return values, and function parameters. Let’s see what line by line tracing tells us:


Resulting trace chart
Resulting trace chart of line by line tracing


We discovered the guilty line! It is line 110 of UserService which retrieves the users from database. Let’s dive into “userRepository.getAll” function now.


“getAll” function of “UserRepository”
“getAll” function of “UserRepository” class


As expected, the function includes just one JPA query over Hibernate Entity Manager. The query seems like it will retrieve the users using “IN(:ids)” keyword independent of number of users. So, why does invocation time increase with increasing number of users? We need to check under the hood of JDBC. Thundra can do this, too! Currently, you can track JDBC queries to PostgreSQL, MySQL, MariaDB and Microsoft SQL Server. Enable the option to track JDBC queries as follows:


Enabling JDBC tracking
Enabling JDBC tracking with environment variable


After invoking the function with these parameters, we observe the following trace charts. Oh, we are sure that we sent only one query to database, right? The first query turns out to be the one that we intended to make. But others are retrieving the cards of the retrieved users. We didn’t mean it but JDBC tries to retrieve the cards of the users which causes latency to our function. This is a well-known problem called “N+1 problem” of Hibernate.


Time span of our intended query
Time span of our intended query
Time span of one of the queries due to N+1
Time span of one of the queries due to N+1 problem


As a solution, we need to first join users and cards and then try to retrieve users. After we implemented the solution, our trace chart looks like as follows:


Trace chart after the implementation
Trace chart after the implementation of solution

Function invocation time drops to 500 ms from 1300 ms again. Note that we might encounter an even longer invocation time than 1300 ms if we retrieve more users. Now, we have a healthy function which will not take longer than expected and will cost less.


We spotted and solved the N+1 problem of Hibernate with Thundra. You can also discover many more solutions using Thundra’s integration with Redis, HTTP, AWS SNS, AWS SQS and many more. There is no room for inefficiency in AWS Lambda with Thundra! Check our documentation to learn how you can experience Thundra in minutes!