x

Meet with us at AWS re:Invent and learn a better way to troubleshoot and secure your applications. Schedule Now!

Thundra in action: Spotting a problem in AWS Lambda

Oct 23, 2018


thundra-in-action

When you monitor an ordinary application using an ordinary APM tool, you are capable of taking many actions to trace your code. However, when it comes to serverless, you may be less comfortable because you don’t have the tools to gather information about the performance of your serverless functions. Thundra rises as a true solution at that point, claiming serverless functions can be instrumented and traced just like ordinary architectures. In this post, you will see how Thundra can help you spot reasons for latency in AWS Lambda due to JDBC behavior under the hood.

The Experiment

In our experiment with Thundra, we are monitoring a function called “user-getAll” which retrieves the users from a PostgreSQL table with given IDs. Here is the code of that function:

 

user-getAll function

 

We run that function with low detail monitoring of Thundra with asynchronous monitoring. This way, we know that there will not be latency posed by Thundra to the function. First, we request for only 1 user with user ID “1”. The execution of this function took 500 ms on average.

 

Execution asking for 1 user

Since our query is expected to retrieve the users with IDs, the average execution time should not change by increasing the number of users. We input user IDs from 1 to 9 this time to see whether this is true or not.

 

Execution asking for 9 users

Oops– there is a problem; invocation took 1300 ms on average. This seems weird. Now, we apparently need some wisdom which will be provided by Thundra.

The “user-getAll” function delegates the request to the service layer to retrieve the user IDs .Let’s start tracing the service layer with Thundra. To make this happen, we need to switch on the service layer tracing using environment variable without modifying the application code and with no need to redeploy the function. We make all public methods of all classes under “io.thundra.lambda.demo.service” traceable in the following way:

 

Enabling tracing of all methods of all classes under “io.thundra.lambda.demo.service”

 

The resulting trace chart shows that all of the invocation time is allocated to this service call. We also see in the logs that the users with the given IDs are retrieved. So, we are on the right path, but where do we look next?

 

Resulting trace chart after enabling tracing of service methods

 

We need to look at the “userService.getAll” function even closer. In the function, user IDs are first validated and retrieved from the cache if they are available. Then, it tries to retrieve from a database. There is a lot to check in this function.

 

UserService.getAll function

 

We need to trace this function line by line to see which part causes the latency. To achieve that, we need to append auditing configurables to the target classes. Again, we are not changing any code of the function. With our new parameters, we enabled line by line tracing and we attached the source code of the line to the trace by setting “auditLinesWithSource” parameter to true. See the following image.

 

Enabling line by line tracing with environment variable

 

At this point, we need to say that there is more to achieve using those parameters, such as tracing local values, return values, and function parameters. Let’s see what line by line tracing tells us:

 

Resulting trace chart of line by line tracing

 

We discovered the guilty line! It is line 110 of UserService which retrieves the users from database. Let’s dive into “userRepository.getAll” function now.

 

“getAll” function of “UserRepository” class

 

As expected, the function includes just one JPA query over Hibernate Entity Manager. The query seems like it will retrieve the users using “IN(:ids)” keyword independent of number of users. So, why does invocation time increase with increasing number of users? We need to check under the hood of JDBC. Thundra can do this, too! Currently, you can track JDBC queries to PostgreSQL, MySQL, MariaDB and Microsoft SQL Server. Enable the option to track JDBC queries as follows:

 

Enabling JDBC tracking with environment variable

 

After invoking the function with these parameters, we observe the following trace charts. Oh, we are sure that we sent only one query to database, right? The first query turns out to be the one that we intended to make. But others are retrieving the cards of the retrieved users. We didn’t mean it but JDBC tries to retrieve the cards of the users which causes latency to our function. This is a well-known problem called “N+1 problem” of Hibernate.

 

Time span of our intended query
Time span of one of the queries due to N+1 problem

 

As a solution, we need to first join users and cards and then try to retrieve users. After we implemented the solution, our trace chart looks like as follows:

 

Trace chart after the implementation of solution

Function invocation time drops to 500 ms from 1300 ms again. Note that we might encounter an even longer invocation time than 1300 ms if we retrieve more users. Now, we have a healthy function which will not take longer than expected and will cost less.

Conclusion

We spotted and solved the N+1 problem of Hibernate with Thundra. You can also discover many more solutions using Thundra’s integration with Redis, HTTP, AWS SNS, AWS SQS and many more. There is no room for inefficiency in AWS Lambda with Thundra! Check our documentation to learn how you can experience Thundra in minutes!