Enhancing Distributed Tracing with Business Context

Feb 11, 2020

 

well-architect

Microservices paradigm lets application teams develop production-ready applications with independent modules communicating with each other through asynchronous events. By taking advantage of this modularity, application teams have started delivering software faster with small slices that communicate with each other using asynchronous event sources. Serverless took that modularity one step further by breaking the application into even smaller pieces (a.k.a. functions) running nearly atomic tiny business logic. Thanks to serverless’ nature with less-ops and more reusable modules, application teams can now deliver faster with ultimate scalability. 

Thanks to the amazing folks in the serverless community, serverless developers are gifted with many great tips for building serverless apps at the production scale. However, best practices for serverless application development has not been prescribed in a holistic way until today. To fill this gap, AWS just launched Serverless Lens for AWS Well-Architected tool in the console. Serverless Lens enlights the path of application teams to deliver well-architected serverless applications from operational excellence, reliability, security, performance efficiency and cost optimization perspectives. 

In this post, we will demonstrate how Thundra can help you easily apply one of the best practices recommended in the Operational Excellence pillar. We’ll explore through the distributed tracing enhanced with additional context to understand the application behavior and maintain operational excellence. 

We will use a blog application to demonstrate how distributed tracing can boost the visibility of your Serverless application. This application has two different personas - blog authors and editors. Using the blog application, blog authors can submit their drafts, get feedback, and edit/delete their blog posts. It allows editors to review drafts, submit feedback, and publish the blogs when it’s ready. 

How do event-driven business flows work?

The blog-application is designed with a serverless paradigm and decomposed into many small  Lambda functions with nearly atomic responsibility and least privileged permissions. When a first request comes into the system, an asynchronous chain of invocation begins through events flowing between Lambda functions. This is called “business flow” which represents a meaningful chain of invocations that achieve a real job for the users of the software. As an example, When an author submits a blog post, a business flow of making that blog post ready for review gets started. Let’s have a closer look.

image4


You can see the whole picture of the business flow drawn automatically by Thundra above. Let’s walk over those operations one by one: 

  1.  The blog application should immediately display a message explaining that the post is saved and will be reviewed by editors. 

  2.  The new blog post is ingested into an SQS queue. 

  3.  A worker Lambda consumes from SQS and publishes a message to SNS to notify the editors. 

  4.  The same Lambda saves the blog post so that it can be used for review by editors. 

  5.  From the DynamoDB record, another Lambda gets triggered and writes the content to an Elasticsearch table with necessary indexes so that it can be searchable among millions of blog posts.

How to evaluate the health of Serverless business flows?

We have gone through a happy-path scenario in which everything worked harmonically. However, every system will fail at some point, and when that happens we should be able to answer the question – How do we know our system is working as expected?  

Determining the health of each function individually can help us spot one error in one part of the system, but it’s cumbersome to see the effect of that error in the business flow it belongs to.

For this reason, we need a way to track the request coming from the customer all the way to the Elasticsearch– We can do exactly that with distributed tracing. With distributed tracing, engineering teams can quickly visualize how their system behaves, its integrations and potential performance bottlenecks, and can use it to pinpoint issues in an asynchronous architecture a lot faster than they would normally do.

Distributed tracing can be achieved either by the development teams manually either in an automated fashion with observability solutions like Thundra.

While Thundra makes it incredibly easy to enable tracing for your Serverless Applications with no code change, you can harness the full power of Thundra with the Tags feature to add additional context to your transactions as suggested by the Serverless Lens.


image5

How does Thundra help with distributed tracing enhanced with additional context?

When you plug Thundra into your serverless environment, it automatically starts generating distributed traces revealing the asynchronous business flows end-to-end.

End-to-end means understanding and managing the aggregate set of distributed services an application consumes down to the line level of the runtime code for every Lambda function. 

In our example, when we submitting a new blog post, Thundra automatically generates a trace map without modifying any message written to any resource. Thundra’s distributed tracing functionality makes it possible to build detailed visual trace maps, track message exchanges between Lambda functions and cloud services, and analyze overall app behavior. 

image3

In the Trace Map view, Thundra automatically lays out the whole transaction, where you can see what happened at each line of the invocation, including values of local variables.

Having that visibility already helps to analyze the behavior and maintain the health of your application – However, there’s a lot more we can do by introducing tagging.

With Thundra’s tagging mechanism, you can report on any business transaction or identifiers to help your engineering teams easily correlate what performance incidents directly or indirectly impact customer experience.

For our example, it’s wise to tag the name of the author in order to filter out the traces related to this person. To do this, you simply use our Thundra library and set a tag as a key-value pair as shown below. 

image2


How to make the best use of distributed tracing with additional context on Thundra?

In the Thundra console, you can quickly use tags at the top of the screen to filter transactions accordingly, and even set up compound alerts based on them.

image7


For this trace, you can see an invocation in which the username is set to `Emrah Samdan`. From here, we use our SQL-like expressions to find all transactions with this author that took longer than 2 seconds.

image1

Taking a step further, you can set up an alert on top of that query and be notified for the traces that match our criteria. Thundra lets you define your own alerting conditions and follow the business flows very closely in this way. You can forward alerts to your email, Slack, and incident management platforms like Opsgenie and PagerDuty. 

image6

Summary

In summary, I frankly believe that the Serverless Lens introduced by AWS Well-Architected team will help software teams adopt serverless for any size of application fast and more safely. We are proud of being an integral part of achieving the operational excellence defined by AWS Well-Architected. 

For any software team, distributed tracing enhanced with additional context plays a key role in the operational excellence of any serverless application. Additional context is more meaningful when developers can make it an integral part of their monitoring and alerting system.

You can start with Thundra through AWS Marketplace or directly from our console for free and upgrade the paid plans when your need grows. Getting started with Thundra is pretty easy by following the below steps: 

  • Step 1: Connect Thundra to your AWS Account (Optional): By installing the AWS CloudFormation stack provided by Thundra, developers can instantly list all of their serverless functions and their invocations with logs associated. Although Thundra stack requires the least permissions needed for its operation, this step is still optional for the teams that don’t want to install a third-party CF stack to their functions.  

  • Step 2: Auto-instrument the Serverless Applications with Thundra Libraries: In order to generate distributed and local(showing what happened in a Lambda line-by-line) traces, you need to add Thundra libraries into your application by adding Thundra Layer to your application. This makes Thundra automatically wrap your Lambda and instrument the AWS SDK calls, HTTP calls, or requests to non-AWS resources such as Redis. Instrumentation can be done with a single click from Thundra console if you plug Thundra to your AWS account as explained in Step 1. If you didn’t, it’s still achieved by AWS SAM.