Tracking the KPIs of Unique Traces

Mar 24, 2020

 

trackng-kpis uniqure-traces

Thanks to the continuous innovation in delivering event-driven distributed architectures, application-level services that used to be one monolith application are now composed of many serverless applications on AWS Lambda and/or AWS Fargate. Checking up on the health of a single application can only give limited visibility, and doesn’t bring understanding about the health and behavior of a distributed application. Developers/DevOps should have a more holistic way of understanding system behavior. Thundra has already been helping with the architecture view of the microservices architecture for this purpose.

With the architecture view, you can understand with just a glance which part of your whole system experiences problems. However, this approach can stay generalized in regards to understanding the health of a “service.” If we consider function visibility and architecture visibility as sitting at opposite ends of the visibility spectrum, we can see that developers need a middle point to fully understand the behavior of their services. For this purpose, we are proud to introduce a unique traces feature that will help Thundra users to understand the health and behavior of their performance metrics.

In this post, we will demonstrate how Thundra identifies unique traces (a.k.a. service flows), and how you can use this information to improve the performance of our modern applications. We will use a fully serverless blog site application to demonstrate the importance of unique traces.

This application has two different personas: blog authors and editors. Using the blog application, blog authors can submit their drafts, get feedback, and edit/delete their blog posts. It also allows editors to review drafts, submit feedback, and publish the blogs when they’re ready.

What’s a trace and what makes it unique?

From its inception, Thundra has been sticking with the OpenTracing specification for tracing interactions between distributed services. Below, we see a “trace” that was triggered by a request coming from API-Gateway to post a blog. In this trace, there are several interservice spans that form a business flow to achieve the objective of posting a blog to our blog site application. Let’s walk over the spans that are automatically drawn by Thundra’s distributed tracing:

  1. The blog application should immediately display a message explaining that the post is saved and will be reviewed by editors. 
  2. The new blog post is ingested into an SQS queue. 
  3. A consumer Lambda consumes the submitted blog post from SQS and publishes a message to SNS to notify the editors. 
  4. The same Lambda saves the blog post so that it can be reviewed by editors. 
  5. From the DynamoDB record, another Lambda is triggered and writes the content to an Elasticsearch table with necessary indexes so that the blog post can be searchable among millions of other posts.

image4

All of these operations together represent a service flow that helps authors to post their blogs for the review of editors. This flow is repeated anytime a new blog post is to be posted. So, this “unique” trace composed of the same operations is produced over and over again. We can keep track of each and every service, but when it comes to understanding the performance of services as a whole, we need a different lens. That’s why we came up with unique traces to understand the behavior and health of service flows.  

We consider this as a mid-layer between invocations and all comprehensive architecture views. If we move from the smallest part to the biggest, we can list the observability components in the following way: 

  1. Spans: Represents a single operation, for example, a span showing a read operation from a DynamoDB table with a specific statement. A span can also be a custom span created by Thundra’s manual instrumentation compatible with OpenTracing specification. 
  2. Invocations: An invocation can have many spans in it, and shows the interactions of a single function at one call. 
  3. Traces: A single chain of invocations which represents a transaction that achieves a job. 
  4. Unique Traces: The latest addition to the game that groups traces that achieve the same job over and over again, and lets developers measure the performance of the transaction and/or service level. 
  5. Architecture: Shows the aggregated architectural topology of the functions using all the spans, invocations, and traces generated by the system. 

How does Thundra help with understanding unique traces?

When you check the left menu item called “unique traces,” you can see that Thundra discovered the unique traces produced by your system. Just to make the definition a bit simpler, the traces are counted as unique with respect to the resources and operations in it. For example, let’s say that when your Lambda function fails to save the user to DynamoDB table for any reason, it writes to an SQS queue to have it try again manually. The “happy” path of writing to the DynamoDB table and the “unhappy” path of adding SQS into the equation will be considered as two different unique traces. After we identify the unique traces, we give them an alias with the initial trigger resource name and type and entry point application. You can change this alias to something different anytime you want.

image6

Just like on every screen of Thundra, we provide our rich querying capabilities to help you understand unique traces. You can filter the queries according to the number of occurrences or according to the name of the resources in it. For example, the following query will return the unique traces that have an S3 bucket with the name of “thundra-demo-team-lambda-java-prod” in it, and will sort them according to the number of occurrences of the trace: 

resource.AWS-S3.name=thundra-demo-team-lambda-java-prod ORDER BY COUNT(Trace) DESC

When you want to take a closer look at a trace that you identified, you can simply click on the trace to see more details. 

Diving deep into unique traces

When you click on a unique trace, you will have several points of view to understand it better. 

  1. Architecture view shows how the flow works in this unique trace. 
  2. Traces view lists the individual traces (also known as occurrences) of the unique trace. You can filter and sort this list according to errors, duration, cold starts, and timeouts. 
  3. Metrics show the traffic and latency on this service as a trendline.

As you can see in the following picture, we are filtering the traces that have at least one error in them. We can then pick one and take a closer look at it.

image5

Closer look to one trace shows a single transaction of this unique flow as below.

image2

Clicking on the nodes in this directed acyclic graph can provide more context on what are the messages being processed and what happened in each application invocation. 

image1

Here we see the metrics for the unique trace that has been used to save a team.

image3

Conslusion

We are filling  an important gap in our observability standpoint with the addition of unique traces. In the following days, we will continue to improve this capability by adding a tagging strategy to unique traces. This will allow you to filter the instances of a unique trace (e.g., “saving a user”) according to a business variable (e.g., username).  

As you may have seen, we have also released our online debugging feature very recently. If you want to try online debugging and discover the unique traces that your system generates, you can start your Thundra journey here

One last note: We are passing through extraordinary times. Be careful, try to stay at home with your family.  We wish you safe and happy serverless-ing!