Microservice applications typically consists of interconnected systems with compute, store, messaging components that work together to dispatch requests coming to the software. For example, a modern e-commerce application today includes an order service, cart service, payment service, and more. Each service is isolated and separated by a network boundary, and the services could be hosted on different platforms.
As Werner Vogels, AWS CTO, says everything fails all the time, and this is no surprise that a failure will happen in such an application. And when it does, we've got to know "what" and "why" in order to recover quickly. In a distributed system, shifting through logs can be daunting. Likewise, metrics have limitations. They could show that something is wrong, but good luck finding out "what" and "where." This is where tracing comes in.
Traces are the way of tracing the requests that transit multiple services. It helps you observe distributed systems, to pinpoint the causes and the impacts of failures and performance issues. A trace consolidates all three pillars of observability into one data model by integrating metrics and logs with the corresponding trace. In this way, developers can understand if the failure is because of their code or due to a service or API that they use. Tracing also helps in understanding the impact of cascading failures on other services and end-users.
What’s a trace?
Fundamentally, a trace begins with a single request and represents the request’s entire journey as it transitions through all the services of a distributed system. A trace typically consists of spans that represent a time interval with metadata attached to it. From that perspective, a span may be regarded as the atom teaming up to compose an element, which is in our case a trace. A span may have a unique identifier, start and end timestamps in typically milliseconds, and some metadata about the service it belongs to, the region it runs on, etc. It may also include some logs and metrics data in the key-value format that are useful for capturing span-specific informational or debugging output and logging messages.
What Is Platform-agnostic Tracing and why it’s important?
When your software is platform-agnostic, it means it is not tied on the platform it runs on. It can run on AWS, GCP, or the Azure Cloud. When the tracing becomes agnostic, this similarly means that our tracing system will trace the requests of the application independent of the platform software runs on. When you are troubleshooting a failure as a developer, you mostly forget about the question of “where” but you must ask “why” and “how” questions. Platform-agnostic tracing gives you the capability to effectively profile and monitor a single request from start to finish by focusing on the application rather than infrastructure.
Why this is important at all? Below are a few reasons you should have a tracing tool that's blind to your architecture.
Monolith to Microservices Transformation
If you ever transform a monolithic architecture into a microservices application, you will agree that decomposition will create isolated components written in different programming languages, with different dependencies, and so on. Using a tracing solution that works on specific set of runtimes and configurations will limit your freedom to innovate with new tools and technologies.
As your architecture evolves and becomes more diverse, your tracing tool should be blind to your internal architectural changes and should enhance your agility without boxing you in.
The Increasing Granularity of Modern Applications
Cloud platforms revolutionized how applications are developed, delivered, and operated, but they also created different problems. The granularity of the small sub-modules and the most asynchronous communication between these modules make the modern applications very complex. Pinpointing the root cause of failures in such systems became even harder by checking the logs and dashboards. Platform-agnostic tracing allows teams to integrate and focus on what matters, irrespective of the platform, or the nature of the service.
TCO for Troubleshooting
When a microservice is separated into smaller pieces, it’s very hard to locate the problem by using different tools and techniques in every other independent service. Platform agnostic tracing creates you a single pane of glass for understanding the issues by collecting traces from several sub-modules and putting them together in the correct order depending on the context. In this way, developers can figure out the underlying reason for the failures and/or performance degredations without switching between different tools. That can cause saving a lot of time and money for organizations.
Platform-agnostic Tracing Tools
There are some open-source tracing tools that provide libraries and standards for defining, generating, and transmitting span and trace information across different microservices. There are also several solutions like Thundra that provides out of box solutions for platform-agnostic tracing. Let’s have a closer look at those tools and platforms.
Jaeger allows you to troubleshoot and monitor complex microservices environments. It enables you to quickly examine the entire chain of actions or events happening within microservices.
Jaeger was created by Uber Technologies as an open-source project in 2015 and then donated to CNCF in 2017. Jaeger is written in Go, and it comes with features that let you optimize latency and performance, monitor distributed transactions, and perform root cause analysis, service dependency analysis, and distributed context propagation.
OpenTracing allows you to profile and monitor applications across different services and components.
The OpenTracing open-source project was created by the Cloud Native Computing Foundation (CNCF). At its core, OpenTracing aims to standardize the approach to distributed tracing and instrumentation.
Thundra’s tracing solution makes it possible to auto-generate the traces by providing runtime instrumentation libraries. However, no automated tracing tool can know the internal details of your distributed application. That’s why Thundra also provides a manual instrumentation SDK, compatible with OpenTracing standards, to let users inject their own spans along with automated spans. Thundra is well known for tracing in serverless and containerized applications but it’s going to be completely platform agnostic in upcoming months.
Zipkin provides mechanisms for storing, sending, receiving, and visualizing traces.
Zipkin has an open sourc unity, where you can always find publications on new data formats, libraries and APIs. Zipkin also has a client-server architecture, uses Thrift as its communication protocol, and supports Elasticsearch and Cassandra as backends for storing trace data.
Instrumenting the application according to the standards like OpenTelemetry and OpenTracing is the key to building a platform-agnostic tool. Tools like Jaeger and Zipkin can provide you the ways of structuring and organizing the data. However, building your own platform agnostic tracing solution is very time consuming and maintaining it while your applications grow is even a harder task.
If you’re worried about dealing with adoption and management of open source solutions, you can start your tracing journey with a managed tool like Thundra. This will familiarize you with the core concepts — such as span, trace, and execution context — with auto-generated traces on the Thundra console. Thundra is flexible and leaves room for manual tracing compatible with OpenTracing standards.