Observability is defined as “...a measure of how well the internal states of a system can be inferred from knowledge of its external outputs...” In monolithic applications, observability is virtually synonymous with monitoring. These applications externalize their performance through logs, which are often enriched and given further context by third-party monitoring and Application Performance Management tools.
But in the new era of modern distributed applications, such legacy monitoring paradigms break down. Today’s microservice-based architectures mean an application is basically an interconnected mesh of services, many of which are third-party and/or open-source, and understanding system performance from external outputs becomes far more challenging. Getting to the root cause of a performance slowdown or disruption suddenly feels like making your way through a maze, blindfolded.
Cloud-native telemetry has emerged as the observability solution of choice, aggregating the diverse streams of data that are output at scale from modern applications. Using instrumentation APIs or structured data formats, an application’s distributed traces, time series metrics, and logs are collected by a cloud-native telemetry platform. Here, system performance can be made observable (and actionable) by a backend system such as Jaeger, Prometheus, or Zipkin.
To date, two open-source projects have dominated the cloud-native telemetry landscape: OpenTracing and OpenCensus—each with its own telemetric standard and its own substantial community. In a commendable spirit of cooperation, however, the two projects have decided to converge in order to achieve an important mission: a single standard for built-in, high-quality cloud-native telemetry.
In this blog article, we first explain why OpenTracing and OpenCensus forced developers to make a choice between the two solutions; we then describe how OpenTelemetry converges the best of both projects into a single standard around which the entire cloud-native development community can rally.
OpenTracing vs. OpenCensus
Distributed Tracing 101
Before describing OpenTracing’s and OpenCensus’ unique approaches to distributed tracing, let’s review some key terms and concepts.
A distributed trace tracks a single request from beginning to end, across all the processes, APIs, and services it interacts with—often across multiple systems and domains. As shown in Figure 1A, each trace is a directed acyclic graph (DAG) of spans that share a unique context ID (known as context propagation). Each span is a named and timed operation comprising a contiguous set of operations, or a workflow, within the trace. The root span (A in Figure 1) is the first span in a trace, while the other spans are related through parent-child or follows-from relationships. In the example below, Spans B and C are the children of Span A. Span D is the child of Span B and Spans E and F are the children of C (and the grandchildren of A). Span G follows from Span F, and Span H follows from Span G. Figure 1B shows the same trace as represented temporally, along a timeline.
Figure 1: A distributed request trace and its spans. (Source: OpenTracing Specification)
Both OpenTracing and OpenCensus provide vendor-agnostic tracer interfaces for creating and interacting with spans. However, they use different—and incompatible—approaches, as described below.
OpenTracing is a Cloud Native Computing Foundation (CNCF) project that was first released in December 2016. It is a semantic specification of a vendor-neutral API for writing distributed tracing. It supports libraries in nine languages, each with a simple interface for instrumenting code so that it can be traced by any vendor or project—such as CNCF Jaeger, LightStep, Apache SkyWalking and Datadog—that provides an OpenTracing-compatible tracer in that language.
The tracer outputs a time-based and hierarchical breakdown of the spans that comprise the request trace. Because OpenTracing abstracts the tracing layer from the application layer, developers can change tracers without having to touch the codebase; they can also write their own tracers.
OpenCensus started at Google as part of its internal Census libraries and was released as an open-source tool in January 2018. Unlike OpenTracing, which specifies distributed traces only, OpenCensus is a collection of libraries for gathering both distributed tracing and time series metrics from an application. Because the metrics and traces collected by OpenCensus share the same context propagation tags and metadata, the observability backend chosen by the developer can provide a more integrated and comprehensive view of application performance.
Applications must import and initialize the desired metrics exporter. Different application microservices, however, may require different exporters. In order to address this issue, an OpenCensus Service can be deployed to collect metrics and tracing data from separate processes in a variety of formats and output them to one or more supported backends.
The Bottom Line
Despite their common objective, the differences between OpenTracing and OpenCensus are fundamental, and developers must always choose between deploying one or the other.
OpenTracing is a thin and easily deployed specification for tracing APIs, while OpenCensus is both a specification and a set of libraries. With OpenTracing, developers instrument their code by choosing an available OpenTracing tracer or by writing their own tracer. With OpenCensus, developers deploy an agent to instrument their code. Both tools export trace data to compatible backends, while OpenCensus also collects and exports time series metrics.
OpenTelemetry = OpenTracing + OpenCensus
As described by leading members of the OpenTelemetry project, the key objectives and guiding principles of the convergence of OpenTracing and OpenCensus are as follows:
- It creates a unified set of libraries and specifications for cloud-native observability telemetry, which is compatible with the leading OSS and backends.
- It provides a straightforward migration path from each current root project, primarily through backwards compatibility bridges that will support existing OpenTracing and OpenCensus instrumentation for two years.
- As each OpenTelemetry language SDK reaches parity with OpenTracing and OpenCensus, the corresponding part of the root projects will be sunsetted, i.e., frozen. In other words, on a language by language basis, the OpenTelemetry project leaders want developers to see OpenTelemetry “as the next major version of both OpenTracing and OpenCensus.”
In order to allow work on the different language SDKs to proceed in parallel yet with a high level of consistency, two important specifications have already been finalized:
- Cross-Language Specification: This sets the groundwork for a system that will be coherent regardless of the programming language. It includes a glossary of common terminology definitions, as well as a model for describing distributed transactions, statistics, and metrics.
- Data Specification: This defines a common format for tracing and metrics data regardless of how the data was generated, so that all exported data can be handled by the same telemetry infrastructure. The specification includes, for example, a data schema for the tracing model described in the Cross-Language Specification.
How OpenTelemetry Works
Figure 2 describes the basic telemetry pipeline when an application calls into OpenTelemetry. The telemetry backend can be Jaeger, Prometheus, Zipkin, or any open-source backend.
Figure 2: Basic OpenTelemetry Delivery Pipeline. (Source: Based on OpenTelemetry: beyond getting started)
Figure 3, however, shows that OpenTelemetry has many extensibility points:
- Collectors can communicate with various backends via multiple out-of-process Exporters.
- The aggregation, batching, and processing in the Collector is configurable.
- The default in-process Exporter sends telemetry to the Collector, but that Exporter is easily replaced.
- Processing extensions, such as sampling, filtering, and enrichments, can be added via the SDK.
- The entire OpenTelemetry SDK can be replaced with an alternative implementation.
Figure 3: OpenTelemetry Layers and Extensibility Points. (Source: OpenTelemetry: beyond getting started)
We highly recommend that you read “OpenTelemetry: beyond getting started” in order to understand in more depth how OpenTelemetry and its extension points address real-life application monitoring scenarios such as:
- Monitoring long-running tasks by starting a new Span using Tracer APIs
- Filtering out synthetic traffic using a custom Sampler
- Adding custom attributes such as productID or login status to automatically track Spans, making it easier to query telemetry data
- Defining global attributes, such as environment name or app name and version, in order to segment all telemetry data using resource APIs
Jump On the OpenTelemetry Bandwagon
According to the OpenTelemetry website, their current goal is to provide a generally available, production-quality release in the second half of this year. Figure 4 shows the progress of the various SIGs, all of which are still in the alpha stage as of the publication of this article. Note that the Collector project refers to the OpenCensus Service Collector described above, which is being modified to run with OpenTelemetry.
Figure 4: Current Status of OpenTelemetry Language SDKs. (Source: OpenTelemetry website)
In order to stay up to date on the progress of OpenTelemetry towards beta and general-availability releases, you should follow the OpenTelemetry website as well as the monthly blog posts for updates. In the January 2020 update, for example, there are up-to-the-minute SIG updates and an invitation to learn more about OpenTelemetry at the upcoming KubeCon EU conference in Amsterdam.
We at Thundra are closely watching the development of the OpenTelemetry project. Our serverless-native APM platform provides development, operations/SRE, security, and management teams with the end-to-end yet granular observability required to effectively troubleshoot and securely manage distributed applications across serverless architectures, containers, and virtual machines. Thundra’s novel distributed-application management approach dramatically simplifies the complex stack typically deployed to handle instrumentation, telemetry, observability, and remediation.
Currently, Thundra uses the OpenTracing API to implement instrumentation, and all Thundra tracing agents are OpenTracing-compatible. We are excited about the extended telemetry and observability integrations that OpenTelemetry promises to bring to us and our customers.