The abundance of cloud services available today makes building web-scale products very accessible. Any company with sufficient resources can leverage infrastructure and platform services to build products that can serve thousands or even millions of customers. They can achieve economies of scale while providing a reliable service, without having to build everything from scratch.
However, hardware has its limits, and building a product at this scale inevitably requires a distributed architecture. Nowadays, this is usually made up of any number of microservices. This approach brings challenges of its own, from the realities of the CAP theorem (e.g., having to choose between availability and consistency) to the complexity of deploying and monitoring the multitude of software components involved.
In this article, we’ll look at the challenges involved in taking up microservices and examine some approaches to troubleshooting issues in this kind of architecture. We’ll also discuss how to achieve synergy by combining approaches.
The Difference Between Monoliths and Microservices
Many business applications are initially developed as monoliths. This makes sense. Monolithic applications allow developers to rapidly develop core functionality, and keeping the application together makes it straightforward to debug and resolve issues.
As applications grow, however, the focus shifts from innovation to scalability and reliability. These objectives are achieved by decomposing the monolith into a number of disparate services, allowing them to scale independently. This is usually done by means of microservices, which are very granular and have a very specific purpose, thus allowing maximum flexibility.
However, this flexibility comes at a cost. It’s common to underestimate the operational complexity incurred by microservices, especially in terms of deployment and monitoring. It also makes things more difficult for developers, who need to transition from debugging a single application in their local environment to tracking issues across hundreds of microservices in a cloud environment.
Challenges of Change
Managing, monitoring, debugging, and troubleshooting distributed applications built on a microservices architecture is challenging in and of itself. But part of the difficulty comes from the transition between monolithic and microservices architectures and having to adapt to a completely new operating environment. In this section, we’ll discuss a few aspects of this transition.
Alienation from the Whole System
Employing microservices at the scale of companies like Netflix, Twitter, or Pinterest results in a situation where most developers can’t understand the entire system at a high level. This is especially true when microservices are owned by particular teams (often as per Conway’s Law), making teams just as interdependent as the microservices they maintain. As a result, developers have a tougher job debugging microservices when a problem involves a downstream service that they depend on but don’t understand.
Although microservices provide flexibility and loose coupling, there is a tradeoff in terms of loss of control. Microservices are just components between other upstream and downstream systems, which might be a third party and outside of the microservice developer’s control. The system can break in unpredictable ways whenever a service maintained by another team is updated.
Distributed microservices systems are much harder to troubleshoot than their monolithic counterparts. Aside from the difficulty of configuring and running a large number of microservices in a local environment, it is challenging just to trawl through logs looking for clues. There might be hundreds or thousands of microservices writing logs in different places, and huge amounts of log data make it difficult to locate helpful diagnostic information.
The best way to address this is to ship logs to a central location, structure them in a way that the right data can be meaningfully extracted, and leverage a search engine to quickly locate relevant information no matter how much log data is ingested.
Centralizing logs helps to make sense of failures in a complex distributed environment, but it’s not enough on its own. Because a single request can traverse a number of different services, it might not be immediately clear whether log entries from different services are actually related to the same request.
The answer to this is distributed tracing, which helps to understand application behavior holistically. By correlating logs across different services, it becomes possible to monitor the lifecycle of a request from the moment it begins to the point when it fails or is satisfied and all the steps in between. Assuming the logs contain the right level of detail, it’s sometimes possible to identify the cause of a problem from them alone.
Distributed tracing can be implemented in a number of ways, including:
- Rolling out your own solution by adding a unique correlation ID to all requests, and maintaining that same ID when talking to other dependent systems.
- Using open-source solutions based on OpenTelemetry.
- Purchasing third-party distributed tracing solutions.
A solid logging system is essential to provide visibility into the health of production systems, as well as troubleshooting when things go wrong. However, logs might not always have all the information necessary to understand the cause of a problem, let alone fix it. Since adding more logs and redeploying several times can get tedious, the ideal would be able to debug the application directly.
While debugging is easy to do in a local environment, it’s a different story in cloud environments where the infrastructure is no longer under your control. To debug a service, you would somehow need to connect to the machine where your code is hosted and debug it, which can be anything from arduous to impossible.
Even in situations where this is possible, traditional debugging does not lend itself to such environments. This is because hitting a breakpoint is a non-collaborative operation that pauses the application, blocking further progress until the application is resumed. This essentially results in the system being unavailable and unable to service requests. At best, this results in wasting your colleagues’ time in lower environments, and at worst it interrupts service for customers in a production environment.
The right way to do remote debugging in a cloud environment is to add non-intrusive breakpoints that capture the application’s context (such as variables, stack trace, etc.) when they’re hit, without interrupting the application at all. This provides all the relevant information necessary to understand and fix an issue. In an ideal world, you could even immediately apply a hotfix to verify the impact of the change, without having to go through the whole CI/CD process each time.
Combining Remote Debugging with Distributed Tracing
Modern microservice applications are mostly served on the cloud, where they can leverage battle-tested infrastructure and platform services and scale as needed. However, the fact that they run remotely, sometimes without the developers having access to a server at all, makes it difficult to solve problems once code is deployed to the cloud.
Log management and aggregation, distributed tracing, monitoring, or debugging tools help, but they are not enough on their own to track down and solve many of the most difficult problems. Thundra Sidekick combines auto instrumentation, auto distributed tracing, remote debugging, and hotfix resolution, giving software teams the ability to collaborate while developing and debugging remote microservice applications.
Boost developer productivity with remote debugging for microservices. Get started with Thundra Sidekick for free.