It has been almost five years since AWS announced its FaaS service Lambda for the first time back in Re:invent 2014. Back at that time, the reactions were signaling that Lambda would be useful as handy functions on AWS instead of the scripts that triggered periodically to perform ephemeral jobs - i.e. provisioning a new EC2 server. We thought we wouldn’t need to track the performance of these functions because they were not performing a critical job. It was enough to know that it worked.
Thanks to its unbelievably great community and incredible support for other “serverless” services by AWS, serverless has now become the new way of developing modern microservices and applications in the following years. One can now set up an architecture composed of managed services that communicate with each other via asynchronous events. Thanks to that, software companies are now able to build robust, fast, reliable applications with almost no cost when there’s no traffic on it. The following chart from Google Trends justifies the increasing trend of serverless over the past 3 years.
What's serverless again?
As technology geeks who love to argue on definitions, we, as the serverless community, came up with many different definitions of serverless till now. If you look from the operations point of view, serverless is the things that work perfectly but we don’t need to manage the scalability and reliability. From that perspective, not only Lambda but API-Gateway, DynamoDB are serverless. I even can claim that Auth0 is serverless because you don’t need to manage the authentication process when you’re using it. On the other hand, such services violate an important serverless principle: pay-per-use. So, controversies never end and serverless folks like that! Serverless hero Ben Kehoe has a completely different approach. He defines serverless as a state of mind similar to Paul Johnston, defining it as a doctrine. According to this perspective, which I completely agree, serverless is an enabler for the companies by letting them finally focus on the value that they can deliver to the customer instead of the infrastructure management tasks. In this way, companies can outsource all their previous concerns to the cloud vendor and provide better features faster instead.
From my point of view, as a product guy of a serverless observability company, serverless is defined as complex systems on which tracking errors or performance bottlenecks is very hard because of its event-based architecture on a highly distributed system. The good news is, it doesn’t need to be that hard if you are having the complete observability over your system.
Observability: Yet another buzzword or a new way of sorcery?
Speaking about observability, I think we should also define it once again here. Many thinks that observability is yet another buzzword but it’s actually the superset of the traditional methodologies such as monitoring and testing to keep our systems operating. With monitoring, software professionals track the metrics and notice the issues by catching an abnormal peak or drop in a metric. Then, they can check other systems and logs in order to understand the problem. But in the first place, the aim of monitoring is to protect our system against the failures that we can understand by staying aware of the issues. After we successfully resolve the issue, it becomes something that we are now aware of. Then, we can write new tests in order to cover this issue in the next iterations of our software lifecycle. However, the night is dark and full of failures, especially for distributed systems. There are problems that we neither understand nor aware of. It’s not certain if we can catch such issues with the metrics that we have been tracking. You need to have a discipline that covers both the monitoring and testing to understand the “unknown unknowns”. Observability comes into the place at this stage.
Observability enthusiasts like me claim that you need to have all three pillars aggregated in the same context in order to achieve the perfect operability and optimized cost of running software. These pillars are traces, metrics, and logs. Metrics and the logs are the old friends that we’ve been familiar with. Traces and especially distributed traces are comparably new and they give us a picture of the lifecycle of the distributed transactions between different resources. If you are lacking one of these pillars and if you don’t aggregate them according to their context, then you will most probably fail to respond to an issue in your distributed architecture. All the dashboards and any kind of visualization and machine learning algorithms are built on top of these pillars.
Challenges of Serverless Observability
Setting up the right observability for serverless is unfortunately not straightforward. In order to achieve the observability with the solutions provided by the cloud vendor, in our case AWS, you can take advantage of two great services X-Ray and CloudWatch. X-Ray provides us the traces pillar which shows the timeline of the serverless transactions while CloudWatch provides us the logs and metrics. However, it’s still the developer’s responsibility to aggregate the information provided by these services.
To deal with this issue, you can consider using the agents of existing monitoring providers or you can implement your own. However, this is not possible because of the nature of serverless. There is no place to install the agents in the containers because the containers are killed after some time by AWS. You should either develop a tool yourself or use an existing solution like Thundra which works agentless. Your monitoring solution should perform any operations such as gathering information and sending it to storage to process during the invocation time. As you can guess, this causes an increase in the invocation time. Gathering information causes a negligible overhead which is a few milliseconds but taking it out requires an HTTP connection set up and can take hundreds of the milliseconds according to the proximity of the destination. Yan Cui explains it in his article with the following image.
The biggest challenge, spoiler alert, is the toughest one. Even if you manage to take the intelligence out of a single function it doesn’t give you enough information to fully understand what actually happened in a complex serverless architecture. You can start with a function triggered by API-Gateway and saves the records to DynamoDB but with your needs evolving, your architecture will inevitably evolve like in the image below. Understanding the behavior of one function is not enough in such a situation. You need to understand the full lifecycle of serverless transactions. You’ll need a proper way to understand and train Charizard! In order to achieve this, you’ll need an automated tool that can provide you distributed traces between various resources no matter these resources are AWS native or external resources like Redis, MongoDB or third party APIs.
In order to achieve the required level of observability, you need to have an observability tool like Thundra that automatically detects and displays the new resources in your serverless architecture. You should also be able to track the events being exchanged between resources during a transaction and discover the outliers before they actually cause an issue.
During our interviews with companies adapting serverless, we saw that they still didn’t break their habit of thinking monitoring is a post-production issue. This is a trap that everyone might easily fall in. Observability is a new paradigm for maintaining healthy products but you should pay attention in the earlier phases of the software development life cycle.
Observability Driven Development
While you are developing a serverless architecture, you should consider how you’ll respond to an issue when you face a problem. You should develop the behavior of feeling responsible for observability. In order to achieve this, a new methodology called observability-driven development is introduced by Charity Majors. It basically says that no matter how skilled your observability tool is, it cannot know the risks and dangers in your architecture. You also can’t guess the unknown unknowns in your system but you can get prepared to ask the wise questions when you face them. To do so, you need to create and send custom traces, metrics, and logs to your observability tool and use a tool like Thundra to show this custom information.
Manual instrumentation enables software developers to send custom traces, metrics, and structured logs. Thanks to that you can extract the context of failure and start asking wise questions. All the information you send should enable you to ask new questions when you face the problem. If you already know the questions that you’ll ask but you can’t find it in your observability tool you should ask them to automate this. Observability-driven-development can only be meaningful when automation doesn’t help you. Here you can see how it can be done in Thundra:
As you can see, a new trace is started before calculating the 2000000th Fibonacci number and finished right after going to DynamoDB. Although this is a dummy example, it still shows how you can take advantage of it. Using this, you can understand if the inefficient piece of code slows your code or creates an error. We see that it takes 40ms to calculate the 2000000th Fibonacci number before we actually write the result to DynamoDB.
This blog post is the first part of the series explaining how serverless observability is useful in different phases of the software development lifecycle. In this part, we covered how it can be useful during development. In the next part, I will cover how it’s useful while testing serverless applications. If you want to chat on several stages of serverless with us, you can ping us over Twitter (@thundraio) or join our Slack and let’s chat. You can sign up to Thundra or see live demo to see Thundra in action.