AWS Lambda Canary Deployments with API Gateway

Jan 14, 2020

 

canary-deployments

One of the great benefits of cloud architectures is the capability to easily replicate and deploy copies of application infrastructure. With some additional technical expenditure on redundant deployments, this approach can be leveraged to improve the stability of your web application. First approaches will involve simple resource swap deployments, where new production resources are tested on a single server prior to swapping the resources out in production when the new code is verified–a process known as blue/green deployments. This process relies on deployment management as a critical component of its function and, if achieved, gives your development team a lot of flexibility.

In this article, we’ll explore the concept of canary deployments–an evolution of blue/green deployments–with a focus on a constant, uninterrupted user experience regardless of the deployment state of the application.

Some Background: Blue/Green Deployments

To introduce canary deployments, we’ll first need to delve a bit deeper into the concept of blue/green deployment practices. Blue/green deployments are intended to reduce downtime in production. In a blue/green system, your organization maintains two identical copies of your application’s production architecture: one designated “blue” and another designated “green.”  One of these servers is exposed to the world at large, serving the needs of your users, while the other stays unavailable. 

This second server provides a simple smoke test environment on top of your production data stack. By using this inactive server as your deployment target, you can test your application’s behavior with the new code base prior to deploying the functionality to live users. This allows you to reduce downtime in deployments while also improving the overall resiliency of your application; the latter is achieved via easy fallback pathways in the event of sudden increases in scale or botched deployments.

How Canary Deployments Help

The blue/green approach can be cost-prohibitive, driven primarily by the need to duplicate all of your application’s production resources. While the actual work of setting up this kind of environment is greatly eased due to advancements in cloud-resource availability, the costs of such an install are nearly double those of a less-secure application. 

Canary deployments aim to address this issue by reducing the need for duplicate application infrastructures. With some tweaks to your application architecture and your organization’s development practices, you can reduce the need for redundancy and deliver the same level of stability to your users.

Canary deployments shift the focus from releasing entire applications to releasing individual features within an application. So, instead of releasing all of your new features at once, you release the code for a new feature while slowly scaling the number of users that have access to the feature as time progresses. This is a longer release process in the absolute sense, since you need to take additional time to fully release a feature to your full user set, but the gains in scale and stability obtained as you grow this subset offset the delay in delivery. This also enables you to continuously release new features for your application without needing a specific deployment or release window. And your changes are deployed with no downtime whatsoever.

Challenges of Canary Deployments

While canary deployments have a lot of benefits that can reduce your overall application downtime, they are not without their own challenges. Below, we’ll look at the main problem areas presented by canary deployments. The first two–managing API contracts and managing database migrations–arise from the need to handle concurrent versions of code during feature rollouts. The final two–selecting a user subset and defining success–are specific to the canary deployment pattern and often only arise during testing of a traditional web application.

Managing API Contracts

The first challenge faced by a system converting to canary deployments is how to release new features in a progressive fashion that is backwards-compatible as needed. A traditional deployment system, or even a blue/green system, can change API endpoint responses at will; but with each new release, canary deployments need to accept new request formats while simultaneously supporting the previous mechanism of invoking the API endpoint. While some of this can be solved with API versioning, massive refactoring efforts will be required to plan for a graceful degradation of previous services before they can truly be called complete.

Managing Database Migrations

Database migrations can pose a significant challenge to canary deployments, as, similar to the API issue above, the database needs to maintain consistency for both the old schema and the new schema simultaneously. Generally, this doesn’t apply to queries that either expand table dimensions (such as adding a column or index) or that insert records. But it does impose restrictions on queries that reduce table dimensions, as well as any queries that introduce new columns with a non-null requirement. 

As your database grows over time, this need to maintain simultaneous data sets can introduce performance issues and technical debt if it grows too large. Be sure to have a post-release migration strategy in place to ensure the data cleanup is completed.

Selecting User Subsets

A successful approach to canary deployments will give you both the option to fully test the feature in production as well as improve its scalability as successive users are added to the feature group. Focus on users that are fairly active in your system–a beta program can help find users who are the most likely to test the limits of your new features. 

In addition, you’ll want to pay careful attention to the ways in which you measure the results of your experiments and rollouts. Create a control group, consisting of additional active users in your system, against which you can compare behavior. This will give you a clear point of comparison for determining the overall impact of the newly built components.

Defining Success

The final challenge faced by canary deployment systems is determining what success means for each new release. Given the variability of potential feature releases, picking a consistent set of metrics against which you can measure the performance of the new code will be crucial in building confidence in the release. Conversely, it’s also critical to understand what a failed deployment looks like; is it only when the website stops responding, or are there response time thresholds that need to be met? Automating these metrics can provide you with crucial methods of detecting a failed deployment as quickly as possible.

AWS Lambda Deployments

AWS Lambda functions are deployed as individual units, with code changes taking place as a single deployment event. As this is a sudden change of single units of code, your serverless functions will need to be built to accommodate the scaling of traffic between the old and new versions. 

Without a strategy in place, canary deployments will be challenging to implement. One approach is to implement a dispatch layer, essentially a load balancer for Lambda functions. With some simple mapping and data manipulation, you can control the percentage of traffic deployed to each Lambda function, letting you scale your deployments, albeit at the cost of additional development complexity. This is a process that begs for automation, and luckily AWS offers Lambda Aliases to do just that.

AWS Lambda Aliases and Canary Deployments

A Lambda Alias is like a pointer to a specific Lambda function version. It serves as a proxy that can dispatch traffic to the specific ARNs that make up your serverless function library. When a new version is created, you can simply update the alias to point to the new version without needing to update the calling application. In addition, AWS Lambda Aliases provide a method of routing alias requests to two different versions of the same function. This allows you to scale new function deployments as needed and identify issues as early as possible, without significantly degrading user experience.

Choosing Success Metrics

When implementing a canary deployment system for a serverless application, choosing the appropriate success metric for your deployment will be an important step. As Lambda Aliases only give us the capability to shift traffic based on a percentage of overall requests, you’ll want to focus on code-independent metrics as much as possible when automating your canary deployments. For identifying failures, this can consist of increased error rates, 500 responses from the server, increased request response times, or any other common application failure pattern. 

For determining the success of your deployments, you can make use of log statements, user events, or simple analysis of the data flow in your system to seek out the desired improvements. You can also follow metrics such as error rates and response durations via Thundra by actively checking your console or setting up alerts specific to the deployment process. Actionable alerts by Thundra send you notifications with details of failures as they occur, letting you take immediate rollback actions as needed.

While this information can all be gathered within your application, tools like Thundra can help you respond to the flow of data in real time. Such a third-party tool allows you to easily aggregate events from multiple sources, giving you real-time observability that would be costly to implement otherwise.

Conclusion

A good deployment system should minimize application downtime while also giving your engineers confidence that the deployed code will function properly. Canary deployments allow you to shift your application’s production deployment patterns to a progressively scaling approach. This allows you to improve delivery quality by minimizing the side effects of botched deployments and also improve user experience via performance measurements of the new code as it scales. 

With Lambda Alias Routing and traffic shifting, you can easily implement canary deployments that scale based on your overall Lambda function usage. And by coupling this with a third-party tool such as Thundra, you can build an automated canary deployment system that ensures minimal application downtime while also giving you confidence in the stability of your application.