Moving quickly while also maintaining high quality and availability is a balancing act that is hard to maintain but essential to the success of any software development lifecycle. Many teams today are using DevOps techniques to keep their development process fast without losing the quality and availability their users expect.
CI/CD is at the center of this DevOps practice, serving as the essential bridge between your development and operations teams. Any failure in CI/CD immediately limits the benefits of adopting DevOps, making it one of the most important stages to build out effectively. Not only should it meet the development team’s current needs, but both the CI pipeline and CD pipeline have to be robust at scale.
In this post, we’ll be discussing the CI pipeline specifically, including common problems that arise when scaling up and how to address them.
Anticipating CI Pipeline Scaling Issues
There are several common issues that can adversely impact CI and reduce DevOps success as measured by DORA. These are the metrics to keep in mind when it comes to your CI pipeline:
- Deployment Frequency (DF)
- Lead Time (LT)
- Mean Time to Resolve (MTTR)
- Change Failure Rate (CFR)
When Bottlenecks Occur
As you might expect, when more developers are working on the codebase and development scales up, there tend to be more frequent releases. This is one of the top places to look for bottlenecks in the CI pipeline.
In a 2020 talk, Engineering Manager Mayank Sawney and I showed that Atlassian’s frontend codebase was a monolith in several areas. Breaking it down into microservices was the most obvious correction, but there remained a hidden threat: All of the releases depended on a single CI pipeline.
If this is not addressed, your team will hit a critical barrier when building for release at the same time. No matter how you piece out the monolith or segment your teams, there will be developers waiting for others to finish using the CI pipeline and testing environments.
Moreover, lead time and development frequency metrics will take a hit, especially for larger organizations with several teams held up at the same bottleneck. Smaller companies have less to worry about here, but it is a concern for anyone trying to scale up.
There are release-scheduling practices that can help avoid such bottlenecks, especially when used in tandem with effective change management. But this becomes messier as the codebase scales, with various parts of the system reflecting requirements from different release types. You have to account for so many parameters for the build process that a failed pipeline could stall everyone, at which point CI pipeline operations would be a single point of failure.
This would negatively impact the change failure rate, as well as LT and DF metrics. It’s less likely to cripple the CI pipeline for smaller teams, but again, scaling up the codebase and application complexity increases the likelihood of this issue emerging.
Confidence and Flaky Tests
Tests that seem to have random results, succeeding or failing without any changes to the code, are called flaky tests and often leave CI teams frustrated when trying to determine if a flaky test is a code failure or a failure of the test itself. This can also reduce the team’s confidence and make some dismiss the usefulness of testing altogether.
Unfortunately, scaling up the codebase often means more tests, which means the possibility of more flaky tests in the pipeline. But dismissing results as merely flaky and simply moving forward can cause issues and disruptions in production; on the other hand, trying to proactively look deeper into flaky test results adds to lead times or can even create a bottleneck. So, you end up having to weigh the importance of CFR versus LT metrics.
Modular CI Pipelines
When scaling up, it should be clear that CI pipelines ought to be modularized just like other components. Consider separating your CI servers, with isolated environments for each pipeline and server.
This keeps pipelines from becoming too complex and also solves the CI bottleneck problem. Separate servers for different teams, along with an effective approach to flaky tests, can keep multiple teams moving as you scale up.
However, it is not easy to maintain synchronous applications while replicating servers. Plus, there are operational costs to keep in mind as you scale up with multiple pipelines and servers. Here, we should employ IaC tooling to execute and define the spin-up of the separate CI servers.
On top of that, I suggest synchronizing the CI servers with the test environment, especially when a monolith is involved. There is a divide-and-conquer approach to resolve this problem, as Mateusz Trojak, Lead Infrastructure Engineer at Brainly, explains in this article. There are new concepts and technologies that can facilitate scaling CI pipelines, with perhaps the most important being observability, which provides critical insights into issues like flaky tests
Thundra is a SaaS solution that provides the most important CI observability metrics. Thundra Foresight answers crucial questions about slow builds and failed tests, helping solve the biggest limitations to scaling CI.
A CI pipeline is at the core of a DevOps approach and central to maintaining quality, availability, and velocity. Right now, CI modularization is the best way to scale up effectively. There are a lot of pitfalls and pain points that come along with that process, but the key to avoiding them is observability and making sure you have the right tools to achieve this.