Software development has come a long way. We’ve transitioned from dedicated teams of software testers to automated tests that execute as features are integrated in CI pipelines. Organizations have realized that manual testing is time-consuming and expensive, and automated tests provide an elegant solution to these pain points.
But automated testing isn’t without issues of its own. A lack of visibility into the CI environment, where tests run, can be costly in both money and human capital. While automated tests are critical for assuring software quality, their visibility and value to teams rapidly diminish when an application's complexity grows due to flaky builds, intermittent failure, delayed builds, and other factors.
Eventually, every software company gets to a point where their CI pipelines become slow, suffer from intermittent failure, and are unreliable. In this article, we’ll look at the missing piece in the agile development workflow: observability in CI. We’ll answer some of the most common questions engineers ask when they start automating their tests:
Why do tests fail in the CI Pipeline?
What makes previously green, fast CI pipelines become unreliable and slow?
Why do some tests fail intermittently?
Finally, we’ll show you some key solutions to these problems.
Why Tests Fail in CI Pipelines
CI/CD pipelines help steer changes through automated testing cycles, out to test environments, and finally to production. A green CI pipeline assures that integrated changes won’t introduce unforeseen side effects into the production environment.
But the CI build turns red, or fails, more often than we’d like. Though frustrating, failed CI pipelines are helpful warnings that something is broken in the software. But failed tests don’t always relate to a specific testing issue. There are many reasons why a pipeline may fail, and we’ve highlighted a few of these below:
- The disparity in environments: Some tests fail due to disparities in the testing environment. For example, a test may pass in local environments but fail in CI environments due to discrepancies between them.
- Buggy code: A bug in test or application code can cause a test failure in CI pipelines.
- Code changes: Tests need to evolve with business logic. When business logic changes, tests written to support the logic need to change as well. If you forget to evolve the tests with the business logic, they’ll fail.
- Caching: Caches, time manipulation, and stagnant data can cause unpredictable test failures in CI pipelines. Frequent test failures may be due to data inconsistency. For example, if old content is loaded or new content is not loaded correctly.
- Test flakiness: Test failure in CI pipelines can occur as a result of flaky tests. Flaky tests are difficult to debug and are a drain on a developer’s time and resources. They may pass or fail when executed, and these inconsistent results in the CI environment make it harder to find the actual bug in the code.
- Infrastructural issues: Test failure can result from different infrastructure-related problems, like test external service failure, network problems, misconfigured modules, or a changed browser version.
There are many more reasons why a test may fail in a CI pipeline than we’ve listed here. While these are often dismissed as annoyances, they can indicate serious issues in the software. In the next section, we’ll see why failed pipelines are so important.
The Hidden Cost of a Broken CI
As organizations evolve, their CI/CD pipelines become complex, slow, and unreliable. New issues arise every day, such as mocks not behaving like the real dependencies in production, shallow unit tests that are written for the sake of improving test coverage, tests that behave differently depending on the time of the day, complex end-to-end tests that take weeks to debug, and more.
The problems can seem endless. What’s worse is that they get little attention as development teams are tasked with more features to release to customers. The attention shifts to the end goal (features in production) and away from the process of getting features there.
While automated tests are a means to an end, and not the goal itself, a slow test suite can be very expensive and it’s worth dedicating time to fixing it. If your CI runs 50 times a day, 30 seconds of additional build time can add up to 152 hours of developer downtime over the course of the year.
Without a proper CI pipeline, you’ll be stuck with a complex, time-consuming process that will affect planned deployments and feature releases. In short: you could be getting to market faster.
It can be hard to account for all of the lost hours and productivity that result from a slow or broken CI. Below is a short outline of some of these hidden costs.
Increased Infrastructure Spend
Most CI platforms bill based on build time. The longer your build, the more you pay. A slow CI increases your infrastructure spend and causes delays in delivery. And the stakes are high; a bad release can stop deployment trains.
Reduced Software Quality
Deployment failures are associated with a bad or slow pipeline, which can inevitably affect the agility and quality of releases delivered to customers. However, if something breaks during the deployment process, developers have to roll back the deployment, resulting in more delays.
The more frequently this happens, the more tempting it is to bypass failed tests in order to deploy fixes faster. With time, the quality of the software degrades and negatively impacts customers.
You’re not likely to see a happy developer with a slow or flaky CI. Finding the root cause of a flaky CI requires a considerable time investment. Without proper visibility, this could take hours or even weeks. When a developer can't readily troubleshoot a broken test, they're less likely to write new tests or care about current ones.
So far, we’ve looked at why a broken CI is expensive, now let’s focus on how to maintain your CI pipelines and keep them in optimal condition.
How to Maintain Healthy CI Pipelines
There are several rules to follow if you want to get the best out of your CI pipelines. Below, you’ll find the three most important actions you can take.
- Don’t ignore failed tests: Too often, developers hit the “re-run test” button if a test fails. Even worse, some failures are ignored completely. This goes against the tenets of testing. If you deploy software with unresolved failed tests, your technical debt will increase, and you’ll pay for it in the long run.
- Don’t ignore slow-running tests: Slow tests reduce productivity. A slow test does not get better with time and ignoring it will inevitably affect your development process.
- Don’t fly blind: Monitor your CI environment just like you do in production. This will help you spot test failures immediately, saving your organization time and money.
Monitoring Tests with Thundra Foresight
As noted above, a lack of visibility into your CI pipelines comes with a price. Luckily, Thundra Foresight has made it easy for developers to see what’s happening in the CI environment. Here’s an example of how Foresight can help.
Say you received an alert an hour after pushing your last commit for the day, indicating your CI pipeline had failed. You can either re-run the pipeline and hope it passes without observability, or you can look at the test logs to figure out which test is failing.
It may take a minute longer, but you decide to do your due diligence and look at the logs. They show that a test unrelated to your commit failed. You check in to your commit branch and locally rerun the failed test to debug it, only for it to pass. It quickly becomes clear that test logs are not enough to troubleshoot this failed test—and that’s doubly true for a complex project. You need a tool that captures every error and contextualizes them in a central place.
Thundra Foresight does exactly that by providing insights like traces, logs, and metrics that make debugging tests easier. Foresight enables software engineers to gain more visibility over end-to-end tests with distributed traces; it also helps monitor and optimizes build duration, resulting in high development productivity, decreased CI costs, and build visibility.
In summary, visibility is key. In the local environment, engineers have profilers and debuggers. In production environments, APM and observability tools are there to help troubleshoot issues. Sign up for Thundra Foresight and bring visibility into your CI pipelines as well.