5 minutes read

POSTED Nov, 2021 dot IN CI/CD

Best Practices to Strengthen Your CI

Sarjeel Yusuf

Written by Sarjeel Yusuf

Product Manager @Atlassian


Ci/CD is often considered the backbone of any DevOps process as it is that stage in which software is thrown over the wall from dev to Ops. From development to production where Ops functionality takes over. Hence it is imperative that we get this right for the success of our overall development pipeline.

A failed or weak CI/CD means that getting code to production could be arduously slow and could even raise the potential of incidents and disruptions as the CI/CD stage failed to capture buggy code. Both velocity and stability are directly impacted by a weak CI/CD practice in the DevOps pipeline.

Unfortunately, this stage is also a fragile stage. This is because when going through the CI and CD process issues may arise and things may fail either in the form of test failures or pipeline execution failures. One such case that is increasingly becoming an issue among developers is flaky tests. This is when some of the tests that run in the CI phase randomly succeed or fail without any actual change in the code.

Moreover, with the rise of cloud microservices, it has become increasingly difficult to debug CI tests in a distributed environment. Much of the issues arise due to the erosion of observability as a result of the abstraction of the underlying layers of infrastructure when using the cloud.

Hence this piece aims to go over some of the best practices to ensure that this backbone of DevOps can be strengthened. We will take a demo example to go over some of the pain points that exist and follow through to mitigating some of the pain points identified.

Setting Up the Demo Environment:

For the purpose of this piece, we will be using an open-source project available on Thundra software engineer Tolga Takir. We will also be using Thundra to help us with observability for CI/CD as Thundra Foresight provides the required capabilities for this piece.

Instructions on how we can set up the demo environment can be found in the Readme page of Thundra’s Readme repository. The application that we will be working on will be a simple microservice. In fact, through Thundra APM we can see a system map as shown in the image below:

Now that we have the demo environment set up, let us go over some best practices that can be adopted to ensure a better CI process to boost DevOps practices.

Track Crucial Metrics

Observability and monitoring are not new concepts and in fact a crucial part of the DevOps pipeline. However, they have traditionally been thought of as Ops domains. As you can notice, throughout this piece we have continuously made the distinction between dev domains and ops domains across the development story and DevOps pipeline. This of course those not bode well with the philosophy of DevOps, which aims to break down all silos, and in more fanatical terms, even the division of domains.

As a result, it can be expected that observability and monitoring that were initially applied for understanding the state of the application in production, now be applied to understand the state of versioning and tests. By tracking various metrics such as Quality and Time-based metrics, while leveraging metrics traces and logs in testing and debugging scenarios, we can effectively do away with the woes of traditional CI.

Using Thundra Foresight in our demo environment we can procure the following metrics:

On the Thundra dashboard, we can begin to detect failed tests and successful tests along with also the time-based metrics. For example, the slowest test schedulerTest can be identified and isolated from, the slowest test suite ‘KinesisSchedulerTest’, where we can dive down deeper into what parts of the test, reflecting our actual system is resulting in time performance bottlenecks.  This can be done by leveraging the other pillars of observability such as trace maps and logs.

The image above maps out the distributed trace data in a trace map. From the information displayed, it can be seen that the slowest calls are READ calls to DyanmoDB. Similarly, we can dive into failed tests and look at the performance history of a test to identify flaky tests. This test, for example, is not a flaky test as is always successful, albeit slow.

Hence, with these metrics we can actually list the major benefits:

  • Building trust in the CI/CD stage across teams with metrics that provide a ground reality status and understanding.
  • Providing insights crucial to the resolution of failed and flaky tests.
  • Reducing the risk of incidents and disruptions in production due to providing an added layer of “debugging”.
  • Building resilience in the CI/CD stage and overall DevOps pipeline.

Chaos Engineering in CI

Since its introduction by Netflix in 2011, Chaos engineering has become an intrinsic part of ensuring the resilience and stability of software systems. Chaos engineering is the process whereby bugs and errors are injected into working code to assess the impact a disruption has on the running of the system.

This may seem counterintuitive to the goal of getting software out, but then again the goal should never be to push code to production. It should be to push reliable systems to customers in an effective manner. Hence considering this north star of a goal, it should become common practice to build and execute a Chaos engineering plan before going to production. Especially considering that there is a lot of difference between a development environment and the production environment. The final environment in which dev and ops have little control and only can hope that the world interacts with the final product as intended.

Hence, the purpose of Chaos engineering is to simulate the uncertainty in production. When acknowledging the very real possibility of this uncertainty, the benefits of this practice during the CI stage become clear.

The feedback and assessment of these plans must be acted upon in an iterative manner and all cases should be covered. At the end of the day, Chaos engineering is all about running experiments and validating or disproving hypotheses. Hence the steps to a successful Chaos engineering practice is as follows:

  1. Build the hypothesis - This is probably the most crucial step of the entire Chaos engineering practices, as the effectiveness of all the effort will be based on how well the plan covers all the ‘what ifs’. It is of course impossible to cover all these ‘what ifs’,  but by allocating ample time, it can be assured that a significant amount of ground is covered.
  2. Inject failures and bugs - This is the execution of the experimentation. It involves injecting real bugs and errors to simulate potential real-life disruptions. Thereby, simulating how the system would actually react to these disruptions.
  3. Measure the impact - here we should ai to leverage monitoring tools where observability into the running of the system can be provided. This includes the availability of metrics, logs, and races. Again for this purpose, we can use Thundra’s APM monitoring capabilities, where the CI tests executed can be dissected through the Thundra Foresight's panel.


For demo purposes, I did not actually simulate a real-life scenario and instead simply failed the build manually. However, according to the test plans decided we can leverage Thundra for this purpose. To know more about how to do so, check out AWS Serverless Hero Yan Cui’s blog on the subject.

Tackling Flaky Tests

This practice somewhat stems out of the first practices described. One such case that is increasingly becoming an issue among developers is flaky tests. This is when some of the tests that run in the CI phase randomly succeed or fail without any actual change in the code.

The reason behind the phenomenon of flaky tests is usually obscure. Many times we do not know why the test is failing and it is common practice to ignore the test and override the warning signs to continue with deployment. This is definitely a slippery slope, as we don’t always know why the test is failing. It begins to degrade the trust that we have in the CI process. As can be seen, we are really getting ourselves into a “boy who cried wolf” situation.

As described earlier, measuring the performance of each test through different builds would allow us to identify Flaky tests using Thundra Foresight. The reason why it is crucial to single out the tacking of flaky tests in the CI process is because of the uncertainty it causes. Uncertainty is definitely not welcomed especially when considering that CI/CD is the backbone of the DevOps pipeline.


In conclusion, a good CI/CD practice directly impacts the success of a team or organization’s DevOps practice. Hence it is imperative that best practices are sought after to ensure the goals of DevOps, to build faster with great reliability, are achieved.

This piece introduced three major practices that every organization should consider adopting. Enabling CI observability of metrics logs and traces, and leveraging this for chaos engineering and flaky test resolutions. It must be further noted though that this is not the only practice, but definitely, the most impactful action that can be taken.

DevOps and cloud adoption will continue to evolve, growing larger in adoption. It is now time we consider such practices to facilitate the adoption of these concepts as to reap the benefits of the development pipelines.