12 minutes read

POSTED Sep, 2021 dot IN Testing

Optimizing Long-Running Tests

Serkan Özal

Written by Serkan Özal


Founder and CTO of Thundra

linkedin-share
 X

Thundra’s engineering team has been building cloud observability solutions for years. In this time, we’ve discovered how crucial analytics and visibility are for the CI/CD process in general, and for tests specifically. After all, these parts of the development pipeline also run in the cloud, so if those tests fail, it’s often not clear why. This makes monitoring these remote environments all the more important.

That’s why Thundra created Foresight, a new observability solution geared to the needs of the development process. There are already tools that monitor your production environment, but Thundra Foresight is an observability tool that embraces the shift-left philosophy by providing insights into your CI/CD pipeline.

Long-running tests in the CI/CD pipeline are especially fascinating because, by their very nature, they can block the CI/CD process for longer periods of time. There are many factors that determine how long a test takes to run, and many of them are unrelated to each other—such as unoptimized algorithms, slow networks, or the simple fact that many different services are part of a single test. This makes locating and fixing tests a complicated and time-consuming task.

In this article, we’ll look into the different types of long-running tests. You’ll learn how Thundra Foresight can help you find and fix them—or at least mitigate their impact on your pipeline runtime.

Types of Long-Running Tests

There are two main types of long-running tests: intentional and accidental.

With intentionally long-running tests the duration is either what they’re trying to test or intrinsic to the tested code.

For example, if you want to test how your system behaves after running for three hours, the test will always take three hours. You can’t take any shortcuts. It’s possible that after running the test you’ll conclude that the system didn’t change its behavior significantly after ten minutes, and you didn’t gain any new insights from the rest of the test. But that can only be determined after you’ve run the test for the full duration.

An example of a long-running test where an intrinsic behavior is tied to the code being tested could be a computer vision system that checks cars on a highway. If you want to test what happens after a thousand cars were checked, you would have to wait until all these cars drove by.

In these instances, you may not like that these tests take so long to execute, but at least you know why. When you end up with a test that takes many times longer than you expected it to—that’s where things get interesting.

An example of accidentally long-running tests could be a checkout process. Obviously, the test will take some time, but how much isn’t clear. It could take a few hundred milliseconds, a few seconds, or it may take minutes and you won’t know how it came to this. In the best-case scenario, you just typed too many zeros in a timeout function, but in the worst case, your algorithms aren’t optimized for the workload, or your testing servers aren’t powerful enough for the task at hand.

These long-running tests can be optimized in multiple ways, but you have to know where to look. Some problems aren’t solved by throwing more hardware at them but by refactoring the code itself.

Regardless of the cause, long-running tests will slow down your CI/CD pipelines and increase development costs.

Problems with Slow CI/CD Pipelines

When you start a new project, a fresh CI/CD pipeline might take just a few seconds to run. You’ll push your changes multiple times a day and deploy a new release in the time it takes to open a browser and navigate to your application. But the more tests you write, the slower the process becomes, and eventually your team is waiting hours for their deployment.

This downtime can lead your team to look for other things to do. While you might appreciate their industriousness, multitasking can also mean a loss of focus and an increase in errors. If there’s nothing else for your team to work on, that’s not good either. Time still costs money, even if there’s nothing to show for it.

Long-running tests also have implications for the efficiency of your CI/CD pipeline as a whole. A process that takes longer due to high-performance requirements will be more expensive and could block other processes running on the same server. These long-running tasks also slow down tasks that come later in the CI/CD pipeline.

If the overhead of running the CI/CD pipeline becomes too great, people start building larger commits, with more changes bundled. And the people responsible for the pipeline might bundle multiple commits automatically to save time. These solutions are more error-prone, and fixing them is harder because you don’t know which changes caused an issue.

Solutions for Intentional Long-Running Tests

Intentional long-running tests can’t be accelerated on a per-test basis, but there are a few tricks you can try to mitigate their impact. Thundra Foresight makes these tricks easier to apply because you get all the metrics you need in a central dashboard.

Run Tests in Parallel

If you have multiple long-running tests and can’t improve their runtime, you can at least try to run them in parallel. This might not help if you only have one test server and the tests are particularly performance-intensive. But if your tests are idling through most of their runtime, parallelization could have a big impact on the time your whole CI/CD pipeline takes to execute.

Testing tools like Selenium even offer tests as a service, which lets you run every test on a dedicated server in their cloud. This stops tests from slowing each other down and better isolates them from each other than if they were running in the same process.

Fail Quickly

Tests often fail because of bad configuration, which is usually detected right at the start. If you waited an hour to be told that the test had failed after ten seconds, that would be frustrating. Make sure your tests fail as soon as possible, that way you can fix the problem without needless wait times.

Figure 1 shows how Thundra Foresight illustrates the actual runtime of your tests. This knowledge helps you make more educated guesses about choosing the right timeouts and saves you from waiting for a few minutes “just to be safe.”

Figure 1: Thundra Foresight test duration history

Run Iterations that Get Longer

If you want to test something for one hour, test it for ten seconds first, then ten minutes, and so on. Often the shorter versions will show you issues that can be solved right away. You may even solve all of your problems with the shorter tests before ever getting to the longest one.

Instead of running a test like this:

describe("Usage analytics", () => {
  describe("Saved notifications", () => {
    it("Should equal 10000", async () => {
      const result = await client.get("/notifications", { limit: 10000 })
      expect(result.notifications.length).to.equal(10000)
    })
  })
})

It could be better to define the test as multiple versions that run longer each time.

describe("Usage analytics", () => {
  describe("Saved notifications", () => {
    it("Should equal 100", async () => {
      const result = await client.get("/notifications", { limit: 100 })
      expect(result.notifications.length).to.equal(100)
    })
    it("Should equal 1000", async () => {
      const result = await client.get("/notifications", { limit: 1000 })
      expect(result.notifications.length).to.equal(1000)
    })
    it("Should equal 10000", async () => {
      const result = await client.get("/notifications", { limit: 10000 })
      expect(result.notifications.length).to.equal(10000)
    })
  })
})

Sort Tests by Historical Runtime

It’s not always obvious which tests run longer than others, so it’s a good idea to record how long a test takes over time. Run your shorter tests before your longer ones. If your long-running tests are fine, but your quick tests fail, you’ll get your feedback sooner.

In the following example, the duration of every test are written to a local JSON file and used to sort the tests in the next run.

decribe("User analytics", () => {
  described("Saved notifications", () => {
    let durations
    let recordDuration
    before(() => {
      durations = []
      recordDuration = (testName, testDuration) =>
        durations.push({ testName, testDuration })
    })

    const tests = {
      "Should equal 1000": async () => {
        const result = await client.get("/notifications", { limit: 1000 })
        expect(result.notifications.length).to.equal(1000)
        recordDuration("Should equal 1000", Date.now() - startTime)
      },
      "Should equal 10000": async () => {
        const result = await client.get("/notifications", { limit: 10000 })
        expect(result.notifications.length).to.equal(10000)
        recordDuration("Should equal 1000", Date.now() - startTime)
      },
      "Should equal 100": async () => {
        const startTime = Date.now()
        const result = await client.get("/notifications", { limit: 100 })
        expect(result.notifications.length).to.equal(100)
        recordDuration("Should equal 100", Date.now() - startTime)
      },
    }

    const durationRecording = require("./durations.json")
    durationRecording
      .sort((a, b) => a.testRuntime - b.testRuntime)
      .forEach(({ testName }) => {
        it(testName, tests[testName])
      })

    after(() => {
      fs.writeFileSync("./durations.json", JSON.stringify(durations))
    })
  })
})

As you can see in Figure 1, metrics are provided by Foresight right out of the box. Just click on a test and check how long it took in the past—no need to waste time browsing logs.

Keep Timeouts in Mind

Look at session timeouts for API tokens and requests submissions. If they are misconfigured, it can slow everything down without giving you any gains. All it takes is a typo, and your ten-second timeout becomes a hundred or thousand-second timeout.

Solutions for Accidental Long-Running Tests

Getting your accidentally long-running tests to behave comes with more options—and with Thundra Foresight, you can make informed decisions faster than ever before.

Delete Tests You Don’t Need Anymore

As your test suite grows, you’ll find that some tests supersede others, or that they test long-resolved issues. Review your test suite regularly so you don’t wait for tests that aren’t required anymore. After all, the fastest test is the one that’s not running.

Let’s look at the following example:

describe("Usage analytics", () => {
  describe("Saved notifications", () => {
    it("Should equal 100", async () => {
      const result = await client.get("/notifications")
      expect(result.notifications.length).to.equal(100)
    })

    it("Regression (fixed): Should include severity", async () => {
      const result = await client.get("/notifications")

      result.notifications.forEach(
        (n) => expect(n["severity"]).to.not.be.undefined
      )
    })
  })
})

Here we have two tests in a suite, one of which checks for a regression. In the past, a bug caused this field to go missing, so a developer wrote a test for it and fixed it.

Such tests are usually kept around after the initial fix to ensure the problem won’t crop up again, but when you have many tests that are this particular, or when they are long-running by definition, they can slow down your CI pipeline. This is why it’s a good practice to remove them after some time. A year, or even just a few months, is often enough.

That said, don’t just delete the tests. First, move them to their own test suite to make sure that they can be tracked. Then skip those tests during the next test run before actually deleting them. This way, you can go through your potentially irrelevant tests in one place, disable them at will, and enable them later if it turns out they’re still needed.

describe("Regressions", () => {
  it.skip("(fixed): Should include severity", async () => {
    const result = await client.get("/notifications")

    result.notifications.forEach(
      (n) => expect(n["severity"]).to.not.be.undefined
    )
  })
})

Provide More Powerful Resources

The nuclear, and often only, option to speed things up is to throw more hardware at it. And in the age of cloud computing, that’s usually not a bad idea. Back in the day, you had to buy a server for thousands of dollars. Today, you can rent one for a few minutes or even seconds, or rent multiple servers at once to run your tests in parallel.

Most people equate more powerful resources with higher costs, but that doesn’t have to be the case. Sometimes a better machine finishes the job so quickly that the absolute costs go down, even when the machine costs increase per minute. Now that you aren’t cornered into buying them outright, it might be worth using one of these mighty machines if it improves runtime by hours.

Overall, adequate resources make life easier for your engineers and free them up to deliver more value by focusing on their work. That extra investment in hardware can save time and thus cost human resources.

Set Reasonable Timeouts

If a test is only long-running when it fails, this usually means you have a timing problem. Experiment with shorter timeouts so your test process doesn’t get needlessly idle.

This example uses a timeout of 10 seconds when accessing an API, just to be safe.

describe("Usage analytics", () => {
  let client
  before(() => {
    client = new ApiClient({
      secret: process.env.API_KEY,
      timeout: 10000,
    })
  })
  describe("Saved notifications", () => {
    it("Should equal 100", async () => {
      const result = await client.get("/notifications")
      expect(result.notifications.length).to.equal(100)
    })
    it("Should smaller than 1MB", async () => {
      const result = await client.get("/notifications")
      const bytes =
        encodeURI(JSON.stringify(result)).split(
          /%(?:u[0-9A-F]{2})?[0-9A-F]{2}|./
        ).length - 1

      expect(bytes < 1000000).to.be.true 
    })
  })
})

Here, two tests are accessing an external API. Now each one of them has the possibility of running for ten seconds. They might return early, but we can’t be sure. Rather than running the tests for a standard 10 seconds, it would be better to check how long they run on average and add a reasonable buffer on top.

Profile and Optimize Tests

Tools like Thundra Foresight will make your test process observable and help you find out which parts of a test are causing it to slow down.

Figure 2 shows a Thundra trace chart for a long-running test. It splits the runtime into different parts, including setup, test, and teardown. This makes it easy to find out what is taking the test so long. In that specific example, each operation takes almost twenty seconds, while the actual test finishes sooner.

We also see that the test used Amazon SQS and how long it took to talk to this service.

Figure 2: Thundra trace chart

Figure 3 shows the architectural graph of that test and provides even more detail about it. This includes all the resources utilized in executing the test, how often they were called, and how long it took to call them.

Figure 3: Thundra architectural graph

If you were to rewrite the code used in the test or the component you’re testing, could you use a better algorithm? Are you calculating unrequired results that could be cut out? Is caching or memorization an option?

You can also mock your resources, such as third-party services or databases. If a Foresight trace tells you the database is the issue, it might be worth investigating different solutions. Sometimes an SQL server can be replaced with SQLite for testing purposes.

Think explicitly about caching test data. If tests depend on each other, reusing their data can lead to all kinds of problems. But if your data is immutable or only read by tests and never modified, you might not have to go through the expensive process of recreating it before every test.

Sort Tests by Historical Runtime

Just like you would with intentional long-running tests, run the quick ones first and the slower ones later. If your slow tests are fine and only the quick ones fail, you don’t have to wait.

Use Different Test Stages

Give every test type its own stage—unit tests and smoke tests are usually much quicker to execute than integration and E2E tests. If a smoke test fails because your application server doesn’t even start, you can save some time to run all the failing E2E tests later.

In Figure 4, you can see that Hazelcast put their slowest test in its own suite to be executed in total isolation from other, quicker tests.

Figure 4: Thundra Foresight’s test suite view

Let’s look at an example where a long-running test in the middle of your test suit can slow down a number of other tests.


describe("All tests", () => {
  it("Server starts", () => {})
  it("Endpoints are configured correctly", () => {})
  it("Check all notifications for severity", () => {})
  it("Add and remove notifications", () => {})
  it("Client connects", () => {})
  it("Count all notifications", () => {})
})

After running the tests a few times with a tool like Foresight, you should have a feeling for their durations and split them up accordingly.

describe("Quick Tests", () => {
  it("Server starts", () => {})
  it("Endpoints are configured correctly", () => {})
  it("Client connects", () => {})
})

describe("Long-running tests", () => {
  it("Add and remove notifications", () => {})
  it("Count all notifications", () => {})
  it("Check all notifications for severity", () => {})
})

Filter Tests by Code Coverage

If you figure out what code was covered by a test, you can filter your test runs by code coverage. This saves you from executing tests that don’t check the things that actually changed. Again, a test not run will always be the fastest.

Code coverage tools exist for different programming languages—for example, Istanbul for JavaScript and JaCoCo for Java.

Only Run Critical Tests Synchronously

If you have too many tests with different levels of importance, it can be a good idea to make a subset of them mandatory. Some of them might be run in the background and trigger a rollback to a previous version if they fail.

For example, a test keeps a regression in check that didn’t fail for a long time. Before deleting it, you could move it into an optional test suite.

Log Test Information In-between

If your test is long-running, it’s a good idea to log all information that happened along the way of executing it. This also includes assertions for intermediate results. This doesn’t accelerate your long-running tests and might even slow them down, but you’re not running a test for its own sake. You’re running it for the value it provides to you. Long-running tests are more expensive than short-running ones, so make sure they are worth your time.

Instead of writing big tests with assertions in the end, like like this:

describe("User analytics", () => {
  it("Update notifications", async () => {
    const client = new Client({ secret: process.dev.API_KEY })
    let result = await client.get("/notifications")
    let updated = result.notifications.map((n) => n.severity + 1)
    result = await client.post("/notifications", updated)

    result = await client.get("/notifications")
    updated = result.notifications.map((n) => (n.read = true))
    result = await client.post("/notifications", updated)

    result = await client.get("/notifications")

    result.notifications.forEach((n) => {
      expect(n.read).to.be.true
      expect(n.severity).to.equal(2)
    })
  })
})

Sprinkel assertions after every step, so you’re not greeted with a confusing error at the end of the test.

describe("User analytics", () => {
  it("Update notifications", async () => {
    const client = new Client({ secret: process.dev.API_KEY })
    let result = await client.get("/notifications")
    let updated = result.notifications.map((n) => n.severity + 1)

    updated.forEach((n) => {
      expect(n.severity).to.equal(2)
    })

    result = await client.post("/notifications", updated)

    result = await client.get("/notifications")
    updated = result.notifications.map((n) => (n.read = true))

    updated.forEach((n) => {
      expect(n.read).to.be.true
    })

    result = await client.post("/notifications", updated)

    result = await client.get("/notifications")

    result.notifications.forEach((n) => {
      expect(n.read).to.be.true
      expect(n.severity).to.equal(2)
    })
  })
})

Thundra Foresight gives you access to traces that show every service that was involved in your test. If your test only accessed half of the services it actually needed, that can be valuable information as you try to fix it.

Summary

Long-running tests can inhibit software delivery because they make it easy to slide into sloppy development practices. Commits get bigger and have to be grouped to avoid multiple test runs. Developers may lose focus as they switch between tasks or sit idle, neither of which leads to releases or revenue.

Luckily there are multiple ways to ease the pain of long-running tests in your CI/CD pipeline, especially for accidentally long-running tests. Whether you’re dealing with code refactoring or simply provisioning more resources for the testing environments, Thundra Foresight makes it easy to optimize your test runs.

Put test analytics and visibility at your fingertips with Thundra Foresight.