Flakiness in tests

A flaky test is a test that both passes and fails periodically without any code changes.

may 22, 2022

A flaky test is a test that both passes and fails periodically without any code changes.

It’s interesting to discuss the reasons behind writing tests. We write tests to be sure that our code is doing what we are expecting, but why not doing this manually?.
The answer to that question is simple, automatic tests of any type (unit, integration, e2e, others) are several times faster than humans testing manually. Furthermore, humans are bad trying to repeat the same test multiple times, we lose the focus easily.
These two characteristics allow us to run our automatic tests all the time for every small change, we have created a short feedback loop to be sure our changes are secure.

Tests run by human are usually very flaky, we are not good repeating tasks multiple times, we get bored, and we start committing mistakes without noticing.

In software, automatic means fast and means secure (otherwise there is no sense to automatize). It’s true that we cannot automatize everything, but it’s also true that we can automatize much more than what we are doing right now.
This is why in software, flaky automatic tests should be an oxymoron. There is no value in a test that I cannot trust.

What’s the effect of a flaky test?

There are patterns that happen in the team when we have flaky tests:

As we don’t believe in the tests, we run them once and again until they are green. This is a problem because tests are not fast anymore. We don’t want slowness in automatic tests.
We move slow tests to the latest steps in the pipeline, making the pipeline to fail at the end and repeating the process once and again. Long feedback loops.
Even worse, we start thinking to move flaky test from pipeline and execute them manually. Also, against the idea of automatic feedback.

This is not a problem of flaky tests, but of bad engineering practices in the team. Let’s use this principle, then:

“If it hurts, do it more frequently, and bring the pain forward.”

― Jez Humble, Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation

Then let’s make our pain to happen before, let’s move those flaky e2e tests to other lower levels. We know that e2e tests are more flaky than integration tests, and integration tests are usually more flaky than unit tests. So we could use the flakiness as a way to ask ourselves if we have to rewrite that test as one or multiple unit or integration test.

Flakiness in unit test

If you have a flaky unit test, your design sucks. Redesign that part of the code to remove the flakiness of the test.

Or even worse, your unit tests don’t follow the “I” in the FIRST approach.
Independent: Tests should not depend on the state of the previous test, whether that is the state of an object or mocked methods. This allows you to run each test individually when needed, which is likely to happen if a test breaks, as you will want to debug into the method and not have to run the other tests before you can see what is going wrong. If you have a test that is looking for the existence of some data, then that data should be created in the setup of the test and preferably removed afterward so as not to affect the later tests.

Flakiness in integration tests

If an integration test is flaky, probably is because is trying to cover too many aspects that you are not controlling in the test. Perhaps this is not the best test to write, try to write smaller unit tests to cover only the business logic and just use integration tests to test your side effects.

E2E or Journey tests

Flaky e2e tests should be downgraded (let’s use that expression) to multiple unit, integration or contract tests.
Automatic tests should be tools to develop, to allow refactoring. To make us to live quiet, to know what things cannot fail, to reduce uncertainty. But there is no value in tests by its own, they have a cost, we have to maintain tests, they are code. So it’s important to have enough tests to cover all our functionalities and scenarios, but minimize overlapping tests that cover the same things several times. If that’s the case, remove the higher test.
This is also part of the testing pyramid:

https://martinfowler.com/articles/practical-test-pyramid.html

There are more unit tests because they are the fastest tests, it's simple to understand what is failing, but we cannot cover everything with them. We cannot cover integration with libraries, frameworks or with third party systems. As much difficult is this, higher we have to go in the pyramid, but our effort should be always to test thing in the lower test possible.
E2E tests are useful for those parts that we cannot test in others test levels.
As the higher you go in the test pyramid, the more difficult it is to create short live environments to pass your tests.

This is the reason why we usually run our e2e tests in environments that try to be similar to production. Flakiness is very related to the grades of liberty the system has, if our system is complicated, and we run our automatic tests against it, we will not be sure why things are failing, it will not be clear.
So let’s minimize e2e tests, let’s just use them for the minimum required thing and for doing this all roles in the team need to understand how are we covering the functionality.

Trust

Another problem with higher level tests is that different people see them for different things:

Devs use e2e tests as a way to check that all the parts of the system are well integrated, they believe in their other tests levels because they wrote them.
QA’s like them to report that the feature is working as expected, following all the acceptance criteria, so they want a lot of detail in the e2e tests.
PO’s or stakeholders or people more far from technology understand them better, so they prefer them.

In fact, these discrepancies only show the lack of trust between different roles in the team, in my experience this is because QA’s and Devs seeing each other as different steps, first we deploy later we test. I think that’s a very bad idea, testing is part of the development and is required for quality.
Another signal of this lack of trust is the number of Long Lived Environments that exist in our path to production.

This is for me what a QA should champion, shift left quality, pushing devs to write better tests, better code.

QA as a multiplier, not a goal-keeper

I don’t like the idea of QA’s as the ones in charge of the quality of the system and being able to demonstrate that things are working, that way of working tend to create bureaucracy.
For me a QA should be a person able to push the team for improving quality, not a manual tester, not the only person that write automatic tests. Not someone to sign-off stories.
A QA should be a multiplier, not a goal-keeper, the main focus of a QA should be to shift left automatic testing and shift right manual one. We are bad doing repetitive tasks, but we are much better than machines doing exploratory testing. Exploratory testing should be the only test done by humans, and ideally should be done in production.

A QA for me is the one that needs to push the team for this, to push to create an architecture able to test manually in production but a system reliable to be deployed to production at any time because we have an automatic fast way to know that things are working.

So if we use flakiness to shift left those tests, we are pushing quality to the moment of the development. That's usually mean working closer between Devs and QA’s. Why do we need to wait until the story or task is done to show results, why not doing that continuously?.

What do we need to have continuous feedback?

One way of doing this is through CI/CD, if we deploy our changes continuously to a controlled environment, we can show partial results continuously and receive feedback earlier. Doing exploratory testing easier for anyone in the team.
I know that there are a lot of discussion of what means continuous, for me, it means at least once per day.

Flaky automatic tests are against the concept “automatic”, use that feedback to solve the problem you have in your design, in your team, or in your organization. We should not live with flaky tests forever.

El Substack de Javier López Fernández

Discusión sobre este post