Long intermezzo

Apparently, we got ourselves ‘another-sudden-flaky-test’ on Friday morning once we were running our integration tests for regression part. The issue was not complicated stuff, it was just that our internal API service timed out during operations and we got this error message when we investigated our CI logs:

.venv/lib/python3.12/site-packages/httpx/_transports/default.py:118: ReadTimeout
=========================== short test summary info ============================
FAILED tests/authentication/test_login_api.py::test_login_success - httpx.ReadTimeout: The read operation timed out

Somewhat, our HTTP client didn’t have retry logic implemented for reliability concerns that could happen just like in this case. So after a few iterations, I decided to implement a straightforward retry request to gracefully handle connection timeout, read timeout, and network errors (all these errors are based on the httpx exception classes). Here’s the retry logic

# existing codes ...
def _request_with_retry(self, fn: Callable) -> httpx.Response:
    last_error = None
    for attempt in range(1, self.retries + 1):
        try:
            return fn()
        except (httpx.ReadTimeout, httpx.ConnectTimeout, httpx.NetworkError) as e:
            last_error = e
            if attempt < self.retries:
                # back-off n 1s, 2s, 3s, etc
                time.sleep(self.retry_delay * attempt)
    raise last_error

The integration tests remain unchanged since they should not know about the retry process at all. When implementing this, it does:

  • Retries up to 3 times when there are exception errors raised based on those categories
  • Wait 1 second up to 3 seconds or longer between attempted requests that would be made (linear backoff, no need for fancy things for now)
  • Re-raises the organic error if all retries were exhausted, so the test still fails with a clear error message if our API services was really down or having a disruption

Those things worked well, and when we re-ran the tests again it went without any errors1. In the midst of this situation, suddenly one of my peers asked me “Should I mark skipped/flaky tests on the particular tests if we have issues like this? Because, if I don’t do that, I’m afraid that my KPI won’t be safe”. And I was like SERIOUSLY, I’m glad you asked that question

Thus, it led to another precedent and subsequent event which eventually came up with this post

Core principles of retry during integration tests

Before I decided to jump to the explanation why this thing would lead to business/personal metrics stuff, the core principles behind handling the retry request in the HTTP client are all about separation of concerns. Each component should only care and manage its own responsibility and its own logic. For example:

  • HTTP/API Client one of its responsibilities is handling network responsibility, retries, HTTP transport, and such
  • Test specification validates business logic, services contract, response assertions and such

If we try to put try/except in the test specification, we would be mixing two different responsibilities into one place. Am I saying this is a bad thing? depends. If we talk about the smaller size of integration tests, then it would be necessary to implement them since it’s quite practical and straightforward

However, if your integration/contract tests start growing bigger, dispersing, and there are a lot of components that pile up together, those convoluted tests would manage: whether the service can be reached and whether the response is correct. Which is not their actual job to do that.

Another practical consideration based on my current experience, there are at least 2 other reasons why I tend to not implement them in one place:

  • First thing, tests become harder to read. The intention gets overridden by another logic, let’s take a look at this comparison
# what is this test about?
def test_login_success(api_client, config):
    try:
        response = api_client.post("/auth/login", json=payload)
   # catching the timeout?
    except httpx.ReadTimeout:
        pass
   # or asserting the response?
    assert response.status_code == 200

The code above is a simplification of one of many test cases in place. Imagine that you would have another assertion or logic to validate other dependencies, which have chain methods. All the test logic would be much more complicated to read, understand and maintain

  • Second, retry logic gets duplicated easily leading towards maintainability becoming much harder. If you have 100 different test cases and put each try/except statement in your test, you will maintain all those test cases (unless you fully utilize AI, then it’s fine). Personally for me, tests should be deterministic (clear pass or fail). It must be reflected by the actual API’s behavior, not based on network flakiness

Should you mark tests as flaky or skipped?

When my peers asked me that question, I deliberately answered him with

”No. We shouldn’t do that unless you have a legitimate reason for using flaky marking”

In my experience, marking a test as flaky or skipped is essentially saying: we know this test fails sometimes, and we pretend to be aware of that and start moving on to the next things

Marking a test as flaky or skipped has the potential to cover the actual issues that are happening in our test specification. It’s kinda like:

  • We start to normalise the real failures. Either the teams stop treating them as not worth investigating, or we just move on since we have a lot of stuff to do
  • Trying to re-run the CI jobs again instead of asking “Why would it have happened?“. Thus, the flaky test potentially gets ignored over time
  • Hides the actual degrading service reliability. When you mark a particular test as flaky/skipped, there’s a potential case that those services are unreliable in the first place. By all means, it will gradually erode trust in the entire system or test suites, and cast doubt on all failures

That being said, the only valid reason to use flaky marking is for short-term or temporary truce while investigating and fixing the actual root cause on the particular services.

Without this, I’m convinced that this is going to be another technical debt that nobody is going to clean up or investigate. You know, it will have a snowball effect afterwards…

Business metrics tension against engineering practices

Pretty much, I 100% relate and understand my peers’ concerns when he said those things about his KPI regarding automation test pass coverage won’t reach the minimum target. I’d to think, that this is a really common tension, which is probably a universal experience (am I wrong or how? LOL). When it comes to this, sometimes we are faced with the uncomfortable situation between pushing engineering ideals or just being strict to the target, but we will do everything that we want so it can conform with the target

Speaking from experience, the suitable approach, is separating reliability issues from quality gates. Instead of masking failures, we can categorize them using tag annotations so they can be excluded from the test pass coverage. Such as:

@pytest.mark.known_issues(reason="unstable third-party from auth services")
def test_login_with_google_api():
    ...

This way, the broken service is still being tracked and visible to everyone and we don’t lie about the test pass metrics. The downside is that this can be too much of a hassle and too many things need attention. The core principle when having this is that we want to make a logical argument that says “a 90% test rate was passed with expected failures on there”, while acknowledging that our distributed systems have dependencies that we can’t fully control

However, there’s a big gap between “right practice” and “what actually happens in a team” in real world scenarios. Especially when I (or even you) am not the one who was setting the rules. Realistically speaking, it’s often hard to change the targets if it’s already decided from higher ups. In addition to that, we are also in the position that we can’t force related teams to do bugfixes so we will have a good test pass rate, and at the same time we also can’t “explicitly lie and cover up” our works.

The best thing that we can do for long-term:

  • Consistently apply the actual annotations on our test suites if it’s truly failed or flaky
  • When the tests becomes flaky at certain point, make sure it’s being tracked and logged properly, so it can be worth to investigate with convincing argument
  • Exclude the flaky/skipped tests suite from test pass rate metrics but still have a clear and honest explanation

That being said, sometimes you just have to pick your own battles: not every improper practice is worth the political cost of pushing back on each other. Knowing which ones are worth it is honestly as important as knowing what the pragmatic approach is

Footnotes

  1. To be clear though, we hadn’t fixed the underlying problem, we just added enough tolerance on our integration tests to survive transient errors