I’ve been working on testing distributed systems for the past 2 years, and i’ve confidence to say that it’s quite a challenging when I was working ont the product or feature scope. There are a lot of things and nuance that we need to consider.

In general, testing in a product/project team is usually focused and trying to answer:

“Does this project/product/feature work as expected? Does it meet the requirements? Does it provide a good user experience?”

However, when we talk about testing in distributed systems, the scope and focus of testing is trying to answer:

“Does this system remain reliable and available under various conditions? Does it handle failures gracefully? Does it maintain data consistency across distributed components?”

They overlap each other, but the perspective, tooling, approach and definition of quality can be quite different. One thing that i’ve learned and noticed is all about feature correctness against system correctness. QA/SDET in a general product/feature scope tends to focus on product or feature behavior such as:

  • UI/UX aspect and rationale
  • API contract and response, request/response validation
  • Business, functional and non-functional requirements
  • Regression testing, edge cases, performance, security
  • Compatibility, accessibility, and so on

In distributed systems testing, it tends to expand the scope into system behavior under uncertainty and complexity such as:

  • Eventual consistency correctness
  • Message delivery guarantees (at least once, at most once, exactly once)
  • Idempotency and retry logic
  • Race conditions and concurrency issues
  • Data replication integrity, data backup and recovery
  • Fault tolerance and resilience testing and so on

Most of the time, bug or issue that we encounter is subtle and hidden and not visible. The infuriating part is that many possible valid scenarios and edge cases are not easily reproducible and we need to simulate or create specific conditions depending on timing and system state. Which is for me it’s quite time consuming and often time requires a lot of trial and error, but i can say for sure that it’s not rewarding at the end of the day

The tools and approach that we use for testing are also quite different. Personally, there’s no silver bullet for this and it always depends on the context and the system that we’re testing. Sometimes, you will use a combination of traffic replay + message broker inspection or sometimes you’ll go with a more deterministic approach like using synthetic transactions data reconciliation initiate failures observe system and waiting for the results

But one thing that I can say for sure, it’s honestly exhilarating and gives me a lot of ‘joy’ and sense of being dumber in a good way