TIL-Stricter Rate Limiting During Testing

Short summary, we’ve been faced with a problem that several users got predominantly 429 TOO MANY REQUESTS error when they were trying to updating their profile information in our feature so called “Profile Widget” in the homepage, causing burst of requests to the rate limiter within our services. We’ve been using Redis as a rate limiter for our services, but we weren’t able to test it properly since at the first time we didn’t have any test harness around it.

We’ve begun to mock and setup the test harness for our rate limiter, trying to validate whether our systems will works as expected under concurrent conditions. However, the majority of our test scenarios were passed without any issues, indicating and led to our preassumption that our rate limiter is working as expected without violating any of the rate limits. But still, the issues weren’t resolved and still persists in the past 30 days, specifically for users who were using iOS devices

First problems that causing all tests are passed

Just like I mentioned earlier, one of the underlying problem that we’ve been facing is not being able to test our rate limiter properly and causing all of our tests to pass, which is we eventually don’t expect that to be happened. Several takeaways from this problem that i want to share with you are:

We should understand the shape of the traffic behind all those errors. by all means, our goals is not to replicate immediately about the ~400 errors at once so it can be reproducible, but rather to mimic the pattern that would be revealed or concealed, and led to those 429 errors, why would it be happening
Tune based on production numbers should be our first step to understand the root cause of the problem and it must be designed or configured properly. A well parameterized concurrent test harness can give us a better understanding and get very close to expose the weaknesses in our system deterministically and mimicking the real-world scenarios
Always strict with your test cases. No matter what happened. Perhaps, this is the most important takeaway from this problem. In the current test harness that we’ve been using, we set our limiter settings to not be very strict and was too loose, which means that we can get a lot of false positives

One of the examples where our test cases were passed but it shouldn’t be can be seen in the following code snippet

@pytest.mark.parametrize("num_users, burst_size, test_duration", [(400, 3, 2.0)])
def test_profile_widget_rate_limiter(redis_rate_limiter: RedisRateLimiter, num_users, burst_size, test_duration):
    """
    Simulates at least ~400 users opening and updating the profile widget at the same time,
    each triggers would causing burst of requests to the rate limiter during update profile
 
    Mimicking the behavior of the profile widget in the production environment, which is
    leading to 429 errors for respective users in the past 30 days (only for iOS)
    """
    def user_simulation(user_id):
        allowed = 0
        for idx in range(burst_size):
            if redis_rate_limiter.allow_request(f"user_id:{user_id}", token=1):
                allowed += 1
            time.sleep(1)
        return allowed
    
    start_time = time.time()
    total_allowed = 0
 
    total_rps = num_users * burst_size
    with ThreadPoolExecutor(max_workers=total_rps) as executor:
        futures = [executor.submit(user_simulation, id) for id in range(num_users)]
        for f in as_completed(futures):
            total_allowed += f.result()
 
    elapsed_time = time.time() - start_time
 
    expected_min_request = num_users * burst_size
    # get_throttled = expected_min_request - total_allowed
    assert total_allowed >= expected_min_request, f"Limiter throttled legitimate profile widget calls"

There’s nothing special about this test case, in fact, we’d feel been “manipulated” since all of our test cases are passed without any issues. But, in that snippet above, there’s 1 single thing that we should pay attention to and it becomes a major game changer for us, why this test case should’ve failed instead of passed.

The concurrency level of the test harness that we’ve been using based on that test case may not (truly) match with the real-world concurrency level that our system is undergoing because the timing of fixated “wait” during the burst of requests. As you can, the time.sleep(1) is the main culprit in here.

In real-time, putting 1 second wait between requests much likely not going to be the case in production, we might have hundreds of users hitting the endpoint within a few milliseconds or even more than that. In essence, based on this test harness, the rate limiter has a time window of 1 second to refill the bucket capacity. So, even with hundreds of users that we’re configured earlier, the limiter will never hit it’s own capacity and no throttling happened. Once we removed the time.sleep(1) from the test case, we’ve been able to see that the test cases are failing, which is the expected behavior and resulting like this

Wrap-up

While discovering and fixing our test harness was valuable (now we’ve set of tests that actually catch rate limiting violations), however, the original iOS issue remains unresolved 😂. I think, there are another problem might be elsewhere and it could be iOS specific retry behavior, network conditions or different bottleneck in our system. The investigation still continues, though. Until now…

I think, that’s all for me, thank you for reading and hopefully this will help you to understand that we should pay attention to our test cases and the configuration of the test harness.

Ryan Personal Blog

Explorer

TIL-Stricter Rate Limiting During Testing

First problems that causing all tests are passed

Wrap-up

Table of Contents

Backlinks