Load Testing with Shadow Traffic


Some time ago, I worked on a feature that would increase the load on my team’s backend service from 5 thousand to 50 million requests per day, a 10,000 times increase in traffic, and we needed to make sure that it could handle that huge grow, which was a bit daunting.

In this article, I explain how we tackled this problem in a clean and low-risk way. If you ever face a similar challenge, this approach might help you too.

Our Problem

Oversimplifying things, there was a Service A that made requests to another Service B, which was already handling 50 million requests per day. Service B would just queue requests and return a 200 OK to inform that the request would be processed.

Due to compelling business reasons, we needed to stop sending these requests to Service B, and instead make my team’s service receive those requests and pass them back to Service B.

I know that at first glance using an intermediary service doesn’t seem to make sense, and I ask you to trust me that there were good business reasons to do this.

Back to our problem, the first alternative that came to me was to do load testing. However, we’d have to prepare and send about 2 million requests per hour, which would have been extremely challenging. So, instead, we decided to try something different: Shadow Traffic.

Shadow Traffic

Shadow Traffic, also known as Shadow Testing, or Traffic Mirroring, is a testing strategy used to evaluate a new system. The main idea is to send real production traffic to the service being tested without impacting real users. It helps teams answer whether services can handle real load and identify bugs using real data.

As mentioned before, we needed to change Service A so it sent messages to Service C instead of Service B. Since we had to change Service A anyway, we could make it send messages to both Services B and C.

By doing so, we killed two birds with one stone: we could ensure that Service C could handle the expected load, and we could also test the service with millions of real-life data (and fix any undetected bugs before going live).

In order to avoid duplicated results, Service B would ignore any requests coming from Service C, and just return a 200 OK to confirm that the request had been received.

This, however, posed a big risk. If our service was not able to handle the shadow traffic load we would certainly bring it down. This was just unacceptable, as our service was critical to deliver other features. That’s when Feature Flags came in handy..

Feature Flags

As you may know feature flags are not just useful to enable or disable features, but also to do progressive rollouts.

Feature flags usually allow us to specify a positive integer value as a percentage of the cohort that will be exposed to the variant, with 1% as the minimum for anyone to be exposed to the new behaviour.

In our case though, one percent of those 50 million messages meant 500,000 messages per day, still 100 times more than the current capacity of our service. As a way to account for this, I proposed cascading two feature flags using the user_id as the cohort identifier, each with a 1% rollout.

Chance of FF1 = true: 1%
Chance of FF2 = true: 1%
Chance of both true = 1% x 1% = 0.01%

If we assume an active user base of 25 million users, using the feature flags in cascade would give us 2,500 users. Since every user was related to two messages on average, that gave us around five thousand messages per day, very similar to the original load of our system. That was not not too bad if when considering that we had started with 50 million messages.

It’s important to note that these numbers just give us probabilities, and are based on the assumption that the distribution of messages per user was uniform, which is probably not true. This, however, gave us a good starting point to understand that this was a viable option to control the new load while doing our tests.

Final thoughts

Depending on the outcome, there were only two options left: we either found that the system was fine as is, or that it needed changes to handle more load.

Sadly, I never found out which one it was, as I decided to leave the company before the testing took place. However, I am convinced that this plan served as a strong foundation to give the team the answers that they needed to move forward.

If you ever face a situation like this, I hope that this can serve as inspiration for ways to ensure your service can handle high load volumes with minimal effort.

Cheers!
José Miguel

Share if you find this content useful.
Follow me on LinkedIn to be notified of new articles.


Leave a Reply

Your email address will not be published. Required fields are marked *