17 November 2013

Scaling Featureswitches

by Mike

I was recently helping out a friend’s company to scale out. This details some of the issues that were faced when scaling out the code for the feature switches.

Once your applications get big, being able to toggle feature-switches on and off to enabled and disable features or change things in real time is a great advantage.

If you are getting 10 page requests per second across 2 servers and you have 10 feature-blocks on the page then 200 lookups per second can be easily accomplished via Redis or Memcache directly. The per page penalty shouldn’t be massive if your server isn’t too removed from your application servers - typically 10ms assuming my benchmarks are within the right ballpark (60,000 GETs per sec and around 1ms round-trip time)

This becomes a bigger issue when you are looking at many more servers, feature blocks and many more pages per second. The Redis server becomes a bottleneck and the round-trip time means that you need many more threads to do the work.

To this end, I redesigned the standard ‘gatekeeper’ code that was used to scale more effectively.

They used to use Xcache. Both support storing variables in memory for blazingly fast access. This is great, it gives great speed, but it is a pain to set these from outside of the web server - ie. from CLI.

So, what we ended up doing was using these for storing the feature flags for a short time after grabbing them from the key/value store (Redis was used). At 10 second cache time and 100 requests per second, this dropped the number of requests against Redis by 3 orders of magnitude.

A second issue was that as the application was scaled out across multiple data centres, we had multiple Redis servers for the feature flags- one for each cluster. Distributing the changes to the flags was a bit of a pain. We settled on using a RabbitMQ fanout exchange and a worker sitting on the Redis server itself to listen for global updates and then push the changes into Redis. This was the most fragile part of the setup, but worked well for us.

RabbitMQ supports server-side keep-alives, but our worker code was a pain to change to accommodate client side keep-alives. This meant that a partition in the network would see RabbitMQ close the connection but the client not being aware of it.

A second worker was used to check the first worker (and other processes) for updates and restarted it if it hadn’t responded for a while either to a heartbeat or a real message.

This gave a very scalable solution with not a massive amount of overhead.

I hope to have a reimplementation of the architecture used very soon and will make it available on Github when working.

tags:

darkflib.github.io

Site Reliability Team Lead at News UK

Scaling Featureswitches