farmdev

Dark-Launching or Dark-Testing New Software Features

If you're building software that will be used by hundreds of millions of people at once it's pretty tricky to simulate that kind of load in a testing environment. And without realistic load tests, you can't be all that sure if your infrastructure will stand up to the pressure. MySpace tried to use 800 EC2 instances to simulate one million concurrent users on their new video features but before they could reach the limit of their own app they hit the physical limits of their Akamai datastore due to the geographic location of the EC2 nodes. D'oh!

Instead of simulating load, why not just deploy the feature to see what happens without disrupting usability? Facebook calls this a dark launch of the feature. (Grig Gheorghiu also has a succinct write-up on dark launching.) Let's say you want to turn a static search field used by 500 million people into an autocomplete field so your users don't have to wait as long for the search results. You built a web service for it but there's no way to simulate all those people typing words at once and generating multiple requests to the web service. The dark launch strategy is where you would--for example--augment the existing form with a hidden background process that sends the entered search keyword to the new autocomplete service multiple times. If the web service explodes unexpectedly then no harm is done; the server errors would just be ignored on the web page. But if it does explode then, great, you can tune and refine the service until it holds up.

There you have it, a real world load test. The downside is that such a test not entirely realistic. Once the user sees the new autocomplete feature she might end up searching with more keywords than normal since the results appear faster. Dare Obasanjo has further examples of the pros and cons to dark launches. This kind of dark test is not perfect but will help boost confidence. Of course, all this only matters if you actually have millions of users!