Surviving the Amazocalypse

So, I got the call every techie dreads: 4.30am “OMG, we’re down!”

It was from Canada and I’m in Oz, so you know it was bad. As most people know now, Amazon’s US East Region was out and had taken down heroku where we host the main LeadNow site . We’re in the middle of spruiking democracy for the Canadian federal election and have been a central pivot for vote mobs and voter socials, so getting back up was really important ahead of the weekend.

Overall, we were fine, and got back up pretty sharp considering the magnitude of the problem, so I’m being tongue-in-cheek with the “Amazocalypse” but really just curious about what leaner startups and cash strapped orgs do for their escalation and recovery procedures that we may not be doing.

One of the nice things about LeadNow is that since we designed from scratch and were cash constrained we built more like a lean startup than drunken-sailors-on-shore-leave type organizations. Cloud services were a no-brainer but, in a sense, since we were trusting in those guys to be able to better handle DR/BC (Disaster Recovery/Business Continuity) better than we ever could infrastructurally (or from the type of budget perspective “proper” DRBC usually takes.).

But, since we’d pushed most code to github and were relying on Advocacy Online and Survey Gizmo as APIs in most cases rather than running our own DB stuff, it ended up being pretty easy to recover. We did have a running server in Amazon US West region but on top of that we also had suspended Amazon servers “waiting in the wings” in case of something really bad happening so we could spin up a server in case of a total meltdown somewhere else. So, really we repointed the DNS we manage via Zerigo, put up a “Holy crap, we’re down due to Amazon outage” page and then installed ruby via rvm and passenger on the running fallback machine and then pushed from github to the server and spun up the Sinatra to get back in the swing of things. Wasn’t pretty, but any outage you can walk away from is a good one.

So, what else could we have been doing?

Anyone seamlessly fail over on the cheap and not even notice? Would be interested in other peoples’ mega fallback plans to keep online despite catastrophic outages especially if a key piece of infrastructure goes down.

Think cheap and cheerful and clever DRBC. We’re a not for profit after all.