Google API infrastructure outage incident report

WestCoastJustin · on May 4, 2013

Very professional -- this should be a model for how to handle and report such outages. There is a lot of procedural machinery happening to even get to this state, i.e. monitoring, a change management system (who did what, and when), troubleshooting and escalation, deploying a fix and verifying that it worked. They do not gloss over the fact that this jumped QA!

cheez · on May 4, 2013

Yep, someone's performance review is not going to go very well.

packetslave · on May 4, 2013

There's nothing in the incident report that says this was due to human action. An automated process could just have easily pushed configs to the wrong environment (either through a bug or a mis-configuration of the process itself). Not saying that's what happened, but the IR doesn't say.

In my experience (having personally been the trigger for a widespread outage that got us in the news), Google takes a fairly sophisticated view of outages: SREs as a rule care less about crucifying the engineer who did it, and much more about questions like:

what was the REAL root cause that caused this to happen? why didn't our processes/tools STOP it from happening? did our monitoring detect it? could we have detected it faster? did we fix it or mitigate it fast enough? did things like communication, escalations, handoffs between teams, etc. work effectively? what can we do better next time?

When I first started, my director took a bunch of us out to a Noogler lunch. We sat down with our plates, and he said "Ok, let's talk about how to get fired." Basically, mistakes happen. Bugs happen. If you cause a huge outage, that isn't necessarily a negative reflection on you, and shouldn't hurt your perf. If you cause a huge outage because you willfully ignored procedure, went around established safety controls, didn't monitor to make sure your changes didn't turn all the pretty dashboards red, THEN you're going to have a problem.

ConceitedCode · on May 4, 2013

Great job, Google API Infrastructure team! Communication is key. Now if only everyone did this...

tlogan · on May 4, 2013

This is great - now if they actually have status page which says that API is not working so we do not need scramble thru forums (which are closed) and StackOverflow forums (which are open, but all questions regarding outage are immediately closed). In case of previous outages (the last one happened on Mar 18), there was actually an email sent to people subscribed to google-apps-apis-downtime-notify

thezilch · on May 4, 2013

FTFA...

Develop better mechanism for quickly delivering status notifications during incidents.

I can't be sure how the numerous references to monitoring failures being related to status updates, but at some point, you have to assume your "status page" will have "bugs" too.

tlogan · on May 4, 2013

Shouldn't be there somebody with title "Google Developer Relations" (or something like that) to send email to mailing list?

staunch · on May 4, 2013

Hopefully this is the first time that person screwed up like this so they weren't summarily executed^Wfired.

ushi · on May 4, 2013

a configuration change was inadvertently released to our production environment without first being released to the testing enviroment.

Google - Just humans, too.