Learning how to respond to downtime

If you run a web service, I want you to take a moment to learn from the recent response by 37signals regarding their 2hours of downtime they had the other day. Here is what I said about it on my linklog.

"37signals responds to downtime, perfectly. They start with an explanation of what happened, then apologize with the promise to compensate where warranted, and assure it won’t happen again, all with human feeling. Learn." — (view bookmark view their post)

Pulling this off is no easy task – though for a remarkably customer service conscience group like 37signals perhaps this comes pretty naturally. I wanted to take a second to show some bad examples of this type of response, so that you can see the contrast (and I’m sure I could be one of these examples if I was harder on myself).

Recently Flickr had some downtime that they knew they were going to have so they gave fair warning about it. This is a good thing. However, their maintenance took longer than they thought it would, and I think they might have stepped over the "snarky remark" edge just slightly. Just so we’re all clear, I love Flickr. I’ve met some of their staff members and each of them are good people. Here is a snippet from their downtime notice post.

"Do you remember when we said we were almost back online? Well, that time we were joking, but this time is for real!"

Personally I think they could have skipped the "every few hours" approach to updating and just waited until the service was updated to bring the community up-to-speed (more on this below). Snarky remarks like the above don’t help too much. How can this be avoided though? You don’t want to be completely unhuman. Let’s look at how 37signals brought the human-feeling into their post, with this line.

"Again, weÃ¢â‚¬â„¢re truly sorry for this interruption. This is not how Fridays are supposed to be."

During their downtime they also updated their users as best they could (this particular situation was relatively out-of-their hands) and while they injected some heartfelt messages into those updates, I think they could have saved that for this post.

Another bad example would be to remain silent and have your service degrade, well, not so gracefully. Blogger recently had some outage and their users just saw a weird message and there was no updates from the Blogger staff. Silence isn’t a good tactic at all.

Points to remember

Based on the good example of 37signals and the bad examples above, I think that we should all strive to do the following when web services go down – and I’ve ordered these by importance (in my opinion).

Degrade gracefully. When downtime occurs, forward to some sort of friendly message that is easily updatable by staff members to let the community know what is going on.
Keep explanations short and simple. Don’t update every 5-seconds (especially if you have nothing to report), and don’t be long winded. Sometimes "we’re working on it" is sufficient. Oh, and each update should have a timestamp.
Don’t give false expectations. I’ve learned this the hard way. Even if your engineers tell you that it will take an hour, there is no need to say that publicly. Keep the "we’re close" messages to a minimum too.
Be human. Try your best to explain the situation in human terms and be warm.

Once the service is back up and running, and a longer explanation is warranted, you can look no further than 37signals post for inspiration.

One thing we can’t see is whether or not 37signals did any contacting of their users behind the scenes. Since their product is a pay-for service, they could have very well personally contacted some of their larger accounts to let them know what is going on. Or, after they were back up, they could have reimbursed them beyond the offer they made publicly. Things like this go a very long way.

Please notice that I believe this task to be extremely hard to pull off well and that I think both Flickr and Blogger are great services.

I’m hoping that I can take all of these points and learn from them the next time we have any troubles at Viddler. In the past we’ve handled these situations fairly well, but I know we can improve a lot by learning from others good and bad examples.

Thoughts?

Addendum: It appears that I am not the only one that thinks 37signals did a great job. Not only do they have numerous comments on the post, but Dan Benjamin also thought so.