Learning how to respond to downtime
If you run a web service, I want you to take a moment to learn from the recent response by 37signals regarding their 2hours of downtime they had the other day. Here is what I said about it on my linklog.
“37signals responds to downtime, perfectly. They start with an explanation of what happened, then apologize with the promise to compensate where warranted, and assure it won’t happen again, all with human feeling. Learn.” — (view bookmark | view their post)
Pulling this off is no easy task - though for a remarkably customer service conscience group like 37signals perhaps this comes pretty naturally. I wanted to take a second to show some bad examples of this type of response, so that you can see the contrast (and I’m sure I could be one of these examples if I was harder on myself).
Recently Flickr had some downtime that they knew they were going to have so they gave fair warning about it. This is a good thing. However, their maintenance took longer than they thought it would, and I think they might have stepped over the “snarky remark” edge just slightly. Just so we’re all clear, I love Flickr. I’ve met some of their staff members and each of them are good people. Here is a snippet from their downtime notice post.
“Do you remember when we said we were almost back online? Well, that time we were joking, but this time is for real!”
Personally I think they could have skipped the “every few hours” approach to updating and just waited until the service was updated to bring the community up-to-speed (more on this below). Snarky remarks like the above don’t help too much. How can this be avoided though? You don’t want to be completely unhuman. Let’s look at how 37signals brought the human-feeling into their post, with this line.
“Again, we’re truly sorry for this interruption. This is not how Fridays are supposed to be.”
During their downtime they also updated their users as best they could (this particular situation was relatively out-of-their hands) and while they injected some heartfelt messages into those updates, I think they could have saved that for this post.
Another bad example would be to remain silent and have your service degrade, well, not so gracefully. Blogger recently had some outage and their users just saw a weird message and there was no updates from the Blogger staff. Silence isn’t a good tactic at all.
Points to remember
Based on the good example of 37signals and the bad examples above, I think that we should all strive to do the following when web services go down - and I’ve ordered these by importance (in my opinion).
- Degrade gracefully. When downtime occurs, forward to some sort of friendly message that is easily updatable by staff members to let the community know what is going on.
- Keep explanations short and simple. Don’t update every 5-seconds (especially if you have nothing to report), and don’t be long winded. Sometimes “we’re working on it” is sufficient. Oh, and each update should have a timestamp.
- Don’t give false expectations. I’ve learned this the hard way. Even if your engineers tell you that it will take an hour, there is no need to say that publicly. Keep the “we’re close” messages to a minimum too.
- Be human. Try your best to explain the situation in human terms and be warm.
Once the service is back up and running, and a longer explanation is warranted, you can look no further than 37signals post for inspiration.
One thing we can’t see is whether or not 37signals did any contacting of their users behind the scenes. Since their product is a pay-for service, they could have very well personally contacted some of their larger accounts to let them know what is going on. Or, after they were back up, they could have reimbursed them beyond the offer they made publicly. Things like this go a very long way.
Please notice that I believe this task to be extremely hard to pull off well and that I think both Flickr and Blogger are great services.
I’m hoping that I can take all of these points and learn from them the next time we have any troubles at Viddler. In the past we’ve handled these situations fairly well, but I know we can improve a lot by learning from others good and bad examples.
Thoughts?
Addendum: It appears that I am not the only one that thinks 37signals did a great job. Not only do they have numerous comments on the post, but Dan Benjamin also thought so.

January 21st, 2008 at 9:56 am
It’s not exactly “downtime”, but certainly related to the idea of how to tell your customers about something bad that happened. Can’t leave out the recent DreamHost debacle. http://blog.dreamhost.com/2008/01/15/um-whoops/ People reacted very strongly to the “joking” tone of that post compared to the severity of a $7.5 million billing error.
January 21st, 2008 at 10:02 am
Joyent’s Strongspace and Bingodisk services have been down outside of expectations for a full week until today. I did not even realize that I was having a problem with backups to their system until Thursday when I tried to access it, couldn’t, and then checked their status log.
Moral: Sending email to affected customers during downtime is absolutely essential because not everyone cares to read your site’s status log.
I have yet to hear of compensation to users for this downtime. I expect that for me it’ll be nothing, since I have one of their “lifetime” accounts and have the luxury of being screwed in every respect in regard to service.
January 21st, 2008 at 11:02 am
Owen: I agree with you in some regard. I think some services do not need to notify the users until they try to access it, while others definitely should.
For example: When a service is used to “service others”, they definitely should. Say for example credit card authentication services. Normally the customers, or clients, of these companies don’t use their services online but their customer’s customers do. In this case, the companies should definitely contact them directly to give them a heads up.
Perhaps in Joyent’s case they should have contacted you. In 37signals case - maybe they should only do so as a courtesy to their largest accounts (perhaps all pay accounts even). Flickr could contact all of their Pro users, or - all of their Pro users that have more than 100,000 views on their photos. Maybe Viddler, now that we offer revenue sharing, should contact all users who have revenue sharing turned on.
There are many ways each service could use to determine how and when they should notify their users of downtime - but I think that the first priority should be to degrade gracefully which is used to notify those trying to access the service what is going on.
January 21st, 2008 at 11:04 am
Chris Masto: Yeah, I definitely overlooked using that as a recent example, and a great example it is. When you are dealing with people’s lively hood, or money, there is no room to joke - but there is room for being human and compassionate.
January 21st, 2008 at 6:19 pm
the most annoying part of the recent flickr outage (and it seems to be running slow now too which is what happened right before the outage) was that they didn’t announce it beforehand. i believe it was due to some scheduled upgrade/maintenance. whammo, no notice. it crawled to a halt and then the “flickr is having a massage” message.
January 22nd, 2008 at 10:55 am
I think 37 Signals did a great job with damage control and apologies for the outage on their system. My employer just recently started using Basecamp, and I didn’t hear any freaking out from management offices, they all seemed to calmly wait for the system to come back up.
I feel like sites when they’re going to either have a scheduled down time, or are experiencing an unexpected outage, have some sort of message on their website, and if it’s unexpected, give regular updates.
I’m much more understanding and willing to roll with something when there is an explanation. Even if it’s the fact that the company is dealing with the fact that they didn’t have the resources to support the traffic they received. Ok, it can be remedied in the future.
Twitter in particular, I think is walking the fine line of upside down birdies a few too often with not a whole lot of reasons behind it, and then followed by a scheduled down time.
April 21st, 2008 at 8:50 am
[...] not sure why they’ve chosen to go silent about this issue, but it is the direct opposite of a good example. Dave Winer is also very surprised at the silence. Ev, Biz, Alex, [...]