Hotmail outage explained, preventative steps taken (we suggest some more of our own)

wave4hotmail News Today on the Inside Windows Live blog, to their credit, the Windows Live team through Mike Schackwitz explained what happened in the Hotmail outage that occurred over the New Year’s Day weekend.  Basically, after running some automated stress tests, some dummy log in accounts were reset, and in doing so inadvertently reset some real Live ID accounts as well.

In Hotmail, one way we monitor the health of the email service is through automated tests. We set up a number of accounts with different configurations, and then use automated tests to log into these accounts, simulate normal user activity and behavior, and report when errors are found. We use scripts to create and delete these test accounts in bulk. The way we delete a test account is to remove its record from a group of directory servers that route users and incoming mail to the correct mailbox. 

On December 30th, we had an error in a script that inadvertently removed the directory records of a small number of real user accounts along with a set of test accounts. Please note that the email messages and folders of impacted users were not deleted; only their inbox location in the directory servers was removed.  Therefore when they logged in, a new mailbox was automatically created for them on a new storage server that didn’t contain their old messages and folders.   This is why the accounts received the “Welcome to Hotmail” message. 

After describing the details of the outage (and we appreciate the openness here), Schackwitz goes on to describe a set of actions to make sure “this never happens again”:

  • We are updating our infrastructure to use a separate code path for provisioning and removing test accounts, so that our testing no longer risks affecting real user accounts. 
  • We are changing our issue alert process so that when multiple users report missing data, these issues get a higher priority and immediate action.
  • We are updating our feedback process so that we can more clearly communicate status to affected customers through the support forums.

While we commend the Windows Live team for making changes in their procedures, to learn from their mistakes, so to speak, but we’d like to add a couple suggestions of our own:

  • Monitor your forums (in person).  We had numerous reports from LiveSide readers on the rash of new complaints in the forums, well before the issue became well known.  If our readers were on it, why wasn’t Windows Live?
  • Get involved in your forums.  While we were interested (and ok, a little bit amused) to see Corp VP Chris Jones join the Facebook group, he, and other Windows Live executives and managers shouldn’t be strangers in the forums.  They should not only monitor and listen, but participate.
  • Get rid of the 270 day limit.  While the vast majority of unattended accounts may have been abandoned, the minority that aren’t are vocal and upset to have lost their emails.  Simply isn’t worth the bad buzz.

Again, we appreciate the openness and honesty that Schackwitz and Windows Live displayed by posting an explanation, it’s the kind of thing that we need to see more of across Windows Live.