More on the Skype outage

| Comments (3) | Networking SYSSEC
Here's Skype's official word on what caused their outage:
In an update to users on Skype's Heartbeat blog, employee Villu Arak said the disruption was not because of hackers or any other malicious activity.

Instead, he said that the disruption "was triggered by a massive restart of our users' computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update," Arak wrote.

Microsoft Corp. released its monthly patches last Tuesday, and many computers are set to automatically download and install them. Installation requires a computer restart.

"The high number of restarts affected Skype's network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact," Arak wrote.

Arak did not blame Microsoft for the troubles and said the outage ultimately rested with Skype. Arak said Skype's network normally has an ability to heal itself in such cases, but a previously unknown glitch in Skype's software prevented that from occurring quickly enough.

Some thoughts:

  • The phrasing "lack of peer-to-peer network resources" is quite interesting. One design goal for P2P systems is that their ability to handle load scales smoothly (or at least semi-smoothly) with the number of clients (peers) trying to use the system. It would be interesting to know what happened here.
  • This is probably not a behavior you'd see in a truly decentralized system. If, for instance, everyone in the world rebooted their SIP client this would probably not cause all the SIP phones in the world to stop working for two days. Though it might cause transient outages as people independently rebooted their machines.
  • How hard would it be for an attacker to trigger this sort of behavior intentionally by bouncing a large number of Skype clients which they have taken over (i.e., zombies in a botnet)?

3 Comments

This really sounds like cow doots. Apart from what you've already said:

First, the clients around the world surely do the update downloads and restarts at varying times, not all at once.

Second, these updates get pushed out every month; why hasn't this happened every month? Maybe Vista has made it worse (Vista's default setting is to automatically install and restart; XP's default was to download and then ask permission to install), but Vista has been around for a while, and I don't think that millions of computers have been switch to Vista in the last month..

Third, this problem's existed since the dawn of timesharing, and is actually at its worst not when the clients restart, but when the server does, and all the clients try to reconnect once the server is back up (which often used to result in re-crashing the server). What do they do when they restart their servers? If they can handle that, they can handle this.

I think they're still covering up.

I suspect Barry's right; this is either not the full story, or it's an outright fabrication. I bet they upgraded their servers to Vista, and had them all configured to auto-install/reboot, and their servers all rebooting at once led to the problem. Then MS paid them an assload of money to not mention that this is a huge potential flaw when using Vista as a server platform. I wonder how many other Vista-served outages we'll see in the next little while before sysadmins start to realize the check-for-update/install/reboot times on all their boxes need to be staggered.

Shouldn't past experience tell us that even truly distributed systems can be subject to congestion collapse? (However, I wasn't around in the 80s when this happened to the Internet.)

Craig, I don't think anyone is running Vista on servers. Longhorn perhaps, but not Vista.

Leave a comment