Networking: August 2007 Archives

 

August 20, 2007

Here's Skype's official word on what caused their outage:
In an update to users on Skype's Heartbeat blog, employee Villu Arak said the disruption was not because of hackers or any other malicious activity.

Instead, he said that the disruption "was triggered by a massive restart of our users' computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update," Arak wrote.

Microsoft Corp. released its monthly patches last Tuesday, and many computers are set to automatically download and install them. Installation requires a computer restart.

"The high number of restarts affected Skype's network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact," Arak wrote.

Arak did not blame Microsoft for the troubles and said the outage ultimately rested with Skype. Arak said Skype's network normally has an ability to heal itself in such cases, but a previously unknown glitch in Skype's software prevented that from occurring quickly enough.

Some thoughts:

  • The phrasing "lack of peer-to-peer network resources" is quite interesting. One design goal for P2P systems is that their ability to handle load scales smoothly (or at least semi-smoothly) with the number of clients (peers) trying to use the system. It would be interesting to know what happened here.
  • This is probably not a behavior you'd see in a truly decentralized system. If, for instance, everyone in the world rebooted their SIP client this would probably not cause all the SIP phones in the world to stop working for two days. Though it might cause transient outages as people independently rebooted their machines.
  • How hard would it be for an attacker to trigger this sort of behavior intentionally by bouncing a large number of Skype clients which they have taken over (i.e., zombies in a botnet)?
 

August 19, 2007

Skype suffered an extended service outage last week. There were a lot of rumors about how this was the result of some sort of attack, though Skype denies it. Here's what they say:
Apologies for the delay, but we can now update you on the Skype sign-on issue. As we continue to work hard at resolving the problem, we wanted to dispel some of the concerns that you may have. The Skype system has not crashed or been victim of a cyber attack. We love our customers too much to let that happen. This problem occurred because of a deficiency in an algorithm within Skype networking software. This controls the interaction between the user's own Skype client and the rest of the Skype network.

I don't have any more information about this than anyone else, so it could be either an attack or just a simple error. In either case, even if you believe Skype's story, it suggests that the Skype system is fairly brittle. Basically, any problem with Skype's central servers, whether through attack or error, has the potential to bring down Skype as a whole. By contrast, in a more distributed/decentralized system, global outages tend to be a lot less common. For instance, if I have an account with SIP server Atlanta and you have an account with SIP server Biloxi, an outage at server Chicago doesn't affect us at all. Of course, if there's some sort of large-scale Internet outage, this can affect us, but such issues aren't that common and of course Skype is just as vulnerable to such issues.

I'm not arguing that SIP is somehow inherently superior to Skype. It's quite possible to build a SIP-based system which is just as centralized and fragile—if Vonage's servers go down, then no Vonage customer will be able to make phone calls. On the other hand, there are other SIP providers and they aren't affected by Vonage outages. The difference here is that Skype is inherently centralized, and you basically can't use Skype without talking to their servers somehow. 1. By contrast, SIP was specifically designed to be used in a decentralized environment, much like e-mail is now, and clients and servers from separate vendors more or less interoperate—though of course some network operators won't allow direct SIP connections so you sometimes (often?) need to go through the PSTN for SIP UA A to talk to SIP UA B.

This isn't to say that decentralized systems are inherently better, of course, but they are generally more resistant to this particular failure mode.

1. Yes, the Skype protocol has been reverse engineered, but as far as I know there aren't any compatible clients or servers, and Skype's implementation is deliberately designed to be closed—you shouldn't expect to be able to use Skype's clients with such a service, which significantly decreases the value of using the Skype protocol.

 

August 17, 2007

Dreamhost was down. Hopefully we're back on the air now.
 

August 8, 2007

Infrant (makers of ReadyNAS, now owned by Netgear) just released a security advisory for remote root SSH access to their box:
NETGEAR has released an add-on to toggle SSH support for the ReadyNAS systems based on a potential exploit to obtain root user access to the ReadyNAS RAIDiator OS. Each ReadyNAS system incorporates a different root password that can be used by NETGEAR Support to understand and/or fix a ReadyNAS system remotely using the ReadyNAS serial number as a key. An attacker that has obtained the algorithm (and your serial number) to generate the root password would be able to remotely access the ReadyNAS and view, change, or delete data on the ReadyNAS.

ReadyNAS installation most vulnerable to this attack is in an unsecure LAN and where the ReadyNAS SSH port (22) is accessible by untrusting clients. Typical home environments are safe if a firewall is utilized and port 22 is not forwarded to the ReadyNAS from the router. We do advise that all ReadyNAS users perform this add-on installation regardless.

Installation of the ToggleSSH add-on will disable remote SSH access and thus close the vulnerability. At the same time, if you need remote access assistance from NETGEAR Support, you can install the ToggleSSH add-on again to re-enable SSH access during the time when the remote access is needed.

In other words, NETGEAR support can remotely log into any ReadyNAS box as root and manage it. A few notes:

  • I'm having trouble imagining any conditions under which I'd want NETGEAR support to have remote access to my fileserver (and no, I don't own one of these). I wonder if there's some way to change the root password or if you're stuck with this backdoor. Is this really something that they need a lot or was it just a cunning plan that didn't get filtered out at some higher level.
  • They don't disclose the algorithm they use to produce the password. Some such algorithms are good and some are bad. It would be interesting to know which type this is.
  • There are three major ways to build a system like this on the verifying side:
    1. Have the box simply know its own password.
    2. Have the password-generation algorithm built into the box.
    3. Use public key cryptography. E.g., the password is a digital signature over the serial number.
    If I had to bet, it would be on (1) or (2). (2) is obviously pretty bad since it means that anyone who has a single box can reverse engineer the algorithm and generate as many passwords as they want. Anyone take one of these apart and know?
  • What kind of auditing is available to find out if your box has already been taken over by some attacker who knows the key—or just someone from NETGEAR tech suport.
Oh, and what were they thinking having this on by default? Outstanding!
 

August 6, 2007

In my previous post about SWORDS robots, I referred to "fail-safe" and "fail-unsafe" strategies. Now, clearly, if you're a civilian in the line of fire of a killer robot, you'd think a strategy in which the robot shut itself down when it couldn't communicate with base to be "safe", you might feel a little differently if you were a soldier who had to go out into enemy fire because a minor communication glitch caused your robot to shut down.

As another example, take a system like Wireless Access in Vehicular Networks (WAVE), which provides for communications between vehicles and between vehicles and road-side units. WAVE can be used for safety messages, such as the Curve Speed Warning message, which allows a station at the side of the road to broadcast the maximum safe speed for a given curve. Obviously, you'd like there to be some message integrity here to prevent an attacker from broadcasting a fake speed. Now, what happens when the integrity check fails; do you ignore the message?

A decent argument could be made that either ignoring or trusting such messages was "fail-safe". Obviously, ignoring them appears safe in the sense that your vehicle reverts to what it was without the WAVE functionality, so you haven't been damaged. On the other hand, the curve speed warning is designed to help safety (that's why it's being broadcast) so ignoring it is arguably failing unsafe! I don't really have a position on what's right or wrong here, but it should be clear that the terminology is confusing.

I've heard people substitute the terms "fail-open" or "fail-closed", but those are even worse. If you're an electrical engineer, a closed circuit means current flows and an open circuit means current doesn't. On the other hand, an open firewall means that data flows but a closed one means it doesn't.

I don't know of any really good terms, unfortunately.