August 7, 2012

Yesterday, Microsoft published their CU-RTC-Web WebRTC API proposal as an alternative to the existing W3C WebRTC API being implemented in Chrome and Firefox. Microsoft's proposal is a "low-level API" proposal which basically exposes a bunch of media- and transport-level primitives to the JavaScript Web application, which is expected to stitch them together into a complete calling system. By contrast to the current "mid-level" API, the Microsoft API moves a lot of complexity from the browser to the JavaScript but the authors argue that this makes it more powerful and flexible. I don't find these arguments that convincing, however: a lot of them seem fairly abstract and rhetorical and when we get down to concrete use cases, the examples Microsoft gives seem like things that could easily be done within the existing framework. So, while it's clear that the Microsoft proposal is a lot more work for the application developer; it's a lot less clear that it's sufficiently more powerful to justify that additional complexity.

Microsoft's arguments for the superiority of this API fall into three major categories:

  • JSEP doesn't match with "key Web tenets"; i.e., it doesn't match the Web/HTML5 style.
  • It allows the development of applications that would otherwise be difficult to develop with the existing W3C API.
  • It will be easier to make it interoperate with existing VoIP endpoints.

Like any all-new design, this API has the significant advantage (which the authors don't mention) of architectural cleanliness. The existing API is a compromise between a number of different architectural notions and like any hybrid proposals has points of ugliness where those proposals come into contact with each other (especially in the area of SDP.) However, when we actually look at functionality rather than elegance, the advantages of an all-new design---not only one which is largely not based on preexisting technologies but one which involves discarding most of the existing work on WebRTC itself---start to look fairly thin.

Looking at the three claims listed above: the first seems more rhetorical than factual. It's certainly true that in the early years of the Web designers strove to keep state out of the Web browser, but that hasn't been the case with rich Web applications for quite some time. To the contrary, many modern HTML5 technologies (localstore, WebSockets, HSTS, WebGL) are about pushing state onto the browser from the server.

The interoperability argument is similarly weakly supported. Given that JSEP is based on existing VoIP technologies, it seems likely that it is easier to make it interoperate with existing endpoints since it's not first necessary to implement those technologies (principally SDP) in JavaScript before you can even try to interoperate. The idea here seems to be that it will be easier to accomodate existing noncompliant endpoints if you can adapt your Web application on the fly, but given the significant entry barrier to interoperating at all, this seems like an argument that needs rather more support than MS has currently offered.

Finally, with regard to the question of the flexibility/JavaScript complexity tradeoff, it's somewhat distressing that the specific applications that Microsoft cites (baby monitoring, security cameras, etc.) are so pedestrian and easily handled by JSEP. This isn't of course to say that there aren't applications which we can't currently envision which JSEP would handle badly, but it rather undercuts this argument if the only examples you cite in support of a new design are those which are easily handled by the old one.

None of this is to say that CU-RTC-Web wouldn't be better in some respects than JSEP. Obviously, any design has tradeoffs and as I said above, it's always appealing to throw all that annoying legacy stuff away and start fresh. However, that also comes with a lot of costs and before we consider that we really need to have a far better picture of what benefits other than elegance starting over would bring to the table.

More or less everyone agrees about the basic objectives of the WebRTC effort: to bring real-time communications (i.e., audio, video, and direct data) to browsers. Specifically, the idea is that Web applications should be able to use these capabilities directly. This sort of functionality was of course already available either via generic plugins such as Flash or via specific plugins such as Google Talk, but the idea here was to have a standardized API that was built into browsers.

In spite of this agreement about objectives, from the beginning there was debate about the style of API that was appropriate, and in particular how much of the complexity should be in the browser and how much in the JavaScript The initial proposals broke down into two main flavors:

  • High-level APIs — essentially a softphone in the browser. The Web application would request the creation of a call (perhaps with some settings as to what kinds of media it wanted) and then each browser would emit standardized signaling messages which the Web application would arrange to transit to the other browser. The original WHATWG HTML5/PeerConnection spec was of this type.
  • Low-level APIs — an API which exposed a bunch of primitive media and transport capabilities to the JavaScript. A browser that implemented this sort of API couldn't really do much by itself. Instead, you would need to write something like a softphone in JavaScript, including implementing the media negotiation, all the signaling state machinery, etc. Matthew Kaufman from Microsoft was one of the primary proponents of this design.

After a lot of debate, the WG ultimately rejected both of these and settled on a protocol called JavaScript Session Establishment Protocol (JSEP), which is probably best described as a mid-level API. That design, embodied in the current specifications [], keeps the transport establishment and media negotiation in the browser but moves a fair amount of the session establishment state machine into the JavaScript. While it doesn't standardize signaling, it also has a natural mapping to a simple signaling protocol as well as to SIP and Jingle, the two dominant standardized calling protocols. The idea is supposed to be that it's simple to write a basic application (indeed, a large number of such simple demonstration apps have been written) but that it's also possible to exercise advanced features by manipulating the various data structures emitted by the browser. This is obviously something of a compromise between the first two classes of proposals.

The decision to follow this trajectory was made somewhere around six months ago and at this point Google has a fairly mature JSEP implementation available in Chrome Canary while Mozilla has a less mature implementation which you could compile yourself but hasn't been released in any public build.

Yesterday, Microsoft made a new proposal, called CU-RTC-Web. See the blog post and the specification.

Below is an initial, high-level analysis of this proposal.

Disclaimer: I have been heavily involved with both the IETF and W3C working groups in this area and have contributed significant chunks of code to both the Chrome and Firefox implementations. I am also currently consulting for Mozilla on their implementation. However, the comments here are my own and don't necessarily represent those of any other organization.

What Microsoft is proposing is effectively a straight low level API.

There are a lot of different API points, and I don't plan to discuss the API in much detail, but it's helpful to talk about the API some to get a flavor of what's required to use it.

  • RealTimeMediaStream -- each RealTimeMediaStream represents a single flow of media (i.e., audio or video).
  • RealTimeMediaDescription -- a set of parameters for the RealTimeMediaStream.
  • RealTimeTransport -- a transport channel which a RealTimeMediaStream can run over.
  • RealTimePort -- a transport endpoint which can be paired with a RealTimePort on the other side to form a RealTimeTransport.

In order to set up an audio, video, or audio-video session, then, the JS has to do something like the following:

  1. Acquire local media streams on each browser via the getUserMedia() API, thus getting some set of MediaStreamTracks.
  2. Create RealTimePorts on each browser for all the local network addresses as well as for whatever media relays are available/ required.
  3. Communicate the coordinates for the RealTimePorts from each browser to the other.
  4. On each browser, run ICE connectivity checks for all combinations of remote and local RealTimePorts.
  5. Select a subset of the working remote/local RealTimePort pairs and establish RealTimeTransports based on those pairs. (This might be one or might be more than one depending on the number of media flows, level of multiplexing, and the level of redundancy required).
  6. Determine a common set of media capabilities and codecs between each browser, select a specific set of media parameters, and create matching RealTimeMediaDescriptions on each browser based on those parameters.
  7. Create RealTimeMediaStreams by combining RealTimeTransports, RealTimeMediaDescriptions, and MediaStreamTracks.
  8. Attach the remote RealTimeMediaStreams to some local display method (such as an audio or video tag).

For comparison, in JSEP you would do something like:

  1. Acquire local media streams on each browser via the getUserMedia() API, thus getting some set of MediaStreamTracks.
  2. Create a PeerConnection() and call AddStream() for each of the local streams.
  3. Create an offer on one brower send it to the other side, create an answer on the other side and send it back to the offering browser. In the simplest case, this just involves making some API calls with no arguments and passing the results to the other side.
  4. The PeerConnection fires callbacks announcing remote media streams which you attach to some local display method.

As should be clear, the CU-RTC-Web proposal requires significantly more complex JavaScript, and in particular requires that JavaScript to be a lot smarter about what it's doing. In a JSEP-style API, the Web programmer can be pretty ignorant about things like codecs and transport protocols, unless he wants to do something fancy, but with CU-RTC-Web, he needs to understand a lot of stuff to make things work at all. In some ways, this is a much better fit for the traditional Web approach of having simple default behaviors which fit a lot of cases but which can then be customized, albeit in ways that are somewtimes a bit clunky.

Note that it's not like this complexity doesn't exist in JSEP, it's just been pushed into the browser so that the user doesn't have to see it. As discussed below, Microsoft's argument is that this simplicity in the JavaScript comes at a price in terms of flexibility and robustness, and that libraries will be developed (think jQuery) to give the average Web programmer a simple experience, so that they won't have to accept a lot of complexity themselves. However, since those libraries don't exist, it seems kind of unclear how well that's going to work.

Microsoft's proposal and the associated blog post makes a number of major arguments for why it is a superior choice (the proposal just came out today so there haven't really been any public arguments for why it's worse). Combining the blog posts, you would get something like this:

  • That the current specification violates "fit with key web tenets", specifically that it's not stateless and that you can only make changes when in specific states. Also, that it depends on the SDP offer/answer model.
  • That it doesn't allow a "customizable response to changing network quality".
  • That it doesn't support "real-world interoperability" with existing equipment.
  • That it's too tied to specific media formats and codecs.
  • That JSEP requires a Web application to do some frankly inconvenient stuff if it wants to do something that the API doesn't have explicit support for.
  • That it's inflexible and/or brittle with respect to new applications and in particular that it's difficult to implement some specific "innovative" applications with JSEP.
Below we examine each of these arguments in turn.

MS writes:

Honoring key Web tenets-The Web favors stateless interactions which do not saddle either party of a data exchange with the responsibility to remember what the other did or expects. Doing otherwise is a recipe for extreme brittleness in implementations; it also raises considerably the development cost which reduces the reach of the standard itself.

This sounds rhetorically good, but I'm not sure how accurate it is. First, the idea that the Web is "stateless" feels fairly anachronistic in an era where more and more state is migrating from the server. To pick two examples, WebSockets involves forming a fairly long-term stateful two-way channel between the browser and the server, and localstore/localdb allow the server to persist data semi-permanently on the browser. Indeed, CU-RTC-Web requires forming a nontrivial amount of state on the browser in the form of the RealTimePorts, which represent actual resource reservations that cannot be reliably reconstructed if (for instance) the page reloads. I think the idea here is supposed to be that this is "soft state", in that it can be kept on the server and just reimposed on the browser at refresh time, but as the RealTimePorts example shows, it's not clear that this is the case. Similar comments apply to the state of the audio and video devices which are inherently controlled by the browser.

Moreover, it's never been true that neither party in the data exchange was "saddled" with remembering what the other did; rather, it used to be the case that most state sat on the server, and indeed, that's where the CU-RTC-Web proposal keeps it. This is the first time we have really built a Web-based peer-to-peer app. Pretty much all previous applications have been client-server applications, so it's hard to know what idioms are appropriate in a peer-to-peer case.

I'm a little puzzled by the argument about "development cost"; there are two kinds of development cost here: that to browser implementors and that to Web application programmers. The MS proposal puts more of that cost on Web programmers whereas JSEP puts more of the cost on browser implementors. One would ordinarily think that as long as the standard wasn't too difficult for browser implementors to develop at all, then pushing complexity away from Web programmers would tend to increase the reach of the standard. One could of course argue that this standard is too complicated for browser implementors to implement at all, but the existing state of Google and Mozilla's implementations would seem to belie that claim.

Finally, given that the original WHATWG draft had even more state in the browser (as noted above, it was basically a high-level API), it's a little odd to hear that Ian Hickson is out of touch with the "key Web tenets".

The CU-RTC-Web proposal writes:

Real time media applications have to run on networks with a wide range of capabilities varying in terms of bandwidth, latency, and noise. Likewise these characteristics can change while an application is running. Developers should be able to control how the user experience adapts to fluctuations in communication quality. For example, when communication quality degrades, the developer may prefer to favor the video channel, favor the audio channel, or suspend the app until acceptable quality is restored. An effective protocol and API will have to arm developers with the tools to tailor such answers to the exact needs of the moment, while minimizing the complexity of the resulting API surface.

It's certainly true that it's desirable to be able to respond to changing network conditions, but it's a lot less clear that the CU-RTC-Web API actually offers a useful response to such changes. In general, the browser is going to know a lot more about the bandwidth/quality tradeoff of a given codec is going to be than most JavaScript applications will, and so it seems at least plausible that you're going to do better with a small number of policies (audio is more important than video, video is more important than audio, etc.) than you would by having the JS try to make fine-grained decisions about what it wants to do. It's worth noting that the actual "customizable" policies that are proposed here seem pretty simple. The idea seems to be not that you would impose policy on the browser but rather that since you need to implement all the negotiation logic anyway, you get to implement whatever policy you want.

Moroever, there's a real concern that this sort of adaptation will have to happen in two places: as MS points out, this kind of network variability is really common and so applications have to handle it. Unless you want to force every JS calling application in the universe to include adaptation logic, the browser will need some (potentially configurable and/or disableable) logic. It's worth asking whether whatever logic you would write in JS is really going to be enough better to justify this design.

In their blog post today, MS writes about JSEP:

it shows no signs of offering real world interoperability with existing VoIP phones, and mobile phones, from behind firewalls and across routers and instead focuses on video communication between web browsers under ideal conditions. It does not allow an application to control how media is transmitted on the network.

I wish this argument had been elaborated more, since it seems like CU-RTC-Web is less focused on interoperability, not more. In particular, since JSEP is based on existing technologies such as SDP and ICE, it's relatively easy to build Web applications which gateway JSEP to SIP or Jingle signaling (indeed, relatively simple prototypes of these already exist). By contrast, gatewaying CU-RTC-Web signaling to either of these protocols would require developing an entire SDP stack, which is precisely the piece that the MS guys are implicitly arguing is expensive.

Based on Matthew Kaufman's mailing list postings, his concern seems to be that there are existing endpoints which don't implement some of the specifications required by WebRTC (principally ICE, which is used to set up the network transport channels) correctly, and that it will be easier to interoperate with them if your ICE implementation is written in JavaScript and downloaded by the application rather than in C++ and baked into the browser. This isn't a crazy theory, but I think there are serious open questions about whether it is correct. The basic problem is that it's actually quite hard to write a good ICE stack (though easy to write a bad one). The browser vendors have the resources to do a good job here, but it's less clear that random JS toolkits that people download will actually do that good a job (especially if they are simultaneously trying to compensate for broken legacy equipment). The result of having everyone write their own ICE stack might be good but it might also lead to a landscape where cross-Web application interop is basically impossible (or where there are islands of noninteroperable de facto standards based on popular toolkits or even popular toolkit versions).

A lot of people's instincts here seem to be based on an environment where updating the software on people's machines was hard but updating one's Web site was easy. But for about half of the population of browsers (Chrome and Firefox) do rapid auto-updates, so they actually are generally fairly modern. By contrast, Web applications often use downrev version of their JS libraries (I wish I had survey data here but it's easy to see just by opening up a JS debugger on you favorite sites). It's not at all clear that the JS is easy to upgrade/native is hard dynamic holds up any more.

The proposal says:

A successful standard cannot be tied to individual codecs, data formats or scenarios. They may soon be supplanted by newer versions, which would make such a tightly coupled standard obsolete just as quickly. The right approach is instead to to support multiple media formats and to bring the bulk of the logic to the application layer, enabling developers to innovate.

I can't make much sense of this at all. JSEP, like the standards that it is based on, is agnostic about the media formats and codecs that are used. There's certainly nothing in JSEP that requires you to use VP8 for your video codec, Opus for your audio codec, or anything else. Rather, two conformant JSEP implementations will converge on a common subset of interoperable formats. This should happen automatically without Web application intervention.

Arguably, in fact, CU-RTC-Web is *more* tied to a given codec because the codec negotiation logic is implemented either on the server or in the JavaScript. If a browser adds support for a new codec, the Web application needs to detect that and somehow know how to prioritize it against existing known codecs. By contrast, when the browser manufacturer adds a new codec, he knows how it performs compared to existing codecs and can adjust his negotiation algorithms accordingly. Moreover, as discussed below, JSEP provides (somewhat clumsy) mechanisms for the user to override the browser's default choices. These mechanisms could probably be made better within the JSEP architecture.

Based on Matthew Kaufman's interview with Janko Rogers [], it seems like this may actually be about the proposal to have a mandatory to implement video codec (the leading candidates seem to be H.264 or VP8). Obviously, there have been a lot of arguments about whether such a mandatory codec is required (the standard argument in favor of it is that then you know that any two implementations have at least one codec in common), but this isn't really a matter of "tightly coupling" the codec to the standard. To the contrary, if we mandated VP8 today and then next week decided to mandate H.264 it would be a one-line change in the specification. In any case, this doesn't seem like a structural argument about JSEP versus CU-RTC-Web. Indeed, if IETF and W3C decided to ditch JSEP and go with CU-RTC-Web, it seems likely that this wouldn't affect the question of mandatory codecs at all.

Probably the strongest point that the MS authors make is that if the API doesn't explicitly support doing something, the situation is kind of gross:

In particular, the negotiation model of the API relies on the SDP offer/answer model, which forces applications to parse and generate SDP in order to effect a change in browser behavior. An application is forced to only perform certain changes when the browser is in specific states, which further constrains options and increases complexity. Furthermore, the set of permitted transformations to SDP are constrained in non-obvious and undiscoverable ways, forcing applications to resort to trial-and-error and/or browser-specific code. All of this added complexity is an unnecessary burden on applications with little or no benefit in return.

What this is about is that in JSEP you call CreateOffer() on a PeerConnection in order to get an SDP offer. This doesn't actually change the PeerConnection state to accomodate the new offer; instead, you call SetLocalDescription() to install the offer. This gives the Web application the opportunity to apply its own preferences by editing the offer. For instance, it might delete a line containing a codec that it didn't want to use. Obviously, this requires a lot of knowledge of SDP in the application, which is irritating to say the least, for the reasons in the quote above.

The major mitigating factor is that the W3C/IETF WG members intend to allow most common manipulations to made through explicit settings parameters, so that only really advanced applications need to know anything about SDP at all. Obviously opinions vary about how good a job they have done, and of course it's possible to write libraries that would make this sort of manipulation easier. It's worth noting that there has been some discussion of extending the W3C APIs to have an explicit API for manipulating SDP objects rather than just editing the string versions (perhaps by borrowing some of the primitives in CU-RTC-Web). Such a change would make some things easier while not really representing a fundamental change to the JSEP model. However, it's not clear if there are enough SDP-editing tasks to make this project worthwhile.

With that said, that in order to have CU-RTC-Web interoperate with existing SIP endpoints at all you would need to know far more about SDP than would be required to do most anticipated transformations in a JSEP environment, so it's not like CU-RTC-Web frees you from SDP if you care about interoperability with existing equipment.

Finally, the MSFT authors argue that CU-RTC-Web is more flexible and/or less brittle than JSEP:

On the other hand, implementing innovative, real-world applications like security consoles, audio streaming services or baby monitoring through this API would be unwieldy, assuming it could be made to work at all. A Web RTC standard must equip developers with the ability to implement all scenarios, even those we haven't thought of.

Obviously the last sentence is true, but the first sentence provides scant support for the claim that CU-RTC-Web fulfills this requirement better than JSEP. The particular applications cited here, namely audio streaming, security consoles, and baby monitoring, seem not only doable with JSEP, but straightforward. In particular, security consoles and baby monitoring just look like one way audio and/or video calls from some camera somewhere. This seems like a trivial subset of the most basic JSEP functionality. Audio streaming is, if anything, even easier. Audio streaming from servers already exists without any WebRTC functionality at all, in the form of the audio tag, and audio streaming from client to server can be achieved with the combination of getUserMedia and WebSockets. Even if you decided that you wanted to use UDP rather than WebSockets, audio streaming is just a one-way audio call, so it's hard to see that this is a problem.

In e-mail to the W3C WebRTC mailing list, Matthew Kaufman mentions the use case of handling page reload:

An example would be recovery from call setup in the face of a browser page reload... a case where the state of the browser must be reinitialized, leading to edge cases where it becomes impossible with JSEP for a developer to write Javascript that behaves properly in all cases (because without an offer one cannot generate an answer, and once an offer has been generated one must not generate another offer until the first offer has been answered, but in either case there is no longer sufficient information as to how to proceed).

This use case, often called "rehydration" has been studied a fair bit and it's not entirely clear that there is a convenient solution with JSEP. However, the problem isn't the offer/answer state, which is actually easily handled, but rather the ICE and cryptographic state, which are just as troublesome with CU-RTC-Web as they are with JSEP [for a variety of technical reasons, you can't just reuse the previous settings here.] So, while rehydration is an issue, it's not clear that CU-RTC-Web makes matters any easier.

This argument, which should be the strongest of MS's arguments, feels rather like the weakest. Given how much effort has already gone into JSEP, both in terms of standards and implementation, if we're going to replace it with something else that something else should do something that JSEP can't, not just have a more attractive API. If MS can't come up with any use cases that JSEP can't accomplish, and if in fact the use cases they list are arguably more convenient with JSEP than with CU-RTC-Web, then that seems like a fairly strong argument that we should stick with JSEP, not one that we should replace it.

What I'd like to see Microsoft do here is describe some applications that are really a lot easier with CU-RTC-Web than they are with JSEP. Depending on the details, this might be a more or less convincing argument, but without some examples, it's pretty hard to see what considerations other than aesthetic would drive us towards CU-RTC-Web.

Thanks to Cullen Jennings, Randell Jesup, Maire Reavy, and Tim Terriberry for early comments on this draft.


July 19, 2012

The other day I went to Home Depot to buy some party supplies (incidentally, check out the party invitation here and the bonus Web site here. It's some of my better work.). One of the things I wanted was a set of rope lights. I eventually picked up three sets of 48' lights for $36.48. However, when I went to ring them up (you know Home Depot is almost all self-check, right?) two rang up at $62.48.

Looking closely, what happened is that the lights were packaged in clear plastic clamshell packaging with two paper labels, one in the front and one in the back. The paper label in the front showed the 48' lights listed above. The back label (the one with the bar code) showed 27' LED lights (LEDs are cooler and cool == expensive). It took a while for Home Depot to sort the problem out. Customer service's initial reaction was that someone had returned a set of the cheap lights but swapped the back labels so that they could get a larger refund. But then they had some more lights pulled off the shelf and they were mismatched as well, so things started to look a bit confused. Eventually, they just pulled the back pages out of the package (I guess to make it hard for me to do a return) and sent me on my way.

Here's the screwed up thing: nobody in this entire transaction was sure which set of actual lights I had in my hand. The matching package (the one which had rung up as expected) looked a lot like the other two packages, but really these things look pretty similar and after all we didn't know that any of the packages was right. I offered to take them out and measure them for length, but nobody seemed interested. So, at the time I walked out the door it seemed quite possible that Home Depot had sold me $188 worth of lights for $109. Of course, I assured them that I would bring them back if they turned out to be the LED lights, but they had no way of knowing I actually would (or of verifying if I did or not). I actually tried to explain this several times, but nobody seemed to care and eventually I gave up and left.

Turns out that they were the right lights after all, though.


July 16, 2012

One of the most common responses to the Rizzo/Duong "BEAST" attack was why not just deploy TLS 1.1. See, for instance, this incredibly long Bugzilla bug about TLS 1.1 in Network Security Services (NSS), the SSL/TLS stack used by both Chrome and Firefox. Unfortunately, while TLS 1.1 deployment is a good idea in and of itself, it turns out not to be a very useful defense against this particular attack. The problem isn't that servers don't support TLS 1.1 (though most still don't) but rather that the attacker can force a client and server which both implement TLS 1.1 to negotiate TLS 1.0 (which is vulnerable).

Background: Protocol Negotiation and Downgrade Attacks
Say we are designing a new protocol to remotely control toasters, the Toaster Control Protocol (TCP). TCP has a client controller, a Toaster Control Equipment (TCE), and a device responsible for toasting the bread, or Toaster Heading Equipment (THE). We'll start by developing TCP 1.0, but we expect that as time goes on we'll want to add new features and eventually we'll want to deploy TCP 2.0. So, for instance, maybe TCP 1.0 will only support toasters up to two slots, but TCP 2.0 will add toaster ovens (as has been widely observed, TCP 3.0 will allow you to send and receive e-mail). We may also change the protocol encoding between versions, so TCP 1.0 could have an ASCII representation whereas TCP 2.0 added a binary encoding to save bits on the wire. For obvious reasons, each version doesn't roll out all at once, so I might want TCP 2.0 TCE to talk to my TCP 1.0 THE. Obviously, that communication will be TCP 1.0, but if I later add a TCP 2.0 toaster oven, I want that to communicate with my TCE using TCP 2.0.

One traditional way to address this problem is to have some sort of initial handshake in which each side advertises its capabilities and they converge on a common version (typically the most recent common version). So, for instance, my TCE would say "I speak 2.0" but if the says "I only speak 1.0" then you end up with 1.0. On the other hand if the TCE advertises 2.0 and the THE speaks 2.0, then you end up with 2.0. As in:

TCE TCE THE THE Hello, I speak versions 1.0, 2.0 Let's do 2.0 Version 2.0 traffic...

Another common approach is to have individual feature negotiation rather than version numbers. For instance, the TCE might say "do you know how to make grilled cheese" and the THE would say "yes" or "no". In that case, you can roll out individual features rather than have a big version number jump. Sometimes, systems will have both types of negotiation, with the version number indicating a pile of features that go together and also being able to negotiate individual features. TLS is actually one such protocol, though the features are called "extensions" (not an uncommon name for this). So you get something like:

TCE TCE THE THE Hello, I do "toaster oven", "grilled cheese", "bagels" I can do "bagels" OK, let's toast some bagels

For non-security protocol, or rather ones where you don't need to worry about attackers, or rather those where you don't think you need to worry about attackers, this kind of approach mostly works pretty well, though there's always the risk that someone will screw up their side of the negotiation. With protocols that are security relevant, however, things are a little different. Let's say that in TCP 2.0 we decide to add encryption. So the negotiation looks pretty much the same as before:

TCE TCE THE THE Hello, I speak versions 1.0, 2.0 Let's do 2.0 Encrypted traffic

But since we're talking security we need to assume someone might be attacking us, and in particular they might be tampering with the traffic, like so:

TCE TCE Attacker Attacker THE THE Hello, I speak versions 1.0, 2.0 Hello, I speak version 1.0 Let's do 1.0 Unencrypted traffic

This is what's called a downgrade attack or a bid-down attack. Even though in principle both sides could do version 2.0 (and an encrypted channel), the attacker has forced them down to 1.0 (and a clear channel). Similar attacks can be mounted against negotiation of cryptographic features. Consider, for instance, the case where we are negotiating cryptographic algorithms and each side supports both AES (a strong algorithm) and DES (a weak algorithm), and the attacker forces both sides down to DES:

TCE TCE Attacker Attacker THE THE I can do AES, DES I can do DES OK, let's do DES Traffic encrypted with DES

There are two basic defenses against this kind of downgrade attack. The first is for sides to remember the other side's capabilities and complain if those expectations are violated. So, for instance, the first time that the TCE and THE communicate, the TCE notices that the THE can do TCP 2.0 and from then on it refuses to do TCP 1.0. Obviously, an attacker can downgrade you on the first communication, but if you ever get a communication without the attacker in the way, then you are immune from attack thereafter (at least until both sides upgrade again). This isn't a fantastic defense for a number of reasons, but it's more or less the best you can do in the non-cryptographic setting. In the setting where you are building a security protocol, however, there's a better solution. Most association-oriented security protocols (SSL/TLS, IPsec, etc.) have a handshake phase where they do version/feature negotiation and key establishment, followed by a data transfer phase where the actual communications happen. In most such protocols, the handshake phase includes an integrity check over the handshake messages. So, for instance, in SSL/TLS, the Finished messages include a Message Authentication Code (MAC) computed over the handshake and keyed with the exchanged master secret:

Client Client Server Server ClientHello ServerHello, Certificate, ServerHelloDone ClientKeyExchange, ChangeCipherSpec, Finished ChangeCipherSpec, Finished

Any tampering with any of the handshake values causes the handshake to fail. This makes downgrade attacks more difficult: as long as the weakest share key exchange protocol and the weakest shared MAC are sufficiently strong (both of these things are true for TLS), then pretty much everything else can be negotiated safely, including features and version numbers. [Technical note: SSL version 2 didn't have anti-downgrade defenses and so there's some other anti-downgrade mechanisms in SSL/TLS as well.] This is why it's so important to establish a baseline level of cryptographic security in the first level of the protocol, so you can prevent downgrade attack to the nonsecure version.

Attacks on TLS 1.1 Negotiation
Based on what I said above, it would seem that rolling out TLS 1.1 securely would be no problem. And if everything was perfect, then that would indeed be true. Unfortunately, everything is not perfect. In order for version negotiation to work properly, a version X implementation needs to accept offers of version Y > X (although of course it will negotiate version X). However, some nontrivial number of TLS servers and/or intermediaries (on the order of 1%) will not complete the TLS handshake if TLS 1.1 is offered (I don't mean they negotiate 1.0 but instead an error is observed). There are similar problems (though less extensive with TLS extensions and offering TLS 1.0 as opposed to SSLv3).

No browser wants to break on 1% of the sites in the world, so instead when some browser clients (at least Chrome and Firefox) encounter a server which throws some error with a modern ClientHello, they seamlessly fall back to older versions. I.e., something like this (the exact details of the fallback order depend on the browser):

Client Client Server Server ClientHello (TLS 1.0) TCP FIN ClientHello (SSLv3) ClientKeyExchange, ChangeCipherSpec, Finished ChangeCipherSpec, Finished

It seems very likely that browsers will continue this behavior for negotiating TLS 1.1 and/or 1.2. Here's the problem: this fallback happens outside of the ordinary TLS version negotiation machinery, so it's not protected by any of the cryptographic checks designed to prevent downgrade attack. Any attacker can forge a TCP FIN or RST, thus forcing clients back to SSLv3, TLS 1.0, or whatever the lowest version they support is. The attack looks like this:

Client Client Attacker Attacker Server Server ClientHello (TLS 1.0) TCP FIN ClientHello (SSLv3) ClientKeyExchange, ChangeCipherSpec, Finished ChangeCipherSpec, Finished

The underlying problem here is that the various extension mechanisms for TLS weren't completely tested (or in some cases, specified; extensions in particular weren't part of SSLv3), and so the browsers have to fall back on ad hoc feature/version negotiation mechanisms. Unfortunately, those mechanisms, unlike the official mechanisms, aren't secure against downgrade attack.1

There is, however, one SSL/TLS negotiation mechanism that is extremely reliable: cipher suite negotiation. In TLS, each cipher suite is rendered as a 16-bit number: the client offers a pile of cipher suites and the server selects the one it likes. Because new cipher suites are introduced fairly regularly, and ignoring unknown suites is so easy, this mechanism has gotten a lot of testing, and it works pretty well, even through nearly all intermediaries. The result is that if you really need to have downgrade attack resistance, you need to put something in the cipher suites field. This is the idea behind the Signaling Cipher Suite Value used by the TLS Renegotiation Indication Extension [RFC 5746]. Recently, there have been several proposals that are intended to indicate TLS 1.1 and/or extension support in the cipher suite field. The idea here is to allow detection of version rollback attacks. Once you can detect version rollback, then you can use the ordinary handshake anti-tampering mechanisms to detect removal of extensions.2

The bad news about these mechanisms is that they require upgrading the server to detect the new cipher suite. On the other hand, they can be incrementally deployed. (Yngve Pettersen has a client-side only proposal which leverages the RI SCSV to a similar end, but relies on the assumption that any server which does RI is modern enough to handle extensions properly).

What's the lesson here? Minimally, this kind of negotiation facility needs to be clearly specified from the start and then extensively tested (and hopefully exercised as soon as possible). Once you've got a significant installed base of noncompliant implementations, it gets very difficult to distinguish a noncompliant peer and a downgrade attack and thus problematic to refuse to connect to apparently noncompliant peers.

1 Note that this isn't always a big deal. Consider, for instance, the TLS Server Name Indication message, which allows a server to host multiple HTTPS sites on the same IP. The attacker could force an SNI downgrade, but this will generally just cause a connection failure, which they could have easily have done by forging an RST for every connection. Downgrade attacks are mostly an issue when the attacker is forcing you to a weaker security posture, rather than just breaking stuff.


April 9, 2012

The IETF RTCWEB WG has been operating on a fast track with an interim meeting between each IETF meeting. Since we needed to schedule a lot of meetings, thought it might be instructive to try to analyze a bunch of different locations to figure out the best strategy. Here's a lightly edited version of my post to the RTCWEB WG trying to address this issue.

Note that I'm not trying to make any claims about what the best set of venues is. It's obviously easy to figure out any statistic we want about each proposed venue, but how you map that data to "best" is a much more difficult problem. The space is full of Pareto optima, and even if we ignore the troubling philosophical question of interpersonal utility comparisons, there's some tradeoff between minimal total travel time and a "fair" distribution of travel times (or at least an even distribution).

The data below is derived by treating both people and venues as airport locations and using travel time as our primary instrument.

  1. For each responder for the current Doodle poll, assign a home airport based on their draft publication history. We're missing a few people but basically it should be pretty complete. Since these people responded before the venue is known, it's at least somewhat unbiased.
  2. Compute the shortest advertised flight between each home airport and the locations for each venue by looking at the shortest advertised Kayak flights around one of the proposed interim dates (6/10 - 6/13), ignoring price, but excluding "Hacker fares". [Thanks to Martin Thomson or helping me gather these.]

This lets us compute statistics for any venue and/or combination of venues, based on the candidate attendee list.

The three proposed venues:

  • San Francisco (SFO)
  • Boston (BOS)
  • Stockholm (ARN)

Three hubs not too distant from the proposed venues:

  • London (LHR)
  • Frankfurt (FRA)
  • New York (NYC) (treating all NYC airports as the same location)
Also, Calgary (YYC), since the other two chair locations (BOS and SFO) were already proposed as venues, and I didn't want Cullen to feel left out.

Here are the results for each of the above venues, measured in total hours of travel (i.e., round trip).

Venue         Mean         Median           SD
SFO           13.5             11         12.2
BOS           12.3             11          7.5
ARN           17.0             21         10.7
FRA           14.8             17          7.3
LHR           13.3             14          7.5
NYC           11.5             11          5.8
YYC           14.9             13         10.2
SFO/BOS/ARN   14.3             13          3.6
SFO/NYC/LHR   12.7             11.3        3.7
XXX/YYY/ZZZ is a three-way rotation of XXX, YYY, and ZZZ. Obviously, mean and median are intended to be some sort of aggregate measure of travel time. I don't have any way to measure "fairness", but SD is intended as some metric of the variation in travel time between attendees.

The raw data and software are attached. The files are:

home-airports: the list of people's home airports
durations.txt: the list of airport-airport durations
doodle.txt: the attendees list
pairings: the software to compute travel times
doodle-out.txt -- the computed travel times for each attendee

This was a quick hack, so there may be errors here, but nobody has pointed out any yet.

Obviously, it's hard to know what the optimal solution is without some model for optimality, but we can still make some observations based on this data:

  • If we're just concerned with minimizing total travel time, then we would always in New York, since it has both the shortest mean travel time and the shortest median travel time, but as I said above, this arguably isn't fair to people who live either in Europe or California, since they always have to travel.
  • Combining West Coast, East Coast, and European venues has comparable (or at least not too much worse) mean/median values than NYC with much lower SDs. So, arguably that kind of mix is more fair.
  • There's a pretty substantial difference between hub and non-hub venues. In particular, LHR has a median travel time 7 hours less than ARN, and the SFO/NYC/LHR combination has a median/mean travel time about 2 hours less than SFO/BOS/ARN (primarily accounted for by the LHR/ARN difference). [Full disclosure, I've favored Star Alliance hubs here, but you'd probably get similar results if, for instance, you used AMS instead of LHR.]
  • Obviously, your mileage may vary based on your location and feelings about what's fair, but based on this data, it looks to me like a three-way rotation between West Coast, East Coast, and European hubs offers a good compromise between minimum cost and a flat distribution of travel times.


    March 3, 2012

    Something annoying but also instructive happened during my build of Chromium today. Everything started when I checked out a clean version and went to do a build, only to be greeted with the following exciting error:
    ar: input.a is a fat file (use libtool(1) or lipo(1) and ar(1) on it)
    ar: input.a: Inappropriate file type or format
    rm: /Users/ekr/dev/chromium/src/out/Debug/\
      webkit_system_interface/geni/adjust_visibility/self/cuDbUtils.o: No such file or directory
    make: *** [out/Debug/libWebKitSystemInterfaceLeopardPrivateExtern.a] Error 1
    make: *** Waiting for unfinished jobs....

    Luckily, I've run into this problem before so I know what the problem is. The script third_party/WebKit/Source/WebCore/WebCore.gyp/mac/, which does some library mangling, uses file to determine what kind of library it's dealing with. Unfortunately, it invokes file with an unqualified name, and since MacPorts wants to put itself at the beginning of PATH this means that you get the file implementation from MacPorts which has a slightly different output than the system file. The result is that decides that you have a thin version of libWebKit...a and tries to run ar on it. When ar fails, so does the build.

    The fix here is to move MacPorts below /usr/bin in your path. I'd already done this—or so I thought— but it turned out that MacPorts had inserted itself twice in .cshrc so I had to edit .cshrc and then run source .cshrc. I did this, and after correcting a typo things looked good and I and went to rerun the build, only to be greeted with:

      CXX(target) out/Debug/
    In file included from base/
    ./base/file_util.h:416:56: error: no type named 'set' in namespace 'std'
                                                const std::set& group_gids);
    ./base/file_util.h:416:59: error: expected ')'
                                                const std::set& group_gids);
    ./base/file_util.h:413:44: note: to match this '('
    BASE_EXPORT bool VerifyPathControlledByUser(const FilePath& base,
    2 errors generated.
    make: *** [out/Debug/]

    I know what you're thinking here—or at least what I thought—someone forgot to #include <set> and for some reason the automated builds didn't catch it, perhaps due to some conditional compilation problem getting triggered on Lion. But checking the source quite clearly showed that set was being included. Moreover, other STL containers like vector work fine. Changing from clang to GCC didn't help here, so eventually I reverted to gcc -E. For those of you who don't know, this runs the preprocessor but not the compiler and so is really useful for diagnosing this kind of include error. Here's the relevant portion of the result:

    # 18 "./base/file_util.h" 2
    # 1 "./set" 1
    # 24 "./base/file_util.h" 2

    It's a little hard to read, but if you know what to look for, it's telling you that instead of including set from /Developer, where the system include files live, the compiler is getting it from the local directory. Now, you might ask what the heck a file named set is doing in the local directory, especially as when I looked it was totally empty. Naturally, it was my fault, but it took a minute to realize what. Remember I said that I had to correct a typo in .cshrc but now what the typo was. Well, the problem was that I had written:

    >set OSVER=`uname -r`
    Instead of
    set OSVER=`uname -r`

    Of course, when I ran this it create a file called set in the current directory and since the compile flags included the current directory in the include path, the compiler duly included it instead of the system include file. And since the file was empty, there wasn't any definition of std::set and we got a compile error. Time wasted by this error: 11 minutes (not including writing this up).


    February 25, 2012

    Disclaimer: I am not a car guy. Read the following with that in mind.

    As long-time EG readers will know, I've complained in the past that my Prius has a feeble starter/electronics battery which is easy to run down even by leaving the interior lights on. This despite the fact that the Prius has a huge battery running the hybrid system to draw on. But I certainly didn't want this. Michael DeGusta reports that if you leave your Tesla parked for a long time (like months), then the car bleeds enough power off of the battery to run the auxilary vehicle systems [parasitic load] to drain it down into deep discharge (and hance damage to the battery) territory:

    A Tesla Roadster that is simply parked without being plugged in will eventually become a "brick". The parasitic load from the car's always-on subsystems continually drains the battery and if the battery's charge is ever totally depleted, it is essentially destroyed. Complete discharge can happen even when the car is plugged in if it isn't receiving sufficient current to charge, which can be caused by something as simple as using an extension cord. After battery death, the car is completely inoperable. At least in the case of the Tesla Roadster, it's not even possible to enable tow mode, meaning the wheels will not turn and the vehicle cannot be pushed nor transported to a repair facility by traditional means.

    The amount of time it takes an unplugged Tesla to die varies. Tesla's Roadster Owners Manual [Full Zipped PDF] states that the battery should take approximately 11 weeks of inactivity to completely discharge [Page 5-2, Column 3: PDF]. However, that is from a full 100% charge. If the car has been driven first, say to be parked at an airport for a long trip, that time can be substantially reduced. If the car is driven to nearly its maximum range and then left unplugged, it could potentially "brick" in about one week.1 Many other scenarios are possible: for example, the car becomes unplugged by accident, or is unwittingly plugged into an extension cord that is defective or too long.

    When a Tesla battery does reach total discharge, it cannot be recovered and must be entirely replaced. Unlike a normal car battery, the best-case replacement cost of the Tesla battery is currently at least $32,000, not including labor and taxes that can add thousands more to the cost.

    There's been a lot of controversy about this report (see, for instance, this defense), but Tesla's response seems to by consistent with DeGusta's basic argument, as does the letter that Jalopnik reproduces above:

    All automobiles require some level of owner care. For example, combustion vehicles require regular oil changes or the engine will be destroyed. Electric vehicles should be plugged in and charging when not in use for maximum performance. All batteries are subject to damage if the charge is kept at zero for long periods of time. However, Tesla avoids this problem in virtually all instances with numerous counter-measures. Tesla batteries can remain unplugged for weeks (even months), without reaching zero state of charge. Owners of Roadster 2.0 and all subsequent Tesla products can request that their vehicle alert Tesla if SOC falls to a low level. All Tesla vehicles emit various visual and audible warnings if the battery pack falls below 5 percent SOC. Tesla provides extensive maintenance recommendations as part of the customer experience.

    At present, then, the agreed upon facts seem to be that:

    1. If you leave the Tesla's batteries at zero charge, battery damage occurs.
    2. If you leave a Tesla unplugged for long enough, even with a charged battery, parasitic load from the vehicle systems will eventually consume the battery's charge, leaving you in state (1) above. [Note that this appears to exceed the Lithium-Ion self-discharge rate, so it likely is parasitic load.]

    The controversy really seems to be about who's fault this is, namely whether the customer should have known better, whether Tesla notified them correctly, etc. I don't have a Tesla so I don't care about that. I'm much more interested in the engineering question of what's going on and what, if anything, can be done about it.

    The parasitic load thing isn't totally unfamiliar territory, of course. Any modern vehicle has electronics and those need power, which they get from the battery. Some do a better job than others. My BMW R1200GS motorcycle, for instance, has this problem and the manual explicitly tells you to connect it to a trickle charger (an expensive BMW model, of course, though you can use a standard one if you're willing to do a tiny bit of work) if you're not going to drive it for a while, and I duly plug it into the wall whenever I get home. If you don't do that, however, the worst you're going to be out is new lead-acid battery, which depending on what vehicle you have, leaves you out something like $50-$200, not $40,000.

    However, the level of load we're talking about here seems awful high. Remember that we're talking about a battery capable of powering your car for 200 miles or so on a single charge (53 kWh). In order to deplete the battery in 11 weeks (~2000 hrs) you would need continuous battery consumption of around 30 W. For comparison, a Macbook Air has a 50Wh battery and gets something like 5 hours on a charge, so it's like the Tesla is running 5 Airs at once 24x7. It's natural to ask where all that power is going, since you don't need anywhere near that much to keep a vehicle on standby. One likely source seems to be the battery cooling system, of which Wikipedia says "Coolant is pumped continuously through the ESS both when the car is running and when the car is turned off if the pack retains more than a 90% charge. The coolant pump draws 146 watts." [Original reference and long discussion here. Note that this post is due to Martin Eberhard, one of the Tesla Founders but apparently no longer with the company at the time he wrote it. Thanks Wayback Machine for preserving this!].

    Obviously, if you have a load this high, then you're going to deplete the battery. The question then becomes whether there is some way of avoiding permanent battery damage as the depletion gets to dangerous levels. The natural thing to do is install some sort of cutoff that turns off all power drain once you get close to that level. This may end up blowing away a bunch of the car's configuration (though really, it's not that hard to store that stuff in flash memory, even though historically manufacturers have tended not to), but surely it's cheaper to reboot your car than replace the entire battery pack. However, if the power is going to the cooling system and the cooling system is doing something important, like keeping the battery from being damaged by excessive heat, then this may not help.

    Oh, one more thing. DeGusta claims that Tesla has the capability to remotely monitor the battery and locate the car, and has sent people out to fix it:

    In at least one case, Tesla went even further. The Tesla service manager admitted that, unable to contact an owner by phone, Tesla remotely activated a dying vehicle's GPS to determine its location and then dispatched Tesla staff to go there. It is not clear if Tesla had obtained this owner's consent to allow this tracking5, or if the owner is even aware that his vehicle had been tracked. Further, the service manager acknowledged that this use of tracking was not something they generally tell customers about.

    If true, that would be... interesting.


    February 11, 2012

    Cryptography is great, but it's not so great if you get arrested and forced to give up your cryptographic keys. Obviously, you could claim that you've forgotten it (remember that you need a really long key to thwart exhaustive search attacks, so this isn't entirely implausible.) However, since you also need to regularly be able to decrypt your data, this means you need to be able remember your password, so it's not entirely plausible either, which means that you might end up sitting in jail for a long time due to a contempt citation. This general problem has been floating around the cryptographic community for a long term, where it's usually referred to as "rubber hose cryptanalysis", with the idea being that the attacker will torture you (i.e., beat you with a rubber hose) until you give up the key. This xkcd comic sums up the problem. Being technical people, there's been a lot of work on technical solutions, none of which are really fantastic. (see the Wikipedia deniable encryption page for one summary).

    Threat model
    As usual, it's important to think about the threat model, which in this case is more complicated than it initially seems. We assume that you have some encrypted data and that the attacker has a copy of that data and of the encryption software you have used. All they lack is the key. The attacker insists you hand over the key and has some mechanism for punishing you if you don't comply. Moreover, we need to assume that the attacker isn't a sadist, so as long as there's no point in punishing you further they won't. It's this last point that is the key to all the technical approaches I know of, namely convincing the attacker that they are unlikely to learn anything more by punishing you further, so they might as well stop. Of course, how true that assumption is probably depends on the precise nature of the proceedings and how much it costs the attacker to keep inflicting punishment on you. If you're being waterboarded in Guantanamo, the cost is probably pretty low, so you probably need to be pretty convincing.

    Technical Approaches
    Roughly speaking, there seem to be two strategies for dealing with the threat of being legally obliged to give up your cryptographic keys:

    • Apparent Compliance/Deniable Encryption.
    • Verifiable Destruction

    Apparent Compliance/Deniable Encryption
    The idea behind an apparent compliance strategy is that you pretend to give up your encryption key, but instead you give up another key that decrypts the message to an innocuous ciphertext. More generally, you want a cryptographic scheme which produces a given ciphertext C which maps onto a series of plaintexts M_1, M_2, ... M_n via a set of keys K_1, K_2, ... K_n. Assume for the moment that only M_n is and M_1, ... M_n-1 are either fake or real (but convincing) non-sensitive data. So, when you are captured, you reveal K_1 and claim that you've decrypted the data. If really pressed, you reveal K_2 and so on.

    The reason that this is supposed to work is that the attacker is assumed to not know n. However, since they have a copy of your software, they presumably know that it's multilevel capable, so they know that there may be more than one key. They just don't know if you've given them the last key. All the difficult cryptographic problems are about avoiding revealing n. There are fancy cryptographic ways to do this (the original paper on this is by Canetti, Dwork, Naor, and Ostrovsky), but consider one simple construction. Take each message M_i and encrypt it with K_i to form C_i and then concatenate all the results to form C. The decryption procedure given a single key is to decrypt each of the sub-ciphertexts in turn and discard any which don't decrypt correctly (assume there is some simple integrity check.) Obviously, if you have a scheme this trivial, then it's easy for an attacker to see how many keys there are just by insisting you provide keys for all the data, so you also pad C with a bunch of random-appearing data which you really can't decrypt at all, which in theory creates plausible deniability. This is approximately what TrueCrypt does):

    Until decrypted, a TrueCrypt partition/device appears to consist of nothing more than random data (it does not contain any kind of "signature"). Therefore, it should be impossible to prove that a partition or a device is a TrueCrypt volume or that it has been encrypted (provided that the security requirements and precautions listed in the chapter Security Requirements and Precautions are followed). A possible plausible explanation for the existence of a partition/device containing solely random data is that you have wiped (securely erased) the content of the partition/device using one of the tools that erase data by overwriting it with random data (in fact, TrueCrypt can be used to securely erase a partition/device too, by creating an empty encrypted partition/device-hosted volume within it).

    How well this works goes back to your threat model. The attacker knows there is some chance that you haven't revealed all the keys and maybe if they punish you further you will give them up. So, whether you continue to get punished depends on their cost/benefit calculations, which may be fairly unfavorable to you. The problem is worse yet if the attacker has any way of determining what correct data looks like. For instance, in one of the early US court cases on this, In re Boucher, customs agents had seen (or at least claimed to had seen) child pornography on the defendant's hard drive and so would presumably have known a valid decryption from an invalid one. Basically, in any setting where the attacker has a good idea of what they are looking for and/or can check the correctness of what you give them, a deniable encryption scheme doesn't work very well, since the whole scheme relies on uncertainty about when you have actually given up the last key.

    Verifiable Destruction
    An alternative approach that doesn't rely on this kind of ambiguity is to be genuinely unable to encrypt the data and to have some way of demonstrating this to the attacker. Hopefully, a rational attacker won't continue to punish you once you've demonstrated that you cannot comply. It's demonstrating part that's the real problem here. Kahn and Schelling famously sum up the problem of how to win at "chicken":

    Some teenagers utilize interesting tactics in playing "chicken." The "skillful" player may get into the car quite drunk, throwing whiskey bottles out the window to make it clear to everybody just how drunk he is. He wears dark glasses so that it is obvious that he cannot see much, if anything. As soon as the car reaches high speed, he takes the steering wheel and throws it out the window. If his opponent is watching, he has won. If his opponent is not watching, he has a problem;

    Of course, as Allan Schiffman once pointed out to me, the really skillful player keeps a spare steering wheel in his car and throws that out the window. And our problem is similar: demonstrating that you have thrown out the data and/or key and you don't have a spare lying around somewhere.

    The technical problem then becomes constructing a system that actually works. There are a huge variety of potential technical options here, but at a high-level, it seems like solutions fall into two broad classes, active and passive. In an active scheme, you actively destroy the key and/or the data. For instance, you could have the key written on a piece of paper which you eat, or there is a thermite charge on your computer which melts it to slag when you press a button. In a passive system, by contrast, no explicit action is required by you, but you have some sort of deadman switch which causes the key/data to be destroyed if you're captured. So, you might store the data in a system like Vanish (although there are real questions about the security of Vanish per se), or you have the key stored offsite with some provider who promises to delete the key if you are arrested or if you don't check in every so often.

    I'm skeptical of how well active schemes can be made to work: once it becomes widely known how any given commercial scheme works, attackers will take steps to circumvent it. For instance, if there is some button you press to destroy your data, they might taser you and ask questions later to avoid you pressing it. Maybe someone can convince me otherwise, but this leaves us mostly with passive schemes (or semi-passive schemes as discussed in a bit.) Consider the following strawman scheme:

    Your data is encrypted in the usual way, but part of the encryption key is stored offsite in some location inaccessible to the attacker (potentially outside their legal jurisdiction if we're talking about a nation-state type attacker). The encryption key is stored in a hardware security module, and if the key storage provider doesn't hear from you (and you have to prove possession of some key) every week (or two weeks or whatever), they zeroize the HSM, thus destroying your key. It's obviously easy to build a system like this where the encryption software automatically contacts the key storage provider, proves possession, and thus resets their deadman timer, so as long as you use your files every week or so, you're fine.

    So, if you're captured, you just need to hold out until the deadman timer expires and then the data really isn't recoverable by you or anyone else. Of course, "not recoverable" isn't the same as "provably not recoverable", since you could have kept a backup copy of the keys somewhere—though the software could be designed in a way that this was inconvenient, thus giving some credibility to the argument that you did not. Moreover, this design is premised on the assumption that there is actually somewhere that you could store your secret data that the attacker couldn't get it from. This may be reasonable if the attacker is the local police, but perhaps less so if the attacker is the US government. And of course any deadman system is hugely brittle: if you forget your key or just don't refresh for a while, your data is gone, which might be somewhat inconvenient.

    One thing that people often suggest is to have some sort of limited-try scheme. The idea here is that the encryption system automatically erases the data (and/or a master key) if the wrong password/key is entered enough times. So, if you can just convincingly lie N times and get the attacker to try those keys, then the data is gone. Alternately, you could have a "coercion" key which deletes all the data. It's clear that you can't build anything like this in a software-only system: the attacker will just image the underlying encrypted data and write their own decryption software which doesn't have the destructive feature. You can, however, build such a system using hardware security modules (assume for now that the HSM can't be broken directly.) This is sort of a semi-passive scheme in that you are intentionally destroying the data, but the destruction is produced by the attacker keying in the alleged encryption key.

    The big drawback with any verifiable destruction system is that it leaves evidence that you could have complied but didn't; in fact, that's the whole point of the system. But this means that the attacker's countermove is to credibly commit to punishing you for noncompliance after the fact. I don't think this question has ever been faced for crypto, but it has been faced in other evidence-gathering contexts. Consider, for instance, the case of driving under the influence: California requires you to take a breathalyzer or blood test as a condition of driving [*], and refusal carries penalties comparable to those for being convicted of DUI. One could imagine a more general legal regime in which actively or passively allowing your encrypted data to be destroyed once you have been arrested was itself illegal, and with a penalty that was large enough that it would almost never be worth refusing to comply (obviously the situation would be different in extra-legal settings, but the general idea seems transferable.) I'll defer to any lawyers reading this about how practical such a law would actually be.

    Bottom Line
    Obviously, neither of these classes of solution seems entirely satisfactory from the perspective of someone who is trying to keep their data secret. On the other hand, it's not clear that this is really a problem that admits of a good technical solution.


    January 23, 2012

    You have to have used git to really understand this one, but...
    [16] git checkout f4a56
    Note: checking out 'f4a56'.
    You are in 'detached HEAD' state. You can look around, make experimental
    changes and commit them, and you can discard any commits you make in this
    state without impacting any branches by performing another checkout.
    If you want to create a new branch to retain commits you create, you may
    do so (now or later) by using -b with the checkout command again. Example:
      git checkout -b new_branch_name
    HEAD is now at f4a560b... Foo
    As you may have gathered from this long warning, you most likely don't want to be in a detached head setting, you probably just meant to create a branch or wanted to rollback a commit but typed the wrong thing. Which is why there are lots of pages about what this means and how to get yourself out. My contribution to this literature can be found below the fold.


    January 22, 2012

    On my way to Red Rock today to do some work, I looked in my wallet to see if I had enough money to afford my hot chocolate (paying for a $3.50 drink with a credit card is a pretty lame move). Here's what I found:

    After some sorting, it comes out as follows...

    Currency Count Value (nominal) Value (USD)
    USD 3 3 3
    CAD 7 100 98.55
    CZK 2 2100 106.40
    GBP 1 10 15.55
    EUR 1 20 25.79
    INR 1 100 1.99
    RUB 9 1570 49.97
    Total 24 - 301.25

    In other words, out of 24 total pieces of paper valued at over $300, I had three spendable pieces of paper valued at $3. Oh, and a couple of United beverage vouchers which expire in 9 days. I ended up going to the ATM.


    January 21, 2012

    You've of course heard by now that much of the Internet community thinks that SOPA and PIPA are bad, which is why on January 16, Wikipedia shut itself down, Google had a black bar over their logo, etc. This opinion is shared by much of the Internet technical community, and in particular much has been made of the argument made by Crocker et al. that DNSSEC and PIPA are incompatible. A number of the authors of the statement linked above are friends of mine, and I agree with much of what they write in it, but I don't find this particular line of argument that convincing.

    As background, DNS has two kinds of resolvers:

    • Authoritative resolvers which host the records for a given domain.
    • Recursive resolvers which are used by end-users for name mapping. Typically they also serve as a cache.

    A typical configuration is for end-user machines to use DHCP to get their network configuration data, including IP address and the DNS recursive resolvers to use. Whenever your machine joins a new network, it gets whatever resolver that network is configured for, which is frequently whatever resolver is provided by your ISP. One of the requirements of some iterations of PIPA and SOPA has been that recursive resolvers would have to block resolution of domains designated as bad. Here's the relevant text from PIPA:

    (i) IN GENERAL- An operator of a nonauthoritative domain name system server shall take the least burdensome technically feasible and reasonable measures designed to prevent the domain name described in the order from resolving to that domain name's Internet protocol address, except that--
    (I) such operator shall not be required--
    (aa) other than as directed under this subparagraph, to modify its network, software, systems, or facilities;
    (bb) to take any measures with respect to domain name lookups not performed by its own domain name server or domain name system servers located outside the United States; or
    (cc) to continue to prevent access to a domain name to which access has been effectively disable by other means; and ...
    (ii) TEXT OF NOTICE.-The Attorney General shall prescribe the text of the notice displayed to users or customers of an operator taking an action pursuant to this subparagraph. Such text shall specify that the action is being taken pursuant to a court order obtained by the Attorney General.

    This text has been widely interpreted as requiring operators of recursive resolvers to do one of two things:

    • Simply cause the name resolution operation to fail.
    • Redirect the name resolution to the notice specified in (ii).

    The question then becomes how one might implement these.

    Technical Implementation Mechanisms
    Obviously if you can redirect the name, you can cause the resolution to fail by returning a bogus address, so let's look at the redirection case first. Crocker et al. argue that DNSSEC is designed to secure DNS data end-to-end to the user's computer. Thus, any element in the middle which modifies the DNS records to redirect traffic to a specific location will break the signature. Technically, this is absolutely correct. However, it is mitigated by two considerations.

    First, the vast majority of client software doesn't do DNSSEC resolution. Instead, if you're resolving some DNSSEC-signed name and the signature is being validated at all it's most likely being validated by some DNSSEC-aware recursive resolver, like the ones Comcast has recently deployed. Such a resolver can easily modify whatever results it is returning and that change will be undetectable to the vast majority of client software (i.e., to any non-DNSSEC software).1. So, at present, a rewriting requirement looks pretty plausible.

    Crocker et al. would no doubt tell you that this is a transitional stage and that eventually we'll have end-to-end DNSSEC, so it's a mistake to legislate new requirements that are incompatible with that. If a lot of endpoints start doing DNSSEC validation, then ISPs can't rewrite undetectably. They can still make names fail to resolve, though, via a variety of mechanisms. About this, Crocker et al. write:

    Even DNS filtering that did not contemplate redirection would pose security challenges. The only possible DNSSEC-compliant response to a query for a domain that has been ordered to be filtered is for the lookup to fail. It cannot provide a false response pointing to another resource or indicate that the domain does not exist. From an operational standpoint, a resolution failure from a nameserver subject to a court order and from a hacked nameserver would be indistinguishable. Users running secure applications have a need to distinguish between policy-based failures and failures caused, for example, by the presence of an attack or a hostile network, or else downgrade attacks would likely be prolific.[12]


    12. If two or more levels of security exist in a system, an attacker will have the ability to force a "downgrade" move from a more secure system function or capability to a less secure function by making it appear as though some party in the transaction doesn't support the higher level of security. Forcing failure of DNSSEC requests is one way to effect this exploit, if the attacked system will then accept forged insecure DNS responses. To prevent downgrade attempts, systems must be able to distinguish between legitimate failure and malicious failure.

    I sort of agree with the first part of this, but I don't really agree with the footnote. Much of the problem is that it's generally easy for network-based attackers to generate situations that simulate legitimate errors and/or misconfiguration. Cryptographic authentication actually makes this worse, since there are so many ways to screw up cryptographic protocols. Consider the case where the attacker overwrites the response with a random signature. Naturally the signature is unverifiable, in which case the resolver's only response is to reject the records, as prescribed by the DNSSEC standards. At this point you have effectively blocked resolution of the name. It's true that the resolver knows that something is wrong (though it can't distinguish between attack and misconfiguration), but so what? DNSSEC isn't designed to allow name resolution in the face of DoS attack by in-band active attackers. Recursive resolvers aren't precisely in-band, of course, but the ISP as a whole is in-band, which is one reason people have talked about ISP-level DNS filtering for all traffic, not just filtering at recursive resolvers.

    Note that I'm not trying to say here that I think that SOPA and PIPA are good ideas, or that there aren't plenty of techniques for people to use to evade them. I just don't think that it's really the case that you can't simultaneously have DNSSEC and network-based DNS filtering.


    1. Technical note: As I understand it, if you're a client resolver who wants to validate signatures itself needs to send the DO flag (to get the recursive resolver to return the DNSSEC records) and the CD flag (to suppress validation by the recursive resolver). This means that the recursive resolver can tell when its safe to rewrite the response without being detected. If DO isn't set, then the client won't be checking signatures. If CD isn't set, then the recursive resolver can claim that the name was unvalidatable and generate whatever error it would have generated in that case (Comcast's deployment seems to generate SERVFAIL for at least some types of misconfiguration.)