Initial notes on Microsoft's CU-RTC-Web Proposal

| Comments (2) | Networking
EXECUTIVE SUMMARY
Yesterday, Microsoft published their CU-RTC-Web WebRTC API proposal as an alternative to the existing W3C WebRTC API being implemented in Chrome and Firefox. Microsoft's proposal is a "low-level API" proposal which basically exposes a bunch of media- and transport-level primitives to the JavaScript Web application, which is expected to stitch them together into a complete calling system. By contrast to the current "mid-level" API, the Microsoft API moves a lot of complexity from the browser to the JavaScript but the authors argue that this makes it more powerful and flexible. I don't find these arguments that convincing, however: a lot of them seem fairly abstract and rhetorical and when we get down to concrete use cases, the examples Microsoft gives seem like things that could easily be done within the existing framework. So, while it's clear that the Microsoft proposal is a lot more work for the application developer; it's a lot less clear that it's sufficiently more powerful to justify that additional complexity.

Microsoft's arguments for the superiority of this API fall into three major categories:

  • JSEP doesn't match with "key Web tenets"; i.e., it doesn't match the Web/HTML5 style.
  • It allows the development of applications that would otherwise be difficult to develop with the existing W3C API.
  • It will be easier to make it interoperate with existing VoIP endpoints.

Like any all-new design, this API has the significant advantage (which the authors don't mention) of architectural cleanliness. The existing API is a compromise between a number of different architectural notions and like any hybrid proposals has points of ugliness where those proposals come into contact with each other (especially in the area of SDP.) However, when we actually look at functionality rather than elegance, the advantages of an all-new design---not only one which is largely not based on preexisting technologies but one which involves discarding most of the existing work on WebRTC itself---start to look fairly thin.

Looking at the three claims listed above: the first seems more rhetorical than factual. It's certainly true that in the early years of the Web designers strove to keep state out of the Web browser, but that hasn't been the case with rich Web applications for quite some time. To the contrary, many modern HTML5 technologies (localstore, WebSockets, HSTS, WebGL) are about pushing state onto the browser from the server.

The interoperability argument is similarly weakly supported. Given that JSEP is based on existing VoIP technologies, it seems likely that it is easier to make it interoperate with existing endpoints since it's not first necessary to implement those technologies (principally SDP) in JavaScript before you can even try to interoperate. The idea here seems to be that it will be easier to accomodate existing noncompliant endpoints if you can adapt your Web application on the fly, but given the significant entry barrier to interoperating at all, this seems like an argument that needs rather more support than MS has currently offered.

Finally, with regard to the question of the flexibility/JavaScript complexity tradeoff, it's somewhat distressing that the specific applications that Microsoft cites (baby monitoring, security cameras, etc.) are so pedestrian and easily handled by JSEP. This isn't of course to say that there aren't applications which we can't currently envision which JSEP would handle badly, but it rather undercuts this argument if the only examples you cite in support of a new design are those which are easily handled by the old one.

None of this is to say that CU-RTC-Web wouldn't be better in some respects than JSEP. Obviously, any design has tradeoffs and as I said above, it's always appealing to throw all that annoying legacy stuff away and start fresh. However, that also comes with a lot of costs and before we consider that we really need to have a far better picture of what benefits other than elegance starting over would bring to the table.

BACKGROUND
More or less everyone agrees about the basic objectives of the WebRTC effort: to bring real-time communications (i.e., audio, video, and direct data) to browsers. Specifically, the idea is that Web applications should be able to use these capabilities directly. This sort of functionality was of course already available either via generic plugins such as Flash or via specific plugins such as Google Talk, but the idea here was to have a standardized API that was built into browsers.

In spite of this agreement about objectives, from the beginning there was debate about the style of API that was appropriate, and in particular how much of the complexity should be in the browser and how much in the JavaScript The initial proposals broke down into two main flavors:

  • High-level APIs — essentially a softphone in the browser. The Web application would request the creation of a call (perhaps with some settings as to what kinds of media it wanted) and then each browser would emit standardized signaling messages which the Web application would arrange to transit to the other browser. The original WHATWG HTML5/PeerConnection spec was of this type.
  • Low-level APIs — an API which exposed a bunch of primitive media and transport capabilities to the JavaScript. A browser that implemented this sort of API couldn't really do much by itself. Instead, you would need to write something like a softphone in JavaScript, including implementing the media negotiation, all the signaling state machinery, etc. Matthew Kaufman from Microsoft was one of the primary proponents of this design.

After a lot of debate, the WG ultimately rejected both of these and settled on a protocol called JavaScript Session Establishment Protocol (JSEP), which is probably best described as a mid-level API. That design, embodied in the current specifications [ http://tools.ietf.org/html/draft-ietf-rtcweb-jsep-01 http://dev.w3.org/2011/webrtc/editor/webrtc.html], keeps the transport establishment and media negotiation in the browser but moves a fair amount of the session establishment state machine into the JavaScript. While it doesn't standardize signaling, it also has a natural mapping to a simple signaling protocol as well as to SIP and Jingle, the two dominant standardized calling protocols. The idea is supposed to be that it's simple to write a basic application (indeed, a large number of such simple demonstration apps have been written) but that it's also possible to exercise advanced features by manipulating the various data structures emitted by the browser. This is obviously something of a compromise between the first two classes of proposals.

The decision to follow this trajectory was made somewhere around six months ago and at this point Google has a fairly mature JSEP implementation available in Chrome Canary while Mozilla has a less mature implementation which you could compile yourself but hasn't been released in any public build.

Yesterday, Microsoft made a new proposal, called CU-RTC-Web. See the blog post and the specification.

Below is an initial, high-level analysis of this proposal.

Disclaimer: I have been heavily involved with both the IETF and W3C working groups in this area and have contributed significant chunks of code to both the Chrome and Firefox implementations. I am also currently consulting for Mozilla on their implementation. However, the comments here are my own and don't necessarily represent those of any other organization.

WHAT IS MICROSOFT PROPOSING?
What Microsoft is proposing is effectively a straight low level API.

There are a lot of different API points, and I don't plan to discuss the API in much detail, but it's helpful to talk about the API some to get a flavor of what's required to use it.

  • RealTimeMediaStream -- each RealTimeMediaStream represents a single flow of media (i.e., audio or video).
  • RealTimeMediaDescription -- a set of parameters for the RealTimeMediaStream.
  • RealTimeTransport -- a transport channel which a RealTimeMediaStream can run over.
  • RealTimePort -- a transport endpoint which can be paired with a RealTimePort on the other side to form a RealTimeTransport.

In order to set up an audio, video, or audio-video session, then, the JS has to do something like the following:

  1. Acquire local media streams on each browser via the getUserMedia() API, thus getting some set of MediaStreamTracks.
  2. Create RealTimePorts on each browser for all the local network addresses as well as for whatever media relays are available/ required.
  3. Communicate the coordinates for the RealTimePorts from each browser to the other.
  4. On each browser, run ICE connectivity checks for all combinations of remote and local RealTimePorts.
  5. Select a subset of the working remote/local RealTimePort pairs and establish RealTimeTransports based on those pairs. (This might be one or might be more than one depending on the number of media flows, level of multiplexing, and the level of redundancy required).
  6. Determine a common set of media capabilities and codecs between each browser, select a specific set of media parameters, and create matching RealTimeMediaDescriptions on each browser based on those parameters.
  7. Create RealTimeMediaStreams by combining RealTimeTransports, RealTimeMediaDescriptions, and MediaStreamTracks.
  8. Attach the remote RealTimeMediaStreams to some local display method (such as an audio or video tag).

For comparison, in JSEP you would do something like:

  1. Acquire local media streams on each browser via the getUserMedia() API, thus getting some set of MediaStreamTracks.
  2. Create a PeerConnection() and call AddStream() for each of the local streams.
  3. Create an offer on one brower send it to the other side, create an answer on the other side and send it back to the offering browser. In the simplest case, this just involves making some API calls with no arguments and passing the results to the other side.
  4. The PeerConnection fires callbacks announcing remote media streams which you attach to some local display method.

As should be clear, the CU-RTC-Web proposal requires significantly more complex JavaScript, and in particular requires that JavaScript to be a lot smarter about what it's doing. In a JSEP-style API, the Web programmer can be pretty ignorant about things like codecs and transport protocols, unless he wants to do something fancy, but with CU-RTC-Web, he needs to understand a lot of stuff to make things work at all. In some ways, this is a much better fit for the traditional Web approach of having simple default behaviors which fit a lot of cases but which can then be customized, albeit in ways that are somewtimes a bit clunky.

Note that it's not like this complexity doesn't exist in JSEP, it's just been pushed into the browser so that the user doesn't have to see it. As discussed below, Microsoft's argument is that this simplicity in the JavaScript comes at a price in terms of flexibility and robustness, and that libraries will be developed (think jQuery) to give the average Web programmer a simple experience, so that they won't have to accept a lot of complexity themselves. However, since those libraries don't exist, it seems kind of unclear how well that's going to work.

ARGUMENTS FOR MICROSOFT'S PROPOSAL
Microsoft's proposal and the associated blog post makes a number of major arguments for why it is a superior choice (the proposal just came out today so there haven't really been any public arguments for why it's worse). Combining the blog posts, you would get something like this:

  • That the current specification violates "fit with key web tenets", specifically that it's not stateless and that you can only make changes when in specific states. Also, that it depends on the SDP offer/answer model.
  • That it doesn't allow a "customizable response to changing network quality".
  • That it doesn't support "real-world interoperability" with existing equipment.
  • That it's too tied to specific media formats and codecs.
  • That JSEP requires a Web application to do some frankly inconvenient stuff if it wants to do something that the API doesn't have explicit support for.
  • That it's inflexible and/or brittle with respect to new applications and in particular that it's difficult to implement some specific "innovative" applications with JSEP.
Below we examine each of these arguments in turn.

FITTING WITH "WEB TENETS"
MS writes:

Honoring key Web tenets-The Web favors stateless interactions which do not saddle either party of a data exchange with the responsibility to remember what the other did or expects. Doing otherwise is a recipe for extreme brittleness in implementations; it also raises considerably the development cost which reduces the reach of the standard itself.

This sounds rhetorically good, but I'm not sure how accurate it is. First, the idea that the Web is "stateless" feels fairly anachronistic in an era where more and more state is migrating from the server. To pick two examples, WebSockets involves forming a fairly long-term stateful two-way channel between the browser and the server, and localstore/localdb allow the server to persist data semi-permanently on the browser. Indeed, CU-RTC-Web requires forming a nontrivial amount of state on the browser in the form of the RealTimePorts, which represent actual resource reservations that cannot be reliably reconstructed if (for instance) the page reloads. I think the idea here is supposed to be that this is "soft state", in that it can be kept on the server and just reimposed on the browser at refresh time, but as the RealTimePorts example shows, it's not clear that this is the case. Similar comments apply to the state of the audio and video devices which are inherently controlled by the browser.

Moreover, it's never been true that neither party in the data exchange was "saddled" with remembering what the other did; rather, it used to be the case that most state sat on the server, and indeed, that's where the CU-RTC-Web proposal keeps it. This is the first time we have really built a Web-based peer-to-peer app. Pretty much all previous applications have been client-server applications, so it's hard to know what idioms are appropriate in a peer-to-peer case.

I'm a little puzzled by the argument about "development cost"; there are two kinds of development cost here: that to browser implementors and that to Web application programmers. The MS proposal puts more of that cost on Web programmers whereas JSEP puts more of the cost on browser implementors. One would ordinarily think that as long as the standard wasn't too difficult for browser implementors to develop at all, then pushing complexity away from Web programmers would tend to increase the reach of the standard. One could of course argue that this standard is too complicated for browser implementors to implement at all, but the existing state of Google and Mozilla's implementations would seem to belie that claim.

Finally, given that the original WHATWG draft had even more state in the browser (as noted above, it was basically a high-level API), it's a little odd to hear that Ian Hickson is out of touch with the "key Web tenets".

CUSTOMIZABLE RESPONSE TO CHANGING NETWORK QUALITY
The CU-RTC-Web proposal writes:

Real time media applications have to run on networks with a wide range of capabilities varying in terms of bandwidth, latency, and noise. Likewise these characteristics can change while an application is running. Developers should be able to control how the user experience adapts to fluctuations in communication quality. For example, when communication quality degrades, the developer may prefer to favor the video channel, favor the audio channel, or suspend the app until acceptable quality is restored. An effective protocol and API will have to arm developers with the tools to tailor such answers to the exact needs of the moment, while minimizing the complexity of the resulting API surface.

It's certainly true that it's desirable to be able to respond to changing network conditions, but it's a lot less clear that the CU-RTC-Web API actually offers a useful response to such changes. In general, the browser is going to know a lot more about the bandwidth/quality tradeoff of a given codec is going to be than most JavaScript applications will, and so it seems at least plausible that you're going to do better with a small number of policies (audio is more important than video, video is more important than audio, etc.) than you would by having the JS try to make fine-grained decisions about what it wants to do. It's worth noting that the actual "customizable" policies that are proposed here seem pretty simple. The idea seems to be not that you would impose policy on the browser but rather that since you need to implement all the negotiation logic anyway, you get to implement whatever policy you want.

Moroever, there's a real concern that this sort of adaptation will have to happen in two places: as MS points out, this kind of network variability is really common and so applications have to handle it. Unless you want to force every JS calling application in the universe to include adaptation logic, the browser will need some (potentially configurable and/or disableable) logic. It's worth asking whether whatever logic you would write in JS is really going to be enough better to justify this design.

REAL-WORLD INTEROPERABILITY
In their blog post today, MS writes about JSEP:

it shows no signs of offering real world interoperability with existing VoIP phones, and mobile phones, from behind firewalls and across routers and instead focuses on video communication between web browsers under ideal conditions. It does not allow an application to control how media is transmitted on the network.

I wish this argument had been elaborated more, since it seems like CU-RTC-Web is less focused on interoperability, not more. In particular, since JSEP is based on existing technologies such as SDP and ICE, it's relatively easy to build Web applications which gateway JSEP to SIP or Jingle signaling (indeed, relatively simple prototypes of these already exist). By contrast, gatewaying CU-RTC-Web signaling to either of these protocols would require developing an entire SDP stack, which is precisely the piece that the MS guys are implicitly arguing is expensive.

Based on Matthew Kaufman's mailing list postings, his concern seems to be that there are existing endpoints which don't implement some of the specifications required by WebRTC (principally ICE, which is used to set up the network transport channels) correctly, and that it will be easier to interoperate with them if your ICE implementation is written in JavaScript and downloaded by the application rather than in C++ and baked into the browser. This isn't a crazy theory, but I think there are serious open questions about whether it is correct. The basic problem is that it's actually quite hard to write a good ICE stack (though easy to write a bad one). The browser vendors have the resources to do a good job here, but it's less clear that random JS toolkits that people download will actually do that good a job (especially if they are simultaneously trying to compensate for broken legacy equipment). The result of having everyone write their own ICE stack might be good but it might also lead to a landscape where cross-Web application interop is basically impossible (or where there are islands of noninteroperable de facto standards based on popular toolkits or even popular toolkit versions).

A lot of people's instincts here seem to be based on an environment where updating the software on people's machines was hard but updating one's Web site was easy. But for about half of the population of browsers (Chrome and Firefox) do rapid auto-updates, so they actually are generally fairly modern. By contrast, Web applications often use downrev version of their JS libraries (I wish I had survey data here but it's easy to see just by opening up a JS debugger on you favorite sites). It's not at all clear that the JS is easy to upgrade/native is hard dynamic holds up any more.

TOO TIED TO SPECIFIC MEDIA FORMATS AND CODECS
The proposal says:

A successful standard cannot be tied to individual codecs, data formats or scenarios. They may soon be supplanted by newer versions, which would make such a tightly coupled standard obsolete just as quickly. The right approach is instead to to support multiple media formats and to bring the bulk of the logic to the application layer, enabling developers to innovate.

I can't make much sense of this at all. JSEP, like the standards that it is based on, is agnostic about the media formats and codecs that are used. There's certainly nothing in JSEP that requires you to use VP8 for your video codec, Opus for your audio codec, or anything else. Rather, two conformant JSEP implementations will converge on a common subset of interoperable formats. This should happen automatically without Web application intervention.

Arguably, in fact, CU-RTC-Web is *more* tied to a given codec because the codec negotiation logic is implemented either on the server or in the JavaScript. If a browser adds support for a new codec, the Web application needs to detect that and somehow know how to prioritize it against existing known codecs. By contrast, when the browser manufacturer adds a new codec, he knows how it performs compared to existing codecs and can adjust his negotiation algorithms accordingly. Moreover, as discussed below, JSEP provides (somewhat clumsy) mechanisms for the user to override the browser's default choices. These mechanisms could probably be made better within the JSEP architecture.

Based on Matthew Kaufman's interview with Janko Rogers [http://gigaom.com/2012/08/06/microsoft-webrtc-w3c/], it seems like this may actually be about the proposal to have a mandatory to implement video codec (the leading candidates seem to be H.264 or VP8). Obviously, there have been a lot of arguments about whether such a mandatory codec is required (the standard argument in favor of it is that then you know that any two implementations have at least one codec in common), but this isn't really a matter of "tightly coupling" the codec to the standard. To the contrary, if we mandated VP8 today and then next week decided to mandate H.264 it would be a one-line change in the specification. In any case, this doesn't seem like a structural argument about JSEP versus CU-RTC-Web. Indeed, if IETF and W3C decided to ditch JSEP and go with CU-RTC-Web, it seems likely that this wouldn't affect the question of mandatory codecs at all.

THE INCONVENIENCE OF SDP EDITING
Probably the strongest point that the MS authors make is that if the API doesn't explicitly support doing something, the situation is kind of gross:

In particular, the negotiation model of the API relies on the SDP offer/answer model, which forces applications to parse and generate SDP in order to effect a change in browser behavior. An application is forced to only perform certain changes when the browser is in specific states, which further constrains options and increases complexity. Furthermore, the set of permitted transformations to SDP are constrained in non-obvious and undiscoverable ways, forcing applications to resort to trial-and-error and/or browser-specific code. All of this added complexity is an unnecessary burden on applications with little or no benefit in return.

What this is about is that in JSEP you call CreateOffer() on a PeerConnection in order to get an SDP offer. This doesn't actually change the PeerConnection state to accomodate the new offer; instead, you call SetLocalDescription() to install the offer. This gives the Web application the opportunity to apply its own preferences by editing the offer. For instance, it might delete a line containing a codec that it didn't want to use. Obviously, this requires a lot of knowledge of SDP in the application, which is irritating to say the least, for the reasons in the quote above.

The major mitigating factor is that the W3C/IETF WG members intend to allow most common manipulations to made through explicit settings parameters, so that only really advanced applications need to know anything about SDP at all. Obviously opinions vary about how good a job they have done, and of course it's possible to write libraries that would make this sort of manipulation easier. It's worth noting that there has been some discussion of extending the W3C APIs to have an explicit API for manipulating SDP objects rather than just editing the string versions (perhaps by borrowing some of the primitives in CU-RTC-Web). Such a change would make some things easier while not really representing a fundamental change to the JSEP model. However, it's not clear if there are enough SDP-editing tasks to make this project worthwhile.

With that said, that in order to have CU-RTC-Web interoperate with existing SIP endpoints at all you would need to know far more about SDP than would be required to do most anticipated transformations in a JSEP environment, so it's not like CU-RTC-Web frees you from SDP if you care about interoperability with existing equipment.

SUPPORT FOR NEW/INNOVATIVE APPLICATIONS
Finally, the MSFT authors argue that CU-RTC-Web is more flexible and/or less brittle than JSEP:

On the other hand, implementing innovative, real-world applications like security consoles, audio streaming services or baby monitoring through this API would be unwieldy, assuming it could be made to work at all. A Web RTC standard must equip developers with the ability to implement all scenarios, even those we haven't thought of.

Obviously the last sentence is true, but the first sentence provides scant support for the claim that CU-RTC-Web fulfills this requirement better than JSEP. The particular applications cited here, namely audio streaming, security consoles, and baby monitoring, seem not only doable with JSEP, but straightforward. In particular, security consoles and baby monitoring just look like one way audio and/or video calls from some camera somewhere. This seems like a trivial subset of the most basic JSEP functionality. Audio streaming is, if anything, even easier. Audio streaming from servers already exists without any WebRTC functionality at all, in the form of the audio tag, and audio streaming from client to server can be achieved with the combination of getUserMedia and WebSockets. Even if you decided that you wanted to use UDP rather than WebSockets, audio streaming is just a one-way audio call, so it's hard to see that this is a problem.

In e-mail to the W3C WebRTC mailing list, Matthew Kaufman mentions the use case of handling page reload:

An example would be recovery from call setup in the face of a browser page reload... a case where the state of the browser must be reinitialized, leading to edge cases where it becomes impossible with JSEP for a developer to write Javascript that behaves properly in all cases (because without an offer one cannot generate an answer, and once an offer has been generated one must not generate another offer until the first offer has been answered, but in either case there is no longer sufficient information as to how to proceed).

This use case, often called "rehydration" has been studied a fair bit and it's not entirely clear that there is a convenient solution with JSEP. However, the problem isn't the offer/answer state, which is actually easily handled, but rather the ICE and cryptographic state, which are just as troublesome with CU-RTC-Web as they are with JSEP [for a variety of technical reasons, you can't just reuse the previous settings here.] So, while rehydration is an issue, it's not clear that CU-RTC-Web makes matters any easier.

This argument, which should be the strongest of MS's arguments, feels rather like the weakest. Given how much effort has already gone into JSEP, both in terms of standards and implementation, if we're going to replace it with something else that something else should do something that JSEP can't, not just have a more attractive API. If MS can't come up with any use cases that JSEP can't accomplish, and if in fact the use cases they list are arguably more convenient with JSEP than with CU-RTC-Web, then that seems like a fairly strong argument that we should stick with JSEP, not one that we should replace it.

What I'd like to see Microsoft do here is describe some applications that are really a lot easier with CU-RTC-Web than they are with JSEP. Depending on the details, this might be a more or less convincing argument, but without some examples, it's pretty hard to see what considerations other than aesthetic would drive us towards CU-RTC-Web.

Acknowledgement
Thanks to Cullen Jennings, Randell Jesup, Maire Reavy, and Tim Terriberry for early comments on this draft.

2 Comments

I don't know anything about this, but reading your write-up it strikes me that when Microsoft says "VoIP inter-operability" they don't mean SIP, they mean Skype. Does what their saying make more sense if they're talking about trying to implement apps using Skype instead of SIP?

From past experience with standardization work, this can come from a different angle - yes, Microsoft wants to make their lives a bit easier, but Skype isn't their only client - they have Lync as well, which is SIP based to some extent, so SDP isn't far from them.
But the main story might be the need to slow down Google. By making changes to the standard, and having the need to use a different set of APIs that require changing what works on the current WebRTC spec, they are giving themselves time to ramp up their own development efforts as well as getting a grip on the technology while making sure their competitors in the browser space don't leave them far behind.
That would be my bet in this.
Another interesting battle is going to be on of the video codec, which is coming soon: http://bloggeek.me/webrtc-codec-wars/

Leave a comment