Microsoft's arguments for the superiority of this API fall into three major categories:
- JSEP doesn't match with "key Web tenets"; i.e., it doesn't match the Web/HTML5 style.
- It allows the development of applications that would otherwise be difficult to develop with the existing W3C API.
- It will be easier to make it interoperate with existing VoIP endpoints.
Like any all-new design, this API has the significant advantage (which the authors don't mention) of architectural cleanliness. The existing API is a compromise between a number of different architectural notions and like any hybrid proposals has points of ugliness where those proposals come into contact with each other (especially in the area of SDP.) However, when we actually look at functionality rather than elegance, the advantages of an all-new design---not only one which is largely not based on preexisting technologies but one which involves discarding most of the existing work on WebRTC itself---start to look fairly thin.
Looking at the three claims listed above: the first seems more rhetorical than factual. It's certainly true that in the early years of the Web designers strove to keep state out of the Web browser, but that hasn't been the case with rich Web applications for quite some time. To the contrary, many modern HTML5 technologies (localstore, WebSockets, HSTS, WebGL) are about pushing state onto the browser from the server.
None of this is to say that CU-RTC-Web wouldn't be better in some respects than JSEP. Obviously, any design has tradeoffs and as I said above, it's always appealing to throw all that annoying legacy stuff away and start fresh. However, that also comes with a lot of costs and before we consider that we really need to have a far better picture of what benefits other than elegance starting over would bring to the table.
More or less everyone agrees about the basic objectives of the WebRTC effort: to bring real-time communications (i.e., audio, video, and direct data) to browsers. Specifically, the idea is that Web applications should be able to use these capabilities directly. This sort of functionality was of course already available either via generic plugins such as Flash or via specific plugins such as Google Talk, but the idea here was to have a standardized API that was built into browsers.
- High-level APIs — essentially a softphone in the browser. The Web application would request the creation of a call (perhaps with some settings as to what kinds of media it wanted) and then each browser would emit standardized signaling messages which the Web application would arrange to transit to the other browser. The original WHATWG HTML5/PeerConnection spec was of this type.
The decision to follow this trajectory was made somewhere around six months ago and at this point Google has a fairly mature JSEP implementation available in Chrome Canary while Mozilla has a less mature implementation which you could compile yourself but hasn't been released in any public build.
Below is an initial, high-level analysis of this proposal.
Disclaimer: I have been heavily involved with both the IETF and W3C working groups in this area and have contributed significant chunks of code to both the Chrome and Firefox implementations. I am also currently consulting for Mozilla on their implementation. However, the comments here are my own and don't necessarily represent those of any other organization.
WHAT IS MICROSOFT PROPOSING?
What Microsoft is proposing is effectively a straight low level API.
There are a lot of different API points, and I don't plan to discuss the API in much detail, but it's helpful to talk about the API some to get a flavor of what's required to use it.
- RealTimeMediaStream -- each RealTimeMediaStream represents a single flow of media (i.e., audio or video).
- RealTimeMediaDescription -- a set of parameters for the RealTimeMediaStream.
- RealTimeTransport -- a transport channel which a RealTimeMediaStream can run over.
- RealTimePort -- a transport endpoint which can be paired with a RealTimePort on the other side to form a RealTimeTransport.
In order to set up an audio, video, or audio-video session, then, the JS has to do something like the following:
- Acquire local media streams on each browser via the getUserMedia() API, thus getting some set of MediaStreamTracks.
- Create RealTimePorts on each browser for all the local network addresses as well as for whatever media relays are available/ required.
- Communicate the coordinates for the RealTimePorts from each browser to the other.
- On each browser, run ICE connectivity checks for all combinations of remote and local RealTimePorts.
- Select a subset of the working remote/local RealTimePort pairs and establish RealTimeTransports based on those pairs. (This might be one or might be more than one depending on the number of media flows, level of multiplexing, and the level of redundancy required).
- Determine a common set of media capabilities and codecs between each browser, select a specific set of media parameters, and create matching RealTimeMediaDescriptions on each browser based on those parameters.
- Create RealTimeMediaStreams by combining RealTimeTransports, RealTimeMediaDescriptions, and MediaStreamTracks.
- Attach the remote RealTimeMediaStreams to some local display method (such as an audio or video tag).
For comparison, in JSEP you would do something like:
- Acquire local media streams on each browser via the getUserMedia() API, thus getting some set of MediaStreamTracks.
- Create a PeerConnection() and call AddStream() for each of the local streams.
- Create an offer on one brower send it to the other side, create an answer on the other side and send it back to the offering browser. In the simplest case, this just involves making some API calls with no arguments and passing the results to the other side.
- The PeerConnection fires callbacks announcing remote media streams which you attach to some local display method.
ARGUMENTS FOR MICROSOFT'S PROPOSAL
Microsoft's proposal and the associated blog post makes a number of major arguments for why it is a superior choice (the proposal just came out today so there haven't really been any public arguments for why it's worse). Combining the blog posts, you would get something like this:
- That the current specification violates "fit with key web tenets", specifically that it's not stateless and that you can only make changes when in specific states. Also, that it depends on the SDP offer/answer model.
- That it doesn't allow a "customizable response to changing network quality".
- That it doesn't support "real-world interoperability" with existing equipment.
- That it's too tied to specific media formats and codecs.
- That JSEP requires a Web application to do some frankly inconvenient stuff if it wants to do something that the API doesn't have explicit support for.
- That it's inflexible and/or brittle with respect to new applications and in particular that it's difficult to implement some specific "innovative" applications with JSEP.
FITTING WITH "WEB TENETS"
Honoring key Web tenets-The Web favors stateless interactions which do not saddle either party of a data exchange with the responsibility to remember what the other did or expects. Doing otherwise is a recipe for extreme brittleness in implementations; it also raises considerably the development cost which reduces the reach of the standard itself.
This sounds rhetorically good, but I'm not sure how accurate it is. First, the idea that the Web is "stateless" feels fairly anachronistic in an era where more and more state is migrating from the server. To pick two examples, WebSockets involves forming a fairly long-term stateful two-way channel between the browser and the server, and localstore/localdb allow the server to persist data semi-permanently on the browser. Indeed, CU-RTC-Web requires forming a nontrivial amount of state on the browser in the form of the RealTimePorts, which represent actual resource reservations that cannot be reliably reconstructed if (for instance) the page reloads. I think the idea here is supposed to be that this is "soft state", in that it can be kept on the server and just reimposed on the browser at refresh time, but as the RealTimePorts example shows, it's not clear that this is the case. Similar comments apply to the state of the audio and video devices which are inherently controlled by the browser.
Moreover, it's never been true that neither party in the data exchange was "saddled" with remembering what the other did; rather, it used to be the case that most state sat on the server, and indeed, that's where the CU-RTC-Web proposal keeps it. This is the first time we have really built a Web-based peer-to-peer app. Pretty much all previous applications have been client-server applications, so it's hard to know what idioms are appropriate in a peer-to-peer case.
I'm a little puzzled by the argument about "development cost"; there are two kinds of development cost here: that to browser implementors and that to Web application programmers. The MS proposal puts more of that cost on Web programmers whereas JSEP puts more of the cost on browser implementors. One would ordinarily think that as long as the standard wasn't too difficult for browser implementors to develop at all, then pushing complexity away from Web programmers would tend to increase the reach of the standard. One could of course argue that this standard is too complicated for browser implementors to implement at all, but the existing state of Google and Mozilla's implementations would seem to belie that claim.
Finally, given that the original WHATWG draft had even more state in the browser (as noted above, it was basically a high-level API), it's a little odd to hear that Ian Hickson is out of touch with the "key Web tenets".
CUSTOMIZABLE RESPONSE TO CHANGING NETWORK QUALITY
The CU-RTC-Web proposal writes:
Real time media applications have to run on networks with a wide range of capabilities varying in terms of bandwidth, latency, and noise. Likewise these characteristics can change while an application is running. Developers should be able to control how the user experience adapts to fluctuations in communication quality. For example, when communication quality degrades, the developer may prefer to favor the video channel, favor the audio channel, or suspend the app until acceptable quality is restored. An effective protocol and API will have to arm developers with the tools to tailor such answers to the exact needs of the moment, while minimizing the complexity of the resulting API surface.
Moroever, there's a real concern that this sort of adaptation will have to happen in two places: as MS points out, this kind of network variability is really common and so applications have to handle it. Unless you want to force every JS calling application in the universe to include adaptation logic, the browser will need some (potentially configurable and/or disableable) logic. It's worth asking whether whatever logic you would write in JS is really going to be enough better to justify this design.
In their blog post today, MS writes about JSEP:
it shows no signs of offering real world interoperability with existing VoIP phones, and mobile phones, from behind firewalls and across routers and instead focuses on video communication between web browsers under ideal conditions. It does not allow an application to control how media is transmitted on the network.
I wish this argument had been elaborated more, since it seems like CU-RTC-Web is less focused on interoperability, not more. In particular, since JSEP is based on existing technologies such as SDP and ICE, it's relatively easy to build Web applications which gateway JSEP to SIP or Jingle signaling (indeed, relatively simple prototypes of these already exist). By contrast, gatewaying CU-RTC-Web signaling to either of these protocols would require developing an entire SDP stack, which is precisely the piece that the MS guys are implicitly arguing is expensive.
A lot of people's instincts here seem to be based on an environment where updating the software on people's machines was hard but updating one's Web site was easy. But for about half of the population of browsers (Chrome and Firefox) do rapid auto-updates, so they actually are generally fairly modern. By contrast, Web applications often use downrev version of their JS libraries (I wish I had survey data here but it's easy to see just by opening up a JS debugger on you favorite sites). It's not at all clear that the JS is easy to upgrade/native is hard dynamic holds up any more.
TOO TIED TO SPECIFIC MEDIA FORMATS AND CODECS
The proposal says:
A successful standard cannot be tied to individual codecs, data formats or scenarios. They may soon be supplanted by newer versions, which would make such a tightly coupled standard obsolete just as quickly. The right approach is instead to to support multiple media formats and to bring the bulk of the logic to the application layer, enabling developers to innovate.
I can't make much sense of this at all. JSEP, like the standards that it is based on, is agnostic about the media formats and codecs that are used. There's certainly nothing in JSEP that requires you to use VP8 for your video codec, Opus for your audio codec, or anything else. Rather, two conformant JSEP implementations will converge on a common subset of interoperable formats. This should happen automatically without Web application intervention.
Based on Matthew Kaufman's interview with Janko Rogers [http://gigaom.com/2012/08/06/microsoft-webrtc-w3c/], it seems like this may actually be about the proposal to have a mandatory to implement video codec (the leading candidates seem to be H.264 or VP8). Obviously, there have been a lot of arguments about whether such a mandatory codec is required (the standard argument in favor of it is that then you know that any two implementations have at least one codec in common), but this isn't really a matter of "tightly coupling" the codec to the standard. To the contrary, if we mandated VP8 today and then next week decided to mandate H.264 it would be a one-line change in the specification. In any case, this doesn't seem like a structural argument about JSEP versus CU-RTC-Web. Indeed, if IETF and W3C decided to ditch JSEP and go with CU-RTC-Web, it seems likely that this wouldn't affect the question of mandatory codecs at all.
THE INCONVENIENCE OF SDP EDITING
Probably the strongest point that the MS authors make is that if the API doesn't explicitly support doing something, the situation is kind of gross:
In particular, the negotiation model of the API relies on the SDP offer/answer model, which forces applications to parse and generate SDP in order to effect a change in browser behavior. An application is forced to only perform certain changes when the browser is in specific states, which further constrains options and increases complexity. Furthermore, the set of permitted transformations to SDP are constrained in non-obvious and undiscoverable ways, forcing applications to resort to trial-and-error and/or browser-specific code. All of this added complexity is an unnecessary burden on applications with little or no benefit in return.
What this is about is that in JSEP you call CreateOffer() on a PeerConnection in order to get an SDP offer. This doesn't actually change the PeerConnection state to accomodate the new offer; instead, you call SetLocalDescription() to install the offer. This gives the Web application the opportunity to apply its own preferences by editing the offer. For instance, it might delete a line containing a codec that it didn't want to use. Obviously, this requires a lot of knowledge of SDP in the application, which is irritating to say the least, for the reasons in the quote above.
The major mitigating factor is that the W3C/IETF WG members intend to allow most common manipulations to made through explicit settings parameters, so that only really advanced applications need to know anything about SDP at all. Obviously opinions vary about how good a job they have done, and of course it's possible to write libraries that would make this sort of manipulation easier. It's worth noting that there has been some discussion of extending the W3C APIs to have an explicit API for manipulating SDP objects rather than just editing the string versions (perhaps by borrowing some of the primitives in CU-RTC-Web). Such a change would make some things easier while not really representing a fundamental change to the JSEP model. However, it's not clear if there are enough SDP-editing tasks to make this project worthwhile.
With that said, that in order to have CU-RTC-Web interoperate with existing SIP endpoints at all you would need to know far more about SDP than would be required to do most anticipated transformations in a JSEP environment, so it's not like CU-RTC-Web frees you from SDP if you care about interoperability with existing equipment.
SUPPORT FOR NEW/INNOVATIVE APPLICATIONS
Finally, the MSFT authors argue that CU-RTC-Web is more flexible and/or less brittle than JSEP:
On the other hand, implementing innovative, real-world applications like security consoles, audio streaming services or baby monitoring through this API would be unwieldy, assuming it could be made to work at all. A Web RTC standard must equip developers with the ability to implement all scenarios, even those we haven't thought of.
Obviously the last sentence is true, but the first sentence provides scant support for the claim that CU-RTC-Web fulfills this requirement better than JSEP. The particular applications cited here, namely audio streaming, security consoles, and baby monitoring, seem not only doable with JSEP, but straightforward. In particular, security consoles and baby monitoring just look like one way audio and/or video calls from some camera somewhere. This seems like a trivial subset of the most basic JSEP functionality. Audio streaming is, if anything, even easier. Audio streaming from servers already exists without any WebRTC functionality at all, in the form of the audio tag, and audio streaming from client to server can be achieved with the combination of getUserMedia and WebSockets. Even if you decided that you wanted to use UDP rather than WebSockets, audio streaming is just a one-way audio call, so it's hard to see that this is a problem.
In e-mail to the W3C WebRTC mailing list, Matthew Kaufman mentions the use case of handling page reload:
This use case, often called "rehydration" has been studied a fair bit and it's not entirely clear that there is a convenient solution with JSEP. However, the problem isn't the offer/answer state, which is actually easily handled, but rather the ICE and cryptographic state, which are just as troublesome with CU-RTC-Web as they are with JSEP [for a variety of technical reasons, you can't just reuse the previous settings here.] So, while rehydration is an issue, it's not clear that CU-RTC-Web makes matters any easier.
This argument, which should be the strongest of MS's arguments, feels rather like the weakest. Given how much effort has already gone into JSEP, both in terms of standards and implementation, if we're going to replace it with something else that something else should do something that JSEP can't, not just have a more attractive API. If MS can't come up with any use cases that JSEP can't accomplish, and if in fact the use cases they list are arguably more convenient with JSEP than with CU-RTC-Web, then that seems like a fairly strong argument that we should stick with JSEP, not one that we should replace it.
What I'd like to see Microsoft do here is describe some applications that are really a lot easier with CU-RTC-Web than they are with JSEP. Depending on the details, this might be a more or less convincing argument, but without some examples, it's pretty hard to see what considerations other than aesthetic would drive us towards CU-RTC-Web.
Thanks to Cullen Jennings, Randell Jesup, Maire Reavy, and Tim Terriberry for early comments on this draft.