Why VoIP over TCP and/or SSL sounds like crap (I)

| Comments (6) | TrackBacks (18) |
The difficulty of using IPsec VPNs has made SSL-based VPNs an increasingly popular networking technology. If your enterprise is using VoIP (isn't everyone?), then it's natural to want to carry that traffic over the SSL VPN. Unfortunately, this doesn't work very well.

The source of the problem is that SSL/TLS runs over TCP. TCP is designed to provide a channel with a number of properties:

  • Stream-oriented
  • Reliable
  • In-order
  • Flow controlled

What this implies is that TCP views data as a single long stream of data. It's convenient to think of the data as being a series of bytes numbered 1-N. In order to transmit it, the data is broken up into a sequence of packets, each with its own sequence number. Those packets are independently transmitted over the network. On the receiving side, these packets are reassembled into a stream and delivered to the application (and hence to your ear) as soon as they're available.

Here's the simplest example:

2 1-51-5
4 6-106-10

At time 1, the sender transmits a single packet containing bytes 1-5. At time 2, it's received by the receiver, who passes it on to the application. At time 3, the sender transmits another packet containing 6-10. At time 4, the receiver receives that packet, and delivers it to the application. Data is delivered to the application as soon as its available and (here's the key point) in order. Consider the next case:

3 6-10
4 1-51-10

In this case, the sender sends two separate packets, one containing 1-5 and the other 6-10. They're sent in order but received out of order. At time 3, the receiver received the packet containing 6-10. However, since it hasn't received 1-5, this packet is out of order so it doesn't deliver it. Rather, it waits until it receives bytes 1-5 at time 4 and then delivers all the bytes together. This is the "in-order" feature. Note that TCP doesn't preserve byte boundaries: the application can't tell whether the data was transferred as one packet or ten or what order things were received in. This is the stream-oriented feature.

Remember that I said that TCP was reliable. Packet networks are fairly unreliable; packets can get damaged, lost, or rerouted. TCP imposes a reliable abstraction over top. The way that this works is that the receiver sends Acknowledgements indicating which packets it has received. An example is shown in the figure below:

In this figure, the sender sends two packets in sequence, one containing bytes 1-5 and one containing bytes 6-10. The receiver responds with an acknowledgement that it's received bytes up to byte 10. One important thing to notice is that the sender doesn't send bytes 11-15 until he gets the ACK. This illustrates another important feature of TCP: flow control. TCP uses the ACKs from the recipient to control the flow of data from the sender. When the network gets congested, packets start getting dropped, the sender stops getting ACKs as fast and responds by reducing the sending rate. This responsiveness to network conditions is a key part of TCP.

If the recipient doesn't acknowledge a packet the sender retransmits it. This looks something like this:

In this example, the sender sends the same packets as in the previous figure but the first one gets lost. What the receiver sees is just the second packet containing bytes 6-10. It can't deliver these since the first packet is missing, so it waits for the sender to retransmit (Note to nerds: I'm assuming that selective ACK isn't in use here). After a while (typically a second or so) the sender notices that it hasn't received an ACK and retransmits both packets. When the receiver sees the retransmitted packets, it acknowledges them. This retransmission and acknowledgement function is what makes TCP reliable--the sender keeps trying to send the data until it gets an ACK or it concludes that the network is fatally broken and terminates the connection. Note that it's now seen two copies of bytes 6-10, but that's not a problem to interpret. At the same time as the receiver sends the ACK, it delivers the completed bytes 1-10 to the application.

We're now ready to see how these features interact with VoIP. Voice traffic consists of a series of samples taken at regular intervals, for instance every 20 milliseconds. If each sample is 20 bytes, this gives you a sequence of 20 byte packets at times 0 ms, 20 ms, 40 ms, 60 ms, etc. In order for the voice to sound the same on the receiving end as it did on the sending end, these samples need to be played at the same intervals. There's some propagation delay here but you still need to play at the same rate. So, if the propagation delay was 50 ms you'd get something like this:

Sample #Time SampledTime Played

Now, consider what happens if sample 3 is lost in transmission. Ordinary VoIP systems use UDP, in which the packets are independent and are delivered as soon as they are received, no matter what order they are in. So, what happens is that the receiving application sees packets 1 and 2, a 20 millisecond blank spot, and then packets 4 and 5 (I'm oversimplifying here, since the timing isn't that precise, but this is the general idea.) Now, the receiver doesn't have sample 3, but it's still got sample 4 scheduled for 110 ms. There are three basic stragies for dealing with this:

  1. Plays 20 ms of silence in place of the dropped sample 3.
  2. Try to guess what would have been there by some form of nterpolation/extrapolation.
  3. Repeat the last sample.
None of these options sound perfect but they're basically ok as long as not too many samples are lost. The standard procedure appears to be (3) replay the last sample. It's easy and has about the right spectral properties to not sound too awful.

The problem is that this doesn't work with TCP. Instead, what happens is that when sample 3 is lost, the TCP implementation sits on samples 4 and 5 until it receives sample 3. This means that it's waiting for the sender to retransmit that sample. As we discussed before, this takes on the order of a second. During that time period, the receiver has no real choice but to play silence, so this is perceived as a dropout.

Once the retransmission happens, the receiver needs to try to recover. If all has gone well, the sender has sent not only sample 3, but most of the samples that would have fit in the next second or so. At this point the sender and receiver are synchronized from a network perspective, but the speaker on the receiver's computer is hopelessly behind. The usual procedure is just to start playing the sound where you would have been if the loss and retransmit had never occurred, so it just sounds like a 1 second dropout. This gives something like this:

Time (ms)Sample played

Obviously, if there's any reasonable rate of packet loss, this starts to sound pretty terrible. But things can get even worse. Remember the flow control feature of TCP? If enough packets get lost, then when the sender retransmits, he'll have a big backlog of untransmitted samples. This takes a while to work through and either the listener gets delayed audio (which sounds really weird) or has to endure a multi-second dropout, which is basically intolerable.

At this point it's worth asking why streaming audio doesn't sound terrible even though it runs over TCP. The reason is that the recipients buffer seconds to minutes worth of audio before they start playing. That way, if there's a packet loss, they just keep right on playing out of the buffer with no interruption. If there's a big enough network problem, you can empty the buffer and that's why you'll sometimes see streaming audio or video pause, but in general this strategy works fine if you have a big enough buffer. Unfortunately, you can't use this strategy for voice because it's interactive. It would be fairly intolerable to have to wait 10 seconds after you've said something before you started hearing the other person's reply.

Planned future posts in this series:

  1. Why congestion control makes the problem worse
  2. Why ACK spoofing is a bad idea.
  3. Why you shouldn't use multiple TCP connections to reduce delay for VoIP

Acknowledgement: Thanks to Cullen Jennings for review and helpful suggestions.

18 TrackBacks

Listed below are links to blogs that reference this entry: Why VoIP over TCP and/or SSL sounds like crap (I).

TrackBack URL for this entry: http://www.educatedguesswork.org/cgi-bin/mt/mt-tb.cgi/343

The stream must go on. The value of any particlar bit diminishes very quickly with age. If bits don't arrive on time, you may as well just forget about them and move on. Retransmissions gum up the works by holding up the bits you want right now in favo... Read More

video slot machine from video slot machine on August 17, 2005 6:30 AM

video slot machine Read More

earn money from earn money on August 18, 2005 8:06 AM

earn money Read More

tahoe casino from tahoe casino on August 28, 2005 9:07 AM

tahoe casino Read More

airway Read More

power of attorney from power of attorney on October 5, 2005 11:58 PM

power of attorney Read More

slot machine from slot machine on October 15, 2005 8:36 AM

slot machine Read More

Top zoofilia from Free pictures from sex animal on November 5, 2005 5:14 AM
Girl fucked by horse videos from Animals getting raped by porn on November 15, 2005 4:08 AM

Forced housewife sex movie sample Father rapes teen daugther Hot asian teens getting raped Famous rape photos of ... Read More

buy drugs without prescription from buy drugs without prescription on November 25, 2005 11:22 AM

intrepid entirety!Lancashire got Mecca repulsions?cheap online drug store http://www.realestatenow.net/cheap-online-drug-store.html Read More

Free incest videos using windows media player from Pakistani nude girls pictures and address on December 16, 2005 7:53 PM

Clips russian pantyhose Rape comic stories Ancient pics gay Free sex films examples Read More

texas hold em poker from texas hold em poker on December 22, 2005 6:20 AM

recitals mails telegraphed won?tense glottal handles poker hand rankings http://www.sheratonnorthcharleston.com/poker-hand-rankings.html Read More

Payday Loan Read More

orbitz Read More

free online casinos from free online casinos on February 16, 2006 4:20 PM

competently wailed!apologizing awakes assimilations retrieval foxwood casino http://foxwood-casino.4hs8.com/86eaec0d.php Read More


Um, why doesn't the list start "0. Why DTLS solves this problem"?

You have the theory spot on.

In practice, packet loss and re-ordering in most networks is rare enough that these problems don't arise often enough to bother people. I have two datapoints for you to consider.

First, my personal experience: when at a previous employer, I would often set up a VPN tunnel back to our internal network exclusively for the purpose of being able to access our SIP/PSTN gateway. With rare exception, these calls would be set up and just work -- no excessive delay, no large gaps -- just normal voice. This worked just fine in Korea and Japan, as well as a host of European and European and North American destinations. Generally, if I didn't tell people I was out of the office, they wouldn't have ever known.

Second, consider Skype: its firewall traversal mechanisms include using TCP to get through symmetric NATs. When people talk about Skype, the overwhelming reaction is, "wow, this sounds good," not "wow, this sounds like crap."

Now, I'll admit that, when this goes wrong, it does go very wrong, and you generally end up disconnecting and trying the call over again. But with modern networks, I think you'll discover that packet loss and reordering are sufficiently rare that sending VoIP over TCP connections just works.

In terms of the inconvenience of having to restart the connection: the ubiquity of cell phones has trained people to expect calls to randomly go wonky, and need re-establishing -- and my experience is that cell phones are likely to go bonkers about 100 times more often than VoIP over TCP connections.

The view that TCP is a bad transport for IP packets is rather well established, see here for what I take to be the standard summary.

However, for the usual case of a single-user broadband connection, packet loss and reordering are rare enough to be outweighed by the sheer unpleasantness of setting up IPSEC and, I guess, the lack of well-deployed alternative.

I think that Adam mostly right and completely wrong. The problem with this approach is that it often works. The problem with this approach is when it does not work, it really really sucks. Everyone tells me this all works fine. I disagree take a system that is running this and configure up a network with 5% packet loss on one of the low bandwidth edge links. VoIP Calls to asia often have rates exceeding this. It sounds really bad. Now consider this from a voice providers point of view. Some users calls the help desk - this is a guy that pays 5 to 20$ month and says his voice quality is awful and describes the problem. What do you do? It is hard to reproduce. It is hard to diagnose. You now have a $100 support call that you are never going to solve. People that have actually deployed voip over TCP go through the following stages (yes I experienced this first hand)...

0) WooHoo it works trough firewalls, our customers love it, they think that working through firewall is better than nothing

1) we are having intermittent really irritable customers that say sometimes it does not work, we tell them it is better than nothing but you know customers they just don't get it

2) We have to do something

3) Brilliant engineer rescues days by opting 5 TCP streams and alternating

4) turns out this does now work in many cases

5) Engineering changes TCP stack to ACK things it did not receive

6) everyone happy, start working on patent of this brilliant idea

7) oops some customers are even more pissed now - turn out your stuff broke other apps and when 3 people run voice on same DSL line it all breaks along with every other app

8) the bad press that started around step 4 is getting worse and worse

9) customers are canceling accounts because you can not solve this voice quality problem - help desk hate engineering because they can't solve it

10) Company turns off lame system that they can't debug, can't support, and can't make work. The internet is a better place.

PS - If you think 911 over TCP is a good idea, I have a web site for you.

I yield the point to Cullen, with the provision that a more accurate title for the overall topic might be "Why VoIP over TCP seems to work almost all the time, but occassionally goes very, very wrong."

PS - I'll take "911 over TCP" over "Yelling help and hoping someone hears me, because my firewall won't let me out" any day.

Push-to-talk seems to solve these problems, though.

I'm eagerly avaiting the "ACK-spoofing is bad" part. The main interest in TCP comes from its firewall traversal properties, so running some custom protocol which looks like TCP to middleboxes is very seductive.

Leave a comment