The difficulty of using IPsec VPNs has made SSL-based VPNs an
increasingly popular networking technology. If your enterprise
is using VoIP (isn't everyone?), then it's natural to want
to carry that traffic over the SSL VPN. Unfortunately,
this doesn't work very well.
The source of the problem is that SSL/TLS runs over TCP.
TCP is designed to provide a channel with a number of
properties:
- Stream-oriented
- Reliable
- In-order
- Flow controlled
What this implies is that TCP views data as a single
long stream of data. It's convenient to think of the data as being a series
of bytes numbered 1-N. In order to transmit it, the data
is broken up into a sequence of packets, each with
its own sequence number. Those packets are independently
transmitted over the network. On the receiving side,
these packets are reassembled into a stream and delivered
to the application (and hence to your ear) as soon as they're
available.
Here's the simplest example:
| Time | Sent | Received | Delivered |
| 1 | 1-5 |
| 2 | | 1-5 | 1-5 |
| 3 | 6-10 |
| 4 | | 6-10 | 6-10 |
At time 1, the sender transmits a single packet containing bytes 1-5.
At time 2, it's received by the receiver, who passes it on to the
application. At time 3, the sender transmits another packet containing
6-10. At time 4, the receiver receives that packet, and delivers it
to the application. Data is delivered to the application as soon as
its available and (here's the key point) in order.
Consider the next case:
| Time | Sent | Received | Delivered |
| 1 | 1-5 |
| 2 | 6-10 |
| 3 | | 6-10 |
| 4 | | 1-5 | 1-10 |
In this case, the sender sends two separate packets, one containing
1-5 and the other 6-10. They're sent in order but received out of
order. At time 3, the receiver received the packet containing
6-10. However, since it hasn't received 1-5, this packet is out
of order so it doesn't deliver it. Rather, it waits until it receives
bytes 1-5 at time 4 and then delivers all the bytes together. This is
the "in-order" feature.
Note that TCP doesn't preserve byte boundaries: the application can't
tell whether the data was transferred as one packet or ten or what
order things were received in. This is the stream-oriented feature.
Remember that I said that TCP was reliable. Packet networks are
fairly unreliable; packets can get damaged, lost, or rerouted.
TCP imposes a reliable abstraction over top. The way that this works
is that the receiver sends Acknowledgements indicating which
packets it has received. An example is shown in the figure
below:
In this figure, the sender sends two packets in sequence, one
containing bytes 1-5 and one containing bytes 6-10. The receiver
responds with an acknowledgement that it's received bytes
up to byte 10. One important thing to notice is that the
sender doesn't send bytes 11-15 until he gets the ACK. This
illustrates another important feature of TCP: flow control.
TCP uses the ACKs from the recipient to control the flow
of data from the sender. When the network gets congested,
packets start getting dropped, the sender stops getting
ACKs as fast and responds by reducing the
sending rate. This responsiveness to network conditions
is a key part of TCP.
If the recipient doesn't acknowledge a
packet the sender retransmits it. This looks something like
this:
In this example, the sender sends the same packets as in the
previous figure but the first one gets lost. What the receiver
sees is just the second packet containing bytes 6-10. It
can't deliver these since the first packet is missing, so
it waits for the sender to retransmit (Note to nerds: I'm assuming that
selective ACK isn't in use here). After a while (typically
a second or so) the sender notices that it hasn't received
an ACK and retransmits both packets. When the receiver
sees the retransmitted packets, it acknowledges them. This
retransmission and acknowledgement function is what makes
TCP reliable--the sender keeps trying to send the data until
it gets an ACK or it concludes that the network is fatally
broken and terminates the connection. Note that
it's now seen two copies of bytes 6-10, but that's not a problem
to interpret. At the same time as the receiver sends the ACK,
it delivers the completed bytes 1-10 to the application.
We're now ready to see how these features interact with VoIP.
Voice traffic consists of a series of samples taken at regular
intervals, for instance every 20 milliseconds. If each sample
is 20 bytes, this gives you a sequence of 20 byte packets at
times 0 ms, 20 ms, 40 ms, 60 ms, etc. In order for the voice
to sound the same on the receiving end as it did on the sending end,
these samples need to be played at the same intervals. There's
some propagation delay here but you still need to
play at the same rate. So, if the propagation delay was 50 ms you'd
get something like this:
| Sample # | Time Sampled | Time Played |
| 1 | 0 | 50 |
| 2 | 20 | 70 |
| 3 | 40 | 90 |
| 4 | 60 | 110 |
| 5 | 80 | 130 |
Now, consider what happens if sample 3 is lost in transmission.
Ordinary VoIP systems use UDP, in which the packets are
independent and are delivered as soon as they are received,
no matter what order they are in.
So, what happens is that
the receiving application sees packets 1 and 2, a 20 millisecond blank
spot, and then packets 4 and 5 (I'm oversimplifying here,
since the timing isn't that precise, but this is the general
idea.) Now, the receiver doesn't have sample 3, but it's still
got sample 4 scheduled for 110 ms. There are three basic stragies for
dealing with this:
- Plays 20 ms of silence in place of the dropped sample 3.
- Try to guess what would have been there by some form of
nterpolation/extrapolation.
- Repeat the last sample.
None of these options sound perfect but they're basically ok as
long as not too many samples are lost. The standard procedure
appears to be (3) replay the last sample. It's easy and has
about the right spectral properties to not sound too awful.
The problem is that this doesn't work with TCP. Instead, what
happens is that when sample 3 is lost, the TCP implementation
sits on samples 4 and 5 until it receives sample 3. This
means that it's waiting for the sender to retransmit that
sample. As we discussed before, this takes on the order of a second.
During that time period, the receiver has no real choice but to play
silence, so this is perceived as a dropout.
Once the retransmission happens, the receiver needs to try to
recover. If all has gone well, the sender has sent not only
sample 3, but most of the samples that would have fit in the
next second or so. At this point the sender and receiver
are synchronized from a network perspective, but the
speaker on the receiver's computer is hopelessly behind.
The usual procedure is just to start playing the sound
where you would have been if the loss and retransmit had never
occurred, so it just sounds like a 1 second dropout. This
gives something like this:
| Time (ms) | Sample played |
| 50 | 1 |
| 70 | 2 |
| 90-1090 | Silence |
| 1110 | 52 |
| 1130 | 53 |
Obviously, if there's any reasonable rate of packet loss,
this starts to sound pretty terrible. But things can get
even worse. Remember the flow control feature of TCP?
If enough packets get lost, then when the sender
retransmits, he'll have a big backlog of untransmitted
samples. This takes a while to work through and either
the listener gets delayed audio (which sounds really weird)
or has to endure a multi-second dropout, which is basically
intolerable.
At this point it's worth asking why streaming audio doesn't
sound terrible even though it runs over TCP. The reason is
that the recipients buffer seconds to minutes worth of
audio before they start playing. That way, if there's
a packet loss, they just keep right on playing out of the
buffer with no interruption. If there's a big enough
network problem, you can empty the buffer and that's why
you'll sometimes see streaming audio or video pause, but
in general this strategy works fine if you have a big
enough buffer. Unfortunately, you can't use this strategy
for voice because it's interactive. It would be fairly
intolerable to have to wait 10 seconds after you've said
something before you started hearing the other person's reply.
Planned future posts in this series:
- Why congestion control makes the problem worse
- Why ACK spoofing is a bad idea.
- Why you shouldn't use multiple TCP connections to reduce delay for VoIP
Acknowledgement: Thanks to Cullen Jennings for review and helpful
suggestions.