Tap annoyances

| Comments (5) |
Transparent network taps like those made by NetOptics and VSS are a standard method for doing network monitoring. Unfortunately, when combined with an unhelpful operating system/NIC combination, they can produce some problematic artifacts. Here's a TCP trace from our testbed. I've replaced the IP addresses with and ports with C: for client and S: for server so it will fit on the page.

1 C: S 4198748444:4198748444(0) win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 1,nop,nop,timestamp 1416746 0>
2 S: S 1678625628:1678625628(0) ack 4198748445 win 5792 <mss 1460,sackOK,timestamp 13870629 1416746,nop,wscale 2>
3 C: . ack 1 win 33304 <nop,nop,timestamp 1416746 13870629>
4 C: P 1:83(82) ack 1 win 33304 <nop,nop,timestamp 1416746 13870629>
5 S: . ack 83 win 1448 <nop,nop,timestamp 13870630 1416746>
6 S: P 1:123(122) ack 83 win 1448 <nop,nop,timestamp 13870630 1416746>
7 C: P 83:126(43) ack 123 win 33304 <nop,nop,timestamp 1416746 13870630>
8 S: . ack 126 win 1448 <nop,nop,timestamp 13870671 1416746>
9 S: . ack 223 win 1448 <nop,nop,timestamp 13870671 1416750>
10 S: . 123:1571(1448) ack 223 win 1448 <nop,nop,timestamp 13870672 1416750>
11 S: . 1571:3019(1448) ack 223 win 1448 <nop,nop,timestamp 13870672 1416750>
12 S: . 3019:4467(1448) ack 223 win 1448 <nop,nop,timestamp 13870672 1416750>
13 C: P 126:223(97) ack 123 win 33304 <nop,nop,timestamp 1416750 13870671>

What you need to note here is that packets 9-12 from the server contain ACKs for byte 223 from the client. But that byte hasn't been seen on the wire yet. It's not seen until packet 13, which, as you can see, contains an ACK for byte 123 from the server, which was delivered back in packet 6. This is a clear causality violation because an ACK for a packet can't precede the packet it's ACKing. The problem here is that the tap (a NetOptics 96298) delivers the data on two different interfaces, one for the client to server direction and one for the server to client direction (this lets you tap full duplex GigE). Either the tap or the host computer (we suspect the host computer and/or NICs) is buffering packet 13 until after its already delivered packets 9-12, so the application gets them in the wrong order. Note that this problem is especially bad in testbed type situations because the host computers and the network between them is so fast that actual data packets can get easily reordered.

The naive thing to do here is to ignore this problem and just deliver the data whenever you get it. This is good enough for simple processing but doesn't get the job done if you're really trying to process the connection. In this case, it leads to the HTTP request preceding the HTTP response, which isn't really acceptable. The right thing to do is to reorder (or rather de-reorder) the packets. If you assume that A must precede ACK(A) then you can hold any packet with ACK(A) (and all TCP packets contain ACKs) until A arrives. This requires a bit more buffering but does mostly work. The bad news is that the logic for doing this is fairly hairy (especially when you consider that TCP stacks are already quite hairy).

The other two alternatives are to use a spanning port (only works well with fairly expensive switches and requires reconfiguration) or to get a tap that only delivers on a single interface and so should be order preserving. We just got one of these. Word from the tech who delivered it is that it is order-preserving, but but we don't have it working at all yet, so I can't report on that.

UPDATE: I should mention that you don't actually need things to arrive strictly in order, as long as the timestamps on the packets represent their actual arrival time. You can buffer and sort. Unfortunately, our timestamps aren't lining up either.


Stupid question: Is it possible to use this causality of sequence numbers and acks (assuming both sides are proper TCP senders) to restructure the flow? It doesn't work for UDP, hotever.

But There's no way you can avoid this on a common PC as the reader: You've got so much buffering in random places (the NIC, PCI, etc) that its going to be impossible to get a commodity PC to perfectly deterministically order the packets. IT's not a "stupid host", its that the PC architecture isn't designed for the hard deterministic realtime processing needed to get the ordering you want.

So you are going to have to have a multiplexing tap, or make a "tap" which puts in some ID/ordering information in an additinal packet header or as a subsequent packet burst.

The one other option, IF you are willing to take a latency (and unfortunatly bandwidth) hit.

Make the PC monitor a bridge. In kernel mode Click in linux, have FromDevice->Queue->ToDevice to bridge the two ethernets. Now you get 100% determinism because you CAN'T get the response until you sent the previous packets. And with a l337 system (PCI-E ethernets), you could probably get close to line rate this way.

Re: causality, that's exactly what we do. Messy, but it works.

We certainly could operate as a bridge, but the problem is that then you're in the critical path for the customer's system, which makes it a much harder sell.

The timestamps are added late in the process (on the host after things get through a horribly nondeterministic process, including possible nondeterminism in the tap).

You're gonna have to either do a bridge (with failover: monitor the bridge and if it stops, change a VLAN config on a switch), do a merging tap which does things right, or build your OWN merging or nonmerging tap which does things right.

I wouldn't even trust the output of the unsynchronized taps, as that alone may be nodeterministic due to random network effects between the two ports.

How long a timeline do you have? I'm working with a VERY cool board from Nick McEwan and Greg Watson of Stanford (4x GigE and an FPGA, on a PCI card). In a few months they may have a more publically available version. Something like that (in a couple of months at least) would be easy to hack up a "Add a VLAN tag with the VLAN # being a sequence number" tap, as long as the traffic monitored is not VLAN encapsulated.

Have you considered using Endace DAG cards? They supposedly offload all capture (including time-stamping) from the kernel; this should significantly reduce (though perhaps not entirely eliminate) instances of out-of-order packets.

We haven't tried these ourselves (it's on our list), but the RIPE NCC Test Traffic Measurement (TTM) project has been experimenting with them. I believe they're planning to include them in their next-gen TTM boxes.

The DAG looks suitable IMO: They have a "Cheap" 2 port copper card, which unless they screwed up should be synchronized on the timestamps between the ports.

Leave a comment