After a few years of deployment, our passive capture system is pretty
stable, but there's nothing so stable that it can't be screwed
up by some programmer. I was recently the author of
such a screwup when I decided to make the system multithreaded.
It looked like
a straightforward job, and it worked fine in smoke testing, but
almost as soon as we started running a lot of traffic through
it, it started to crash consistently with memory errors (SEGVs)
in the TIME_WAIT
cleanup code.
TIME_WAIT state
A little background is probably in order here. TCP doesn't have
connection identifiers; the way you distinguish packets from
different connections is by the source and destination IP
addresses and port numbers. However, those aren't globally
unique, so it's potentially possible to confuse packets
from two different connections with the same host/port
quartets if they occur sufficiently close together. In
order to avoid this problem, when a TCP connection is
closed it's put into a TIME_WAIT state. Any packets that
potentially correspond to that connection are rejected until
TIME_WAIT expires. The timer is set to be twice the maximum
segment lifetime (2MSL). This makes sure that there aren't
any packets from the closed connection hanging around the
network. Once the 2MSL timer expires, then the host/port
quartet can be reused.
TIME_WAIT implementation
Our system has two relevant data structures. Every connection
has an associated tcp_conn
object. These objects
are stored in a hash table keyed by the host/port quartet.
Whenever a new packet comes in we look up the connection
in the hash table and then process the packet.
The way we implement TIME_WAIT in our code is using a sorted
queue. When a connection is closed we put it at the end of
the 2MSL queue (but it stays in the connection table).
Since the timer is always the same length, this
automatically sorts things. Periodically we traverse the
queue from the front discarding any connection whose timer
has expired. The data structure looks like this:
The crash
All this stuff worked great until we tried to multithread the
system. The implementation I decided on was to have a bunch of
worker threads (one for each processor), each of which works
semi-independently. The global TCP data structures (connection
table, 2MSL queue, etc.) are shared by all the threads. Of
course, this means that any reads and writes have to be
surrounded by thread locks, which I duly added.
As noted prevously, this all worked fine in smoke testing,
but when we started to actually run performance tests,
it started to crash pretty reliably. You'd be running
the 2MSL queue flush routine and one of the entries in
the queue would be total garbage and you'd get a segmentation
fault. At this point you're probably thinking we had a
locking problem and that's what we thought too. Sure
enough, when we went looking for it, we found that in
a moment of insanity I'd surrounded the routines that
modified the conn table and the 2MSL queue with read (shared)
locks rather than write (exclusive) locks. This is obviously
bad and I fixed it. So far so good, right? Wrong. A little
more testing revealed that we were still getting this
kind of crash and detailed code inspection didn't turn up
any missing locks.
Debugging the Problem (1)
Concurrency problems are notoriously hard to debug because
they generally depend on timing. So, when you bring the program into
the debugger and single step through it all the timing
changes and you can't reproduce the problem. The situation
was even worse in this case because you had to run tens
of thousands of connections through the system before it
would crash.1 To make matters even more confusing, setting
the number of worker threads to 1 didn't fix the problem,
casting doubt on the whole concurrency theory.
We relatively quickly determined that some connections that
were supposed to have been destroyed were somehow ending
up back on the 2MSL queue. This actually makes the situation
more rather than less confusing, since there are two questions.
- The destructor (yeah, this is C, but it's still a destructor)
is supposed to zero the connection data
structure.
- How did it get back on the queue?
The destructor looked fine, so we had to get primitive. We added some sentinel
values before and after the queue pointers to detect over and underruns.
No joy — they were never triggered.
We and started littering
the program with printfs but nothing obvious appeared.
Finally, we added a secondary state variable
in the connection that was set in the 2MSL queue reaper proper.
Sure enough, when these connections got resurrected, they
already had the variable set. So, why weren't they being
destroyed?
A bunch more printfs later we had the answer. The 2MSL queue
reaper locked the global TCP state data structures--as it
should. Then the destructor also tried to lock those
data structures. This makes sense because it can be called
in contexts where they're not already locked. Unfortunately,
pthreads doesn't support recursive locks, so the second lock
fails even though it's in the same thread. Unfortunately, we don't ignore
the error but instead of crashing
with a "can't happen" like we probably should have (though
ignoring destructor failures is a pretty common practice)
we handle
the error smoothly and exit the destructor function
without zeroizing the connection structure or removing
it from the connection table.
Obviously, this is a bug and it's somehow related to our real
bug. More importantly, fixing it by avoiding double locks
makes the crash go away. However, it doesn't really explain
what the problem was: before we call the destructor we do
remove the connection from the TIME_WAIT queue, so this should
just look like a memory leak. Why does it cause crashes?
I hate mysteries and I hate leaving bugs partially fixed,
so we kept looking.
Debugging the Problem (2)
Let's review what we know so far. Connections close and get placed
at the end of the 2MSL queue. That queue is reaped from the front and
any connections whose timers have expired are destroyed. However,
due to our faulty thread locking code, those entries are removed
from the 2MSL queue but never destroyed. So, the key question
is how this is different from if things were functioning correctly:
- The connections are still in the connection table.
- The connection structures are never zeroized.
- The connection structures are never freed.
It's hard to see how (3) can be causing our problem. Sure, it
leaks memory but that causes a different kind of crash—running
out of memory—and we're not close to that yet. Similarly,
it's hard to see how failing to zeroize can be causing the
problem since this memory shouldn't be being used anyway. The
only reason we zeroize is to help detect bugs. So, it must
be something about it being in the connection table: more
packets must be coming in and be being processed using this
connection. Some printfs verify this. But why does that cause
a problem?
The answer to this question isn't intuitive so what we ended
up doing was brute forcing it. Once we realized the
problem was reuse we littered that code branch with checks
for whether the 2MSL queue contained bad data or not
(luckily it was always ending up at the front of the queue,
which is puzzling but turns out to have been a clue that
we were only able to interpret in retrospect.)
This let us track down where it was getting corrupted.
The answer turns out to be
TIME_WAIT
assassination. Ordinarily, when packets come in for a connection in TIME_WAIT we simply
discard them.
However, if you receive a SYN (connection open)
packet that matches a connection in TIME_WAIT but the
sequence number clearly is for a new connection, you cancel
the 2MSL timer and start the new connection. (This actually
isn't recommended behavior and not all stacks
do this, but it's common enough that a passive decoder needs
to implement it.) Ordinarily, this isn't something you see
a lot, but because the traffic we're running is from a simulator
with only a few client machines, you see a lot of host/port
quartet reuse.
So, what happens when a connection is TIME_WAIT assassinated?
- The packet comes in and we find the relevant connection in
the connection table.
- We notice we're in TIME_WAIT state.
- If the packet is a SYN and the sequence number is in the right range
we need to assassinate.
- We dequeue the connection from the TIME_WAIT queue.
- We remove it from the TCP connection table.
- We destroy it.
Ordinarily this works smoothly, but what happens here is that
connections which have already been partially destroyed stick
around and get assassinated. Those connections aren't
actually on the 2MSL queue, having been removed when we reaped
them earlier. However, in the queue implementation
we're using (BSD sys/queue), entries don't know that they
have been removed from a queue, so when you re-remove them,
you get weird side effects, namely a spurious re-insert.
Remove; Remove == Insert
So why do double removes of a queue entry cause spurious reinserts?
The generic removal procedure for an element E in a doubly
linked list L is something like:
1 If E->prev==NULL (E is the first element in L)
2 Set L->first = E->next
3 else
4 Set E->prev->next = E->next
5 If E->next == NULL (is the last element in L)
6 Set L->last = E->prev
7 else
8 Set E->next->prev = E->prev
Note that we don't need to change E
at all, since it's
being taken out of the list anyway. So, what happens if E
is the first element in the list2, we remove it, then we remove
it again? Well, when we do step 1, E->prev==NULL
3
so we set L->first = E->next
. Now, E->next
points to whatever was right after E
on the list at
the time of removal. That element could have been freed, destroyed,
even re-allocated by now. Since E
probably wasn't also last, we jump to step 8
, which
might cause a crash, but probably won't4.
The end result is that L->first
ends up pointing to some random memory location. When we finally
get around to looking at L
, our code freaks out.
Note that in this particular case, what happens
is that whatever entry was immediately after the entry
getting doubly removed gets reinserted at the head of the queue,
even if that entry has been deleted and/or destroyed long
ago. What makes this so hard to track down is that the offending
connection isn't the one that gets reinserted into the connection
queue, so a lot of common debugging techniques (e.g., trying to
track the fate of the offending connection) don't work well.
1. Don't tell me to use Valgrind. It crashed
and didn't find anything. I miss Purify (which might not
have found anything but most likely wouldn't have
crashed).
2. Which is what always happens in 2MSL queue
reaping. Because the connections are sorted on the queue, we
never reap a connection unless all previous connections have
also been reaped.
3. Note that there's a dependency on the
way I'm checking whether E
is first. If I compared
to L->first
the behavior would be different and we'd always
end up executing line 4, which would (in this case) induce a crash
because E->prev == NULL
. Of course, BSD sys/queue uses
quite different algorithms.
4 Remember that destruction of E->next
probably
failed as well, so this isn't even a free memory write. And FMR's don't
generally cause immediate crashes, though Purify/Valgrind/etc. do catch
them..
UPDATE: Fixed a bug in line 4, found by Kevin Ballard.