« Where to find the Internets in YVR | Main | Automated International checkin »
December 30, 2006
A complicated bug
After a few years of deployment, our passive capture system is pretty stable, but there's nothing so stable that it can't be screwed up by some programmer. I was recently the author of such a screwup when I decided to make the system multithreaded.It looked like a straightforward job, and it worked fine in smoke testing, but almost as soon as we started running a lot of traffic through it, it started to crash consistently with memory errors (SEGVs) in the TIME_WAIT cleanup code.
TIME_WAIT state
A little background is probably in order here. TCP doesn't have
connection identifiers; the way you distinguish packets from
different connections is by the source and destination IP
addresses and port numbers. However, those aren't globally
unique, so it's potentially possible to confuse packets
from two different connections with the same host/port
quartets if they occur sufficiently close together. In
order to avoid this problem, when a TCP connection is
closed it's put into a TIME_WAIT state. Any packets that
potentially correspond to that connection are rejected until
TIME_WAIT expires. The timer is set to be twice the maximum
segment lifetime (2MSL). This makes sure that there aren't
any packets from the closed connection hanging around the
network. Once the 2MSL timer expires, then the host/port
quartet can be reused.
TIME_WAIT implementation
Our system has two relevant data structures. Every connection
has an associated tcp_conn object. These objects
are stored in a hash table keyed by the host/port quartet.
Whenever a new packet comes in we look up the connection
in the hash table and then process the packet.
The way we implement TIME_WAIT in our code is using a sorted queue. When a connection is closed we put it at the end of the 2MSL queue (but it stays in the connection table). Since the timer is always the same length, this automatically sorts things. Periodically we traverse the queue from the front discarding any connection whose timer has expired. The data structure looks like this:
The crash
All this stuff worked great until we tried to multithread the system. The implementation I decided on was to have a bunch of worker threads (one for each processor), each of which works semi-independently. The global TCP data structures (connection table, 2MSL queue, etc.) are shared by all the threads. Of course, this means that any reads and writes have to be surrounded by thread locks, which I duly added.
As noted prevously, this all worked fine in smoke testing, but when we started to actually run performance tests, it started to crash pretty reliably. You'd be running the 2MSL queue flush routine and one of the entries in the queue would be total garbage and you'd get a segmentation fault. At this point you're probably thinking we had a locking problem and that's what we thought too. Sure enough, when we went looking for it, we found that in a moment of insanity I'd surrounded the routines that modified the conn table and the 2MSL queue with read (shared) locks rather than write (exclusive) locks. This is obviously bad and I fixed it. So far so good, right? Wrong. A little more testing revealed that we were still getting this kind of crash and detailed code inspection didn't turn up any missing locks.
Debugging the Problem (1)
Concurrency problems are notoriously hard to debug because
they generally depend on timing. So, when you bring the program into
the debugger and single step through it all the timing
changes and you can't reproduce the problem. The situation
was even worse in this case because you had to run tens
of thousands of connections through the system before it
would crash.1 To make matters even more confusing, setting
the number of worker threads to 1 didn't fix the problem,
casting doubt on the whole concurrency theory.
We relatively quickly determined that some connections that were supposed to have been destroyed were somehow ending up back on the 2MSL queue. This actually makes the situation more rather than less confusing, since there are two questions.
- The destructor (yeah, this is C, but it's still a destructor) is supposed to zero the connection data structure.
- How did it get back on the queue?
The destructor looked fine, so we had to get primitive. We added some sentinel values before and after the queue pointers to detect over and underruns. No joy — they were never triggered. We and started littering the program with printfs but nothing obvious appeared. Finally, we added a secondary state variable in the connection that was set in the 2MSL queue reaper proper. Sure enough, when these connections got resurrected, they already had the variable set. So, why weren't they being destroyed?
A bunch more printfs later we had the answer. The 2MSL queue reaper locked the global TCP state data structures--as it should. Then the destructor also tried to lock those data structures. This makes sense because it can be called in contexts where they're not already locked. Unfortunately, pthreads doesn't support recursive locks, so the second lock fails even though it's in the same thread. Unfortunately, we don't ignore the error but instead of crashing with a "can't happen" like we probably should have (though ignoring destructor failures is a pretty common practice) we handle the error smoothly and exit the destructor function without zeroizing the connection structure or removing it from the connection table.
Obviously, this is a bug and it's somehow related to our real bug. More importantly, fixing it by avoiding double locks makes the crash go away. However, it doesn't really explain what the problem was: before we call the destructor we do remove the connection from the TIME_WAIT queue, so this should just look like a memory leak. Why does it cause crashes? I hate mysteries and I hate leaving bugs partially fixed, so we kept looking.
Debugging the Problem (2)
Let's review what we know so far. Connections close and get placed
at the end of the 2MSL queue. That queue is reaped from the front and
any connections whose timers have expired are destroyed. However,
due to our faulty thread locking code, those entries are removed
from the 2MSL queue but never destroyed. So, the key question
is how this is different from if things were functioning correctly:
- The connections are still in the connection table.
- The connection structures are never zeroized.
- The connection structures are never freed.
It's hard to see how (3) can be causing our problem. Sure, it leaks memory but that causes a different kind of crash—running out of memory—and we're not close to that yet. Similarly, it's hard to see how failing to zeroize can be causing the problem since this memory shouldn't be being used anyway. The only reason we zeroize is to help detect bugs. So, it must be something about it being in the connection table: more packets must be coming in and be being processed using this connection. Some printfs verify this. But why does that cause a problem?
The answer to this question isn't intuitive so what we ended up doing was brute forcing it. Once we realized the problem was reuse we littered that code branch with checks for whether the 2MSL queue contained bad data or not (luckily it was always ending up at the front of the queue, which is puzzling but turns out to have been a clue that we were only able to interpret in retrospect.) This let us track down where it was getting corrupted.
The answer turns out to be TIME_WAIT assassination. Ordinarily, when packets come in for a connection in TIME_WAIT we simply discard them. However, if you receive a SYN (connection open) packet that matches a connection in TIME_WAIT but the sequence number clearly is for a new connection, you cancel the 2MSL timer and start the new connection. (This actually isn't recommended behavior and not all stacks do this, but it's common enough that a passive decoder needs to implement it.) Ordinarily, this isn't something you see a lot, but because the traffic we're running is from a simulator with only a few client machines, you see a lot of host/port quartet reuse.
So, what happens when a connection is TIME_WAIT assassinated?
- The packet comes in and we find the relevant connection in the connection table.
- We notice we're in TIME_WAIT state.
- If the packet is a SYN and the sequence number is in the right range we need to assassinate.
- We dequeue the connection from the TIME_WAIT queue.
- We remove it from the TCP connection table.
- We destroy it.
Ordinarily this works smoothly, but what happens here is that connections which have already been partially destroyed stick around and get assassinated. Those connections aren't actually on the 2MSL queue, having been removed when we reaped them earlier. However, in the queue implementation we're using (BSD sys/queue), entries don't know that they have been removed from a queue, so when you re-remove them, you get weird side effects, namely a spurious re-insert.
Remove; Remove == Insert
So why do double removes of a queue entry cause spurious reinserts?
The generic removal procedure for an element E in a doubly
linked list L is something like:
1 If E->prev==NULL (E is the first element in L) 2 Set L->first = E->next 3 else 4 Set E->prev->next = E->next 5 If E->next == NULL (is the last element in L) 6 Set L->last = E->prev 7 else 8 Set E->next->prev = E->prev
Note that we don't need to change E at all, since it's
being taken out of the list anyway. So, what happens if E
is the first element in the list2, we remove it, then we remove
it again? Well, when we do step 1, E->prev==NULL3
so we set L->first = E->next. Now, E->next
points to whatever was right after E on the list at
the time of removal. That element could have been freed, destroyed,
even re-allocated by now. Since E
probably wasn't also last, we jump to step 8, which
might cause a crash, but probably won't4.
The end result is that L->first
ends up pointing to some random memory location. When we finally
get around to looking at L, our code freaks out.
Note that in this particular case, what happens is that whatever entry was immediately after the entry getting doubly removed gets reinserted at the head of the queue, even if that entry has been deleted and/or destroyed long ago. What makes this so hard to track down is that the offending connection isn't the one that gets reinserted into the connection queue, so a lot of common debugging techniques (e.g., trying to track the fate of the offending connection) don't work well.
1. Don't tell me to use Valgrind. It crashed
and didn't find anything. I miss Purify (which might not
have found anything but most likely wouldn't have
crashed).
2. Which is what always happens in 2MSL queue
reaping. Because the connections are sorted on the queue, we
never reap a connection unless all previous connections have
also been reaped.
3. Note that there's a dependency on the
way I'm checking whether E is first. If I compared
to L->first the behavior would be different and we'd always
end up executing line 4, which would (in this case) induce a crash
because E->prev == NULL. Of course, BSD sys/queue uses
quite different algorithms.
4 Remember that destruction of E->next probably
failed as well, so this isn't even a free memory write. And FMR's don't
generally cause immediate crashes, though Purify/Valgrind/etc. do catch
them..
UPDATE: Fixed a bug in line 4, found by Kevin Ballard.
Posted by ekr at December 30, 2006 6:46 AM | Filed under: Software
Comments
Err, did you report the error to the valgrind folks? The mailing list is usually pretty responsive.
And were you running v3.2.1 or the distro provided version?
Posted by: mmk at December 30, 2006 9:15 AM
Just a note, in line 4 of the remove procedure you probably want to assign E->prev->next = E->next, not E->prev.
Posted by: Kevin Ballard at December 30, 2006 7:15 PM
Unfortunately, pthreads doesn't support recursive locks, ...
Many pthreads implementations do (though I think it's an extension to the posix spec). Does your pthread.h have a PTHREAD_MUTEX_RECURSIVE?
Posted by: Alain Gilbert at December 31, 2006 8:56 AM
Very nice post. Thank you.
Posted by: Cd-MaN at January 1, 2007 11:26 PM