A complicated bug

| Comments (4) | Software
After a few years of deployment, our passive capture system is pretty stable, but there's nothing so stable that it can't be screwed up by some programmer. I was recently the author of such a screwup when I decided to make the system multithreaded.

It looked like a straightforward job, and it worked fine in smoke testing, but almost as soon as we started running a lot of traffic through it, it started to crash consistently with memory errors (SEGVs) in the TIME_WAIT cleanup code.

A little background is probably in order here. TCP doesn't have connection identifiers; the way you distinguish packets from different connections is by the source and destination IP addresses and port numbers. However, those aren't globally unique, so it's potentially possible to confuse packets from two different connections with the same host/port quartets if they occur sufficiently close together. In order to avoid this problem, when a TCP connection is closed it's put into a TIME_WAIT state. Any packets that potentially correspond to that connection are rejected until TIME_WAIT expires. The timer is set to be twice the maximum segment lifetime (2MSL). This makes sure that there aren't any packets from the closed connection hanging around the network. Once the 2MSL timer expires, then the host/port quartet can be reused.

TIME_WAIT implementation
Our system has two relevant data structures. Every connection has an associated tcp_conn object. These objects are stored in a hash table keyed by the host/port quartet. Whenever a new packet comes in we look up the connection in the hash table and then process the packet.

The way we implement TIME_WAIT in our code is using a sorted queue. When a connection is closed we put it at the end of the 2MSL queue (but it stays in the connection table). Since the timer is always the same length, this automatically sorts things. Periodically we traverse the queue from the front discarding any connection whose timer has expired. The data structure looks like this:

The crash

All this stuff worked great until we tried to multithread the system. The implementation I decided on was to have a bunch of worker threads (one for each processor), each of which works semi-independently. The global TCP data structures (connection table, 2MSL queue, etc.) are shared by all the threads. Of course, this means that any reads and writes have to be surrounded by thread locks, which I duly added.

As noted prevously, this all worked fine in smoke testing, but when we started to actually run performance tests, it started to crash pretty reliably. You'd be running the 2MSL queue flush routine and one of the entries in the queue would be total garbage and you'd get a segmentation fault. At this point you're probably thinking we had a locking problem and that's what we thought too. Sure enough, when we went looking for it, we found that in a moment of insanity I'd surrounded the routines that modified the conn table and the 2MSL queue with read (shared) locks rather than write (exclusive) locks. This is obviously bad and I fixed it. So far so good, right? Wrong. A little more testing revealed that we were still getting this kind of crash and detailed code inspection didn't turn up any missing locks.

Debugging the Problem (1)
Concurrency problems are notoriously hard to debug because they generally depend on timing. So, when you bring the program into the debugger and single step through it all the timing changes and you can't reproduce the problem. The situation was even worse in this case because you had to run tens of thousands of connections through the system before it would crash.1 To make matters even more confusing, setting the number of worker threads to 1 didn't fix the problem, casting doubt on the whole concurrency theory.

We relatively quickly determined that some connections that were supposed to have been destroyed were somehow ending up back on the 2MSL queue. This actually makes the situation more rather than less confusing, since there are two questions.

  1. The destructor (yeah, this is C, but it's still a destructor) is supposed to zero the connection data structure.
  2. How did it get back on the queue?

The destructor looked fine, so we had to get primitive. We added some sentinel values before and after the queue pointers to detect over and underruns. No joy — they were never triggered. We and started littering the program with printfs but nothing obvious appeared. Finally, we added a secondary state variable in the connection that was set in the 2MSL queue reaper proper. Sure enough, when these connections got resurrected, they already had the variable set. So, why weren't they being destroyed?

A bunch more printfs later we had the answer. The 2MSL queue reaper locked the global TCP state data structures--as it should. Then the destructor also tried to lock those data structures. This makes sense because it can be called in contexts where they're not already locked. Unfortunately, pthreads doesn't support recursive locks, so the second lock fails even though it's in the same thread. Unfortunately, we don't ignore the error but instead of crashing with a "can't happen" like we probably should have (though ignoring destructor failures is a pretty common practice) we handle the error smoothly and exit the destructor function without zeroizing the connection structure or removing it from the connection table.

Obviously, this is a bug and it's somehow related to our real bug. More importantly, fixing it by avoiding double locks makes the crash go away. However, it doesn't really explain what the problem was: before we call the destructor we do remove the connection from the TIME_WAIT queue, so this should just look like a memory leak. Why does it cause crashes? I hate mysteries and I hate leaving bugs partially fixed, so we kept looking.

Debugging the Problem (2)
Let's review what we know so far. Connections close and get placed at the end of the 2MSL queue. That queue is reaped from the front and any connections whose timers have expired are destroyed. However, due to our faulty thread locking code, those entries are removed from the 2MSL queue but never destroyed. So, the key question is how this is different from if things were functioning correctly:

  1. The connections are still in the connection table.
  2. The connection structures are never zeroized.
  3. The connection structures are never freed.

It's hard to see how (3) can be causing our problem. Sure, it leaks memory but that causes a different kind of crash—running out of memory—and we're not close to that yet. Similarly, it's hard to see how failing to zeroize can be causing the problem since this memory shouldn't be being used anyway. The only reason we zeroize is to help detect bugs. So, it must be something about it being in the connection table: more packets must be coming in and be being processed using this connection. Some printfs verify this. But why does that cause a problem?

The answer to this question isn't intuitive so what we ended up doing was brute forcing it. Once we realized the problem was reuse we littered that code branch with checks for whether the 2MSL queue contained bad data or not (luckily it was always ending up at the front of the queue, which is puzzling but turns out to have been a clue that we were only able to interpret in retrospect.) This let us track down where it was getting corrupted.

The answer turns out to be TIME_WAIT assassination. Ordinarily, when packets come in for a connection in TIME_WAIT we simply discard them. However, if you receive a SYN (connection open) packet that matches a connection in TIME_WAIT but the sequence number clearly is for a new connection, you cancel the 2MSL timer and start the new connection. (This actually isn't recommended behavior and not all stacks do this, but it's common enough that a passive decoder needs to implement it.) Ordinarily, this isn't something you see a lot, but because the traffic we're running is from a simulator with only a few client machines, you see a lot of host/port quartet reuse.

So, what happens when a connection is TIME_WAIT assassinated?

  1. The packet comes in and we find the relevant connection in the connection table.
  2. We notice we're in TIME_WAIT state.
  3. If the packet is a SYN and the sequence number is in the right range we need to assassinate.
  4. We dequeue the connection from the TIME_WAIT queue.
  5. We remove it from the TCP connection table.
  6. We destroy it.

Ordinarily this works smoothly, but what happens here is that connections which have already been partially destroyed stick around and get assassinated. Those connections aren't actually on the 2MSL queue, having been removed when we reaped them earlier. However, in the queue implementation we're using (BSD sys/queue), entries don't know that they have been removed from a queue, so when you re-remove them, you get weird side effects, namely a spurious re-insert.

Remove; Remove == Insert
So why do double removes of a queue entry cause spurious reinserts? The generic removal procedure for an element E in a doubly linked list L is something like:

1 If E->prev==NULL (E is the first element in L)
2   Set L->first = E->next
3 else
4   Set E->prev->next = E->next

5 If E->next == NULL (is the last element in L)
6   Set L->last = E->prev
7 else
8   Set E->next->prev = E->prev

Note that we don't need to change E at all, since it's being taken out of the list anyway. So, what happens if E is the first element in the list2, we remove it, then we remove it again? Well, when we do step 1, E->prev==NULL3 so we set L->first = E->next. Now, E->next points to whatever was right after E on the list at the time of removal. That element could have been freed, destroyed, even re-allocated by now. Since E probably wasn't also last, we jump to step 8, which might cause a crash, but probably won't4. The end result is that L->first ends up pointing to some random memory location. When we finally get around to looking at L, our code freaks out.

Note that in this particular case, what happens is that whatever entry was immediately after the entry getting doubly removed gets reinserted at the head of the queue, even if that entry has been deleted and/or destroyed long ago. What makes this so hard to track down is that the offending connection isn't the one that gets reinserted into the connection queue, so a lot of common debugging techniques (e.g., trying to track the fate of the offending connection) don't work well.

1. Don't tell me to use Valgrind. It crashed and didn't find anything. I miss Purify (which might not have found anything but most likely wouldn't have crashed).
2. Which is what always happens in 2MSL queue reaping. Because the connections are sorted on the queue, we never reap a connection unless all previous connections have also been reaped.
3. Note that there's a dependency on the way I'm checking whether E is first. If I compared to L->first the behavior would be different and we'd always end up executing line 4, which would (in this case) induce a crash because E->prev == NULL. Of course, BSD sys/queue uses quite different algorithms.
4 Remember that destruction of E->next probably failed as well, so this isn't even a free memory write. And FMR's don't generally cause immediate crashes, though Purify/Valgrind/etc. do catch them..

UPDATE: Fixed a bug in line 4, found by Kevin Ballard.


Err, did you report the error to the valgrind folks? The mailing list is usually pretty responsive.

And were you running v3.2.1 or the distro provided version?

Just a note, in line 4 of the remove procedure you probably want to assign E->prev->next = E->next, not E->prev.

Unfortunately, pthreads doesn't support recursive locks, ...

Many pthreads implementations do (though I think it's an extension to the posix spec). Does your pthread.h have a PTHREAD_MUTEX_RECURSIVE?

Very nice post. Thank you.

Leave a comment