Curse you, OpenSSL error stack

| Comments (2) | TrackBacks (4) |
The passive network capture system I've been working on has two features that have an interesting interaction:
  • It decrypts SSL/TLS transactions that it captures.
  • It delivers the captured data via SSL/TLS.

We had a report that the SSL delivery connection was failing with the following error:

error:0407106B:rsa routines:RSA_padding_check_PKCS1_type_2:block type is not 02
error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed

This doesn't make any sense, though, because the capture system acts as an SSL client and doesn't do any RSA decryption. Also, it only happens when we're decrypting SSL. If we're just capturing HTTP data, or you don't have the SSL keys, then the system works perfectly.

It should be clear at this point that we're getting some kind of error bleedthrough from the SSL decryption, but how? We need one more piece of information to work it out: it only happens when the delivery socket is in non-blocking mode. If it's in blocking mode, everything works great.

What's happening is this: it's a result of the way that OpenSSL handles errors. It maintains a per-thread (static in our case) error stack. When you call SSL_get_error(r,ssl) it combines the information from r,ssl, and the error stack to decide what to return. Now, here's the important point: the error stack isn't cleared automatically on the call to SSL_write().

So, here's the sequence of events:

  1. We call RSA_private_decrypt() to decrypt the connection.
  2. The RSA_private_decrypt() fails, populating the error stack.
  3. Sometime later we call SSL_write() to deliver the data.
  4. SSL_write() encounters a blocking condition. This:
    • sets errno to EAGAIN (35)
    • returns -1
    • leaves the error stack untouched.
  5. When we call SSL_get_error(), we get the error from (2) because that's what's on the error stack.
  6. Since we're getting a totally unexpected error, we do the conservative thing and abort the connection.

This doesn't happen in blocking mode because you never return an error in step 4 (unless something went really wrong internally).

This problem doesn't occur normally for two reasons. First, generally when you encounter an OpenSSL error you call ERR_get_error() to find out what went wrong. ERR_get_error() clears the error stack as a side effect. We didn't bother to call it in the RSA decryption code in step (1) because we know what went wrong—the encryption block is badly formatted somehow—and there's nothing to do about it. Second, when something goes wrong in an SSL connection, you typically just throw the connection away and when you create a new connection SSL_connect() clears the error stack as a side effect.

There's a simple one line fix: call ERR_get_error() in step 1 to collect the error and clear the error stack. As a belt-and-suspenders move, we also clear the error stack before SSL_write() by calling ERR_clear_error(), just in case there's some other place we've forgotten to collect the error.

Isn't programming fun?

4 TrackBacks

Listed below are links to blogs that reference this entry: Curse you, OpenSSL error stack.

TrackBack URL for this entry:


Note that ERR_get_error does not clear the entire error stack, it just pops the top item off the stack. So to safely use OpenSSL you must always call ERR_clear_error() after any function fails when you've dealt with the error condition.

There was a nifty bug in mod_ssl caused by the same issue a while back. Global state sucks.

OpenSSL is a very pooly designed API. Eric did it ad hoc, it accreted more cruft, and now it's a shambles. IMHO, of course.

Leave a comment