What the heck is Format Preserving Encryption?

| Comments (0) | COMSEC
Voltage (full disclosure: I have a number of friends there and I'm on their TAB) have released a technology they call Format-Preserving Encryption (FPE). The basic technology here is fairly old and is described in a paper by Black and Rogaway, but as far as I know, this is the first attempt to try to put it together in a single commercial package. Below I attempt to describe some of the relevant technical issues, which are sort of interesting.

Why FPE?
The use case for FPE is simple: say you have a database that contains information with multiple levels of sensitivity. So, for instance, if you're Amazon you might have a customer database that any employee can access but you'd like the credit card numbers to be accessible only to employees that really need it.

The classic approach here would be to use database access controls. This works well as long as you trust the DB server, but if, for instance, you want to send a copy of the DB to someone else, then you may not be able to trust their server, so you need to redact the database, which can be a pain. Another problem here is that sometimes sensitive information like CCNs is used for customer identification, which means you can't just redact the CCN. Rather, you need to replace it with something that's unique but doesn't leak the CCN itself. And of course, if someone compromises your database server, then all bets are off.

The problem with simple encryption
The natural alternative is to use encryption. Encrypting the whole database doesn't help, because you want users to have access to most of the database, just not to the sensitive fields. So, what you need to do is encrypt just the sensitive fields. This turns out to be trickier than it looks.

For example let's say we want to encrypt the social security number 123 45 6789 using AES-ECB. So, we might do:

  • Encode into ASCII to give 31 32 33 34 35 36 37 38 39.
  • Pad with 00 to give 31 32 33 34 35 36 37 38 39 00 00 00 00 00 00 00.
  • Encrypt with AES to give 77 6e 2c a5 02 17 7a 5b 19 e4 28 65 26 f3 7e 14

This kind of sucks. Not only have we managed to start with a 9 digit string and end up with a 128-bit random-appearing value, none of the bytes of the output are ASCII digits. So, if our database or database software is expecting to have values for this field that look like SSNs, we've just broken that invariant.

The source of the problem, of course, is that we're using a block cipher in ECB mode, and most block ciphers come in a small number of sizes (64, 128, and 256 bits are the standard ones). A block cipher just randomly maps the input space onto the output space, so ECB mode encryption effectively selects a random b-bit value (where b is the block size). The smaller the fraction of the possible values that are valid, the higher the probability that the output will be invalid. To take the specific case of SSNs, there are approximately 2^{30} valid values (if we think of the trailing zeros as not counting), so the chance of producing a valid value by random chance is vanishingly small (order 2^{-98}).

One thing you might think would make sense would be to use a different mode than ECB, say counter. The problem with counter mode in this case is that you need to use a different section of keystream (or a different key) to encrypt each value to avoid easy cryptanalytic attacks. So, you need some per-value distinguisher that gets carried along with the ciphertext, which expands the amount of storage you need for the encrypted values, even as it keeps the ciphertext small.

As noted above, our big problem is our block size is too large. As noted above, even though SSNs are 9 digits long, they are sparsely packed (for instance letters aren't allowed), so there are approximately 2^{30} valid SSNs, as long as we use a better mapping than straight 1-1 digit correspondence. For instance, think of the 9 digit SSN as a value from 1 to 999,999,999 (not all 9-digit numbers are valid SSNs, but for simplicity, let's pretend they are.) We can represent that in binary as a 30 bit quantity. If we had a 32 bit block cipher, we could encrypt this value with less than 10% expansion, which might be OK under some circumstances (we'll describe how to do better below).

Ordinary block ciphers have blocks much larger than this, of course, but it turns out that there's a generic technique for making block ciphers of arbitrary size (actually, even values only), called Luby-Rackoff (L-R) . The nice thing about L-R is that it's a general construction based on a pseudorandom function (PRF), which we know how to build with standard cryptographic techniques.

Cycle Walking
We can use L-R to build a block cipher with a block size of any number of bits we want, but this still means that our function produces 2^b possible values where b is the block size, but this generally won't line up perfectly with the set of values we want to encipher. To return to our SSN example, we have 10^9 possible values, which means we need a block size of 30 bits, which implies a set size of 2^{30} = 1073741824. So, for any given input value, there's about a 7% chance that it will encrypt to an invalid SSN (greater than 10^9). If the database (or software) is really aggressive about validity checking, then you'll have an unacceptable rejection rate.

To deal with this issue, Black and Rogaway describe a technique they call "cycle-walking". The idea is that we start with an initially valid value (1-999,999,999) and then encrypt it. If the ciphertext is also valid, we stop and emit it. If it's invalid (greater than 999,999,999), we encrypt again, and repeat until we have a valid value. This gives us an encryption procedure that is guaranteed to produce an in-range output. Decryption is done in the same way.

Bottom Line
So, why can't we just use cycle-walking? Because it only works well if the block size is approximately right—if the size of the valid set is a lot smaller than the block size of the cipher, then you have to do a lot of iterations in order to get an in-range result. So, you can't use a 64-bit block cipher in order to encrypt an SSN because you end up having to do a prohibitive number if iterations; you need to use L-R to construct a block cipher of approximately the right size and then use cycle-walking to shave off the last few values.

UPDATE: Paul Hoffman pointed out to me privately that it's not clear how this all relates to FPE. Basically, FPE means the combination of L-R plus cycle walking. This lets you do one-to-one and onto encryption for most set sizes. If the set size is really small, there's another technique (also due to Black and Rogaway): you encrypt all possible input values and then sort the ciphertexts. You then use the index of the ciphertext in the sorted list as the encrypted value. This is obviously prohibitively expensive unless the number of possible values is small because it requires encrypting all possible values and then keeping a very large mapping table.

Leave a comment