Voltage (full disclosure: I have a number of friends there and
I'm on their TAB) have released a technology they call
Format-Preserving
Encryption (FPE). The basic technology here is fairly old
and is described in a
paper by Black and Rogaway,
but as far as I know, this is the first attempt to try to
put it together in a single commercial package. Below I attempt to
describe some of the relevant technical issues, which are sort of
interesting.
Why FPE?
The use case for FPE is simple: say you have a database that contains
information with multiple levels of sensitivity. So, for instance,
if you're Amazon you might have a customer database that any
employee can access but you'd like the credit card numbers to
be accessible only to employees that really need it.
The classic approach here would be to use database access controls.
This works well as long as you trust the DB
server, but if, for instance, you want to send a copy of the
DB to someone else, then you may not be able to trust their
server, so you need to redact the database, which can be a pain.
Another problem here is that sometimes sensitive information
like CCNs is used for customer identification, which means you
can't just redact the CCN. Rather, you need to replace it with
something that's unique but doesn't leak the CCN itself.
And of course, if someone compromises your database server, then
all bets are off.
The problem with simple encryption
The natural alternative is to use encryption. Encrypting the whole
database doesn't help, because you want users to have access
to most of the database, just not to the sensitive fields.
So, what you need to do is encrypt just the sensitive fields.
This turns out to be trickier than it looks.
For example let's say we want to encrypt the social security number
123 45 6789 using AES-ECB. So, we might do:
- Encode into ASCII to give
31 32 33 34 35 36 37 38 39.
- Pad with
00 to give 31 32 33 34 35 36 37 38 39 00 00 00 00 00 00 00.
- Encrypt with AES to give
77 6e 2c a5 02 17 7a 5b 19 e4 28 65 26 f3 7e 14
This kind of sucks. Not only have we managed to start with a 9 digit string
and end up with a 128-bit random-appearing value, none of the bytes of
the output are ASCII digits. So, if our database or database software
is expecting to have values for this field that look like SSNs,
we've just broken that invariant.
The source of the problem, of course, is that we're using a block
cipher in ECB mode, and most block ciphers come in a small number of sizes (64,
128, and 256 bits are the standard ones). A block cipher just
randomly maps the input space onto the output space, so
ECB mode encryption effectively selects a random b-bit value (where b
is the block size). The smaller the fraction of the possible
values that are valid, the higher the probability that the output
will be invalid. To take the specific case of SSNs, there are
approximately 2^{30} valid values (if we think of the trailing
zeros as not counting), so the chance of producing a valid
value by random chance is vanishingly small (order 2^{-98}).
One thing you might think would make sense would be to use a different
mode than ECB, say counter. The problem with counter mode in this case
is that you need to use a different section of keystream (or a
different key) to encrypt each value to avoid easy cryptanalytic
attacks. So, you need some per-value distinguisher that gets carried
along with the ciphertext, which expands the amount of storage you
need for the encrypted values, even as it keeps the ciphertext small.
Luby-Rackoff
As noted above, our big problem is our block size is too large.
As noted above, even though SSNs are 9 digits long, they are sparsely
packed (for instance letters aren't allowed), so there are
approximately 2^{30} valid SSNs, as long as we use a better mapping
than straight 1-1 digit correspondence. For instance, think of the 9
digit SSN as a value from 1 to 999,999,999 (not all 9-digit numbers are valid
SSNs, but for simplicity, let's pretend they are.)
We can represent that in
binary as a 30 bit quantity. If we had a 32 bit block cipher, we could
encrypt this value with less than 10% expansion, which might be OK
under some circumstances (we'll describe how to do better below).
Ordinary block ciphers have blocks much larger than this, of
course, but it turns out that there's a generic technique for
making block ciphers of arbitrary size (actually, even values
only), called Luby-Rackoff (L-R) . The nice thing about
L-R is that it's a general construction based on a
pseudorandom function (PRF), which we know how to
build with standard cryptographic techniques.
Cycle Walking
We can use L-R to build a block cipher with a block size of any
number of bits we want, but this still means that our function
produces 2^b possible values where b is the block size, but
this generally won't line up perfectly with the set of values
we want to encipher. To return to our SSN example, we have
10^9 possible values, which means we need a block size of
30 bits, which implies a set size of 2^{30} = 1073741824.
So, for any given input value, there's about a 7% chance that
it will encrypt to an invalid SSN (greater than 10^9).
If the database (or software) is really aggressive about
validity checking, then you'll have an unacceptable rejection
rate.
To deal with this issue, Black and Rogaway describe a technique they call "cycle-walking". The idea is that we start with an initially
valid value (1-999,999,999) and then encrypt it. If the ciphertext is
also valid, we stop and emit it. If it's invalid (greater than
999,999,999), we encrypt again, and repeat until we have a valid
value. This gives us an encryption procedure that is guaranteed to
produce an in-range output. Decryption is done in the same way.
Bottom Line
So, why can't we just use cycle-walking? Because
it only works well if the block size is
approximately right—if the size of the valid set is
a lot smaller than the block size of the cipher, then
you have to do a lot of iterations in order to get an in-range
result. So, you can't use a 64-bit block cipher in order
to encrypt an SSN because you end up having to do a
prohibitive number if iterations; you need to use L-R
to construct a block cipher of approximately the right size
and then use cycle-walking to shave off the last few values.
UPDATE: Paul Hoffman pointed out to me privately that it's not clear how
this all relates to FPE. Basically, FPE means the combination of L-R plus cycle
walking. This lets you do one-to-one and onto encryption for most set sizes.
If the set size is really small, there's another technique (also due to Black
and Rogaway): you encrypt all possible input values and then sort the
ciphertexts. You then use the index of the ciphertext in the sorted list
as the encrypted value. This is obviously prohibitively expensive unless
the number of possible values is small because it requires encrypting
all possible values and then keeping a very large mapping table.