Notes on the security properties of Bitcasa

| Comments (2) | COMSEC
Techcrunch has been writing about a new startup called Bitcasa. Roughly speaking, Bitcasa offloads your storage to a cloud service. They use data deduplication to avoid storing multiple copies of common data. In Techcrunch's original review, they wrote:

When you save a file, Bitcasa writes those 1's and 0's to its server-side infrastructure in the cloud. It doesn't know anything about the file itself, really. It doesn't see the file's title or know its contents. It doesn't know who wrote the file. And because the data is encrypted on the client side, Bitcasa doesn't even know what it's storing.

So if you want to cloud-enable your 80 GB collection of MP3's or a terabyte of movies (acquired mainly through torrenting, naughty you!), go ahead. Even if the RIAA and MPAA came knocking on Bitcasa's doors, subpoenas in hand, all Bitcasa would have is a collection of encrypted bits with no means to decrypt them.

This seems intuitively wrong from an information theory perspective, but it's certainly possible that Bitcasa is using some crypto magic I'm unaware of. Luckily, in this article Bitcasa explains the technology.

OK, so convergent encryption... what happens is when you encrypt data, I have a key and you have a key. And let's say that that these are completely different. Let's say that we both have the exact same file. I encrypt it with my key and you encrypt it with your key. Now the data looks completely different because the encryption keys are different. Well, what happens if you actually derive the key from the data itself? Now we both have the exact same encryption key and we can de-dupe on the server side.

They don't provide a cite, but we're probably talking about a system like that described by Storer et al.. If so, then there's no reason to think it won't work, but the implied security claim above seems over-strong. In particular, as far as I can tell, Bitcasa does in fact learn which user has stored which chunk. Consider what happens when you and I both try to store file X. The encryption is a deterministic function, so we each generate the same n chunks X_1, X_2, ... X_n. When we upload them (or query Bitcasa for whether we need to upload them), Bitcasa learns which chunks I have. When you do the same, Bitcasa learns we have the same file. More generally, the pattern of chunks you have serves as a signature for the files you have; if you have a copy of a given file, you can easily determine whether a given user has stored it on Bitcasa. So, unless I've missed something I don't think that in fact this provides security against attempts by Bitcasa to determine whether a given user is storing one of a given set of contraband files.


The Tahoe least-authority file system had a convergent encryption deduplication scheme but they dropped it because it leaks too much information.

I don't know anything about Bitcasa, but this sounds like Content Addressable Storage. With CAS the files are split up ("chunked") on the client side and the server doesn't have any knowledge of what the client considers a file. I don't imagine that completely solves the issue you brought up, but at this point I've exhausted my knowledge of CAS.

Leave a comment