David Gelernter
writes
Maybe most important, you need a cloud for security. More and more of
people's lives is going online. For security and privacy, I need the
same sort of serious protection my information gets that my money gets
in a bank. If I have money, I'm not going to shove it in a drawer
under my bed and protect it with a shotgun or something like that.
I'm just going to assume that there are institutions that I can trust,
reasonably trustworthy to take care of the money for me. By the same
token, I don't want to worry about the issues particularly with
machines that are always on, that are always connected to the network,
easy to break into. I don't want to manage the security on my
machine. I don't want to worry about encryption; I don't want to
worry about other techniques to frustrate thieves and spies. If my
information is out on the cloud, not only can somebody else worry
about encryption and coding it, not only can somebody else worry about
barriers and logon protections, but going back to Linda and the idea
of parallelism and a network server existing not on one machine, but
being spread out on many, I'd like each line of text that I have to be
spread out over a thousand computers, let's say, or over a million.
So, if I'm a hacker and I break into one computer, I may be able to
read a vertical strip of a document or a photograph, which is
meaningless in itself, and I have to break into another 999,999
computers to get the other strips.
This doesn't make a lot of sense to me.
First, most of what you see described now as "cloud"-type services,
e.g., EC2 or Box.net, are really just big server farms operated
by a single vendor. There's a reasonable debate about whether
these are more or less secure than services you operate yourself.
With something like Box.net, you don't need to do any of the
admin work on the server, so you can't screw it up and leave your
system insecure. On the minus side, you don't really know what the
operator is doing, so maybe they're administering it more
insecurely than you would yourself. Moreover, there's a certain
level of risk from the fact that other people—maybe your
enemies—are accessing the same computers as you and may be
trying to steal your data. What this kind of cloud service is,
mostly, is more convenient: managing your own systems is
a huge pain in the ass, and so while in the
best case you might manage them more securely than Amazon or
Box would, in practice you probably won't1.
What this doesn't do, however, is remove a single point of failure. In
fact, there are at least two:
- Your data is stored at a small number of machines at the
service provider site. Compromise of one of those machines will
lead to compromise of your data,
as will of course compromise of any of their management machines.
- If the machine on your desk which you use to access the
data is compromised, then your data will also be compromised.
You can, of course, remove the risk of compromise from the service
provider side by encrypting all your data before storing it. In that
case, you're left with the risk of compromise of your own machines
but you now have to, as Gelernter says "worry about encryption".
There's no real way to completely remove the risk of compromise of your own
machines: after all, you need some way to view the data and that
means that your machines need to be able to access it. At most
you can minimize the risk by appropriate security measures.
It's clear, however, from Gelernter's discussion of having your
data spread out over a million machines that he's talking about
something different: a peer-to-peer system like
Distributed Hash Table (DHT)
where your data is sharded over a large number of machines operated by
different people. In the limit, you could have a worldwide system
where anyone could add their machine to the overlay network and
just pick up a share of the data being stored by other people.
In principle, this sounds like it removes the risk
of a single point of failure, since you would need to compromise all
the machines in question. In practice, it's not anywhere near so good,
for two reasons. First, you're trusting a whole pile of other
people who you don't know not to reveal/misuse whatever part of your data
they're storing. That's not very comforting if the data in question is
your social security number. So, if you're unlucky enough to have part of your
data stored by your enemies, that's not good.
Second, DHTs are designed
to dynamically rebalance their load as machines join and leave the
overlay. This means that it may be possible for an attacker to
arrange that his hosts are the ones which get to store your
data, which would increase the risk of compromise.
Even in DHTs which
don't dynamically rebalance, it's generally not practical to
manage a distributed access control system across such an open network;
instead it's just generally assumed that if you want your data to be
confidential you will encrypt it.
This brings us to the suggestion that the data will be sharded in
some way that makes each individual piece useless. This seems kind of
pointless. First, it's not necessarily easy to
have a generic function which breaks a data object into subsets
each of which is useless. Gelernter gives the example of a vertical
strip of a photo, but consider that a horizontal strip of an image
of a document (or a vertical strip of a document in landscape mode)
leaks a huge amount of information. I can imagine security arguments
for other sharding mechanisms (every Nth byte, for instance),
but there are also cases where they're not secure. Second,
if you're encrypting the data anyway, then it doesn't matter
how you break it up, since any subset is as useful (or useless) as
any other.
The bottom line, then, is that cloud storage doesn't necessarily make
things as much more secure or simpler as you would like: You still need to deal with encryption
and with protecting your own computer. What cloud storage does is
remove the need for you to operate and protect your own server. This
adds a lot of flexibility (the ability to have your data available whatever
machine you're using) without too much additional effort, but it's
not much more secure than just carrying the data around on a laptop or
USB stick.
One more thing: you don't really want to just shard the data.
Say that you break each file up into 100 pieces and the node storing
piece #57 crashes and loses your data. What happens?
If your file is plain text, it might be recoverable, but with
lots of file formats (e.g., XML), this kind of damage can
render the entire file unusable without heroic recovery efforts.
There are well-known techniques for addressing this situation
(see forward error correction),
but it's not just a simple matter of splitting the file into multiple parts.
1.Technical note: I'm talking mostly about full services
like Box or Amazon S3. Outsourced virtual machine services like EC2
of course require you to manage them and so you can screw them up just
as badly as you could screw up a machine in your own rack.