A common problem that occurs in computer and networking applications
is to generate a unique identifier. One obvious application is
detecting message replays. If you receive two identical messages
(e-mails, IP datagrams, etc.) how do you know whether the sender sent
them twice or the network just duplicated the message? The standard
approach here is to require the sender to include a "unique"
identifier in each message. Then you compare the identifiers and if
they're the same it's a retransmit and if they're different the sender
sent it twice.
Identifier Scope
At this point, the question immediately arises: unique within what
scope? For some applications (e.g., identifying files within
a global p2p network) you need an identifier that's globally
unique. For others (e.g., IP packet reassembly) you need one
that's unique to a given host over a short period of time
(the lifetime of packets in the network). So, when we talk about
uniqueness we need to know what scope the identifier needs to
be unique within. The two most common cases are probably:
- Globally unique: This identifier will never be used for anything
else by anyone ever.
- Locally unique: This identifier will never be re-used by this
particular system.
Note that these cases are very closely related; if one thinks
of the system identity as part of the identifier, then a locally
unique identifier becomes a globally unique one. Indeed, it's
common to create globally unique identifiers by gluing a fresh
locally unique identifier to a static globally unique name.
Thus, for instance, e-mail message IDs look like:
E1GdBEu-0005LF-Fz@stiedprstage1.ietf.org.
Locally unique identifiers can be generated in two basic
ways: counters and randomly generated.
Counters
There's nothing mysterious about a counter: the first identifier is 1,
the second 2, etc. Another variant of this is to use the current
time--though you have to be careful to use a high enough resolution
timer that successive calls to it produce different results.
In either case, counters are simple and easy to implement.
The big problem with a counter is that you have to keep the counter
state stored somewhere to make sure you avoid re-using the counter.
This isn't necessarily a problem, but there are a bunch of ways it
can go wrong. A simple case is persisting the counter across machine
reboots. In order to do avoid starting again from zero, you need
to save the counter to disk after each update. Consider the following
code:
1 send_message(msg) {
2 msg.id = ctr;
3
4 write(msg);
5
6 ctr++;
7 save(ctr);
8 }
Figure 1
Now, what happens if the program crashes or the machine reboots in
between line 4 and line 7. In that case, you'll reuse ctr
for the next message you send. In order to be safe, you need to save
ctr before sending the message. Then if the machine
crashes you skip a counter but you don't reuse one. The code looks
something like this:
1 send_message(msg) {
2 msg.id = ctr;
3 ctr++;
4 save(ctr)
5
6 write(msg);
7 }
Figure 2
Now at this point you might be thinking that neither you nor anybody
you know would write the code in Figure 1, but the code in Figure 2
isn't guaranteed to work either because modern operating systems
and hard drives have write caches, so just because you've tried
to save the data to disk doesn't mean it actually has ended up on
the hard drive platter. Similarly, if you're using time of day,
you have to worry about the clock being set back.
So, making sure values are never reused is actually fairly tricky
in practice.
Another problem with counters is that they're inherently local.
After all, it's pretty likely that everyone else is going
to start from 0 or 1 too. If you want to have a globally unique identifier you need to
scope the counter by some globally unique value like your domain
name.
This only works if you have such a globally unique value,
which not all systems do.
Randomly Generated
The second approach is to use a random number generator.
The idea here is that you generate a random number of suitable
length and there's a high probability it will be unique. The
appropriate length is governed by the math of the birthday paradox, but in general, if
you want to generate 2n identifiers, they need to be
somewhat greater than 2n bits long in order to have a
high probability of uniqueness. By contrast, because counter
values are assigned sequentially, if you want to generate 2n
counter values, they have to be of length approximately n bits.
Accordingly, counters are more space efficient.
The big advantage of randomly generated IDs is that you don't
need to worry about keeping state around--well, sort of.
Strictly speaking, this is only true if you have a true random number
generator like dice or radioactive decay. In practice, most
systems use pseudorandom number generators (PRNGs). These are
functions that start with some seed data to set the initial state
and then crank out a bunch of random-appearing values.
storing state. If you feed a PRNG the same seed then you'll get
the same set of random data. So, in principle you need to do state
management here too. However, in practice, this isn't
a real problem because you can leverage the existing state of the
machine (the clock time, network traffic, process
timing info, maybe a stored state file...) as seed data. Even though the
entropy of each individual piece of state is low, the probability that
all the system state will ever be exactly replicated is extremely
low and so you can treat this as a random generator.1
By contrast to counters, you don't need to do anything special
to make your randomly generated IDs globally unique. It's just
a matter of generating a long enough value. This is obviously
very convenient since it means no need to have any global naming
scheme.
The issue that typically freaks people out about randomly generated
IDs is that they're not guaranteed to be unique, they're just
statistically unique. There's some (low) chance that two randomly
generated IDs will be the same. By contrast, counters (if properly
implemented) appear to be guaranteed to be unique.
I hear this argument a couple
of times a year and in my experience no amount of working through
the math will convince people that the probability is vanishingly
small. They want a guarantee.
Unfortunately, this is a case where people's intuition leads them
astray; computers seem deterministic so people trust that a properly
implemented counter will always increase. This isn't really so:
memories and hard drives are both subject to random errors, so
even if you're correctly incrementing your counter, there is some
chance that it will (for instance) come back as zero the next
time you read it. Of course, storage devices have error correcting
codes designed to prevent this kind of thing, but they're based
on the same kind of statistics that assure you that randomly
generated IDs are OK, so that shouldn't give you any confidence.
I don't have good estimates for the probability of this kind
of system error but given that my computer acts up randomly reasonably
often, it's pretty likely that the real-world probability of counter
reuse exceeds the expected probability of RNG collisions.
1. Note that this is a much weaker condition than that the
seed data be unpredictable. For our purposes here it just has to
be unique. Of course in practice people typically use the same
cryptographic random number generators for this purpose as well.