On generating unique identifiers

| Comments (4) |
A common problem that occurs in computer and networking applications is to generate a unique identifier. One obvious application is detecting message replays. If you receive two identical messages (e-mails, IP datagrams, etc.) how do you know whether the sender sent them twice or the network just duplicated the message? The standard approach here is to require the sender to include a "unique" identifier in each message. Then you compare the identifiers and if they're the same it's a retransmit and if they're different the sender sent it twice.

Identifier Scope
At this point, the question immediately arises: unique within what scope? For some applications (e.g., identifying files within a global p2p network) you need an identifier that's globally unique. For others (e.g., IP packet reassembly) you need one that's unique to a given host over a short period of time (the lifetime of packets in the network). So, when we talk about uniqueness we need to know what scope the identifier needs to be unique within. The two most common cases are probably:

  • Globally unique: This identifier will never be used for anything else by anyone ever.
  • Locally unique: This identifier will never be re-used by this particular system.

Note that these cases are very closely related; if one thinks of the system identity as part of the identifier, then a locally unique identifier becomes a globally unique one. Indeed, it's common to create globally unique identifiers by gluing a fresh locally unique identifier to a static globally unique name. Thus, for instance, e-mail message IDs look like: E1GdBEu-0005LF-Fz@stiedprstage1.ietf.org.

Locally unique identifiers can be generated in two basic ways: counters and randomly generated.

Counters
There's nothing mysterious about a counter: the first identifier is 1, the second 2, etc. Another variant of this is to use the current time--though you have to be careful to use a high enough resolution timer that successive calls to it produce different results. In either case, counters are simple and easy to implement.

The big problem with a counter is that you have to keep the counter state stored somewhere to make sure you avoid re-using the counter. This isn't necessarily a problem, but there are a bunch of ways it can go wrong. A simple case is persisting the counter across machine reboots. In order to do avoid starting again from zero, you need to save the counter to disk after each update. Consider the following code:

1 send_message(msg) {
2    msg.id = ctr;
3 
4    write(msg);
5
6    ctr++;
7    save(ctr);
8 }

Figure 1

Now, what happens if the program crashes or the machine reboots in between line 4 and line 7. In that case, you'll reuse ctr for the next message you send. In order to be safe, you need to save ctr before sending the message. Then if the machine crashes you skip a counter but you don't reuse one. The code looks something like this:

1 send_message(msg) {
2    msg.id = ctr;
3    ctr++;
4    save(ctr)
5
6    write(msg);
7 }

Figure 2

Now at this point you might be thinking that neither you nor anybody you know would write the code in Figure 1, but the code in Figure 2 isn't guaranteed to work either because modern operating systems and hard drives have write caches, so just because you've tried to save the data to disk doesn't mean it actually has ended up on the hard drive platter. Similarly, if you're using time of day, you have to worry about the clock being set back. So, making sure values are never reused is actually fairly tricky in practice.

Another problem with counters is that they're inherently local. After all, it's pretty likely that everyone else is going to start from 0 or 1 too. If you want to have a globally unique identifier you need to scope the counter by some globally unique value like your domain name. This only works if you have such a globally unique value, which not all systems do.

Randomly Generated
The second approach is to use a random number generator. The idea here is that you generate a random number of suitable length and there's a high probability it will be unique. The appropriate length is governed by the math of the birthday paradox, but in general, if you want to generate 2n identifiers, they need to be somewhat greater than 2n bits long in order to have a high probability of uniqueness. By contrast, because counter values are assigned sequentially, if you want to generate 2n counter values, they have to be of length approximately n bits. Accordingly, counters are more space efficient.

The big advantage of randomly generated IDs is that you don't need to worry about keeping state around--well, sort of. Strictly speaking, this is only true if you have a true random number generator like dice or radioactive decay. In practice, most systems use pseudorandom number generators (PRNGs). These are functions that start with some seed data to set the initial state and then crank out a bunch of random-appearing values. storing state. If you feed a PRNG the same seed then you'll get the same set of random data. So, in principle you need to do state management here too. However, in practice, this isn't a real problem because you can leverage the existing state of the machine (the clock time, network traffic, process timing info, maybe a stored state file...) as seed data. Even though the entropy of each individual piece of state is low, the probability that all the system state will ever be exactly replicated is extremely low and so you can treat this as a random generator.1

By contrast to counters, you don't need to do anything special to make your randomly generated IDs globally unique. It's just a matter of generating a long enough value. This is obviously very convenient since it means no need to have any global naming scheme.

The issue that typically freaks people out about randomly generated IDs is that they're not guaranteed to be unique, they're just statistically unique. There's some (low) chance that two randomly generated IDs will be the same. By contrast, counters (if properly implemented) appear to be guaranteed to be unique. I hear this argument a couple of times a year and in my experience no amount of working through the math will convince people that the probability is vanishingly small. They want a guarantee.

Unfortunately, this is a case where people's intuition leads them astray; computers seem deterministic so people trust that a properly implemented counter will always increase. This isn't really so: memories and hard drives are both subject to random errors, so even if you're correctly incrementing your counter, there is some chance that it will (for instance) come back as zero the next time you read it. Of course, storage devices have error correcting codes designed to prevent this kind of thing, but they're based on the same kind of statistics that assure you that randomly generated IDs are OK, so that shouldn't give you any confidence. I don't have good estimates for the probability of this kind of system error but given that my computer acts up randomly reasonably often, it's pretty likely that the real-world probability of counter reuse exceeds the expected probability of RNG collisions.

1. Note that this is a much weaker condition than that the seed data be unpredictable. For our purposes here it just has to be unique. Of course in practice people typically use the same cryptographic random number generators for this purpose as well.

4 Comments

And if you really want to be on the safe side, you combine a counter (timestamp), a static globally unique name (MAC address) and a PRN: RFC 4122

So is there a potential market for 128 bit counter chips that increment when read and also when a reset line is pulled low? Increment only chips?

A counter chip would be cheap to make, just some NVR and wee bit of logic. A fancy one would have error detection and correction built in.

A way to grossly reduce the chance of a random number being repeated is to append in the current time to the greatest degree of accuracy that you can, then hash (or leave it alone if you don't mind exposing the time). This reduces the chance of the UID being repeated to that of the random generator repeating in that exact instant.

I disagree with Paul Hoffman's comment. If the clock has been used as some of the data to seed the PRNG (as it should have been) then appending it to the produced random number makes the resulting id no more unique.

Leave a comment