# What's a random number and why would I want one?

I wanted to write something about hardware random number generators, but as I was writing I realized that it was probably worth starting with some background about random number generation in general. That's in this post. A future (hopefully near future) post will talk about hardware random number generation.

There are a lot of contexts in which it's convenient to have a ready source of random numbers and the requirements aren't the same. Consider two example applications: surveys and lotteries. [Thanks to Terence Spies for suggesting the survey example.]

In a typical randomized survey, you start with some set of people (e.g., a list of names or phone numbers) and you randomly select some subset of them. Then you call them and ask them what they think of Barack Obama, Jessica Alba, or ketchup. Then you can use straightforward statistics to estimate the probability that are going to vote for Obama or eat ketchup, or perhaps that like Obama and Banderas. For this application what's principally important is that your sampling function be unbiased, i.e., there be an equal chance of selecting each member of the set. For instance, consider what happens if you select people by height from low to high: you're going to get a lot more women than men in which case you probably don't get that good an estimate of how popular Jessica Alba is in the population as a whole.

In a typical lottery, customers get to pick a set of numbers and then the lottery operator draws their own set of numbers using some random number generation method. If you get the numbers right (or sometimes m out of n), you win. It's important to recognize that the critical property here is unpredictability: if gamblers can predict the numbers that the operator will generate then they can choose numbers that are likely to win. Uniformity actually isn't that important, though it makes things easier—lots of gambling games aren't at all uniform (think craps, for instance)—you just need the odds that the game pays off to match the probabilities (actually the inverses of the probabilities) of each outcome. Let's stick to the uniform case for now, though.

So, we have a need for two different kinds of random number generator: one useful in non-adversarial situations like surveys and one useful in adversarial situations like lotteries. In the latter case, we need the generator to be unpredictable to an attacker, which is much harder to achieve.

Back in the old days, people used to use tables of random numbers that had been generated by hand via dice rolling or somesuch. For obvious reasons, this works OK if you're doing surveys—though it's important not always to use the same numbers or you end up sampling the same people every time which of course skews the sample a bit, if only because people get annoyed with you calling them all the time. [Technical note: if you're going to sample repeatedly you want to have the samples be disconnected, otherwise you don't get any additional data when you take the second sample.] On the other hand, it's clearly useless for any adversarial application, since the attacker just needs to figure out what table you're using and they can predict the next numbers you're going to generate.

Now that we have computers, what's typically used are algorithmic pseudorandom number generators (PRNGs). These are functions that generate a uniform stream of values. One common design is to use a hash function or an encryption algorithm—for instance, you might use AES in counter mode. This is fine as far as it goes, but what do you use for a key? AES is a deterministic function: if the attacker knows the key, they can predict the output of the generator. That's why this function is pseudorandom, not really random. As before, we still want to generate a different stream of numbers each time we run the generator: you need to seed the generator with a different key each time you want to use it. For non-adversarial situations, you just need the seed to be unique—it's conventional to use the time of day. But for adversarial situations you need the seed (key) to be unpredictable to the attacker, which means that it has to be randomly generated.

This starts to look like turtles all the way down, but it's really not: if you can get a small amount of random data (e.g., 128 bits), you can bootstrap that up into a large amount of pseudorandom data that's unpredictable to the attacker by using the random data to seed your PRNG. This is how things are done in cryptographic/communications security applications (and what went wrong with OpenSSL on Debian was that this seeding got commented out of the code.)

In order to have a secure system in the adversarial context, we actually need one more property: given the output of the PRNG, it has to be computationally expensive to work backward to reconstruct the PRNG state/seed or forward from one output to the next output. Otherwise the attacker will just take output N and compute output N+1. Not all PRNGs have this property, though the AES-based one I described above does, as do the cryptographically secure PRNGs that people typically use. For a practical system, there's one more property you want: to be able to inject new seed data at any time, not just at the beginning and the PRNGs that people typically use have this property. It's a little tricky (and beyond the scope of this post) to figure out how best to mix in the new seed data, but it's a reasonably well understood problem nonetheless.

I want to make one more point: the output of a badly seeded cryptographic PRNG looks pretty similar to the output of a well seeded PRNG. While there are statistical tests you can use to measure the quality of random numbers, they only really test the generator, not the seeding. The only real way to determine whether a cryptographic PRNG is well-seeded is to try out candidate seeds until you find one that generates the output you're seeing, which is sort of a non-terminating process if the PRNG was well-seeded. Unfortunately, this means that if you want to be sure your PRNG is strong, you actually have to do so by construction—by making sure that the seeds are correct—you can't get a sufficient degree of confidence by testing the output.