COMSEC: September 2008 Archives

What's a random number and why would I want one?

I wanted to write something about hardware random number generators, but as I was writing I realized that it was probably worth starting with some background about random number generation in general. That's in this post. A future (hopefully near future) post will talk about hardware random number generation.

There are a lot of contexts in which it's convenient to have a ready source of random numbers and the requirements aren't the same. Consider two example applications: surveys and lotteries. [Thanks to Terence Spies for suggesting the survey example.]

In a typical randomized survey, you start with some set of people (e.g., a list of names or phone numbers) and you randomly select some subset of them. Then you call them and ask them what they think of Barack Obama, Jessica Alba, or ketchup. Then you can use straightforward statistics to estimate the probability that are going to vote for Obama or eat ketchup, or perhaps that like Obama and Banderas. For this application what's principally important is that your sampling function be unbiased, i.e., there be an equal chance of selecting each member of the set. For instance, consider what happens if you select people by height from low to high: you're going to get a lot more women than men in which case you probably don't get that good an estimate of how popular Jessica Alba is in the population as a whole.

In a typical lottery, customers get to pick a set of numbers and then the lottery operator draws their own set of numbers using some random number generation method. If you get the numbers right (or sometimes m out of n), you win. It's important to recognize that the critical property here is unpredictability: if gamblers can predict the numbers that the operator will generate then they can choose numbers that are likely to win. Uniformity actually isn't that important, though it makes things easier—lots of gambling games aren't at all uniform (think craps, for instance)—you just need the odds that the game pays off to match the probabilities (actually the inverses of the probabilities) of each outcome. Let's stick to the uniform case for now, though.

So, we have a need for two different kinds of random number generator: one useful in non-adversarial situations like surveys and one useful in adversarial situations like lotteries. In the latter case, we need the generator to be unpredictable to an attacker, which is much harder to achieve.

Back in the old days, people used to use tables of random numbers that had been generated by hand via dice rolling or somesuch. For obvious reasons, this works OK if you're doing surveys—though it's important not always to use the same numbers or you end up sampling the same people every time which of course skews the sample a bit, if only because people get annoyed with you calling them all the time. [Technical note: if you're going to sample repeatedly you want to have the samples be disconnected, otherwise you don't get any additional data when you take the second sample.] On the other hand, it's clearly useless for any adversarial application, since the attacker just needs to figure out what table you're using and they can predict the next numbers you're going to generate.

Now that we have computers, what's typically used are algorithmic pseudorandom number generators (PRNGs). These are functions that generate a uniform stream of values. One common design is to use a hash function or an encryption algorithm—for instance, you might use AES in counter mode. This is fine as far as it goes, but what do you use for a key? AES is a deterministic function: if the attacker knows the key, they can predict the output of the generator. That's why this function is pseudorandom, not really random. As before, we still want to generate a different stream of numbers each time we run the generator: you need to seed the generator with a different key each time you want to use it. For non-adversarial situations, you just need the seed to be unique—it's conventional to use the time of day. But for adversarial situations you need the seed (key) to be unpredictable to the attacker, which means that it has to be randomly generated.

This starts to look like turtles all the way down, but it's really not: if you can get a small amount of random data (e.g., 128 bits), you can bootstrap that up into a large amount of pseudorandom data that's unpredictable to the attacker by using the random data to seed your PRNG. This is how things are done in cryptographic/communications security applications (and what went wrong with OpenSSL on Debian was that this seeding got commented out of the code.)

In order to have a secure system in the adversarial context, we actually need one more property: given the output of the PRNG, it has to be computationally expensive to work backward to reconstruct the PRNG state/seed or forward from one output to the next output. Otherwise the attacker will just take output N and compute output N+1. Not all PRNGs have this property, though the AES-based one I described above does, as do the cryptographically secure PRNGs that people typically use. For a practical system, there's one more property you want: to be able to inject new seed data at any time, not just at the beginning and the PRNGs that people typically use have this property. It's a little tricky (and beyond the scope of this post) to figure out how best to mix in the new seed data, but it's a reasonably well understood problem nonetheless.

I want to make one more point: the output of a badly seeded cryptographic PRNG looks pretty similar to the output of a well seeded PRNG. While there are statistical tests you can use to measure the quality of random numbers, they only really test the generator, not the seeding. The only real way to determine whether a cryptographic PRNG is well-seeded is to try out candidate seeds until you find one that generates the output you're seeing, which is sort of a non-terminating process if the PRNG was well-seeded. Unfortunately, this means that if you want to be sure your PRNG is strong, you actually have to do so by construction—by making sure that the seeds are correct—you can't get a sufficient degree of confidence by testing the output.

Countermeasures for gaming random extensions

Writing in the comments section, Nick Weaver expresses the not unreasonable concern that if you randomly select people who are eligible for extensions, people will try to game the system:
If you do the randomization suggestion, you really don't want to do it for a security conference, because then you will get security researchers playing games (eg, 3-4 different titles/abstracts/author orders, withdraw the lowest extending ones)

This really is the way security people think, but I think it is possible to design technical mechanisms to stop people from doing this. One obvious thing you could do is simply forbid any individual from being an author on two papers that apply for extensions. This sounds a bit restrictive but one could certainly argue that if you're asking for extensions on two papers, you don't have time to work on them both and should just focus on one.

It seems like there is a less restrictive approach, however, namely to use cut-and-choose. Nick's attack relies on people submitting extension requests for papers they don't really intend to submit. This implies a countermeasure where you force authors to prove that their requests are legitimate. You can't really do this for the papers where you reject their extension requests (since they could reasonably claim they're not ready to submit, which is why they asked for an extension), but you can for any request where you grant an extension:

Here's one natural approach:

• Force people to submit all their requests at once before you tell them whether any are granted.
• Randomly (cryptographically securely random) select whatever subset of the papers you're going to select.
• Require the authors to actually submit papers for which requests were granted.

(Obviously you'd need to notify authors ahead of time that these were the terms).

Now, obviously, an author who is trying to cheat you can submit bogus papers (or slight variants), but I'm assuming that if you actually force them to submit, the PC can detect them and appropriately punish the authors.

I'm being deliberately vague around the words "require" and "punish" here, because we're now leaving the realm of cryptography and entering the world of politics. One thing you could do is simply refuse to accept any papers unless the authors submitted all the papers by that author. More severely, you could refuse to accept any papers from them in the future (this sanction is sometimes used in other cases of clear misconduct). In any case, it's clear we could create sanctions that authors wouldn't like.

Let's do the math on how well this would work. For mathematical convenience, assume that you are only submitting a single real paper and that if you're caught cheating is that all your submissions for this conference are summarily rejected. Assume further that if the extension isn't granted, you don't submit. Thus, we have the following outcomes:

 Outcome Payoff No extensions granted 0 One extension granted 1 >1 extension granted 0

If the probability of extensions being granted is p and you submit n times, then the probability of each of these events is:

 Outcome P(outcome) No extensions granted (1-p)n One extension granted p*(1-p)n-1*n >1 extension granted 1-pn

However, because we've assumed that the payoffs for the first and third cases are zero, and the second case it's 1, the payoff is just the second case, i.e., p*(1-p)n-1*n. To find the best strategy for the submitter, we want to find the n that gives the maximum payoff. Since we can only choose discrete values of n it's easiest to find this value experimentally. Here's the relevant R code to find the maximum:

```pay <- function(p,n){
n* p * ((1-p) ^ (n-1))
}

max.pay <- function(p){
n <- seq(1,100)
pays <- sapply(n,pay,p=p)
n[order(pays,decreasing=TRUE)][1]
}
```
(note 1: this code does not work properly for very small values of p where the optimal n value is > 100. That can be fixed by making the maximal n value bigger. note 2: we happen to know that there is a single maximum so we could do a loop that stopped as soon as p_{n+1} < p_n, but that was too much effort.)

It turns out that for all p >= .05, the optimal number of submissions is 1, which is the behavior you want to encourage. For p < .5, it makes sense for the submitters to do multiple submissions (with the number increasing as p increases). Accordingly, if you want to give extensions less than 1/2 the time, you need more draconian punishments (e.g., banning from future submissions) to enforce that.

I should note that this sort of suspicious attitude probably isn't the best way to get the result you want. In general, the academic community depends on people's voluntary compliance with norms (we don't really check that people don't make up their data, for instance), and if you simply tell people that this is the norm for this conference (and perhaps make them explicitly affirm their compliance), I expect they will do so. On the other hand, if you make it a game, they're likely to take it as a challenge to cheat (especially in the security community) (cf. the famous A fine is a price.). On the other hand, that's less fun than designing a new mechanism.

For extra credit, consider the following extensions to the model: (1) the punishment is greater than just having all your papers rejected. (2) authors have more than one paper. (3) submitting without an extension is less good than submitting with an extension but still of nonzero value.

October 2012

S M T W T F S
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31