# How to ask sensitive questions on a survey

| Comments (6) | TrackBacks (26) |
Let's say that you want to do a survey where you ask people questions they might not want to reveal their answers to, e.g. "Have you ever taken illegal drugs?"

Here's the problem. Let's assume that we just simply ask the question and population fraction π has the attribute you're interested in measuring (e.g., they've smoked pot or whatever). Unfortunately, only fraction λ of those people are willing to admit it. So, when you do your survey, πλ answers "Yes". Say this value is F. This doesn't help you much: you now know that at least F people have the attribute, but you have no way of measuring the upper bound. All you know is that the true value of π is somewhere between F (if λ=1) and 1 (if λ=F). Obviously, this technique works well if you have a good estimate of λ, but poorly if you don't.

The standard methodology for removing this kind of error is called a Randomized Response survey, which uses secret randomness to mask the response. The basic method looks like this:

1. Interviewer asks question. E.g. "Have you ever smoked marijuana?". We'll assume that "Yes" is the sensitive answer here.
2. Subject flips a coin.
3. If the coin comes up heads, the subject answers "Yes".
4. If the coin comes up tails, the subject answers the question truthfully.

The results can be summarized in the following contingency table:

 Coin flip result Heads (.5) Tails (.5) Smoker (π) Yes Yes Non-smoker (1-π) Yes No

In this survey, the only people who will answer "No" will be people who both flipped a tails and haven't smoked marijuana. Because these two are independently distributed, the fraction of No answerers will be approximately (1-π)/2. This makes it very easy to estimate π. We simply take the No response rate, N, and compute 1-2N, which gives us our estimate of π.

Now, obviously the above assumes that people answer truthfully, which we don't know that they'll do. We need to ask whether it's reasonable for people to answer truthfully. Without randomization, the reason that people don't answer truthfully is that it reveals information about them. I.e., if you say "Yes" then the interviewer knows you're a smoker. With a randomized response design, the interviewer gets some information: if you say No you definitely are not a smoker, but if you say Yes you might or might not be.

Remember that the Yes response rate Y is given by 1-(1-π)/2 = .5 + π/2. Out of that set, Y, π will have actually smoked marijuana and .5-&pi/2; will have not, but will have answered yes because of the coin flip. (π/2 will have smoked marijuana but also flipped heads). Now, assume that the researcher has done the study and made his estimate of π. This means that his a priori estimate is that an arbitrary person he meets (who he hasn't asked the question of) has a π chance of having smoked marijuana. Now, if you answer "Yes" to the randomized question above, he can adjust his estimate: you now have a &pi/(.5 + &pi/2) chance of being a smoker.

How much does this improve his information? It depends on the value of π. If π is relatively small (e.g. .1), then .5 + π/2 is approximatley .5 and so the new estimate becomes 2*π--the survey question has caused the interviewer to double his estimate of your chance of being a smoker. On the other hand, if π is fairly high (e.g. .5) then .5 + π/2 starts to approach 1 and the interviewer gets less information about you in particular. In no case does this technique let the interviewer more than double1 his confidence of your positive status, so the amount of individual information leakage is fairly small.

Of course, this demonstration that not that much information is transmitted, while using only simple probability theory, is still somewhat involved, so it's not entirely clear that interviewees actually answer truthfully when this technique is used (see here for one analysis). Nevertheless, this general kind of survey design is very widely used to elicit answers to embarassing questions.

1. The limit at a factor of 2 is a result of the 50/50 nature of the coin flip. If we used a die roll so that people answered truthfully (say) 2/3 of the time the advantage would be larger.

I don't get it--why on earth would you want to reduce the amount of information the subject is giving you? Isn't the goal to make the subject more comfortable about giving you as much information as possible?

Now, this trick may do that, in cases where the subject understands probability theory well enough to figure out that he or she is giving out relatively little information with his or her answer, and can therefore follow the rules safely. But in that case, wouldn't it be better still to enact this same procedure with a secretly biased coin that almost always comes out to "tails"--unbeknownst to the subject--so as to maximize the total information gathered? And if the subject is bad at probability theory, mightn't he or she conclude just about anything--including that this scheme is all a big trick to wheedle information out of him or her, and therefore that it'd be best to lie, or refuse to answer, or run screaming from the study? And mightn't there be much more effective ways of persuading subjects to answer truthfully, without losing any information at all?

Or has this method actually been tested against alternatives and found to be (to my surprise, if it's true) the most comforting for subjects?

Well, the meta-goal is to produce the most accurate estimate of the underlying population statistic. If the subjects lie when asked directly, then you're not getting good information. So, you may get more accurate estimates if the subject thinks they're leaking less individual information--and since you don't actually care about this particular subject's status... Why do you find this concept difficult?

As for the question of whether this works, it's been extensively studied and the results are a bit mixed. In fact, the original post contained a link to a meta-analysis of a variety of studies on this topic that also included references to many of those studies, so you might wish to start there.

The part I have trouble with is the jump from "if the subject thinks they're leaking less individual information" to "if the subject is leaking less individual information". It appears to me that the designers of the protocol made this jump, assuming that if they designed a protocol which reduced the amount of information leaked by each subject, then the subjects would also think they were leaking less information, and therefore be more honest.

I would have thought it more effective to design protocols that make subjects feel more comfortable about being honest--either by giving the impression of less information leakage, or by reassuring the subject that revealing the information is okay--while preserving as much actual information revelation as possible. To me, the protocol you describe sounds like the worst of both worlds--excellent actual information reduction properties, using a probabilistic technique that might well not reassure anyone lacking a sophisticated understanding of probability theory that it's safe to be honest.

Well, whether this works or not doesn't really depend on your opinion--it's an empirical question. Maybe it works, maybe it doesn't, but this kind of armchair analysis doesn't strike me as a very good way to determine the answer.

Yes, of course, it's a purely empirical question. That's why I put the final question in my first comment--to make it clear that while my intuition told me that this would be a lousy method, I was perfectly willing to accept experimental results that contradicted my intuition. When you came back with the answer that the empirical results were in fact inconclusive, though, I started to suspect that I might be on to something.

I should note that the page you linked to, and one of the references it cites, mention several variations on the "randomized response" technique which might be statistically identical or even inferior to the one you described, but still psychologically preferable. (That is, they may yield less information assuming completely honest respondents, but nevertheless work better because they're better at encouraging honesty in respondents.) For example, if you have the responder answer according to a second coin flip, rather than "yes", in the case of "heads", then very little extra information is lost, but the responder no longer has the temptation to rule out an embarrassing revelation entirely simply by answering "no".

Years ago, I read a bit about a large study of family violence (Straus' and Gelles' National Family Violence Survey--the one which acheived some notoriety by discovering that women were only slightly less violent than men towards their spouses). Their survey did not use randomized response, but they did use various methods to encourage accurate responses, including gradually ratcheting up the embarrassment level of the questions, treating the embarrassing questions as natural followups to the less embarrassing ones, and carefully phrasing the questions so that the supposedly embarrassing answers would seem as normal as the unembarrassing answers. It seems quite plausible to me (although, again, this is purely an empirical question) that such techniques may be much more effective than randomized response at persuading people to answer such questions truthfully.

"I should note that the page you linked to, and one of the references it cites, mention several variations on the "randomized response" technique which might be statistically identical or even inferior to the one you described, but still psychologically preferable."

You seem to be under the misimpression that my post was intended to provide a complete tutorial on randomized response theory, rather than providing an introduction to a technique I thought was interesting. The basic insight that randomness might help is what's important.

The rest of your comment is just you rehashing your intuition, which, as I previously noted, isn't really that dispositive.