Software: May 2009 Archives


May 13, 2009

USENIX conferences use this tool called HotCRP to manage the conference review process. Like other systems, you rate papers on a numeric scale (1-5). When you ask for a summary of the papers, the system displays a cute little graphic of how many people have chosen each rating (and even a cute little mouseover that displays the mean and SD), but once you have more than a few papers to look, it's a bit inconvenient to get a sense of the distribution. Maybe the PC chairs have a tool, but PC members don't. Luckily, it's easy to extract it from the HTML source. Here's a Perl script that will suck out the scores and compute the mean:
    next unless /GenChart\?v=([\d,]+)/;
    $sum = 0;
    $ct = 0;

    for($i=0; $i<=$#scores; $i++){
	$sum += ($i + 1) * $scores[$i];
	$ct += $scores[$i];

    $mean = $sum / $ct;
    print "$mean\n";

You just do save as1 and shove the file into the script on stdin. Extracting standard deviations and the name of the paper are left as exercises for the reader.

1. Note: you need to save as source not a complete Web page/directory. Otherwise the browser helpfully saves the images and rewrites the links to point to your local disk, which breaks everything. Took me a while to figure out what the heck was going on with that one.


May 12, 2009

Ed Felten posts about the Minnesota Breathlyzer case (I've written about it here):
The problem is illustrated nicely by a contradiction in the arguments that CMI and the state are making. On the one hand, they argue that the machine's source code contains valuable trade secrets -- I'll call them the "secret sauce" -- and that CMI's business would be substantially harmed if its competitors learned about the secret sauce. On the other hand, they argue that there is no need to examine the source code because it operates straightforwardly, just reading values from some sensors and doing simple calculations to derive a blood alcohol estimate.

It's hard to see how both arguments can be correct. If the software contains secret sauce, then by definition it has aspects that are neither obvious nor straightforward, and those aspects are important for the software's operation. In other words, the secret sauce -- whatever it is -- must relevant to the defendants' claims.

I'm not sure this argument is right in the general case. Ignoring the specific case of breathalyzers, if I want to develop a new piece of software, it's pretty helpful to have a worked example to rip off. To take a simple case, if I wanted to build a new NAT (a pretty well-understood technology) I'd rather start with some existing package than build everything myself. It's not that there is anything secret in one of these gizmos, just that it would give you something to imitate/test against, etc. This would be especially true if I could actually copy the source, not just mimic it. Conversely, if I were the vendor of an existing system, I wouldn't necessarily want to assist my competitors.

Three further observations: First, I expect it's a lot less of an advantage to have the source code for a device like a breathalyzer or a voting machine. First, it's not a generic PC wired to a bunch of network ports: there's a bunch of sensors and stuff that can't be sourced from your average OEM network gear manufacturing plant (this is more true for breathalyzers than voting machines). Second, a lot of the business of selling something like this is engaging with law enforcement, voting officials, etc. There's more too it than just getting your boxes on the shelf at Fry's. Consequently, it's probably not as much of a competitive advantage to save on engineering costs as it might be in some other business.

Second, if every breathalyzer vendor is required to disclose their source code, it makes it a fair bit harder for your competitors to just steal your source code, since, at least potentially, you can see their source code and have an opportunity to demonstrate that it's a copy of yours. Of course, this doesn't rule out less blatant copying, using the original system as a template/regression test system, etc.

Third, we're kind of stretching the definition of "trade secret" here, at some abstract level. As Ed observes, if the system is straighforward, what's the secret? On the other hand, it's fairly consistent with the relatively expansive tech industry definition of trade secret.


May 3, 2009

The Minnesota Supreme Court has ruled that defendants in DUI cases can get discovery of breathalyzer source code. (Ruling here). Apparently this puts a pretty serious crimp in Minnesota DUI proceedings because the manufacturer won't provide the source code:
The state's highest court ruled that defendants in drunken-driving cases have the right to make prosecutors turn over the computer "source code" that runs the Intoxilyzer breath-testing device to determine whether the device's results are reliable.

But there's a problem: Prosecutors can't turn over the code because they don't have it.

The Kentucky company that makes the Intoxilyzer says the code is a trade secret and has refused to release it, thus complicating DWI prosecutions.

"There's going to be significant difficulty to prosecutors across the state to getting convictions when we can't utilize evidence to show the levels of the defendant's intoxication," said Dakota County Attorney James Backstrom.

"In the short term, it's going to cause significant problems with holding offenders accountable because of this problem of not being able to obtain this source code."

I can't find the original filings, which include an affidavit from David Wagner, so I'm not sure I'm seeing the best argument for this position. That said, however, I'm not sure that source code analysis is really the best way to determine whether breathalyzers are accurate.

At a high level a breathalyzer is a sensor apparently either an IR spectrometer or some sort of electrochemical fuel cell gizmo attached to a microprocesser and a display. The microprocessor reads the output of the sensor, does some processing, and emits a reading. Obviously, there are a lot of things that can go wrong here, and this page describes a bunch of problems in the source code of another machine, mostly that there seems to be a bunch of ad hoccery in the way the measurements are handled. For instance:

3. Results Limited to Small, Discrete Values: The A/D converters measuring the IR readings and the fuel cell readings can produce values between 0 and 4095. However, the software divides the final average(s) by 256, meaning the final result can only have 16 values to represent the five-volt range (or less), or, represent the range of alcohol readings possible. This is a loss of precision in the data; of a possible twelve bits of information, only four bits are used. Further, because of an attribute in the IR calculations, the result value is further divided in half. This means that only 8 values are possible for the IR detection, and this is compared against the 16 values of the fuel cell.

So, maybe this is bad and maybe it isn't. But it's not clear that you can determine the answer by examining the source code. Rather, you want to ask what the probability is that a system constructed this way would produce an inaccurate reading. If, for instance, the A/D converters have an inherent error rate/variance that's large compared to the sensitivity that they read out in, then it's not crazy to divide down to some smaller number of significant digits—though I might be tempted to do it later in the process. More to the point, any piece of software you look at closely is going to be chock full of errors of various kinds, but it's pretty hard to tell whether they are going to actually impact performance without some careful analysis.

On the flip side, actually reading the source code is a pretty bad way of finding errors. First, it's not very efficient in terms of finding bugs. I've written and reviewed a lot of source code and it's just really hard to get any but the most egregious bugs out with that kind of technique. Second, even if we find things that could have gone wrong (missed interrupts, etc.) it's very hard to determine whether they caused problems in any particular case. [Note that you could improve your ability to recover from some kinds of computational error by logging the raw data as well as whatever readings the system produces.] Third: there are a lot of non-software things that can go wrong. In particular, you need to establish that what the sensors is are reading actually correspond to the alcohol level in the breath, that that actually corresponds to blood alcohol level, that the sensors are reading accurately, etc.

Stepping up a level, it's not clear what our policy should be about how to treat evidence from software-based systems; all software contains bugs of one kind or another (and we haven't even gotten to security vulnerabilities yet). If that's going to mean that all software-based systems are useless for evidentiary purposes, the world is going to get odd pretty fast.