March 23, 2008
The White House's mail archiving policy
In response to lawsuits over missing emails, the White House claims that they have been following some somewhat unusual IT practices:"When workstations are at the end of their lifecycle and retired ... the hard drives are generally sent offsite to another government entity for physical destruction," the White House said in a sworn declaration filed with U.S. Magistrate Judge John Facciola.It has been the goal of a White House Office of Administration "refresh program" to replace one-third of its workstations every year in the Executive Office of the President, according to the declaration.
Some, but not necessarily all, of the data on old hard drives is moved to new computer hard drives, the declaration added.
In proposing an e-mail recovery plan Tuesday, Facciola expressed concern that a large volume of electronic messages may be missing from White House computer servers, as two private groups that are suing the White House allege.
Facciola proposed the drastic approach of going to individual workstations of White House computer users after the White House disclosed in January that it recycled its computer backup tapes before October 2003. Recycling -- taping over existing data -- raises the possibility that any missing e-mails may not be recoverable.
Some initial observations:
- Every three years is a fairly fast retirement cycle. For comparison, the IRS depreciation schedule for computers is 5 years.
- It's not clear to me that the hard drive destruction issue is that relevant. When you convert from one machine to another, it's by far easiest to simply move your entire mail archive over, rather than picking and choosing. If you do that, the primary difference between the old and new computers in terms of what data is available is going to be remanent data from explicitly deleted messages, which obviously is not on the new macine. First, most mail systems don't store data in large flat files (yes, yes, I know about MH, but I think we an assume Karl Rove does not use that) or databases, so it's reasonably likely that anything that old will already have been reclaimed and written over. Second, I would really hope that if the White House wants to securely delete something, they do better than just hitting the delete key and hoping.
- I wonder what mail server logs are available. Even if the data has been deleted, many mail servers keep extensive logs. This could be used both for traffic analysis and as a guide to what should be found with enough effort. Of course, there's always the chance of remanent data on the server as well.
- What you want is to have confidence that the data you want retained really is retained and that the data you want destroyed really is destroyed, not to rely on the relatively unpredictable properties of your media. It doesn't sound to me like this policy really achieves either. Of course, there is always the possibility that the White House is playing dumb and/or lying, but incompetence wouldn't exactly shock me either.
Posted by ekr at 12:50 PM | Comments (3)
March 20, 2008
bibxml2rfc: a bibliography tool for xml2rfc
After all my complaining about the xml2rfc bibliography system, I decided to do something about it. I thought for a while about hacking xml2rfc itself, but after spending a while reading the crufty tcl code in xml2rfc, I decided it might be easier to do a secondary bibliography management tool in the style of bibtex.There are two tools:
bibxml2rfc:the bibliography manager. It runs through the XML file, finds everything that looks like a reference, and the builds a pair of bibliography files (one for normative and one for informative) for inclusion into your XML file. It automatically fetches reference entries for RFCs and Internet Drafts. You can use ordinary XML inclusion techniques so you don't need to modify the XML file reference section when the bibliography changes.bibxml2rfc-merge:a tool to make merged files for submission. The Internet Drafts submission tool won't accept a bunch of separate XML files, so bibxml2rfc-merge merges them into one for submission. You don't need this for your ordinary work.
The source can be found at: https://svn.resiprocate.org/rep/ietf-drafts/ekr/bibxml2rfc. Documentation for bibxml2rfc can be found at https://svn.resiprocate.org/rep/ietf-drafts/ekr/bibxml2rfc/bibxml2rfc.txt
Posted by ekr at 8:00 AM | Comments (1)
March 19, 2008
Discrepancies in Sequoia Advantage machines
Ed Felten reports on inconsistencies in the vote totals reported by Sequoia Advantage voting machines in New Jersey (Note: these machines are different from the touch screen machines we looked at in the California TTBR, so I don't have any inside information.) Anyway, the anomaly is that the number of votes for Democratic and Republican candidates doesn't match the number of times that the ballots were activated. If the number of votes were less than the number of ballots, you could explain that as an undervote, but in the results tape Ed shows, the Republican ballot was selected 60 times and there were 61 votes!I haven't thought much about potential causes (Ed's commenters theorize) but my money is on simple bugs in the system rather than an attack. If you were an attacker and you had managed to take control of the machine, one of the first things you would want to do is make certain that the results were consistent. Moreover, since this is a primary and not a general election, an attacker wouldn't really benefit from moving votes from one party to another. Much easier (and harder to get caught) to move them from one candidate to another within a party.
Not that this should make you feel any better, since the most basic function of voting machines is to correctly count votes. It shoud also make you wonder about both Sequoia's testing and the testing done by the certification labs. We already know that it's insufficient from a security perspective, but (assuming the problem is in the system), then this seems like it should have been caught by the testing/SQA process.
Sequoia's explanation can be found here. Felten says it's inadequate and that he'll explain why tomorrow. Stay tuned.
Posted by ekr at 9:34 PM | Comments (0)
January 30, 2008
How to control your permalinks in MT4
In the comments, Dave B. asks:Can you share your solution, please? 'Not the world's most intuitive UI' is being _very_ polite! That change has smashed hundreds of my inbound links ... but damned if I can find the place to set the dirify defaults ... :-(
Glad you asked.
Go to "Design | Templates | Archive Templates | Individual Entry Archive".
At the bottom is a pulldown labelled "Archive Mapping". It's probably
set to yyyy/mm/entry-basename.html. Apparently this is
a schematic representation of the names.
If you change it to yyyy/mm/entry_basename.html
(note substitution of underscore for hyphen) you'll get
underscorified permalinks. Totally intuitive, eh?
Posted by ekr at 7:16 AM | Comments (1)
December 28, 2007
Not the world's best IVR system
Just paid a traffic ticket (see here) via Santa Clara county's not very impressive IVR (amazingly, there is no Web system). Among the high-tech features:- Extreme amounts of clipping and distortion to the extent to which you can only understand about 3/4 of the prompts. It's strangely inconsistent—some of the prompts sound fine, but some are almost incomprehensible.
- The interface for entering your citation number and your last name is pretty bad. Say your citation number is H01234. First it asks you if there are any letters. Then it asks you to key in the letter, then prompts you for each possible letter on the number key. Then it asks if there is another letter. If not, it asks you to key in any numbers there are. Then it asks if there are any letters. Repeat. I'm not sure this is actually the IVR's fault, but rather that they have to deal with tickets from a wide variety of jurisdictions. Still, there seem like a bunch of ways to make this better (standardize citation numbers, add a jurisdiction/format code so that the format is predictable, etc.)
- After you've paid your fine, it gives you a (really long) receipt number and then asks you to press one to repeat, two to continue. If you press two, it asks you to key in your citation number. I assume you're done at this point and can hang up, but if not there may soon be a warrant out for my arrest.
Oh, there's also a $12.95 "convenience fee" for using this system to pay your fine by credit card.
Posted by ekr at 8:04 PM | Comments (2)
December 2, 2007
First Notes on Django
I spent yesterday at the IETF coding sprint. The idea here was to rewrite a bunch of the IETF software tools in a more modern system (Django), as well as write a bunch of new tools. I'd never worked with Python or Django before—other than writing test programs—but that didn't stop Cullen Jennings and I from trying to write an IETF charter management tool (still in development). Some initial notes after 15 hours or so of screwing around:
- This kind of framework really does let you get an app up and running quickly. I figure I could have gotten slightly more done working directly in CGI and Perl, but when you factor in that I didn't really figure out how to get Django to do anything useful until about 3:30, Django seems to come out pretty far ahead.
- Django embeds a lot of data in the URL itself rather than
in arguments. The way this works is that there is a map table
from URL patterns (regexes) (Jamie Zawinski, call your office).
So, you get something like this:
(r'^(?P[a-z0-9]+)/$',views.current), (r'^(?P [a-z0-9]+)/all/$',views.list), (r'^(?P [a-z0-9]+)/fake/$',views.fake_wg), (r'^(?P [a-z0-9]+)/add/$',views.add), (r'^(?P [a-z0-9]+)/$',views.current), (r'^(?P [a-z0-9]+)/all/$',views.list), (r'^(?P [a-z0-9]+)/fake/$',views.fake_wg), (r'^(?P [a-z0-9]+)/add/$',views.add), So, looking at the first of these lines, it says that any URL that matches the pattern
^(?P<wgname>[a-z0-9]+)/$gets handled by the functionviews.currentand the first parenthesized match gets passed as an argument via a parameter namedwgname. This is clever, but kind of weird, especially when you realize that these expressions are evaluated in sequence, so there's a chance for collisions. I got bitten by this once already. - It's great to have automated mapping from data types to
database schema, but it would be a lot better if it could
hide the behavior of the relational DB a little better.
To take an example, when you want to have a
many-to-one mapping, (e.g., cards -> deck of cards), you use a "foreign key",
like so:
class Deck(models.Model): brand = models.CharField(max_length=20) class Card(models.Model): suit = models.CharField(max_length=10) value = models.IntegerField() deck = models.ForeignKey(Deck)Thinking about this as a data structure, this does two things, one obvious and one unobvious:
- Create a pointer from any given
Cardobject to the deck it belongs to. This is the forward mapping you'd expect since it's explicitly declared inCard - Create a slot in the
Deckclass calledcard_setwhich contains pointers to all theCardobjects that belong to theDeck. This is fairly unobvious, since it's not explicitly declared, it just happens automatically.
Of course, these aren't just data structures; they are mapped to underlying stuff in the database, so this creates some weirdness. To give you an example, I spent about 30 minutes trying to figure out when I created a
Deckand then inserted aCard(ok, not really, but analogous structures), I ended up with null pointers in both directions. It turns out that you need to do asave()of the container (Deck) before creating the contained object (Card), otherwise it ends up pointing at nothing. I don't really know why—SQL experts should feel free to tell me—and ultimately had to have Fenner tell me how to make it work. - Create a pointer from any given
- It's pretty clear that there are sophisticated and arguably elegant ways to do jobs (e.g., rendering HTML), but I don't know any of them, so I end up just doing things crudely, hardwiring forms into the HTML, etc. Probably if I were going to really work on this kind of stuff regularly, that would be worth learning.
Posted by ekr at 4:30 PM | Comments (7)
November 21, 2007
s2b, a bit diagram tool
When we were writing the most recent version of RELOAD, I found myself doing a lot of bit diagrams, which I hate doing: it takes forever and whenever you change anything (like add one field) you have to totally redo the diagram to get the alignment right. I sometimes wonder whether protocol designer's penchant for 32-bit alignment and padding doesn't have more to do with making the diagrams easier to modify than it does with processor architecture.Anyway, rather than do a lot of hand drawing, I decided to do a tool, which I call s2b (this, by the way, is a classic demonstrations of programmer thinking, since it would have taken me about 5 hrs to do the diagrams and it took me about 12 hrs to do the tool, mostly tuning the output format). The idea here is that you specify your diagrams in a language, which looks a lot like C structures (I actually ripped it off from the specification language, used in SSL/TLS, which was ripped off from C and XDR (which itself was ripped off from C)). You then run your language through s2b and it emits bit diagrams. Here's an example:
struct {
peer_id id;
uint32 uptime;
opaque certificate<65000>;
ip_address_and_port address;
} peer_info_data;
public struct {
string version;
uint8 transport;
peer_info_data peer_info;
} route_log_entry;
And here's the diagram it produces:
STRUCTURE: route_log_entry
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Version Len | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
| Version |
/ /
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Transport | |
+-+-+-+-+-+-+-+-+ +
| Id |
+ +
| |
+ +
| |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | Uptime |
+-+-+-+-+-+-+-+-++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | Certificate Len | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
| Certificate |
/ /
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
/ /
| Address |
/ /
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Of course, what you really want is to simply embed the language syntax into your document and then have the diagrams automatically embedded into the document at the declaration site.1 s2b also includes a Perl script called s2x.pl that will do this for xml2rfc documents. You just embed your PDU definition in an artwork declaration, like so:
<figure>
<!-- begin-pdu-->
<artwork><![CDATA[
public struct {
route_log_entry entries<65000>;
} route_log;
]]></artwork>
</figure>
One more
point deserves mentioning. Because I was just generating diagrams
and not bothering to automatically generate encoders or decoders,
there is only one built-in primitive type: primitive, which
is just an n-bit long object. You have to define all other types.
Here's the prologue I use:
primitive uint8 8; primitive uint16 16; primitive uint24 24; primitive uint32 32; primitive int32 32; primitive uint64 64; primitive uint128 128; primitive char 8; primitive opaque 8; primitive blob 0; typedef char string<65000>;You tell s2x.pl about this with a begin-prologue comment. You can define types of any size you want. This means that there's nothing special about 8 bits, so you can do flags or whatever.
Not the greatest tool ever, but a lot more fun than drawing ASCII art.
1. You might ask why I didn't simply work directly in a language and eschew the bit diagrams altogether. This is what, for instance, SSL/TLS does. Answer: a lot of people (me not one of them) prefer bit diagrams.
Posted by ekr at 7:36 AM | Comments (3)
November 18, 2007
Curse you, xml2rfc!
As draft season is once again upon us, I am once again spending a lot of time with xml2rfc the unofficial official draft production tool of the IETF. Now, the party line at IETF is that we use ASCII and you can prepare documents in any tool you like, but here on Planet Earth, the combination of nroff bit rot (or at least mind rot) and increasingly stringent formatting requirements has made it a real PITA to do documents in any tool other than xml2rfc. This does not mean that xml2rfc is a joy to use.Before I go on with my litany of complaint, I want to head off at the pass the usual response one hears a this point. Two responses, actually: (1) nobody is making you use it and (2) it's open source software, if you don't like how it behaves, then you can fix it. The first objection is literally true but as a practical matter false. First, everyone else uses it so if you want to collaborate you pretty much have to. Second, as I said earlier, the fact that everyone else uses it means that the IETF has felt free to impose increasingly stringent tests on submitted documents to the point where if you use any other document production system, each time you want to submit a new document you end up spending a lot of time figuring out how to get it through whatever submission filters have been imposed this week. Finally, and most importantly, if you submit your draft to the RFC Editor in XML (you do want your document published as an RFC, right?) they will edit it in XML and so when you want to do a bis version, you have all their copy edits incorporated. On the other hand, if you give them plaintext, then you end up either having to edit their incredibly crufty nroff source or backport all their copy edit changes into your original source format, whatever that was.
The second response, of course, is insane. I just want to write documents and shouldn't have to be an XML hacker, let alone a tcl hacker (I did mention that xml2rfc is written in tcl, right?) to get that task done. "Go fix it yourself" is a fine mantra for tools that are truly optional, but not for those which are increasingly becoming the de facto standard.
OK, back to my theme. As the name suggests, to write something in xml2rfc you start with an XML document in a particular format and the run it through xml2rfc to produce ASCII or HTML or whatever (though ASCII is the normative format). The document looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<!-- edited with XMLSPY v5 rel. 3 U (http://www.xmlspy.com)
by Daniel M Kohn (private) -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY rfc2119 PUBLIC ''
'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml'>
]>
<rfc category="std" ipr="full3978" docName="sample.txt">
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<?rfc toc="yes" ?>
<?rfc symrefs="yes" ?>
<?rfc sortrefs="yes"?>
<?rfc iprnotified="no" ?>
<?rfc strict="yes" ?>
<front>
<title>An Example</title>
<author initials='A.Y' surname="Mous" fullname='Anon Y. Mous'>
<organization/>
</author>
<date/>
<abstract><t>An example.</t></abstract>
</front>
<middle>
<section title="Requirements notation">
<t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
"SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
and "OPTIONAL" in this document are to be interpreted as
described in <xref target="RFC2119"/>.</t>
</section>
<section title="Security Considerations">
<t>None.</t>
</section>
</middle>
<back>
<references title='Normative References'>&rfc2119;</references>
</back>
</rfc>
Now, there's plenty of stuff to object to here, starting with the (false) notion that I want to be writing my document in XML in the first place. But what I want to talk about right now is how references/bibliographies are done.
Bibliography Locations
xml2rfc has three major reference handling modes:
- Directly inserting the bibliographic information into the file.
- Reading the bibliographic information off files on the disk.
- Reading the bibliographic information off a site on the Internet (the example above).
You can mix and match these with some of the references being in each location.
Now, with RFCs and Internet-Drafts, as opposed to, say, scientific papers, Internet based references are unusually attractive.
- There's an extremely small set of about 10,000 documents that most of your citations come from.
- Those documents have unambiguous naming scheme that everyone agrees on (RFC-XXXX, draft-yyy). This sounds trivial, but it's actually a significant obstacle to reference sharing between collaborators in formats like LaTeX where you need to unambiguously specify the reference key—even in the face of tools like RefTeX to let you search.
- The common documents have a lot of reference volatility drafts get updated regularly and you can feed xml2rfc the draft name without the version number and it will automatically pick up the latest version. This prevents bit-rot.
For all these reasons, you'd think any sane person would use Internet-based references all the time and just use file-based and/or included references (which, btw, are hideous) when they had to reference something that wasn't online. Unfortunately, if you are that sane person, you're about to get screwed: as soon as you go offline (like you want to work on your document on a plane) things go pear-shaped in a really serious kind of way and you get an error that looks like this:
xml2rfc: error: http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml: http package failed
Now, problem one (and a theme we'll come back to in a minute) is
that you pretty much have to be a computer scientist to figure
what this means. HTTP package failed? Maybe I need a new HTTP
package? No, you're not on the Internet. But that's sort of forgivable,
because only a computer scientist would be able to tolerate writing
a document of any length in XML in the first place. And if you think
about it for a minute, you can probably figure out what this
means—though it's worth noting that the web page where
I got this example from is none too clear on the fact that you're
actually getting this reference from the Internet, and though you'd
think the http:// would be a bit of a giveaway, it turns
out that XML people routinely use un-dereferenceable URLs to
identify resources, so there's no guarantee that just because
something starts with http://, you actually
can retrieve it.
Problem two is that in most cases you've built this document before and have just made some trivial change and want to rebuild. Most of the references were present when you rebuilt the document two hours ago before you got on the plane. xml2rfc could have simply cached them at that time and use the local cached copy when disconnected until it has time to check cache validity. Unfortunately, it doesn't, so all your references break as soon as you go offline.
Now, this would all be just annoying except for the fact that that error I showed you above is all xml2rfc gives you when you try to build a document with unresolvable references. Even one unresolvable reference means that it won't process your document at all, so if you change one paragraph, leave the references alone, and want to see what it looks like, too bad! You're SOL! At this point your only choice is to go through and stub out all the unresolvable references so that xml2rfc doesn't freak out, and since they appear all over the document this is a lot of work, and even more work when you have to unstub them when you actually want to build the document. By contrast, in a system like LaTeX/bibtex, you just end up with
[?]at the reference site in the text and empty biblio entries at the end.
The consequence of all this stuff is that people who want to work offline end up using one of the other two reference styles, where there's a local copy. And if you want to collaborate with anyone else, you all either have to have a copy of the entire bibliography strategy gets pretty tedious (did I mention it's scattered across one file for each reference, though there may be some poorly documented or undocumented way to fix that) so you end up just cutting and pasting the bibliography information into the main working file, which, did I mention, is hideous? In the document I'm working on now, over 20% of the lines in the file are devoted to bibliography. But at least it's self-contained.
I can't help myself: here's a typical bibliography entry, cut right out of my document:
<reference anchor="I-D.garcia-p2psip-dns-sd-bootstrapping">
<front>
<title>P2PSIP bootstrapping using DNS-SD</title>
<author fullname="Gustavo Garcia" initials="G" surname="Garcia">
<organization></organization>
</author>
<date day="25" month="October" year="2007" />
<abstract>
<t>This document describes a DNS-based bootstrap mechanism to
discover the initial peer or peers needed to join a P2PSIP
Overlay. The document specifies the use of DNS Service Discovery
(DNS-SD) and the format of the required resource records to
support the discovery of P2PSIP peers. This mechanism can be
applied in scenarios with DNS servers or combined with multicast
DNS to fulfill different proposed P2PSIP use cases.</t>
</abstract>
</front>
<seriesInfo name="Internet-Draft"
value="draft-garcia-p2psip-dns-sd-bootstrapping-00" />
<format target="http://www.ietf.org/internet-drafts/draft-garcia-p2psip-dns-sd-bootstrapping-00.txt"
type="TXT" />
</reference>
And here's the reference entry it actually produces:
[I-D.garcia-p2psip-dns-sd-bootstrapping]
Garcia, G., "P2PSIP bootstrapping using DNS-SD",
draft-garcia-p2psip-dns-sd-bootstrapping-00 (work in
progress), October 2007.
Now, ask yourself the following question: why, exactly, does this biblio entry need to contain the abstract?!?! The URL is also included, though not used here, but that's so xml2rfc can make clickable links in an HTML version. I guess putting the abstract in the reference would let some future JavaScript weenie pop up the abstract if you hover over the reference. That would sure be useful! The real answer, of course, is that that was what was in the file we sucked down from the Internet and we're sure as heck not going to edit it, lest we break the XML.
Bibliography Errors
So, what happens if you screw up stuffing in some reference, which, since
there are three places to do this, happens depressingly often.
Let's see what happens if we screw up one of these.
First, let's delete the reference from the body of the document. This produces the following result:
xml2rfc: warning: no <xref> in <rfc> targets <reference anchor='RFC2119'> around input line 10
Now, this isn't so bad. Once you translate the xmlese, it says that there's a reference anchor (i.e., something you can reference) for RFC 2119 that isn't targeted by an xref (i.e., a reference in the text.) So, this is a superfluous bibliography entry. Also, the good news is that in this case it will still make the document.
Now, let's put that back and try removing the
&rfc2119;marker at the end. That produces this error:
xml2rfc: error: can't read "xref(RFC2119)": no such element in array around input line 35
Uh... yeah.
So, what this means, literally, is that there's some array (xref?) doesn't contain the element "RFC2119". If you think like a computer programmer as opposed to someone who just wants to produce documents, you might guess that you're reference to RFC2119 doesn't point anywhere. But how do I populate that array. Well, if you go back to the example, you can probably figure out how to fix this, which is good, because you have to fix it if you want the document to build past the point of the first undefined reference!
Finally we come to the piece de resistance: what happens if you don't put in the entity declaration at the top? You get this:
xml2rfc: error: not expecting pcdata in <references> element around input line 4
1 in "internally-preprocessed XML"
Syntax:
41:<references title="Normative References">
40:<back>
8:<rfc category="std" ipr="full3978" docName="sample.txt">
"Not expecting pcdata"? What the fuck does that mean?
Luckily, you have me to translate for you. What this means is that the
string &rfc2119; in the references section is an
entity reference, but because you haven't defined the
entity, the parser treats it as character data (PCDATA),
which isn't permitted at this location in the XML document
by the DTD. Hence, "not expecting pcdata". Useful, right?
As if that weren't bad enough, even once you've decoded this error message it doesn't tell you which entity you've forgotten to define. Sure, there's a line number, 41, and here's line 41:
<references title='Normative References'>&rfc2119;</references>So far so good, but unfortunately the line number here is that of the
<references> element,
not of the offending missing entity. Put as many valid
references in there as you want and you still get the
same line number. In order to figure out the offending
entity, you either need to match up the front and the back
of the documents or progressively cut references out of the back till
the error goes away.1
The basic reason you're getting this error instead of something useful
like "Go include a <!ENTITY rfc2119 ... production,
at the top of the file, you dummy" is that this part of the references
system is done purely using XML mechanisms, so you get an
XML failure before some better error handling mechanism
comes into play. This isn't the only time xml2rfc does this to
you either, it's just the most offensive.
And that, children, is how the Internet standards sausage gets made. Outstanding!
1. Apparently you can use other tools to diagnose this too, but xml2rfc won't help you out.
Posted by ekr at 10:10 AM | Comments (3)
April 24, 2007
Searching on movie computers
Picked up The Andromeda Strain at the library. Not as good as the book, though reasonably well done. [The book, btw, is extremely well executed, totally inconsistent with Crichton's subsequent descent into hackery. Crichton has (or had) a good sense for how science works as well as a good ear for how to generate convincing faux technical detail.]One thing struck me, though. As with Asimov 25+ years earlier [*] the extrapolations of what computers can and can't do is strangely off. The movie includes the following gadgets:
- An automatic medical analyzer with voice recognition (in the book)
- A computer capable of fully simulating the growth of the Andromeda organism based purely on X-ray crystallography and electron microscopy (not in the book)
However, when they're studying the conditions under which Andromeda can grow, some scientist has to watch a CRT display the results of each growth plate one after another looking for anomalies. Better yet, the anomalous results display the legend "NO GROWTH". Of course, this sort of pattern matching is childs play for software. Why the computer can't detect these is not explained. Incidentally, this scene does not seem to be in the book. On the contrary, Crichton explicitly has the computer flag the relevant growth conditions, which is exactly what you'd want.
This blind spot about how to search for stuff on a computer would be echoed in Jurassic Park where the girl hacker navigates some 3D VR-style ile manager ("It's a UNIX system! I know this!") that no hacker would be caught dead using. (Though as I recall said file manager was a real demo on SGI systems of the time). Of course, both movies were made in time periods where not that many people had first-hand experience with computers and your average disk drive couldn't really hold enough files to make searching necessary. Now that nearly everyone has direct experience with search engine, I'd be surprised to see this particular mistake made in contemporary fiction. More likely screenwriters will assume that you can just type any random thing into Google and the information will just pop out. (I actually remember someone mentioning this about a recent movie but can't recall which one it was.)
Posted by ekr at 11:32 PM | Comments (2)
April 22, 2007
Note to self: MacOS X single user mode
One of the nice features of OS/X is that since it's BSD if you forget your password you can reboot single user and just change it. You just press Apple-S and boot. Or so I thought. I tried it today on the Mac Mini which I use as an ad hoc DVD player, only to have it come up normally. Natural thought: maybe it's the peripherals.
Step 1: replace the TV with a monitor. No joy.
Step 2: replace the wireless keyboard and mouse with a regular keyboard
(no mouse). Machine comes up in single user but I can't seem to type anything. Or rather, I can type but nothing happens.
Step 3: scrounge an old USB mouse and plug it in. Hooray, I can type. Mission accomplished!
Now I just have to pull all that stuff off and wire it back up the way it was. I guess that's some kind of incentive to remember my password.
Posted by ekr at 11:47 PM | Comments (1)
March 14, 2007
Back doors and Open Source
As you may have heard, someone recently inserted a back door into WordPress 2.1.1. What appears to have happened is that the attacker broke into WordPress's servers and modified the distribution:Longer explanation: This morning we received a note to our security mailing address about unusual and highly exploitable code in WordPress. The issue was investigated, and it appeared that the 2.1.1 download had been modified from its original code. We took the website down immediately to investigate what happened.It was determined that a cracker had gained user-level access to one of the servers that powers wordpress.org, and had used that access to modify the download file. We have locked down that server for further forensics, but at this time it appears that the 2.1.1 download was the only thing touched by the attack. They modified two files in WP to include code that would allow for remote PHP execution.
What's interesting about this incident is how it was detected. In many such incidents, what's detected is that there is something funny on the systems hosting the code (e.g., the CVS repository in the 2003 Linux kernel backdoor attempt. The relevant change then gets tracked down and studied very carefully. According to the WordPress people, however, what happened here was that someone detected the vulnerability and then they tracked it to an intrusion.
This is interesting because there are two major ways for someone to insert a back door into some piece of open source software:
- Compromise one of the development or distribution servers.
- Make a legitimate change that contains a back door.
The first is easier to do because most systems aren't very well secured and so all you need to do is break in and replace the distribution. However, it's also a lot easier to detect. First, breaking in often leaves signs. Second, the source code typically exists in multiple copies and so eventually the inconsistency gets noticed, especially because it's typically the download servers that are easiest to attack but they can be compared with the main source repository.
The second is harder to do but can be much more effective. You just submit a fair-sized patch to the project that happens to contain a security vulnerability. Most patches don't undergo particularly thorough review, especially on non-security projects, and it's quite easy to hide a buffer overflow or other vulnerability in your code once you get above a hundred lines or so. This is a little more work since you actually have to implement a semi-useful feature, but the bar for that is pretty low on most projects as well. I don't think I've ever heard of a backdoor being inserted this way. It's hard to know if that reflects that nobody does this or just that nobody catches it.
Posted by ekr at 8:11 AM | Comments (2)
January 20, 2007
iPhone and lockin
OK, I have to admit that the Not Yet Released Device Potentially To Be Known As The iPhone (NYRDPTBKATi) looks pretty sweet, but the level of lockin significantly detracts from the overall coolness. There are actually two issues here, one bogus and one real.
DRM
The bogus
one, raised by Randall Stross in the
Times is the FairPlay DRM:
Here is how FairPlay works: When you buy songs at the iTunes Music Store, you can play them on one and only one line of portable player, the iPod. And when you buy an iPod, you can play copy-protected songs bought from one and only one online music store, the iTunes Music Store.The only legal way around this built-in limitation is to strip out the copy protection by burning a CD with the tracks, then uploading the music back to the computer. If youre willing to go to that trouble, you can play the music where and how you choose the equivalent to rights that would have been granted automatically at the cash register if you had bought the same music on a CD in the first place.
This is, of course true, but sort of irrelevant. First, you're quite free to buy physical CDs and rip them yourself. iTunes will even rip them for you. At least with the iPod and presumably with the NYRDPTBKATi, there won't be any copy protection on them at all. It's true that the iPod file format obfuscates the locations of the files on the disk, but they're all there and you can get 3rd party programs which know how to read the format. The vast majority of music gets into iPods by being ripped, not downloaded.
Second, the issue isn't the iPod or NYRDPTBKATi, but rather iTMS, which imposes the DRM on the way out the door. Stross says this in the article but some misses the implication:
This claim requires willful blindness to the presence of online music stores that eschew copy protection. For example, one online store, eMusic, offers two million tracks from independent labels that represent about 30 percent of worldwide music sales.Unlike the four major labels Universal, Warner Music Group, EMI and Sony BMG the independents provide eMusic with permission to distribute the music in plain MP3 format. There is no copy protection, no customer lock-in, no restrictions on what kind of music player or media center a customer chooses to use the MP3 standard is accommodated by all players.
In other words, it's quite possible to play non-DRMed files (what else is a podcast, after all?), it's just that (1) the music users want isn't available (2) the users don't know where to get it or (3) the UIs for getting it are too annoying. if it's really true that the major labels are willing to go non-DRM than this sounds like a great marketing opportunity for someone to make a really good non-DRM online music store. In any case, Stross's quarrel isn't with the iPod but with iTMS.
Programmability
This brings us to the real issue: programmability. According to this
article the NYRDPTBKATi isn't going to be an open platform. I.e., you won't be able
to load your own applications onto it. Apple has advanced two major
arguments for why this is OK: protecting the network from rogue
applications and protecting the stability of the device.
Here's Jobs advancing the first reason:
But its not like the walled garden has gone away. You dont want your phone to be an open platform, meaning that anyone can write applications for it and potentially gum up the provider's network, says Jobs. You need it to work when you need it to work. Cingular doesnt want to see their West Coast network go down because some application messed up.
Look, this is mostly nonsense. Yes, it's true that programmable computers can do damage to the Internet (cf. zombies, spam, DDoS, etc.) but this isn't primarily an issue of people installing the wrong third party software but rather of their machine being remotely compromised via vulnerabilities in the existing software—primarily stuff installed by the OS manufacturer. I should mention that Apple isn't giving out SDKs for the iPhone, so it's going to be harder for malware authors to program to it than (say) Windows, but that's only a temporary obstacle if it becomes an attractive attack target. There are of course ways to stop third-party malware from being loaded on at all (e.g., signed code) but the level of defense that Apple employs on the iPod doesn't suggest that they're too likely to have done anything like that here. I'd imagine they're just hiding the specs and the SDK and maybe churning the API/ABI occasionally to make it more inconvenient to write a real product.
More importantly, the danger in rogue applications isn't primarily to the access network but rather to machines other places on the Internet. It's actually very easy for Cingular to detect when a device is doing something dangerous to their network and shut it down. And to the extent to which it's not easy, Cingular has much bigger problems since they're already quite willing to sell you Windows Mobile and Palm smartphnes, which are programmable.
The second argument Apple is advancing is that letting end-users run arbitrary third-party apps will potentially destabilize the handset, contributing to a bad user experience.
We define everything that is on the phone, he said. You dont want your phone to be like a PC. The last thing you want is to have loaded three apps on your phone and then you go to make a call and it doesnt work anymore. These are more like iPods than they are like computers.The iPhone, he insisted, would not look like the rest of the wireless industry.
These are devices that need to work, and you cant do that if you load any software on them, he said. That doesnt mean theres not going to be software to buy that you can load on them coming from us. It doesnt mean we have to write it all, but it means it has to be more of a controlled environment.
So, this is vaguely more reasonable, especially considering Apple's well-known fetish for the providing the optimal UI experience. Still, it's not particularly convincing. I've loaded several 3rd party apps on my Treo and haven't noticed it destabilizing the phone functionality. A modern O/S like OSX should be quite capable of protecting applications from each other—that kind of process isolation is one of the major functions of the OS. I haven't noticed any of the 3rd-party apps I run on my OS/X boxes being a source of massive instability.
Does it matter?
I'm probably unusual in that I'd actually like to be able to
do some development on a handheld device, which would obviously
be a problem if there's no SDK. That would be a big motivator
for getting something based on a real operating system rather
than PalmOS.
But even ordinary users may
find this kind of lockin inconvenient.
I don't know what applications Apple intends to provide on the NYRDPTBKATi,
and the truth is that they provide a pretty reasonable set on
your Mac, but even so I've installed Firefox, MS Office, the
Palm software suite, and Windows Media Player. Apple offers their
own versions of some of this stuff and it will be interesting to
see if they decide you should have to run Safari instead of
Firefox or Keynote instead of PowerPoint.
One of the nice things about having a general purpose computer
is that you get to make these decisions for yourself rather
than having Steve Jobsa make them for you.
Posted by ekr at 10:39 AM | Comments (6)
December 30, 2006
A complicated bug
After a few years of deployment, our passive capture system is pretty stable, but there's nothing so stable that it can't be screwed up by some programmer. I was recently the author of such a screwup when I decided to make the system multithreaded.It looked like a straightforward job, and it worked fine in smoke testing, but almost as soon as we started running a lot of traffic through it, it started to crash consistently with memory errors (SEGVs) in the TIME_WAIT cleanup code.
TIME_WAIT state
A little background is probably in order here. TCP doesn't have
connection identifiers; the way you distinguish packets from
different connections is by the source and destination IP
addresses and port numbers. However, those aren't globally
unique, so it's potentially possible to confuse packets
from two different connections with the same host/port
quartets if they occur sufficiently close together. In
order to avoid this problem, when a TCP connection is
closed it's put into a TIME_WAIT state. Any packets that
potentially correspond to that connection are rejected until
TIME_WAIT expires. The timer is set to be twice the maximum
segment lifetime (2MSL). This makes sure that there aren't
any packets from the closed connection hanging around the
network. Once the 2MSL timer expires, then the host/port
quartet can be reused.
TIME_WAIT implementation
Our system has two relevant data structures. Every connection
has an associated tcp_conn object. These objects
are stored in a hash table keyed by the host/port quartet.
Whenever a new packet comes in we look up the connection
in the hash table and then process the packet.
The way we implement TIME_WAIT in our code is using a sorted queue. When a connection is closed we put it at the end of the 2MSL queue (but it stays in the connection table). Since the timer is always the same length, this automatically sorts things. Periodically we traverse the queue from the front discarding any connection whose timer has expired. The data structure looks like this:
The crash
All this stuff worked great until we tried to multithread the system. The implementation I decided on was to have a bunch of worker threads (one for each processor), each of which works semi-independently. The global TCP data structures (connection table, 2MSL queue, etc.) are shared by all the threads. Of course, this means that any reads and writes have to be surrounded by thread locks, which I duly added.
As noted prevously, this all worked fine in smoke testing, but when we started to actually run performance tests, it started to crash pretty reliably. You'd be running the 2MSL queue flush routine and one of the entries in the queue would be total garbage and you'd get a segmentation fault. At this point you're probably thinking we had a locking problem and that's what we thought too. Sure enough, when we went looking for it, we found that in a moment of insanity I'd surrounded the routines that modified the conn table and the 2MSL queue with read (shared) locks rather than write (exclusive) locks. This is obviously bad and I fixed it. So far so good, right? Wrong. A little more testing revealed that we were still getting this kind of crash and detailed code inspection didn't turn up any missing locks.
Debugging the Problem (1)
Concurrency problems are notoriously hard to debug because
they generally depend on timing. So, when you bring the program into
the debugger and single step through it all the timing
changes and you can't reproduce the problem. The situation
was even worse in this case because you had to run tens
of thousands of connections through the system before it
would crash.1 To make matters even more confusing, setting
the number of worker threads to 1 didn't fix the problem,
casting doubt on the whole concurrency theory.
We relatively quickly determined that some connections that were supposed to have been destroyed were somehow ending up back on the 2MSL queue. This actually makes the situation more rather than less confusing, since there are two questions.
- The destructor (yeah, this is C, but it's still a destructor) is supposed to zero the connection data structure.
- How did it get back on the queue?
The destructor looked fine, so we had to get primitive. We added some sentinel values before and after the queue pointers to detect over and underruns. No joy — they were never triggered. We and started littering the program with printfs but nothing obvious appeared. Finally, we added a secondary state variable in the connection that was set in the 2MSL queue reaper proper. Sure enough, when these connections got resurrected, they already had the variable set. So, why weren't they being destroyed?
A bunch more printfs later we had the answer. The 2MSL queue reaper locked the global TCP state data structures--as it should. Then the destructor also tried to lock those data structures. This makes sense because it can be called in contexts where they're not already locked. Unfortunately, pthreads doesn't support recursive locks, so the second lock fails even though it's in the same thread. Unfortunately, we don't ignore the error but instead of crashing with a "can't happen" like we probably should have (though ignoring destructor failures is a pretty common practice) we handle the error smoothly and exit the destructor function without zeroizing the connection structure or removing it from the connection table.
Obviously, this is a bug and it's somehow related to our real bug. More importantly, fixing it by avoiding double locks makes the crash go away. However, it doesn't really explain what the problem was: before we call the destructor we do remove the connection from the TIME_WAIT queue, so this should just look like a memory leak. Why does it cause crashes? I hate mysteries and I hate leaving bugs partially fixed, so we kept looking.
Debugging the Problem (2)
Let's review what we know so far. Connections close and get placed
at the end of the 2MSL queue. That queue is reaped from the front and
any connections whose timers have expired are destroyed. However,
due to our faulty thread locking code, those entries are removed
from the 2MSL queue but never destroyed. So, the key question
is how this is different from if things were functioning correctly:
- The connections are still in the connection table.
- The connection structures are never zeroized.
- The connection structures are never freed.
It's hard to see how (3) can be causing our problem. Sure, it leaks memory but that causes a different kind of crash—running out of memory—and we're not close to that yet. Similarly, it's hard to see how failing to zeroize can be causing the problem since this memory shouldn't be being used anyway. The only reason we zeroize is to help detect bugs. So, it must be something about it being in the connection table: more packets must be coming in and be being processed using this connection. Some printfs verify this. But why does that cause a problem?
The answer to this question isn't intuitive so what we ended up doing was brute forcing it. Once we realized the problem was reuse we littered that code branch with checks for whether the 2MSL queue contained bad data or not (luckily it was always ending up at the front of the queue, which is puzzling but turns out to have been a clue that we were only able to interpret in retrospect.) This let us track down where it was getting corrupted.
The answer turns out to be TIME_WAIT assassination. Ordinarily, when packets come in for a connection in TIME_WAIT we simply discard them. However, if you receive a SYN (connection open) packet that matches a connection in TIME_WAIT but the sequence number clearly is for a new connection, you cancel the 2MSL timer and start the new connection. (This actually isn't recommended behavior and not all stacks do this, but it's common enough that a passive decoder needs to implement it.) Ordinarily, this isn't something you see a lot, but because the traffic we're running is from a simulator with only a few client machines, you see a lot of host/port quartet reuse.
So, what happens when a connection is TIME_WAIT assassinated?
- The packet comes in and we find the relevant connection in the connection table.
- We notice we're in TIME_WAIT state.
- If the packet is a SYN and the sequence number is in the right range we need to assassinate.
- We dequeue the connection from the TIME_WAIT queue.
- We remove it from the TCP connection table.
- We destroy it.
Ordinarily this works smoothly, but what happens here is that connections which have already been partially destroyed stick around and get assassinated. Those connections aren't actually on the 2MSL queue, having been removed when we reaped them earlier. However, in the queue implementation we're using (BSD sys/queue), entries don't know that they have been removed from a queue, so when you re-remove them, you get weird side effects, namely a spurious re-insert.
Remove; Remove == Insert
So why do double removes of a queue entry cause spurious reinserts?
The generic removal procedure for an element E in a doubly
linked list L is something like:
1 If E->prev==NULL (E is the first element in L) 2 Set L->first = E->next 3 else 4 Set E->prev->next = E->next 5 If E->next == NULL (is the last element in L) 6 Set L->last = E->prev 7 else 8 Set E->next->prev = E->prev
Note that we don't need to change E at all, since it's
being taken out of the list anyway. So, what happens if E
is the first element in the list2, we remove it, then we remove
it again? Well, when we do step 1, E->prev==NULL3
so we set L->first = E->next. Now, E->next
points to whatever was right after E on the list at
the time of removal. That element could have been freed, destroyed,
even re-allocated by now. Since E
probably wasn't also last, we jump to step 8, which
might cause a crash, but probably won't4.
The end result is that L->first
ends up pointing to some random memory location. When we finally
get around to looking at L, our code freaks out.
Note that in this particular case, what happens is that whatever entry was immediately after the entry getting doubly removed gets reinserted at the head of the queue, even if that entry has been deleted and/or destroyed long ago. What makes this so hard to track down is that the offending connection isn't the one that gets reinserted into the connection queue, so a lot of common debugging techniques (e.g., trying to track the fate of the offending connection) don't work well.
1. Don't tell me to use Valgrind. It crashed
and didn't find anything. I miss Purify (which might not
have found anything but most likely wouldn't have
crashed).
2. Which is what always happens in 2MSL queue
reaping. Because the connections are sorted on the queue, we
never reap a connection unless all previous connections have
also been reaped.
3. Note that there's a dependency on the
way I'm checking whether E is first. If I compared
to L->first the behavior would be different and we'd always
end up executing line 4, which would (in this case) induce a crash
because E->prev == NULL. Of course, BSD sys/queue uses
quite different algorithms.
4 Remember that destruction of E->next probably
failed as well, so this isn't even a free memory write. And FMR's don't
generally cause immediate crashes, though Purify/Valgrind/etc. do catch
them..
UPDATE: Fixed a bug in line 4, found by Kevin Ballard.
Posted by ekr at 6:46 AM | Comments (4)
December 9, 2006
A small optimization
A few weeks ago I helped a friend calibrate the color on his photo printer. The basic idea here is that you print out a test sheet containing patches of various colors. You then scan each patch with a handheld sensor. The color correction software compares the intended color to the actual printed color and generates a correction profile for the printer.The process is simple but tedious. The basic test sheet contains a hundred or so patches and you need to scan each one individually. Something like this:
Once you've printed the test sheet, you position the sensor over each patch, press the botton on the top of the sensor, wait for it to scan, and then move onto the next patch. Aside from being boring, this procedure is kind of error prone. It's pretty easy to forget to press the button and get off by one or two targets. And since the whole point of the exercise is to deal with the printed colors being wrong, the software doesn't freak out when the colors don't match. Luckily, it does tell you which patches it thinks you've scanned, so by the time you get to the end of the row, you do figure out it but then you probably have to rescan it.
This isn't awful, but there's an easy hack that would make it better. The reason that the software can't detect that you've made a mistake is that the the test sheet is basically a color ramp, so color patch n and color patch n+1 are quite similar. Simply re-arranging the colors so that contiguous colors are quite different would make it possible to automatically detect mistakes (unless the printer is incredibly screwed up).
Posted by ekr at 8:52 PM | Comments (5)