Curse you, xml2rfc!

| Comments (3) | Outstanding! Software
As draft season is once again upon us, I am once again spending a lot of time with xml2rfc the unofficial official draft production tool of the IETF. Now, the party line at IETF is that we use ASCII and you can prepare documents in any tool you like, but here on Planet Earth, the combination of nroff bit rot (or at least mind rot) and increasingly stringent formatting requirements has made it a real PITA to do documents in any tool other than xml2rfc. This does not mean that xml2rfc is a joy to use.

Before I go on with my litany of complaint, I want to head off at the pass the usual response one hears a this point. Two responses, actually: (1) nobody is making you use it and (2) it's open source software, if you don't like how it behaves, then you can fix it. The first objection is literally true but as a practical matter false. First, everyone else uses it so if you want to collaborate you pretty much have to. Second, as I said earlier, the fact that everyone else uses it means that the IETF has felt free to impose increasingly stringent tests on submitted documents to the point where if you use any other document production system, each time you want to submit a new document you end up spending a lot of time figuring out how to get it through whatever submission filters have been imposed this week. Finally, and most importantly, if you submit your draft to the RFC Editor in XML (you do want your document published as an RFC, right?) they will edit it in XML and so when you want to do a bis version, you have all their copy edits incorporated. On the other hand, if you give them plaintext, then you end up either having to edit their incredibly crufty nroff source or backport all their copy edit changes into your original source format, whatever that was.

The second response, of course, is insane. I just want to write documents and shouldn't have to be an XML hacker, let alone a tcl hacker (I did mention that xml2rfc is written in tcl, right?) to get that task done. "Go fix it yourself" is a fine mantra for tools that are truly optional, but not for those which are increasingly becoming the de facto standard.

OK, back to my theme. As the name suggests, to write something in xml2rfc you start with an XML document in a particular format and the run it through xml2rfc to produce ASCII or HTML or whatever (though ASCII is the normative format). The document looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<!-- edited with XMLSPY v5 rel. 3 U (http://www.xmlspy.com)
     by Daniel M Kohn (private) -->

<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
    <!ENTITY rfc2119 PUBLIC '' 
      'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml'>
]>

<rfc category="std" ipr="full3978" docName="sample.txt">

<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>

<?rfc toc="yes" ?>
<?rfc symrefs="yes" ?>
<?rfc sortrefs="yes"?>
<?rfc iprnotified="no" ?>
<?rfc strict="yes" ?>

    <front>
        <title>An Example</title>
        <author initials='A.Y' surname="Mous" fullname='Anon Y. Mous'>
            <organization/>
        </author>
        <date/>
        <abstract><t>An example.</t></abstract>
    </front>

    <middle>
        <section title="Requirements notation">
            <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
            "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
            and "OPTIONAL" in this document are to be interpreted as
            described in <xref target="RFC2119"/>.</t>
        </section>

        <section title="Security Considerations">
        <t>None.</t>
        </section>
    </middle>

    <back>
        <references title='Normative References'>&rfc2119;</references>
    </back>

</rfc>

Now, there's plenty of stuff to object to here, starting with the (false) notion that I want to be writing my document in XML in the first place. But what I want to talk about right now is how references/bibliographies are done.

Bibliography Locations
xml2rfc has three major reference handling modes:

  • Directly inserting the bibliographic information into the file.
  • Reading the bibliographic information off files on the disk.
  • Reading the bibliographic information off a site on the Internet (the example above).

You can mix and match these with some of the references being in each location.

Now, with RFCs and Internet-Drafts, as opposed to, say, scientific papers, Internet based references are unusually attractive.

  • There's an extremely small set of about 10,000 documents that most of your citations come from.
  • Those documents have unambiguous naming scheme that everyone agrees on (RFC-XXXX, draft-yyy). This sounds trivial, but it's actually a significant obstacle to reference sharing between collaborators in formats like LaTeX where you need to unambiguously specify the reference key—even in the face of tools like RefTeX to let you search.
  • The common documents have a lot of reference volatility drafts get updated regularly and you can feed xml2rfc the draft name without the version number and it will automatically pick up the latest version. This prevents bit-rot.

For all these reasons, you'd think any sane person would use Internet-based references all the time and just use file-based and/or included references (which, btw, are hideous) when they had to reference something that wasn't online. Unfortunately, if you are that sane person, you're about to get screwed: as soon as you go offline (like you want to work on your document on a plane) things go pear-shaped in a really serious kind of way and you get an error that looks like this:

xml2rfc: error: http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml: http package failed

Now, problem one (and a theme we'll come back to in a minute) is that you pretty much have to be a computer scientist to figure what this means. HTTP package failed? Maybe I need a new HTTP package? No, you're not on the Internet. But that's sort of forgivable, because only a computer scientist would be able to tolerate writing a document of any length in XML in the first place. And if you think about it for a minute, you can probably figure out what this means—though it's worth noting that the web page where I got this example from is none too clear on the fact that you're actually getting this reference from the Internet, and though you'd think the http:// would be a bit of a giveaway, it turns out that XML people routinely use un-dereferenceable URLs to identify resources, so there's no guarantee that just because something starts with http://, you actually can retrieve it.

Problem two is that in most cases you've built this document before and have just made some trivial change and want to rebuild. Most of the references were present when you rebuilt the document two hours ago before you got on the plane. xml2rfc could have simply cached them at that time and use the local cached copy when disconnected until it has time to check cache validity. Unfortunately, it doesn't, so all your references break as soon as you go offline.

Now, this would all be just annoying except for the fact that that error I showed you above is all xml2rfc gives you when you try to build a document with unresolvable references. Even one unresolvable reference means that it won't process your document at all, so if you change one paragraph, leave the references alone, and want to see what it looks like, too bad! You're SOL! At this point your only choice is to go through and stub out all the unresolvable references so that xml2rfc doesn't freak out, and since they appear all over the document this is a lot of work, and even more work when you have to unstub them when you actually want to build the document. By contrast, in a system like LaTeX/bibtex, you just end up with

[?]
at the reference site in the text and empty biblio entries at the end.

The consequence of all this stuff is that people who want to work offline end up using one of the other two reference styles, where there's a local copy. And if you want to collaborate with anyone else, you all either have to have a copy of the entire bibliography strategy gets pretty tedious (did I mention it's scattered across one file for each reference, though there may be some poorly documented or undocumented way to fix that) so you end up just cutting and pasting the bibliography information into the main working file, which, did I mention, is hideous? In the document I'm working on now, over 20% of the lines in the file are devoted to bibliography. But at least it's self-contained.

I can't help myself: here's a typical bibliography entry, cut right out of my document:

      <reference anchor="I-D.garcia-p2psip-dns-sd-bootstrapping">
        <front>
          <title>P2PSIP bootstrapping using DNS-SD</title>

          <author fullname="Gustavo Garcia" initials="G" surname="Garcia">
            <organization></organization>
          </author>

          <date day="25" month="October" year="2007" />

          <abstract>
            <t>This document describes a DNS-based bootstrap mechanism to
            discover the initial peer or peers needed to join a P2PSIP
            Overlay. The document specifies the use of DNS Service Discovery
            (DNS-SD) and the format of the required resource records to
            support the discovery of P2PSIP peers. This mechanism can be
            applied in scenarios with DNS servers or combined with multicast
            DNS to fulfill different proposed P2PSIP use cases.</t>
          </abstract>
        </front>

        <seriesInfo name="Internet-Draft"
                    value="draft-garcia-p2psip-dns-sd-bootstrapping-00" />

        <format target="http://www.ietf.org/internet-drafts/draft-garcia-p2psip-dns-sd-bootstrapping-00.txt"
                type="TXT" />
      </reference>

And here's the reference entry it actually produces:

   [I-D.garcia-p2psip-dns-sd-bootstrapping]
              Garcia, G., "P2PSIP bootstrapping using DNS-SD",
              draft-garcia-p2psip-dns-sd-bootstrapping-00 (work in
              progress), October 2007.

Now, ask yourself the following question: why, exactly, does this biblio entry need to contain the abstract?!?! The URL is also included, though not used here, but that's so xml2rfc can make clickable links in an HTML version. I guess putting the abstract in the reference would let some future JavaScript weenie pop up the abstract if you hover over the reference. That would sure be useful! The real answer, of course, is that that was what was in the file we sucked down from the Internet and we're sure as heck not going to edit it, lest we break the XML.

Bibliography Errors
So, what happens if you screw up stuffing in some reference, which, since there are three places to do this, happens depressingly often. Let's see what happens if we screw up one of these.

First, let's delete the reference from the body of the document. This produces the following result:

xml2rfc: warning: no <xref> in <rfc> targets
<reference anchor='RFC2119'> around input line 10

Now, this isn't so bad. Once you translate the xmlese, it says that there's a reference anchor (i.e., something you can reference) for RFC 2119 that isn't targeted by an xref (i.e., a reference in the text.) So, this is a superfluous bibliography entry. Also, the good news is that in this case it will still make the document.

Now, let's put that back and try removing the

&rfc2119;
marker at the end. That produces this error:
xml2rfc: error: can't read "xref(RFC2119)": no such element in array around input line 35

Uh... yeah.

So, what this means, literally, is that there's some array (xref?) doesn't contain the element "RFC2119". If you think like a computer programmer as opposed to someone who just wants to produce documents, you might guess that you're reference to RFC2119 doesn't point anywhere. But how do I populate that array. Well, if you go back to the example, you can probably figure out how to fix this, which is good, because you have to fix it if you want the document to build past the point of the first undefined reference!

Finally we come to the piece de resistance: what happens if you don't put in the entity declaration at the top? You get this:

xml2rfc: error: not expecting pcdata in <references> element around input line 4
1 in "internally-preprocessed XML"

Syntax:
    41:<references title="Normative References">
    40:<back>
    8:<rfc category="std" ipr="full3978" docName="sample.txt">

"Not expecting pcdata"? What the fuck does that mean?

Luckily, you have me to translate for you. What this means is that the string &rfc2119; in the references section is an entity reference, but because you haven't defined the entity, the parser treats it as character data (PCDATA), which isn't permitted at this location in the XML document by the DTD. Hence, "not expecting pcdata". Useful, right?

As if that weren't bad enough, even once you've decoded this error message it doesn't tell you which entity you've forgotten to define. Sure, there's a line number, 41, and here's line 41:

<references title='Normative References'>&rfc2119;</references>
So far so good, but unfortunately the line number here is that of the <references> element, not of the offending missing entity. Put as many valid references in there as you want and you still get the same line number. In order to figure out the offending entity, you either need to match up the front and the back of the documents or progressively cut references out of the back till the error goes away.1

The basic reason you're getting this error instead of something useful like "Go include a <!ENTITY rfc2119 ... production, at the top of the file, you dummy" is that this part of the references system is done purely using XML mechanisms, so you get an XML failure before some better error handling mechanism comes into play. This isn't the only time xml2rfc does this to you either, it's just the most offensive.

And that, children, is how the Internet standards sausage gets made. Outstanding!

1. Apparently you can use other tools to diagnose this too, but xml2rfc won't help you out.

3 Comments

You:

"Not expecting pcdata"? What the fuck does that mean?

Michael Bolton:

"'PC Load Letter'?!? What the fuck does that mean?!?"

I love it. As a draft author and WG chair, I hereby agree with your points.

Bonus points for your Office Space reference, even if it was unintentional.

It is indeed unfortunate that xml2rfc's primary function is to handle the ever-increasing complexity of the cover sheets required for our TPS reports. This does not seem like a noble goal.

I remember when we'd just edit our RFCs in Word like a normal person. Finding all of the freak typographically-correct glyphs that Word would helpfully insert for us that failed to survive the conversion to good ole ASCII... Ahh. Memories.

ekr,
I don't get why you (and fluffy insist) on wearing this hairshirt.

I have the whole bibxml on my local hard disk. It gets updated whenever I update my local Internet Draft and RFC cache. I set the one magic environment variable in my shell's .rc file [1], and then all I have to do is put a single line in the references section like this:


It works on the plane and it works at home. It is easy to read, and it gets automatically updated to the most recent version of the document.

What's the problem?
thanks,
-rohan

[1] XML_LIBRARY=:~/ref/bibxml/1:~/ref/bibxml/2:~/ref/bibxml/3:~/ref/bibxml/4

(1) Did you miss the part about how you have to work with other people who may not want to do this?

(2) I *do* have a copy of the biblio files on disk, but it's silly to have to do that when they're sitting on some Web site.

Just because there is a workaround to something doesn't mean it's not bad design.

Leave a comment