As draft season is once again upon us, I am once again spending a
lot of time with
xml2rfc
the unofficial official draft production tool of the IETF.
Now, the party line at IETF is that we use ASCII and you
can prepare documents in any tool you like, but here on
Planet Earth, the combination of nroff bit rot (or at least
mind rot) and increasingly stringent formatting requirements
has made it a real PITA to do documents in any tool other than
xml2rfc. This does not mean that xml2rfc is a joy to use.
Before I go on with my litany of complaint, I want to head
off at the pass the usual response one hears a this point.
Two responses, actually: (1) nobody is making you use it and (2)
it's open source software, if you
don't like how it behaves, then you can fix it. The first objection
is literally true but as a practical matter false. First,
everyone else uses it so if you want to collaborate you
pretty much have to. Second, as I said earlier, the fact
that everyone else uses it means that the IETF has felt free
to impose increasingly stringent tests on submitted documents
to the point where if you use any other document production
system, each time you want to submit a new document you
end up spending a lot of time figuring out how to get it
through whatever submission filters have been imposed this
week. Finally, and most importantly, if you submit your
draft to the RFC Editor in XML (you do want your document
published as an RFC, right?) they will edit it in XML and
so when you want to do a bis version, you have all their
copy edits incorporated. On the other hand, if you give
them plaintext, then you end up either having to edit their incredibly
crufty nroff source or backport all their copy edit changes
into your original source format, whatever that was.
The second response, of course, is insane. I just want to write
documents and shouldn't have to be an XML hacker, let alone
a tcl hacker (I did mention that xml2rfc is written in tcl, right?)
to get that task done. "Go fix it yourself" is a fine mantra
for tools that are truly optional, but not for those which are
increasingly becoming the de facto standard.
OK, back to my theme. As the name suggests, to write something in xml2rfc you
start with an XML document in a particular format and the run
it through xml2rfc to produce ASCII or HTML or whatever
(though ASCII is the normative format). The document
looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<!-- edited with XMLSPY v5 rel. 3 U (http://www.xmlspy.com)
by Daniel M Kohn (private) -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY rfc2119 PUBLIC ''
'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml'>
]>
<rfc category="std" ipr="full3978" docName="sample.txt">
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<?rfc toc="yes" ?>
<?rfc symrefs="yes" ?>
<?rfc sortrefs="yes"?>
<?rfc iprnotified="no" ?>
<?rfc strict="yes" ?>
<front>
<title>An Example</title>
<author initials='A.Y' surname="Mous" fullname='Anon Y. Mous'>
<organization/>
</author>
<date/>
<abstract><t>An example.</t></abstract>
</front>
<middle>
<section title="Requirements notation">
<t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
"SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
and "OPTIONAL" in this document are to be interpreted as
described in <xref target="RFC2119"/>.</t>
</section>
<section title="Security Considerations">
<t>None.</t>
</section>
</middle>
<back>
<references title='Normative References'>&rfc2119;</references>
</back>
</rfc>
Now, there's plenty of stuff to object to here, starting with the (false)
notion that I want to be writing my document in XML in the first place. But what I want
to talk about right now is how references/bibliographies are done.
Bibliography Locations
xml2rfc has three major reference handling modes:
- Directly inserting the bibliographic information into the file.
- Reading the bibliographic information off files on the disk.
- Reading the bibliographic information off a site on the Internet
(the example above).
You can mix and match these with some of the references being
in each location.
Now, with RFCs and Internet-Drafts, as opposed to, say, scientific
papers, Internet based references are unusually attractive.
- There's an extremely small set of about 10,000
documents that most of your citations come from.
- Those documents
have unambiguous naming scheme that everyone agrees
on (RFC-XXXX, draft-yyy). This sounds trivial, but it's
actually a significant obstacle to reference sharing between
collaborators in formats like LaTeX where you need to
unambiguously specify the reference key—even in the face
of tools like RefTeX to let you search.
- The common documents have a lot of reference volatility
drafts get updated regularly and you can feed xml2rfc the
draft name without the version number and it will automatically
pick up the latest version. This prevents bit-rot.
For all these reasons,
you'd think any sane person would use Internet-based references
all the time and just use file-based and/or included references
(which, btw, are hideous) when they had to reference something
that wasn't online. Unfortunately, if you are that sane person,
you're about to get screwed: as soon as you go offline
(like you want to work on your document on a plane) things
go pear-shaped in a really serious kind of way and you
get an error that looks like this:
xml2rfc: error: http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml: http package failed
Now, problem one (and a theme we'll come back to in a minute) is
that you pretty much have to be a computer scientist to figure
what this means. HTTP package failed? Maybe I need a new HTTP
package? No, you're not on the Internet. But that's sort of forgivable,
because only a computer scientist would be able to tolerate writing
a document of any length in XML in the first place. And if you think
about it for a minute, you can probably figure out what this
means—though it's worth noting that the web page where
I got this example from is none too clear on the fact that you're
actually getting this reference from the Internet, and though you'd
think the http://
would be a bit of a giveaway, it turns
out that XML people routinely use un-dereferenceable URLs to
identify resources, so there's no guarantee that just because
something starts with http://
, you actually
can retrieve it.
Problem two is that in most cases you've built this document before
and have just made some trivial change and want to rebuild. Most
of the references were present when you rebuilt the document two
hours ago before you got on the plane. xml2rfc could have simply
cached them at that time and use the local cached copy when disconnected
until it has time to check cache validity. Unfortunately, it doesn't,
so all your references break as soon as you go offline.
Now, this would all be just annoying except for the fact that that
error I showed you above is all xml2rfc gives you when you
try to build a document with unresolvable references. Even one
unresolvable reference means that it won't process your document
at all, so if you change one paragraph, leave the references
alone, and want to see what it looks like, too bad! You're SOL!
At this point your only choice is to go through and stub out
all the unresolvable references so that xml2rfc doesn't freak
out, and since they appear all over the document this is
a lot of work, and even more work when you have to unstub them
when you actually want to build the document. By contrast, in
a system like LaTeX/bibtex, you just end up with
[?]
at the reference site in the text and empty biblio entries at the
end.
The consequence of all this stuff is that people who want to work
offline end up using one of the other two reference styles, where
there's a local copy. And if you want to collaborate with anyone else,
you all either have to have a copy of the entire bibliography strategy gets pretty
tedious (did I mention it's scattered across one file for each
reference, though there may be some poorly documented or undocumented way to fix
that) so you end up just cutting and pasting the bibliography information
into the main working file, which, did I mention, is hideous? In the
document I'm working on now, over 20% of the lines in the file are
devoted to bibliography. But at least it's self-contained.
I can't help myself: here's a typical bibliography entry, cut right
out of my document:
<reference anchor="I-D.garcia-p2psip-dns-sd-bootstrapping">
<front>
<title>P2PSIP bootstrapping using DNS-SD</title>
<author fullname="Gustavo Garcia" initials="G" surname="Garcia">
<organization></organization>
</author>
<date day="25" month="October" year="2007" />
<abstract>
<t>This document describes a DNS-based bootstrap mechanism to
discover the initial peer or peers needed to join a P2PSIP
Overlay. The document specifies the use of DNS Service Discovery
(DNS-SD) and the format of the required resource records to
support the discovery of P2PSIP peers. This mechanism can be
applied in scenarios with DNS servers or combined with multicast
DNS to fulfill different proposed P2PSIP use cases.</t>
</abstract>
</front>
<seriesInfo name="Internet-Draft"
value="draft-garcia-p2psip-dns-sd-bootstrapping-00" />
<format target="http://www.ietf.org/internet-drafts/draft-garcia-p2psip-dns-sd-bootstrapping-00.txt"
type="TXT" />
</reference>
And here's the reference entry it actually produces:
[I-D.garcia-p2psip-dns-sd-bootstrapping]
Garcia, G., "P2PSIP bootstrapping using DNS-SD",
draft-garcia-p2psip-dns-sd-bootstrapping-00 (work in
progress), October 2007.
Now, ask yourself the following question: why, exactly, does this
biblio entry need to contain the abstract?!?! The URL is also included,
though not used here, but that's so xml2rfc can make clickable
links in an HTML version. I guess putting the abstract in the
reference would let some future JavaScript weenie pop up the
abstract if you hover over the reference. That would sure be useful!
The real answer, of course, is that that
was what was in the file we sucked down from the Internet and we're
sure as heck not going to edit it, lest we break the XML.
Bibliography Errors
So, what happens if you screw up stuffing in some reference, which, since
there are three places to do this, happens depressingly often.
Let's see what happens if we screw up one of these.
First, let's delete the reference from the body of the document.
This produces the following result:
xml2rfc: warning: no <xref> in <rfc> targets
<reference anchor='RFC2119'> around input line 10
Now, this isn't so bad. Once you translate the xmlese, it says
that there's a reference anchor (i.e., something you can reference)
for RFC 2119 that isn't targeted by an xref (i.e., a reference
in the text.) So, this is a superfluous bibliography entry.
Also, the good news is that in this case it will still make the
document.
Now, let's put that back and try removing the
&rfc2119;
marker at the end. That produces this error:
xml2rfc: error: can't read "xref(RFC2119)": no such element in array around input line 35
Uh... yeah.
So, what this means, literally, is that there's some array (xref?)
doesn't contain the element "RFC2119". If you think like a computer
programmer as opposed to someone who just wants to produce documents,
you might guess that you're reference to RFC2119 doesn't point
anywhere. But how do I populate that array. Well, if you go back
to the example, you can probably figure out how to fix this,
which is good, because you have to fix it if you want the document to
build past the point of the first undefined reference!
Finally we come to the piece de resistance: what happens if you
don't put in the entity declaration at the top? You get this:
xml2rfc: error: not expecting pcdata in <references> element around input line 4
1 in "internally-preprocessed XML"
Syntax:
41:<references title="Normative References">
40:<back>
8:<rfc category="std" ipr="full3978" docName="sample.txt">
"Not expecting pcdata"? What the fuck does that mean?
Luckily, you have me to translate for you. What this means is that the
string &rfc2119;
in the references section is an
entity reference, but because you haven't defined the
entity, the parser treats it as character data (PCDATA),
which isn't permitted at this location in the XML document
by the DTD. Hence, "not expecting pcdata". Useful, right?
As if that weren't bad enough, even once you've decoded
this error message it doesn't tell you which entity you've
forgotten to define. Sure, there's a line number, 41, and
here's line 41:
<references title='Normative References'>&rfc2119;</references>
So far so good, but unfortunately the line number here is
that of the
<references>
element,
not of the offending missing entity. Put as many valid
references in there as you want and you still get the
same line number. In order to figure out the offending
entity, you either need to match up the front and the back
of the documents or progressively cut references out of the back till
the error goes away.
1
The basic reason you're getting this error instead of something useful
like "Go include a <!ENTITY rfc2119 ...
production,
at the top of the file, you dummy" is that this part of the references
system is done purely using XML mechanisms, so you get an
XML failure before some better error handling mechanism
comes into play. This isn't the only time xml2rfc does this to
you either, it's just the most offensive.
And that, children, is how the Internet standards sausage gets
made. Outstanding!
1. Apparently you can use other tools to diagnose
this too, but xml2rfc won't help you out.