« Museum of pre-historic technology (I) | Main | Hyperthreading cache leaking »
June 19, 2005
So you want to capture customer logging information?
News.com reports that the DoJ wants ISPs to retain logs of customer activity.In Europe, the Council of Justice and Home Affairs ministers say logs must be kept for between one and three years. One U.S. industry representative, who spoke on condition of anonymity, said the Justice Department is interested in at least a two-month requirement.Justice Department officials endorsed the concept at a private meeting with Internet service providers and the National Center for Missing and Exploited Children, according to interviews with multiple people who were present. The meeting took place on April 27 at the Holiday Inn Select in Alexandria, Va.
"It was raised not once but several times in the meeting, very emphatically," said Dave McClure, president of the U.S. Internet Industry Association, which represents small to midsize companies. "We were told, 'You're going to have to start thinking about data retention if you don't want people to think you're soft on child porn.'"
This is phrased as being about retention, but the ISPs can only retain what they captured in the first place, and in many cases the answer is surprisingly little.
The first thing you have to realize is that the Internet isn't like the telephone network. In the PSTN, each call setup and termination requires explicit creation of state at each switch along the path. The Internet (and packet switched networks in general) is different: state typically exists only at the endpoints. The intermediate routers just forward packets. For instance, when you make a Web (HTTP) connection to Amazon.com, there's TCP state on your client and your server (and maybe on a firewall or two in between) but as far as the intermediate routers are concerned, they just see a bunch of mostly uninterpreted IP datagrams, with traffic from multiple senders and receivers mixed together. The routers don't even attempt to reassemble them into connections, nor do they need to. That's part of the elegance of the Internet design. What logging occurs generally happens at the connection endpoints.
Web (HTTP)
The simplest case is HTTP. If you connect to a Web server, that server will keep
some logs of your activity. Your web browser may keep logs too,
but of course those live on your computer. So, your ISP generally
doesn't keep logs of your Web browsing. On the other hand,
if your ISP runs your Web site for you, they probably keep logs,
but that's in their capacity as your Web server, not as your ISP.
For instance, EG runs at Dreamhost,
but they're my hosting server only. They don't provide my home
Internet service.
Mail (SMTP)
There's a lot more opportunity to log e-mail traffic. Most home
users get their e-mail service from their ISP. What this means
in practice is that when mail is sent to them it
gets delivered to the ISP's e-mail server. The user
reads their mail by contacting the ISP's server to
pick it up, using POP, IMAP, or Web mail. Because the
mail server is involved in mail delivery—and typically
has the mail lying around for a while—its easy for it
to keep logs and standard mail servers do so by default.
These logs typically contain to/from information
and the disposition of the mail, as well as a timestamp.
Often, when you actually read your mail is logged too.
Because the mail server has access to the content (i.e., the
message body), it can
of course keep a copy, but standard practice isn't to do so.
When users send mail, they typically deliver it to the ISP's mail server, which then takes care of the ultimate delivery. This has the advantage that it's "fire and forget". If the message can't be delivered right away the mail server will keep trying even if your machine is disconnected from the Internet. A lot of ISPs actually require their users to use their mail servers under the theory that it helps them suppress spam. As before, these transactions are easy to log, and as far as I know this is standard practice.
IM
The situation with IM is fairly complicated. The general rule
is that whoever runs your IM service (e.g., Yahoo, AOL, MSN)
has an opportunity to log but it's inconvenient for your ISP.
If you run your own server (e.g., Jabber/XMPP) then
whoever runs that server can log traffic, as with HTTP.
Whoever runs the service has the opportunity to
access the actual data traffic, but they typically don't.
Non-server logging
Of course, just because something isn't logged now doesn't
mean it couldn't be. In theory, ISPs could capture every
packet that goes through their routers. They could decode
them and synthesize their own logs or simply record them
to disk for future processing. In practice, however, this
would be a substantially nontrivial undertaking. The
routers that ISPs use aren't set up to record this kind
of detailed information. In practice, this probably
means putting some sort of tap on the network. This is,
of course, possible, but is a substantially different
issue from merely retaining some logging information.
Posted by ekr at June 19, 2005 7:38 PM | Filed under:
Comments
Hmm. Given that child porn seems to be the motivation here, it seems like, at least for HTTP, you want to monitor the servers at least as closely as the clients.
Posted by: Kevin Dick at June 19, 2005 8:04 PM
But of course if people run their own servers, even in a co-lo, then you have no access to that traffic.
Posted by: EKR at June 19, 2005 8:14 PM
Of course. I was thinking about how to reduce the cost of the non-server logging case. I figure it would be easier to detect and save HTTP traffic on the server side than the client side. Strangle the supply rather than the demand.
Posted by: Kevin Dick at June 19, 2005 9:17 PM
Vern Paxson has a paper in progress showing that full logging of major links actually isn't that costly.
Anyway, with a 3 TB disk array running $8k, its good for logging a 1 Gb stream for 6 days without any tricks at all.
Posted by: Nicholas Weaver at June 19, 2005 9:40 PM