The Palo Alto Library is a distributed system

| Comments (12) | TrackBacks (11) |
A distributed system is one on which I cannot get any work done because some machine I have never heard of has crashed.
   —Leslie Lamport

The first thing I noticed when I went to check out my books at the library today was that there was a long line. It quickly became apparent why: the automatic checkout machines were down. When we got up to the front of the line, the checkout librarian started checking us out manually---writing down our card numbers and books on a piece of paper. Here's how the conversation went (from memory, so only approximately accurate):

EKR: What's the problem?
Librarian: Our computers are down.
EKR: I didn't realize they were so brittle.
Librarian: Well, our Internet connection is down.
EKR: That's not a very good design
Librarian: Yes it is. The libraries all use the same checkout system and so if the Internet connection is down then the other libraries have no way of knowing if something has been checked out.
EKR: Well, you could just check things out on the computers here and then synch things up when the Internet connection comes back online.
Librarian: No. The Internet connection is either up or its down.
EKR: No, I mean you can just check people out here and then when the connection comes back online you just upload the changes.
Librarian: But when people return something at Mitchell Park it shows up here right away. We can't do that if the connection is down.

At this point she had finished writing down my books and there were people behind me, so I gave up. But the great thing about blogs is that now I can talk about it here.

Say you want to have a distributed system like this. You've got two branches, which should to have a common view of the universe. Call them Alpha and Beta. So, when someone checks out a book in Alpha, Beta knows about it and vice versa. In order to achieve these, you either have Alpha and Beta linked up to each other or to a common central server.

The central server is easier to explain so let's start with that. In the basic design, the computers at Alpha and Beta are dumb and all the information is stored at the central server--call it Central. Whenever anyone at Alpha or Beta wants to know anything, it asks Central. Whenever they want to change anything, they tell Central to change it. And when they want to know the state--even on something they just changed--they ask Central.

This design has the nice property that Alpha and Beta can never get out of synch, since all the brains are at Central (I'm deliberately ignoring database locking here for the moment). However, it has the annoying property that it's really slow, because neither Alpha nor Beta can ever display anything without talking to Central first. The natural fix for this is for Alpha and Beta to each maintain a replica of the database locally. Then, when they want to display something they just read it out of their replica.1 Of course, if Alpha changes anything it needs to tell the central server, which notifies Beta, and vice versa. Note that at this point it becomes clear that you don't really need the central server. Alpha and Beta can just maintain replicas and notify each other whenever anything changes. There are advantages to having a central server, but it's easiest to explain the rest of this if we assume that there are just two machines connected together.

This all works fine as long as all the machines are connected all the time. But what happens if for some reason one becomes disconnected? There are four basic strategies for dealing with disconnected operation

  1. Forbid it. If Alpha's network connection goes down, no users at Alpha can look up or check out books.
  2. Read-only. If Alpha's network connection goes down, people at Alpha can look up stuff, but not check stuff out.
  3. Partial write. Alpha and Beta each get assigned some subset of the database. They're allowed to read anything in the database but can only write their assigned section. When the network comes back online, they synch up. Note that if Alpha changes something while they're disconnected and Beta tries to read it, it gets the wrong result until they're reconnected.
  4. Concurrent write. Each of Alpha and Beta are allowed to read and write any record. When the network connection comes back online they synch up. This may mean resolving any records which have been changed by both machines. This isn't so much of a problem in the library context because any patron or any copy of a given book can only be at one location at once (though think about what happens if I put a hold on a book from location Alpha and then go to Beta to pick it up and check it out). However, in other systems it's common to have records changed at two places simultaneously. Re-synchronization in such systems can be a real pain in the ass (cf. CVS).

What's weird about the system at the Palo Alto Library is that they're pretending they're executing a Read-Only system when they're actually using what's basically a Partial Write scheme. They're letting people check books out of the library but instead of keying it into the computer, writing it down on paper with the intention to key it into the computer when it comes online again. A reasonable database system would let you do this all in the computer and then synch up automatically when the Internet connection came back up. Apparently the Palo Alto librarians aren't IT savvy enough to demand a reasonable system.

1. In a lot of environments it's not efficient to keep a full replica. In particular, if there's a lot of locality of reference (Alpha mostly works on some subset of the data, Beta mostly works on another) then you can get a more efficient system with caches rather than replicas. But library systems aren't necessarily this way.

11 TrackBacks

Listed below are links to blogs that reference this entry: The Palo Alto Library is a distributed system.

TrackBack URL for this entry: http://www.educatedguesswork.org/cgi-bin/mt/mt-tb.cgi/290

law library from law library on August 2, 2005 8:10 AM

law library Read More

diploma Read More

Ftp pictures sex Asian girl picture fucking vagina Nude picture of teacher sex Passwords indian porn Read More

poker casino90 from poker casino90 on February 11, 2006 1:36 PM

poker casino poker 602 Read More

blackjack video poker video poker party poker party poker Read More

video poker partypoker partypoker casino casino Read More

oklahoma casinos from oklahoma casinos on February 20, 2006 11:58 PM

Azores interpose:Cranford spoons hubs Leone,Kirby cvs pharmacy http://cvs-pharmacy.e-pills-4u.com/ Read More

whole life insurance from whole life insurance on February 26, 2006 10:41 AM

Sternberg unified.Goethe:caveats mends contradicted stroker,baleful lighthearted insurance http://www.insurance-24x7.com/ Read More

12 Comments

The same thing happened to me at the San Francisco Public Library.

I had concluded that their multi-branch system was basically an extension of the single branch with (dumb) terminals, and with the catalogue cached locally for searching.

It might be a violation of Federal law to have a discussion of this sort without citing Kistler and Satyanarayanan, "Disconnected operation in the Coda file system," ACM Tr. Computer Sys. 10(1):3-25, 1992.

Re: Coda, I think that's a classic example of how 90% of the world's IT infrastructure is implemented by people who don't read papers and the systems written by the people who wrote the papers aren't widely adopted. It's not entirely clear why this is...

You're wasting your time talking about systems design with the person who's checking out your books. Especially at a large public library, they're just a more genteel class of register biscuit. If you want to talk to someone about it you should find the systems librarian... Though s/he probably won't have time to talk to you as they are some of the most overworked people I know of. (In the college library I worked at for a while one person supported several hundred public computers, two or three dozen staff workstations, four or five NT servers, and the big database server. She has one professional assistant who was hopelessly untechnical and a student worker, me.)

I think the technical reason they don't do this is that the checkout computers are only recently evolved from dumb terminals. Their software for the job is still more or less a terminal window. Libraries are chronically underfunded, public libraries even more so, and what money they get they tend to invest in books rather than software development.

Hovav,

Thanks for ensuring EG's regulatory compliance.

What's truly amazing is that the Palo Alto library has had computerized checkout terminals forever, and they must have been centrally synchronized. But before high speed net connections, I bet it was done using the kind of local cache and periodic synchronize strategies that I'm talking about here....

I'm not sure this is a fair criticism of the design of the library system. Centralized systems--even high-availability ones--are enormously simpler and cheaper to build and maintain than distributed ones. They also depend more on the availability of the underlying communications infrastructure. That can be a perfectly reasonable tradeoff to make, depending on the reliability of the infrastructure and the cost of working around its failures. For example, if you'd gone to the library and discovered that they were closed because of a local power failure, would you have posted to your blog complaining of their lack of a backup generator?

I'm just speculating, but it's entirely possible that the library's Internet connection these days is not much less reliable than its power supply. If so, then its decision to spend its limited budget on something other than making its checkout system robust in the face of Internet connection failures would make good sense, no?

Dan,

The library already has a distributed system: they have replicas in every location and when the connection goes down they write down the updates on paper and key them in later when the connection comes back up. I'm simply suggesting that they move from a partially manual system to a completely computerized one. The additional cost to build a system that does this is actually quite small and it can often be almost entirely hidden in the communication layers. In any case, the argument "we're too cheap to build the system that way" is a very different argument from the argument that it *has* to be the way it is, which is the one I'm primarily addressing.

FWIW, I heard at least one person at the library complaining that the computers seemed to be down a lot, so I'm not sure your theory about how good the connection is is actually that accurate.

Eric--are you sure they've got replicas everywhere? The conversation you had sounds consistent with the library adopting the single-central-server model. When the connection to it goes down, the librarians write down checkout transactions on paper, and enter them in when the system comes back up. In theory, or in other systems, as you point out, that could cause inconsistencies, but since a book is only ever in one branch at a time, and can only be checked out of the branch where it exists, the delayed update is pretty safe.

Of course, if they really do already go to the trouble of maintaining replicas in every location, but don't allow for loose synchronization in case the connections are down (or merely too slow), then I concede your point--their system is pretty stunningly messed up. Maybe their computers are down all the time because they all crash every time even one of the replicas is down?

Yes, I believe they have replicas everywhere. I first heard they were having problems when I walked in initially and yet I spent a while looking up books and even placed a few holds using the terminals (which is kind of strange). Now, it could be the case that the system had *just* crashed when I walked up to the counter, but as I say, I thought I heard otherwise.

I agree that it would be cheaper to build a 100% centralized system--though I still maintain that the distributed system is a superior product--though perhaps not enough to justify the cost.

1) I have forwarded a link to your posting to our libray director.


2) I hope that they respond as to the current status, down times, etc.


3) If there is a response, I do hope that you link to it so that "both" sides of the discussion are easily avaiable.

On the other hand, Though I do agree that there are problems with poorly designed systems, it might be that there are SO FEW DOWNAGES and adding the BACKUP SOFTWARE might be so costly, that it might be that the system was correctly purchased.

Personally, I suspect that this was a "crack" in the RFQ and I do hope that it is corrected and in that your article is a wake up call to those people who specific and write such suystems in the future.

The real mistake was writing down the numbers.

They should have photocopied your card and your books' barcodes together... That way they could just scan the pages when the system came back up.

Thanks to Mike Liveright for drawing my attention to this discussion thread. I'm the Director of the Palo Alto City Library, and I can shed at least a little light on what happened on Sunday afternoon. At about 4:30pm, approximately 30 minutes before the Library closed for the day on Sunday 6/5, there was some sort of hiccup in the City's network, just enough to cause a couple of the Library's servers and some PCs, including the express check-out stations, to go down. Staff in most locations simply rebooted machines as needed, and activities resumed smoothly. Unfortunately, at the Main Library, the staff who should have taken the lead in rebooting or at least calling for help, quite simply didn't do so as swiftly as they should. They opted to manually check-out materials instead of addressing the source of the problem, or using the standalone PC-based back-up system, which also would have been appropriate in this situation, since there wasn't a power failure. Anyway, by the time they regrouped and went about bringing equipment back up again, the Library was closing. I really apologize for this inconvenience. The incident has caused me to call for a checklist of what to do when this happens the next time. This is something I didn't realize was lacking (I'm still relatively new to Palo Alto and the network very rarely fails), but clearly it's an important thing to have. Please let me know if you have further questions!

Leave a comment