A few notes on Google Print

The Author's Guild is suing Google over Google Print. Here's the response from Susan Wojciki, Google VP of product management and Jonathan Brand's analysis of the fair use aspects.

A few points I haven't seen made elsewhere.
There's a lot of attention being paid to the fact that Google is only presenting excerpts of the material. Here's Wojciki:

At most we show only a brief snippet of text where their search term appears, along with basic bibliographic information and several links to online booksellers and libraries.

I'm sure that's true, but that doesn't mean that Google isn't making the full book available. Repeated queries with overlapping search keys and a little screen scraping might well let you recover the entire book text. I know that Google has mechanisms that prevent large-scale automatic queries, but I don't know how hard those are to defeat in practice.

As has been noted, Google's ability to perform full-text searches depends on having a complete copy of the scanned book. What hasn't been much noted is that Google probably isn't just making one copy of the material. Remember that Google operates an enormous server farm. It's quite likely that they have to make a substantial number of copies for parallel searches (performance) and redundancy (high availability).

Acknowledgement The ideas in this post were developed during discussions between Hovav Shacham, Nagendra Modadugu, Cullen Jennings, and myself.

It also seems highly likely that for performance and redudancy, the data structure which stores this info looks nothing like the book which is being "copied". Am I "copying" a book if I write down a list of numbers which correspond to the ordinals of entries in a dictionary for each word in that book? eg the sentence "Aardvark Zzyzygy" becomes 1,9999999 in my notation. Now is "1,9999999" a copy of "Aardvark Zzyzygy"?

WWell, here's the part where me not being a copyright lawyer gets problematic... But it seems to me that if the replica is (1) derived from the book and (2) allows you to reconstruct the book, then yeah, it's a copy. The problem with your example is that it doesn't allow reconstruction. But to be useful for Google's purposes I suspect their replicas need to allow it.

You might want to check the law on concordances, since this can be used as a concordance. One of the most interesting cases involves the Dead Sea scrolls; a concordance of the words used in them was published and then used by researchers denied access to the originals to reconstruct the text of some of the scrolls. Here is a brief
discussion of the use of the concordance
, though my memory says the legal battle was bloodier than this would imply.

Book scanning involves this process:

1) Make an image file of every page.
2) Derive an OCR (optical character recognition) of this image file. This is not proofread, but depending on the quality of the original image, it will usually be about 95 percent accurate, which is good enough for keyword searching. The end user never sees this OCR file.

Further, indexing the book involves this:

3) Index the words in the OCR file. Record the location and coordinates of each word in the original image file, so that the word or phrase can be highlighted.

4) and presumably: maintain some sort of word frequency or relevancy data on the entire book, for ranking purposes in the search engine.

The library and Google each get a copy of both the image file and the OCR file that Google generates from the image file. When Google hits on search terms, and wants to display the snippet or page, what's displayed is from the image file.

