The Wayback Machine - https://web.archive.org/web/20210121023719/http://kcoyle.blogspot.com/search/label/digitization
Showing posts with label digitization. Show all posts
Showing posts with label digitization. Show all posts

Tuesday, October 10, 2017

Google Books and Mein Kampf

I hadn't look at Google Books in a while, or at least not carefully, so I was surprised to find that Google had added blurbs to most of the books. Even more surprising (although perhaps I should say "troubling") is that no source is given for the book blurbs. Some at least come from publisher sites, which means that they are promotional in nature. For example, here's a mildly promotional text about a literary work, from a literary publisher:



This gives a synopsis of the book, starting with:

"Throughout a single day in 1892, John Shawnessy recalls the great moments of his life..." 

It ends by letting the reader know that this was a bestseller when published in 1948, and calls it a "powerful novel."

The blurb on a 1909 version of Darwin's The Origin of Species is mysterious because the book isn't a recent publication with an online site providing the text. I do not know where this description comes from, but because the  entire thrust of this blurb is about the controversy of evolution versus the Bible (even though Darwin did not press this point himself) I'm guessing that the blurb post-dates this particular publication.


"First published in 1859, this landmark book on evolutionary biology was not the first to deal with the subject, but it went on to become a sensation -- and a controversial one for many religious people who could not reconcile Darwin's science with their faith."
That's a reasonable view to take of Darwin's "landmark" book but it isn't what I would consider to be faithful to the full import of this tome.

The blurb on Hitler's Mein Kampf is particularly troubling. If you look at different versions of the book you get both pro- and anti- Nazi sentiments, neither of which really belong  on a site that claims to be a catalog of books. Also note that because each book entry has only one blurb, the tone changes considerably depending on which publication you happen to pick from the list.


First on the list:
"Settling Accounts became Mein Kampf, an unparalleled example of muddled economics and history, appalling bigotry, and an intense self-glorification of Adolf Hitler as the true founder and builder of the National Socialist movement. It was written in hate and it contained a blueprint for violent bloodshed."

Second on the list:
"This book has set a path toward a much higher understanding of the self and of our magnificent destiny as living beings part of this Race on our planet. It shows us that we must not look at nature in terms of good or bad, but in an unfiltered manner. It describes what we must do if we want to survive as a people and as a Race."
That's horrifying. Note that both books are self-published, and the blurbs are the ones that I find on those books in Amazon, perhaps indicating that Google is sucking up books from the Amazon site. There is, or at least at one point there once was, a difference between Amazon and Google Books. Google, after all, scanned books in libraries and presented itself as a search engine for published texts; Amazon will sell you Trump's tweets on toilet paper. The only text on the Google Books page still claims that Google Books is about  search: "Search the world's most comprehensive index of full-text books." Libraries partnered with Google with lofty promises of gains in scholarship:
"Our participation in the Google Books Library Project will add significantly to the extensive digital resources the Libraries already deliver. It will enable the Libraries to make available more significant portions of its extraordinary archival and special collections to scholars and researchers worldwide in ways that will ultimately change the nature of scholarship." Jim Neal, Columbia University
I don't know how these folks now feel about having their texts intermingled with publications they would never buy and described by texts that may come from shady and unreliable sources.

Even leaving aside the grossest aspects of the blurbs and Google's hypocrisy about its commercialization of its books project, adding blurbs to the book entries with no attribution and clearly not vetting the sources is extremely irresponsible. It's also very Google to create sloppy algorithms that illustrate their basic ignorance of the content their are working with -- in this case, the world's books.

Friday, November 23, 2012

Fair Use(-ful)

The beauty and the aggravation of Fair Use in US copyright law is that one cannot pre-define particular uses as "fair." The countries that have, instead, the legal concept of "Fair Dealing" have an enumerated set of uses that are considered fair, although there is obviously still some need for interpretation. The advantage to Fair Use is that it can be re-interpreted with the times without the need for modification of the law. As new technologies come along, such as digitization of previously analog works, courts can make a decision based on the same four factors that have been used for earlier technologies. However, until such a decision is made in a court of law, it isn't possible to be sure whether a use is fair or not.

We have recently seen a court case that decided that HathiTrust's use of digitized books to provide an index to those books is fair. There is another court case that will decide a similar question regarding Google's digitization of books for its Google Book Search. Note, however, that even if both of these are determined to be fair use, each is a particular situation in a particular context. Both organizations have developed their services in an attempt to meet what they judged to be the letter of the law, and yet there is a considerable difference in the services they provide.

HathiTrust stores copies of digitized books from the collections of member libraries. In this case, HT is not itself doing the digitization but is storing files for books mostly digitized by Google. A search in the full text database of OCR'd page images returns, for in-copyright items, the page numbers on which the terms were found, and the number of hits found on each page. There are no snippets and no view of the text unless the text itself is deemed to be out of copyright.

Google has a different approach. To begin with, Google has performed mass digitization of books (estimated at about 20 million) without first obtaining permission from rights holders. So the Google case includes the act of digitization, whereas the HathiTrust case begins with digital files obtained from Google. Therefore the act of digitizing was not a factor in that case. In terms of use of the digitized works, Google also provides keyword searching of the OCR'd digital images, but takes a different approach to the results viewable by the searchers. Google provides short (about 3-5 lines) snippets that show the search terms in context on a page.
Google, however, places specific restrictions to avoid letting users "game" the search to gain access to enough of the text to substitute for actually acquiring access to the book. Here is how Google describes this in its recent legal response:
"The information that appears in Google Books does not substitute for reading the book. Google displays no more than three snippets from a book in response to a search query, even if the search term appears many times in the book. ... Google also prevents users from view a full page, or even several contiguous snippets, by displaying only one snippet per page in response to a given search and by 'blacking' (i.e. making unable for snippet view in response to any search) at least one snippet per page and one out of ten pages in a book." p.8
Google also exempts some types of books, like reference works, cookbooks, and poetry, from snippet display entirely.

The differences in the results returned by these two services reflect the differences in their contexts and their goals. HathiTrust has member institutions and their authorized users. The collection within HathiTrust reflects the holdings of the member institutions' libraries which means that the authorized users should have access, either in their library or through inter-library loan, to the physical book that was scanned. The HathiTrust full text is a search on the members' "stuff." The decision to give only page numbers makes some sense in this context, although providing snippets to scholars might have been acceptable to the judge. The return of page numbers and full word counts within pages reflects, IMO, the interest in quantitative analysis of term use. It also gives scholars some idea of the weight the term has within the text.

Google's situation is different. Google has no institutions, no members, no libraries; it provides its service to the general public (at least to the US public). There is no reason to assume that all of the members of that public will have access to the hard copy of any particular digitized book. Google seems to have decided that promoting its service as having primarily a marketing function, with the snippets as "teasers," would mollify the various intellectual property owners. In its brief of November 9, Google reiterates that it does not put advertising on the Google Book Search results pages, nor does Google make any money off of its referrals to book purchasing sites.

So here are two organizations that have bent over backwards to stay within what they deemed to be the boundaries of fair use, and they have done so in significantly different ways. This means that the fair use determination of each of these could have different outcomes, and each will provide different clues as to how fair use is viewed for digitized works.

It of course bears mentioning that both of these solutions provide hurdles for users. The HathiTrust user who is searching on a term that could have more than one meaning ("iron" "dive" "foot") does not have any context to help her understand if the results are relevant. The Google user, on the other hand, gets some context but cannot see all of the results and therefore does not know if there are key retrievals among those that have been blocked algorithmically. A use that is "fair" within copyright law may not seem "fair" to the user who is doing research. It makes you wonder if our idea of "fair use" couldn't be extended to be fair but also "useful."

Related posts
http://kcoyle.blogspot.com/2012/10/copyright-victories-part-ii.html

Sunday, July 15, 2012

Friends of HathiTrust

I have written before about the lawsuit of the Author's Guild (AG) against HathiTrust (HT). The tweet-sized explanation is that the AG claims that the corpus of digitized books in the HathiTrust that are not in the public domain are infringements of copyright. HathiTrust claims that the digitized copies are justified under fair use. (It may be relevant that many of the digitized texts stored in HT are the result of the mass digitization done by Google.)

For analysis of the legal issues, please see James Grimmelman's blog, in particular his post summarizing how the various arguments fit into the copyright law's "four factors."

I want to focus on some issues that I think are of particular interest to librarians and scholars. In particular, I want to bring up some of the points from the amicus brief from the digital humanities and law scholars.

While scientists and others who work with quantifiable data (social scientists using census data, business researchers with huge amounts of data from stock markets, etc.), those working in the humanities whose raw material is in printed texts have not been able to make use of the massive data mining techniques that are moving other areas of research forward. If you want to study how language has changed over time, or when certain concepts entered the vocabulary of mass media, the physical storage of this information makes it impossible to run these as calculations, and the size of the corpus makes it very difficult, if not impossible, to do the research in "human time". Thus, the only way for the "Digital Humanities" to engage in modern research is after the digitization of their primary materials.

This presumably speaks to the first factor of fair use:

(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

As Grimmelman says "The Authors Guild focuses on the corpus itself; HathiTrust focuses on its uses." It may make sense that scholars should be allowed to make copies of any material they need to use in their research, but I can imagine objections, some of which the AG has already made: 1) you don't need to systematically copy every book in every library to do your research and 2) that's fine, but can you guarantee that infringing copies will not be distributed?

It's a hard sell, yet it's also hard not to see the point of view of the humanities scholars who feel that they could make great progress (ok, and some good career moves) if they had access to this material.

The other argument that the digital humanities scholars make is that the data derived from the digitization process is not infringing because it is non-expressive metadata. Here it gets a bit confusing because although they refer to the data derived from digitization as "metadata," the examples that they give vary from the digitized copies themselves, to a database where all of this is stored, and to the output from Google n-grams. If the database consists of metadata, then the Google n-grams are an example of the use of that metadata, but are not an example of the metadata itself. In fact the "metadata" that is produced from digitization is a good graphic copy of each page of the book, plus a reproduction, word for word (with unfortunate but not deliberate imprecision) of the text itself. That this copy is essential for the research uses desired is undeniable, and the brief gives many good examples of quantitative research in the humanities. But I fear that their insistence that digitization produces mere "metadata" may not be convincing.

Here's a short version from the text:

"In ruling on the parties’ motions, the Court should recognize that text mining is a non-expressive use that presents no legally cognizable conflict with the statutory rights or interests of the copyright holders. Where, as here, the output of a database—i.e., the data it produces and displays—is noninfringing, this Court should find that the creation and operation of the database itself is likewise noninfringing. The copying required to convert paper library books into a searchable digital database is properly considered a “nonexpressive use” because the works are copied for reasons unrelated to their protectable expressive qualities; none of the works in question are being read by humans as they would be if sitting on the shelves of a library or bookstore." p. 2

They also talk about transformation of works, and the legal issues here are complex and my impression is that the various past legal decisions may not provide a clear path. They then end a section with this quote:

"By contrast, the many forms of metadata produced by the library digitization at the heart of this litigation do not merely recast copyrightable expression from underlying works; rather, the metadata encompasses numerous uncopyrightable facts about the works, such as author, title, frequency of particular words or phrases, and the like." (p.17)

This, to me, comes completely out of left field. Anyone who has done digitization projects is aware that most projects use human-produced library metadata for the authors and titles of the digitized works. In addition, the result of the OCR step of the digitization process is a large text file that is the text, from first word to last, in that order, and possibly a mapping file that gives the coordinates of the location of each word on each OCR'd page. Any term frequency data is a few steps away from the actual digitization process and its immediate output, and fits in perfectly with the earlier arguments around the use of datamining.

I do sincerely hope that digitization of texts will be permitted by the court for the purposes argued in this paper. An attempt at justification, after the fact, of Google's mass digitization project may, however, suffer weaknesses inherent in that project, in particular that no prior negotiation was attempted with either authors nor publishers, and once the amended settlement between Google and the suing parties was denied by court, there is no mutual agreement on uses, security, nor compensation.

In addition, the economic and emotional impact of Google's role in this process cannot be ignored: this is a company that is so strong and so pervasive in our lives that mere nations struggle to protect their own (and their citizens') interests. When Google or Amazon or Facebook steps into your territory, the earth trembles and fear is not an unreasonable response. I worry that idea of digitization itself has been tainted, making it harder for scholars to make their case of the potential benefits of post-digitization research.

Wednesday, April 25, 2012

Digital Urtext

As we reach a point where many of the classic books of literature and science published before the magical date of 1923 have been digitized, it is time to consider the quality of those copies and the issue of redundancy.

A serious concern in the times before printing was that copying -- and it was hand-copying in those times -- introduced errors into the text. When you received a copy of a Greek or Latin work you might be reading a text with key words missing or mis-represented. In our digitizing efforts we have reproduced this problem, and are in a similar situation as that of the Aldine Press when it set out to reproduce the classics for the first time in printed form: we need to carry the older texts into the new technology as accurately as possible.

While the digitized images of pages may be relatively accurate, the underlying (and uncorrected, for the most part) OCR introduces errors into the text. The amount of error is often determined by quality of the original or the vagaries of older fonts.If your OCR is 99.9% accurate, you still have one error for every 1,000 characters. A modern book has about 1500 characters on a page, so that means one error for every page. Also, there are particular problems in book scanning, especially where text doesn't flow easily on the page. Tables of contents seem to be full of errors:
IX. Tragedy in the Gra\'eyard 80 

X. Dire Prophecy of the Howling Dog .... 89 
XL Conscience Racks Tom 98 
 
In addition, older books have a tendency to use hyphenated line breaks a great deal:

and declined. At last the enemy's mother ap-
peared, and called Tom a bad, vicious, vulgar child, 

These remain on separate lines in the OCR'd text, which is accurate to the original but which causes problems for searching and any word analysis.

The other issue is that for many classic works we have multiple digital copies. Some of these are different editions, some are digitizations (and OCR-ing) of the same edition. Each has different errors.

For the purposes of study, and for the use of these texts for study, it would be useful to have a certified "Urtext" version, a quality digitization with corrected OCR that scholars agree represents the text as closely and accurately as possible. This might be a digital copy of the first edition, or it might be a digital copy of an agreed "definitive" edition.

We have a notion of "best edition" (or "editions") for many ancient texts. Determining one or a small number of best editions for modern texts should not be nearly as difficult. Having a certified version of such texts must be superior to having students and scholars reading from and studying a wide variety of flawed versions. Professors could assign the Urtext version to their classes, knowing that every one of the students was encountering the same high quality text.




(I realize that Project Gutenberg may be an example for a quality control effort -- unfortunately those texts are not coordinated with the digital images, and often do not have page numbers or information about the edition represented. But they are to be praised for thinking about quality.)

Saturday, November 22, 2008

More on Google/AAP

Here are some more bits and thoughts on the agreement between Google and the AAP.

Library Involvement

Some librarians were involved in the settlement talks. The only one I have found so far who has come out about this is Georgia Harper. The librarians were working under a non-disclosure agreement (NDA), and therefore will not be able to reveal any details of the discussions. I have heard statements from others who I believe were privy to the negotiations, and they all seem to feel that the outcome was better for libraries due to the involvement of members of our "class." (Note that Google and AAP had high-end lawyers arguing their side, and we had hard-working librarians. I don't know how many of "our" representatives were also lawyers, but you can just imagine how greatly out-gunned they were.) Unfortunately that doesn't change my mind about the bait and switch move.

Google Books as Library

Some have begun to refer to Google Books as a library. We have to do some serious thinking about what the Google Book database really is. To begin with, it's not a research collection, at least not at this point. It's really a somewhat odd, almost random bunch of book "stuff." As you know, neither Google nor the libraries are selecting particular books for digitization. This is a "mass digitization" project that starts at one end of a library and plows through blindly to the other end. Some libraries have limited Google to public domain works, so in terms of any area of study there is an artificial cut-off of knowledge. Not to mention that some libraries, mainly the University of California, have been working with Google primarily to digitize books in their two storage facilities; that is, they have been digitizing the low use books that were stored remotely.

So the main reason why Google Books is not a library is that it isn't what we would call a "collection." The books have not been chosen to support a particular discipline or research area. Yet it will become a de facto collection because people will begin using it for research. Thus "all human knowledge" becomes something more like the elephant and the blind man: research in online resources and research that uses print materials will get very different views of human knowledge. (This is not a new phenomenon. I wrote about this in terms of some early digital projects I was involved in.) One of the big gaps in Google Books will be current materials, those that are still in print. Google will need to convince the publishers that it can increase their revenue stream for current books in order to get them to participate.

Subscribing to Google Books: Just Say No?


Beyond the (undoubtedly hard-won by library representatives) single terminal access in each public library in the US, libraries will be asked to subscribe to the Google Book service in order to give their users access to the text of the books (not just the search capability). This is one of the more painful aspects of the agreement because it seems to ignore the public costs that went in to the purchase, organization, and storage of those works by libraries. (I'm not includng privately funded libraries here, but many of the participants are publicly funded.) The parallels with the OCLC mess are ironic: libraries paying for access to their own materials. So, couldn't the libraries just refuse to subscribe? Not really. Publicly funded libraries have a mission to provide access to the world's intellectual output in a way that best serves their users. When something new comes along -- films on DVD, music on CD, the Internet -- libraries must do what they can to make sure that their users are not informationally underpriviledged. Google now has the largest body of digitized full text, and there will be a kind of "information arms race" as institutions work to make sure that their users can compete using these new resources.

The (Somewhat Hidden) Carrot

I can't imagine that anyone thought that libraries and Google were digitizing books primarily so that people could read what are essentially photographs of book pages on a computer screen. Google initially stated that they were only interested in searching the full text of books. While interesting in itself, keyword searching of rather poor OCR text is not a killer app. What we gain by having a large number of digitized books is a large corpus on which we can do computational research. We can experiment with ideas like: can we follow the flow of knowledge through these texts? Can we create topic maps of fields of study? Can we identify the seminal works in some area? The ability to do this research is included in the agreement (section 7.2(d), The Research Corpus). There will be two copies of this corpus allowed under the agreement, although I don't see any detail as to what the "corpus" will consist of. Will it just be a huge file of digitized books and OCR? Will it be a set of services?

I have suspected for a while that Google was already doing research on the digital files that it holds. It only makes sense. For academics in areas like statistics, computer science, and linguistics, this corpus opens up a whole range of possibilities for research; and research means grants, and grants mean jobs (or tenure, as the case may be). This will be a strong motivation for institutions to want to participate in the Google Book product. Research will NOT be limited to participants; others can request access. What I haven't yet found is anything relating to pricing for the use of the research collection, nor if being a participating library grants less expensive access for your institution. If the latter is the case, then one motivation for libraries to agree to allow Google to scan their books (at some continuing cost to the library) will be that it favors the institution's researchers in this new and exciting area. Full participant libraries (the ones that get to keep the digital copies of their works) can treat their own corpus as research fodder. The other costs of being a full participant are such that I'll still be surprised if any libraries go that route, but if they do I think that this "hidden carrot" will be a big part of it.

----

There's lots of good blogging going on out there on this topic. It needs a cumulative page to help people find the posts. Please tell me you have time to work on that, so I don't have to take it on! (Or that it exists already and I've missed it.) (The PureInformation Blog has a good list.)

Note: the Internet Archive/OCA may take this on. I'll post if/when they do.

Previous posts:

Tuesday, November 18, 2008

Google Giveth ... and Taketh Away

Some additions, amendments.

The agreement between Google and the AAP is of great significance for libraries. It is also very long, written in "legalese", and contains conclusions of a lengthy negotiation without revealing the nature of the discussion. Given that many lawyers were involved, we may never get the back story of this historic settlement, yet it has the potential to change the landscape on rights, digitization, and libraries.

I am basing much of my analysis on the summary of the agreement produced by ARL. This unfortunately means that some errors may be introduced between their summary and my interpretation. I have gone to the original document to check some particulars, such as definitions, but much of that document goes unread for now.

Key Points

(... or, a summary of the summary)

  • The agreement is primarily about books that are presumed to be in copyright but which are no longer in print. In-print books continue to be managed directly by the rights holders, who can make agreements with Google (or anyone else) for uses of those items.

  • The agreement has some odd limitations that baffle me: it only covers books published in the US that have been registered with the Copyright Office. It does not include any books published after January 5, 2009 .The settlement does cover non-US books (e.g. Berne countries); I'm still unclear on the statement about registration for US books, but it was cited in the ARL document.

  • The agreement trades off Google's liability with payment to rights holders. That is, as long as Google requires payment from users to displays and copies, and passes 2/3 of those monies to the rights holder, Google is exempt from copyright infringement claims by rights owners. So users of the digital files will pay to keep Google legal.

  • The agreement does not answer the all-important question of whether scanning for the purposes of searching is an allowed use under copyright law.

  • The agreement flaunts the concept of Fair Use by quantifying the amount of an in-copyright book that users can view for free ("20% of the text," "five adjacent pages," but not the final 5% of a fiction book, to keep the endings a surprise.) The ARL document has Google saying that it will not interfere with fair use. I can't find that statement in the actual settlement. These quantities are contractual, and I'm assuming that that technology will not allow users to exert fair use rights, only the contractual agreement.

  • Google will sell digital copies of in-copyright books to users, who will have perpetual access to the book online. Some printing will be allowed but all printed pages will have a watermark that identifies the user. (I'm calling this "ratwear," software that rats you out.) Users will be able to make notes on the book's pages, but they will only be able to share those notes with other purchasers of the book. (Thus buying a Google book is like joining a secret reading club.) The settle states that the watermark will identifier either the user, or other information "which could be used to identify the authorized user that printed the material or the access point from which the material was printed." Agreement, p. 47

Key Points Relating to Libraries

This is the hard part for me. Hard in that it really hurts.

  • After digitizing books held in libraries, Google will then turn around and become a library vendor, supplying those same books back to libraries under Google's control. Each public library in the US will get a single "terminal" provided (and presumably controlled) by Google that allows users to view (but not copy and paste from) books in the Google database. Some printing is allowed, but there will be a per-page fee charged.

  • Libraries and institutions can also subscribe to all or part of the database of out of print books. Access is not perpetual, but limited to the life of the subscription.

  • There is verbiage about how users in these institutions can share their "annotations." In other words, if you take notes on your own, obviously those are yours. But if you use the capabilities of the system to make your notes in the system, you cannot share your own notes freely.

Now for the Clincher


... this is the pact with the devil.

  • A library can partner with Google for digitization of its collection and get the same release from liability that Google has. The library can keep copies of these digitized books, however, it must follow security standards set by Google and the AAP and must submit its security plan for review and allow yearly auditing. (The security measures are formidable and quite possibly not affordable for all but the wealthiest institutions. There are huge penalties up to millions of dollars for not getting security right.)

  • Libraries that make this pact with the devil are thereby allowed to preserve the files, print replacement copies for deteriorating books, and provide access for people with disabilities. Note that all of these uses by libraries are already allowed by copyright law.

  • The libraries that make this pact with the devil cannot let their users read the digitized books. Well, they can let them read up to five (5!) pages in any digitized book. Presumably if the library wants to provide other uses it must subscribe to Google's service. Libraries are expressly forbidden from using their copies of the books for interlibrary loan, e-reserves, or in course management systems.

... and if you refuse to negotiate with the devil...

  • Current Google library partners who do not choose to become party to this must delete all copies of digitizations of in-copyright works made by the Google project in order to obtain a release from liability. If they choose not to delete the copies, they are on their own in terms of liability for the in-copyright books that Google did digitize (and Google knows exactly which books are involved.)

  • Even if the library was only allowing Google to digitize public domain works, those libraries must destroy all of their copies to get release from liability in case they mis-judged the copyright status of one of the those books.
In other words, this agreement is making the assumption that if anyone sues Google for copyright infringement, the library will be a party to that suit.

They say that "the devil is in the details." In this case that is not true: the devil is right up front, in the main message. That message is that Google has agreed with the publishers, and is selling out the libraries that is has been working with. The deal that Google and the libraries had was that in exchange for working with Google to digitize books in their collections, the libraries received a copy of the digital file. After that, it was up to the libraries to do the right thing based on their understanding of copyright law. Participating with Google has been an expensive proposition for the libraries in terms of their own staff time and in the development of digital storage facilities. Part of the appeal of working with Google was the assumption that partnering with the search giant gve the entire project clout and provided some protection for the libraries. With Google and the AAP now in cahoots, the libraries must join them or try to stand alone in an unclear legal situation; an unclear situation that Google invited the libraries into in the first place.

This is classic bait and switch. And it is bait and switch with powerful commercial interests against public institutions. There is no question about it...

THIS IS EVIL

Note: I've added more comment and info in the comments area as things pop up. So read on....

Monday, November 03, 2008

Determining Copyright Status

Among the many interesting bits in the Google/AAP agreement is Section E which essentially lays out in detail what steps Google must take to determine if an item is or is not in the public domain. As we know, this is not easy. The agreement states that two people must view the title page of the work (yes, it says "two people") to determine if the item has a copyright notice, and to check the place of publication. To determine if copyright has been renewed, "Google shall search either the United States Copyright Renewal Records or a copy thereof." If a renewal record isn't found, and the work has a copyright date before 1964, then it is presumed to be in the public domain.

I decided to try this out, at least the part about checking the renewal. I did my searches in two databases: Stanford's and Rutgers'.

I happen to have a copy of Orwell's 1984 with detailed copyright notices. It lists the first copyright as 1949, by Harcourt, Brace and Jovanovich, Inc. It then says "Copyright renewed 1977 by Sonia Brownell Orwell." It also includes "Copyright 1984 by Virgin Cinema Films Limited" although I must say that I'm not sure why that latter copyright notice is in the book.

A search on '1984' in the Rutgers' database yields no hits, but using the author's name I find 37 items, of which one reads:
AUTH: George Orwell, translation: Amelie Audiberti. NM: translation.
TITL: 1984.
ODAT: 1Jul50; DREG: 7Nov77 RREG: R678090. RCLM: AFO-2377. Amelie Audiberti, nee Elisabeth Savane (A)
A search in the Stanford database gets me:
Title    1984 NM: translation
Author George Orwell, translation: Amelie Audiberti
Registration Date 1Jul50
Renewal Date 7Nov77
Registration Number AFO-2377
Renewal Id R678090
Renewing Entity Amelie Audiberti, nee Elisabeth Savane (A)
Both of these seem to be for the same item, and it's a translation of the book 1984. The renewal listed in the book for the English text is not in the databases. The instructions to Google say nothing about taking renewal dates from the book, so this one would appear to be in the public domain by the agreement's criteria.

Picking up another book of the right age, I have Proust's "The Captive" in the Modern Library edition, the "C. K. Scott Moncrieff" translation, with "Copyright, 1929, by Random House, Inc." on the title page.

In Stanford's database I get:

Title    The captive. Translated by C. K. Scott Monorieff
Author PROUST, MARCEL
Registration Date 27Jun29
Renewal Date 7Sep56
Registration Number A9965
Renewal Id R176423
Renewing Entity Random House, Inc. (PWH)

In Rutgers I get:
CLNA: RANDOM HOUSE, INC.
TITL: The captive.
XREF: Proust, Marcel.
Unfortunately, this latter doesn't include a date, so I'm not sure that this record provides sufficient information. Fortunately, the Stanford database gives more information. Unfortunately, the Stanford record gives the title and what we librarians would call the "statement of responsibility" in the same field, and misspells the name of the translator. This may make it more difficult for any automated matching of the records. (I am assuming that Google will be doing automated matching, not hand searching of the database. That may be a mistaken assumption, especially since they have agreed that two humans will view the title page.)

This next (and last) one is an especially interesting case. I have a copy of Rebecca West's "Black Lamb and Grey Falcon: A Journey through Yugoslavia" printed by Penguin books in 1994. It gives the copyright date as "1940, 1941" and the renewal date as "1968, 1969", both under the name of Rebecca West.

A search on the title in Rutgers' database gets me these three records:

CLNA: WEST, ROBERT.
TITL: Black lamb and grey falcon. (In Atlantic monthly, Feb.-May 1941)
ODAT: 21Jan41 OREG: B482882; 19Feb41 RREG: Rebecca West ; 12Aug68; R441634-441631.

CLNA: WEST, REBECCA.
TITL: Black lamb and grey falcon; a journey through Yugoslavia. Pub. serially in the Atlantic monthly, Dec. 17, 1940-Apr. 17, 1941. NM: additions.
ODAT: 20Oct41; A158501 RREG: Rebecca West ; 10Jan69; R453530.

CLNA: WEST PUB. CO.
TITL: Black lamb and grey falcon. (In The Atlantic monthly, Jan. 1941)
ODAT: 20Dec40; B479489 RREG: Rebecca West ; 2Jan68; R426137.

As you can tell, some part of the book was originally published in the Atlantic Monthly as a serial. From these records it's difficult to tell exactly what issues of the monthly it was included in, and the "Claimants" are all different. In the Stanford database it's a bit more clear. There are five records; four are duplicates for the original articles in the Atlantic Monthly and one more called "Additions." Each of the four duplicate records is like this one:
Title    Black lamb and grey falcon. (In Atlantic monthly, Feb.-May 1941)
Author WEST, REBECCA.
Registration Date 21Jan41, 19Feb41,21Mar41 21Apr41
Renewal Date 12Aug68
Registration Number B482882, B488595, , B492319,, B495868
Renewal Id R441633
Renewing Entity Rebecca West (A)
I suppose that the four renewal records are one for each item in the Atlantic Monthly, but they each have the same information. Only the fifth record, the one for "additions," includes the subtitle that appears on the book. The presence of the article records is puzzling because Stanford claims to have included only records for the renewal of books. In fact, it is easy to find records for articles in the database, so it's probably best to assume that the database covers text in general.

Even for the human searcher, it may be difficult to connect the book and the records because there is nothing in the book itself to indicate that it was previously published in a journal. In fact, the introduction merely mentions that the book itself was first published in two volumes in 1941.

The book was published in two volumes because it is nearly 1200 pages long. The archives of the Atlantic Monthly list the four articles with this same name as containing 24, 24, 26, and 24 pages, respectively. It's rather hard to understand how those articles, as copyrighted, could be the same as a 1200 page book. We are left only with the record that claims to be "Additions" and that has the same subtitle as the book:

Title   Black lamb and grey falcon; a journey through Yugoslavia.
Pub. serially in the Atlantic monthly, Dec. 17, 1940-Apr. 17, 1941.
NM: additions
Author WEST, REBECCA
Registration Date 20Oct41
Renewal Date 10Jan69
Registration Number A158501
Renewal Id R453530
Renewing Entity Rebecca West (A)
Again, title field contains quite a bit of information beyond the title, and it just isn't crystal clear to me that this record is for the book and not for the articles. If it is for the book, then the idea that 1200 pages were published serially over four journal issues is quite a stretch. Plus, the Monthly archive claims that the dates are Jan, Feb, Apr and May, 1941.

Underlying this statement: "To determine if copyright has been renewed, "Google shall search either the United States Copyright Renewal Records or a copy thereof" is a great deal more complexity than that one sentence implies. It makes me wonder if the negotiators for the AAP are fully aware of how inaccurate the results might be. (An example: the author field in a record for an article by George Orwell reads: "Author George Orwell. U. S. ed. pub. as Shooting an elephant, 26Oct50, A49135".) If they are aware of it, then I must commend them for taking the practical path and allowing Google to make books available based on this evidence. If a copyright holder notifies Google that a book has been determined to be public domain in error, Google is obliged to change the status of the work from public domain to "in copyright," but is not held liable for infringement if the steps for determining public domain were followed and documented as laid out in the agreement.

It will be hard to determine, however, if Google should happen to err on the side of copyright, and lists as under copyright works that are actually in the public domain. While copyright holders can be expected to make sure that their works are properly protected, works in the public domain have no rights holder to monitor their status, and no one assigned to protect the public interest.

One other caveat, which appears in Section E, is:
Any determination by Google that a work is a Public Domain Book is solely for the purposes of Section 3.2(d)(v) and is not to be relied on or invoked for any other purposes, including determining whether a work is in fact in the public domain under the Copyright Act.
Basically, this means that just because Google determines that a book is in the public domain doesn't mean that's the legal status of the book. It also means that the rest of us can't use the excuse: "But Google says it's in the public domain." I have not heard whether Google will make the documentation of its copyright search available, and it's that documentation that has the real value. It's kind of like algebra: the answer is important, but what really matters is how you got the answer.

[Note: keep an eye on the Open Library and Creative Commons for some work on copyright determination that will be openly accessible.]

Google/AAP settlement

This Google/AAP settlement has hit my brain like a steel ball in a pinball machine, careening around and setting off bells and lights in all directions. In other words, where do I start?

Reading the FAQ (not the full 140+ page document), it seems to go like this:

Google makes a copy of a book.
Google lets people search on words in the book.
Google lets people pay to see the book, perhaps buy the book, with some money going to the rights holder.
Google manages all of this with a registry of rights.

Now, replace the word "Google" above with "Kinko's."

Next, replace the word "Google" above with "A library."

TILT! If Google is allowed to do this, shouldn't anyone be allowed to do it? Is Jeff Bezos kicking himself right now for playing by the rules? Did Google win by going ahead and doing what no one else dared to do? Can they, like Microsoft, flaunt the law because they can buy their way out of any legal pickle?


Ping! Next thought: we already have vendors of e-books who provide this service for libraries. They serve up digital, encoded versions of the books, not scans of pages. These digital books often have some very useful features, such as allowing the user to make notes, copy quotes of a certain length, create bookmarks, etc. The current Google Books offering is very feature poor. Also, because it is based on scans, there is no flowing of pages to fit the screen. The OCR is too poor to be useful to the sight-impaired. And if they sell books, what will the format be?


TILT! Will it even be legal for a publicly-funded library to provide Google books if they aren't ADA compliant?


Ping! This one I have to quote:

"Public libraries are eligible to receive one free Public Access Service license for a computer located on-site at each of their library buildings in the United States. Public libraries will also be able to purchase a subscription which would allow them to offer access on additional terminals within the library building and would eliminate the requirement of a per page printing fee. Higher education institutions will also be eligible to receive free Public Access Service licenses for on-site computers, the exact number of which will depend on the number of students enrolled."


TILT! Were any public libraries asked about this? Does anyone have an idea of what it will cost them to 1) manage this limited access and pay-per-page printing 2) obtain more licenses when demand rises? Remember when public libraries only had one machine hooked up to the Internet? Is this the free taste that leads to the Google Books habit?


Ping! The e-book vendors only provide books where they have an agreement with the publishers, thus no orphan works are included. So, will Google's niche mainly consist of providing access to orphan works? Or will the current e-book vendors be forced out of the market because Google's total base is larger, even though the product may be inferior?


Ping! We already have a licensor of rights, the Copyright Clearance Center, and it was founded with the support of the very folks (the AAP) who have now agreed to create another organization, funded initially by Google and responding only to the licensing of Google-held content.


TILT! Google books gets its own licensing service, its own storefront... can anyone compete with that? And what happens to anything that Google doesn't have?


Ping! It looks like Google will collect fees on all books that are not in the public domain. This means that users will pay to view orphan works, even though a vast number of them are actually in the public domain. Unclaimed fees will go to pay for the licensing service. Thus, users will be paying for the service itself, and will be paying to view books they should be able to access freely and for free.


Ping! We have a copyright office run by the US government. I'm beginning to wonder what that Copyright Office does, however, since we now have two non-profit organizations in the business of managing rights, plus others getting into the game, such as OCLC with its rights assessment registry, and folks like Creative Commons. Shouldn't the Copyright Office be the go-to place to find out who owns the rights to a work? Shouldn't we be scanning the documents held by the Copyright Office that tell us who has rights? (Note: the famed renewal database is actually a scan of the INDEX to the copyright renewal documents, not the full information about renewal.) Even if we had access to every copyright registration document in the Copyright Office, would we know who owns various rights? I think not. And how much of this will change with the Google opt-in system? I get the feeling that we'll maybe resolve some small percentage of rights questions, somewhere in the order of 2-5%. And it will, in the end, all be paid for by readers, or by libraries on behalf of readers.


TILT! Rights holders can opt-out of the Google Books database. If (when) Google has the monopoly on books online, opt-out will be a nifty form of censorship. Actually, censorship aimed directly at Google will be a nifty form of censorship.


GAME OVER. All your book belong to us.

Friday, July 20, 2007

Copies, duplicates, identification

In at least three projects I'm working on now I am seeing problems with the conflict between managing copies (which libraries do) and managing content (which users want). Even before we go chasing after the FRBR concept of the work, we are already dealing with what FRBR-izers would call "different items of the same manifestation." Given that the items we tend to hold were mass produced, and thus there are many copies of them, it seems odd that we have never found a way to identify the published set that those items belong to.

"Ah," you say, "what about the ISBN?" The ISBN is a good manifestation identifier for things published after 1968 (not to mention some teddy bears and fancy chocolates), but it doesn't help us for anything earlier than that.

You probably aren't saying, "What about the BICI?" which was an admirable attempt to create a book identifier similar to the SICI (which covers serials, serials issues, and serials articles). The BICI never got beyond being a draft NISO standard, presumably because no one was interested in using it. The SICI is indeed a full NISO standard, but it seems to be falling out of use. Both of these were identifiers that could be derived either from the piece or from metadata, which is in itself not a bad idea. What was a less than good idea is that the BICI only could be derived for books that have ISBNs, but if you've got an ISBN you haven't a whole lot of use for a BICI, although it would allow you to identify individual chapters or sections of the book. But as a book identifier, it doesn't do much for us.

Now that we're moving into a time of digitization of books, I'm wondering if we can't at least find a way to identify the duplicate digital copies (of which there will be many as the various digitization projects go forward, madly grabbing books off of shelves and rushing them to scanners). Early books were identified using incipits, usually a few characters of beginning and ending text. Today's identifier would have to be more clever, but surely with the ability to run a computation on the digitized book there would be some way to derive an identifier that is accurate enough for the kind of operation where lives aren't usually at stake. There would be the need to connect the derived book identifier to the physical copies of the book, but I'm confident we can do that, even if over a bit of time.

Both Google and the Internet Archive are assigning unique identifiers to digitized books, but we have to presume that these are internal copy level identifiers, not manifestation-specific. The Archive seems to use some combination of the title and the author. Thus "Venice" by Mortimer Menpes is venicemenpes00menpiala while "Venice" by Berly De Zoete is venicedeselincou00dezoiala and "Venice" by Daniel Pidgeon is venicepidgeon00pidgiala. The zeroes in there lead me to believe that if they received another copy it would get identified as "01." Google produces an impenetrable identifier for the Mortimer Menpes book: id=4XsKAAAAIAAJ, which may or may not be derivable from the book itself. I suspect not. And we know that Google will have duplicates so we also know that each item will be identified, not each manifestation.

Meanwhile, there is a rumor circulating that the there is discussion taking place at Bowker, the ISBN agency, on the feasibility of assigning ISBNs to pre-1968 works, especially as they get digitized. I'm very interested in how (if?) we can attach such an identifier to the many copies of the books that already exist, and to their metadata. (This sounds like a job for WorldCat, doesn't it, since they have probably the biggest and most accurately de-duped database of manifestations.)

I know nothing more about it than that, but will pass along any info if I get it. And I'd love to hear from anyone who does know more.

Tuesday, October 24, 2006

Google Book Search is NOT a Library Backup

I have seen various quotes from library managers that the Google Book Search program, which is digitizing books from about a dozen large research libraries, now provides a backup to the library itself. This is simply not the case. Google is, or at least began as, a keyword search capability for books, not a preservation project. This means that "good enough" is good enough for users to discover a book by the keywords. A few key facts about GBS:

1) it uses uncorrected OCR. This means that there are many errors that remain in the extracted text. A glaring example is that all hyphenated words that break across a line are treated as separate words, e.g. re-create is in the text as "re" and "create". And the OCR has particular trouble with title pages and tables of contents:

Copyright, 18w,

B@ DODD, MEAD AND COMPANY,

411 r@h @umieS

@n(Wr@ft@ @rr@

5 OHN WILSON AND SON, CAMBRIDGE, U. S. A.

Here's the table of contents page:

(@t'

@ 1@ -r: @

@Je@ @3(

CONTENTS

CHAPTER PAGS

I. MATERIAL AND METHOD . . 7
II. TIME AND PLACE 20
III. MEDITATION AND IMAGINATION 34
IV. THE FIRST DELIGHT . . . 51
V. THE FEELING FOR LITERATURE 63
VI. THE BOOKS OF LIFE . . . 74
Vii. FROM THE BOOK TO THE READER 8@
VIII. BY WAY OF ILLUSTRATION . 95
IX. PERSONALITY 109
X. LIBERATION THROUGH IDEAS . 121
XI. THE LOGIC OF FREE LIFE. . 132
XII. THE IMAGINATION 143
XIII. BREADTH OF LIFE 154
XIV. RACIAL EXPRESSION . . . i65
XV. FRESHNESS OF FEELING. . . 174

2) it will not digitize all items from the libraries. Some will be considered too delicate for the scanning process, others will present problems because of size or layout. It isn't clear how they will deal with items that are off the shelf when that shelf is being digitized.

3) quality control is generally low. I have heard that some of the libraries are trying to work with Google on this, but the effort by the library to QC each dgitized book would be extremely costly. People have reported blurred or missing pages, but my favorite is:

"Venice in Sweden"
Search isbn:030681286X (Stones of Venice, by Ruskin)
Click on the link and you see a page of Stones of Venice. Click on the Table of Contents and you're at page two or so of a guidebook on Sweden. Click forward and backward and move seamlessly from Venice to Sweden and back again. Two! Two! Two books in one! (I reported this to G months ago.)

4) the downloaded books aren't always identical to the book available online (which in turn may be different to the actual physical book due to scanning abnormalities). Look at this version of "Old Friends" both online and after downloading, and you'll see that most of the plates are missing from the downloaded version. Not necessarily a back-up problem, but it doesn't instill confidence that copies made from their originals will be complete.

Note that these examples may not affect the usefulness of the search function provided by Google, but they do affect the assumption that these books back up the library

Thursday, September 07, 2006

Google Books and Federal Documents

The Google Books blog today announces with some fanfare that Diane Publications, a publishing house that specializes in (re)publishing Federal documents, is making all of its documents available for full viewing. The publisher states:
The free flow of government information to a democratic society is utmost in our mind.
So I did a publisher search on Google and found a publication called "Marijuana Use in America" -- which is a reprint of a 104th Congress hearing, and on each page there is a watermark that says:
Copyrighted material

Now you all know that this is wrong, because Federal documents are in the public domain, but no where does the Google blog or the publisher mention the "PD" word. This troubles me because it will now require effort to undo this misinformation.

And, of course, just to add salt to the wound, I was easily able to find a book that the Diane Publishing company sells for $30 that you can get from GPO for $2.95. This really hurts.

Friday, August 25, 2006

Do it yourself digital books

This user got tired of waiting for his book to appear on the Internet Archive so he just did it! Perhaps the Million Book Project just needs one million users like this:

Reviewer: papeters - 4 out of 5 stars - April 10, 2004
Subject: Good copy for PG

Tired of waiting for corrections, I got another copy of the book and made good scans. The book is available through PG at:

http://www.gutenberg.net/1/1/9/2/11926/11926-h/11926-h.htm (html)
or
http://www.gutenberg.net/1/1/9/2/11926/11926.txt (plain-text)

See this at http://www.archive.org/details/WashingtonInDomesticLife