Coyle's InFormation: OpenLibrary

Showing posts with label OpenLibrary. Show all posts

Monday, April 27, 2020

Ceci n'est pas une Bibliothèque

On March 24, 2020, the Internet Archive announced that it would "suspend waitlists for the 1.4 million (and growing) books in our lending library," a service they then named The National Emergency Library. These books were previously available for lending on a one-to-one basis with the physical book owned by the Archive, and as with physical books users would have to wait for the book to be returned before they could borrow it. Worded as a suspension of waitlists due to the closure of schools and libraries caused by the presence of the coronavirus-19, this announcement essentially eliminated the one-to-one nature of the Archive's Controlled Digital Lending program. Publishers were already making threatening noises about the digital lending when it adhered to lending limitations, and surely will be even more incensed about this unrestricted lending.

I am not going to comment on the legality of the Internet Archive's lending practices. Legal minds, perhaps motivated by future lawsuits, will weigh in on that. I do, however, have much to say on the use of the term "library" for this set of books. It's a topic worthy of a lengthy treatment, but I'll give only a brief account here.

LIBRARY … BIBLIOTHÈQUE … BIBLIOTEK

The roots “LIBR…” and “BIBLIO…” both come down to us from ancient words for trees and tree bark. It is presumed that said bark was the surface for early writings. “LIBR…”, from the Latin word liber meaning “book,” in many languages is a prefix that indicates a bookseller’s shop, while in English it has come to mean a collection of books and from that also the room or building where books are kept. “BIBLIO…” derives instead from the Greek biblion (one book) and biblia (books, plural). We get the word Bible through the Greek root, which leaked into old Latin and meant The Book.

Therefore it is no wonder that in the minds of many people, books = library. In fact, most libraries are large collections of books, but that does not mean that every large collection of books is a library. Amazon has a large number of books, but is not a library; it is a store where books are sold. Google has quite a few books in its "book search" and even allows you to view portions of the books without payment, but it is also not a library, it's a search engine. The Internet Archive, Amazon, and Google all have catalogs of metadata for the books they are offering, some of it taken from actual library catalogs, but a catalog does not make a quantity of books into a library. After all, Home Depot has a catalog, Walmart has a catalog; in essence, any business with an inventory has a catalog.

"...most libraries are large collections of books, but that does not mean that every large collection of books is a library."

The Library Test

First, I want to note that the Internet Archive has met the State of California test to be defined as a library, and this has made it possible for the Archive to apply for library-related grants for some of its projects. That is a Good Thing because it has surely strengthened the Archive and its activities. However, it must be said that the State of California requirements are pretty minimal, and seem to be limited to a non-profit organization making materials available to the general public without discrimination. There doesn't seem to be a distinction between "library" and "archive" in the state legal code, although librarians and archivists would not generally consider them easily lumped together as equivalent services.

The Collection

The Archive's blog post says "the Internet Archive currently lends about as many as a US library that serves a population of about 30,000." As a comparison, I found in the statistics gathered by the California State Library those of the Benicia Public Library in Benicia California. Benicia is a city with a population of 31,000; the library has about 88,000 books. Well, you might say, that's not as good as over one million books at the Internet Archive. But, here's the thing: those are not 88,000 random books, they are books chosen to be, as far as the librarians could know, the best books for that small city. If Benicia residents were, for example, primarily Chinese-speaking, the library would surely have many books in Chinese. If the city had a large number of young families then the children's section would get particular attention. The users of the Internet Archive's books are a self-selected (and currently un-defined) set of Internet users. Equally difficult to define is the collection that is available to them:

This library brings together all the books from Phillips Academy Andover and Marygrove College, and much of Trent University’s collections, along with over a million other books donated from other libraries to readers worldwide that are locked out of their libraries.

Each of these is (or was, in the case of Marygrove, which has closed) a collection tailored to the didactic needs of that institution. How one translates that, if one can, to the larger Internet population is unknown. That a collection has served a specific set of users does not mean that it can serve all users equally well. Then there is that other million books, which are a complete black box.

Library science

I've argued before against dumping a large and undistinguished set of books on a populace, regardless of the good intentions of those doing so. Why not give the library users of a small city these one million books? The main reason is the ability of the library to fulfill the 5 Laws of Library Science:

Books are for use.
Every reader his or her book.
Every book its reader.
Save the time of the reader.
The library is a growing organism. [0]

The online collection of the Internet Archive nicely fulfills laws 1 and 5: the digital books are designed for use, and the library can grow somewhat indefinitely. The other three laws are unfortunately hindered by the somewhat haphazard nature of the set of books, combined with the lack of user services.

Of the goals of librarianship, matching readers to books is the most difficult. Let's start with law 3, "every book its reader." When you follow the URL to the National Emergency Library, you see something like this:

The lack of cover art is not the problem here. Look at what books you find: two meeting reports, one journal publication, and a book about hand surgery, all from 1925. Scroll down for a bit and you will find it hard to locate items that are less obscure than this, although undoubtedly there are some good reads in this collection. These are not the books whose readers will likely be found in our hypothetical small city. These are books that even some higher education institutions would probably choose not to have in their collections. While these make the total number of available books large, they may not make the total number of useful books large. Winnowing this set to one or more (probably more) wheat-filled collections could greatly increase the usability of this set of books.

"While these make the total number of available books large, they may not make the total number of useful books large."

A large "anything goes" set of documents is a real challenge for laws 2 and 4: every reader his or her book, and save the time of the reader. The more chaff you have the harder it is for a library user to find the wheat they are seeking. The larger the collection the more of the burden is placed on the user to formulate a targeted search query and to have the background to know which items to skip over. The larger the retrieved set, the less likely that any user will scroll through the entire display to find the best book for their purposes. This is the case for any large library catalog, but these libraries have built their collection around a particular set of goals. Those goals matter. Goals are developed to address a number of factors, like:

What are the topics of interest to my readers and my institution?
How representative must my collection be in each topic area?
What are the essential works in each topic area?
What depth of coverage is needed for each topic? [1]

If we assume (and we absolutely must assume this) that the user entering the library is seeking information that he or she lacks, then we cannot expect users to approach the library as an expert in the topic being researched. Although anyone can type in a simple query, fewer can assess the validity and the scope of the results. A search on "California history" in the National Emergency Library yields some interesting-looking books, but are these the best books on the topic? Are any key titles missing? These are the questions that librarians answer when developing collections.

The creation of a well-rounded collection is a difficult task. There are actual measurements that can be run against library collections to determine if they have the coverage that can be expected compared to similar libraries. I don't know if any such statistical packages can look beyond quantitative measures to judge the quality of the collection; the ones I'm aware of look at call number ranges, not individual titles. There

Library Service

The Archive's own documentation states that "The Internet Archive focuses on preservation and providing access to digital cultural artifacts. For assistance with research or appraisal, you are bound to find the information you seek elsewhere on the internet." After which it advises people to get help through their local public library. Helping users find materials suited to their need is a key service provided by libraries. When I began working in libraries in the dark ages of the 1960's, users generally entered the library and went directly to the reference desk to state the question that brought them to the institution. This changed when catalogs went online and were searchable by keyword, but prior to then the catalog in a public library was primarily a tool for librarians to use when helping patrons. Still, libraries have real or virtual reference desks because users are not expected to have the knowledge of libraries or of topics that would allow them to function entirely on their own. And while this is true for libraries it is also true, perhaps even more so, for archives whose collections can be difficult to navigate without specialized information. Admitting that you give no help to users seeking materials makes the use of the term "library" ... unfortunate.

What is to be done?

There are undoubtedly a lot of useful materials among the digital books at the Internet Archive. However, someone needing materials has no idea whether they can expect to find what they need in this amalgamation. The burden of determining whether the Archive's collection might suit their needs is left entirely up to the members of this very fuzzy set called "Internet users." That the collection lends at the rate of a public library serving a population of 30,000 shows that it is most likely under-utilized. Because the nature of the collection is unknown one can't approach, say, a teacher of middle-school biology and say: "they've got what you need." Yet the Archive cannot implement a policy to complete areas of the collection unless it knows what it has as compared to known needs.

"... these warehouses of potentially readable text will remain under-utilized until we can discover a way to make them useful in the ways that libraries have proved to be useful."

I wish I could say that a solution would be simple - but it would not. For example, it would be great to extract from this collection works that are commonly held in specific topic areas in small, medium and large libraries. The statistical packages that analyze library holdings all are, AFAIK, proprietary. (If anyone knows of an open source package that does this, please shout it out!) If would also be great to be able to connect library collections of analog books to their digital equivalents. That too is more complex than one would expect, and would have to be much simpler to be offered openly. [2]

While some organizations move forward with digitizing books and other hard copy materials, these warehouses of potentially readable text will remain under-utilized until we can discover a way to make them useful in the ways that libraries have proved to be useful. This will mean taking seriously what modern librarianship has developed over its circa 2 centuries, and in particular those 5 laws that give us a philosophy to guide our vision of service to the users of libraries.

-----

[0] Even if you are familiar with the 5 laws you may not know that Ranganathan was not as succinct as this short list may imply. The book in which he introduces these concepts is over 450 pages long, with extended definitions and many homey anecdotes and stories.

[1] A search on "collection development policy" will yield many pages of policies that you can peruse. To make this a "one click" here are a few *non-representative* policies that you can take a peek at:

Hennepin County (public)
Lansing Community College (community college)
Stanford University, Science Library (research library)

[2] Dan Scott and I did a project of this nature with a Bay Area public library and it took a huge amount of human intervention to determine whether the items matched were really "equivalent". That's a discussion for another time, but, man, books are more complicated than they appear.

Monday, November 28, 2016

All the Books

I just joined the Book of the Month Club. This is a throwback to my childhood, because my parents were members when I was young, and I still have some of the books they received through the club. I joined because my reading habits are narrowing, and I need someone to recommend books to me. And that brings me to "All the Books."

"All the Books" is a writing project I've had on my computer and in notes ever since Google announced that it was digitizing all the books in the world. (It did not do this.) The project was lauded in an article by Kevin Kelley in the New York Times Magazine of May 14, 2006, which he prefaced with:

"What will happen to books? Reader, take heart! Publisher, be very, very afraid. Internet search engines will set them free. A manifesto."

There are a number of things to say about All the Books. First, one would need to define "All" and "Books". (We can probably take "the" as it is.) The Google scanning projects defined this as "all the bound volumes on the shelves of certain libraries, unless they had physical problems that prevented scanning." This of course defines neither "All" nor "Books".

Next, one would need to gather the use cases for this digital corpus. Through the HathiTrust project we know that a small number of scholars are using the digital files for research into language usage over time. Others are using the the files to search for specific words or names, discovering new sources of information about possibly obscure topics. As far as I can tell, no one is using these files to read books. The Open Library, on the other hand, is lending digitized books as ebooks for reading. This brings us to the statement that was made by a Questia sales person many years ago, when there were no ebooks and screens were those flickery CRTs: "Our books are for research, not reading." Given that their audience was undergraduate students trying to finish a paper by 9:30 a.m. the next morning, this was an actual use case with actual users. But the fact that one does research in texts one does not read is, of course, not ideal from a knowledge acquisition point of view.

My biggest beef with "All the Books" is that it treats them as an undifferentiated mass, as if all the books are equal. I always come back to the fact that if you read one book every week for 60 years (which is a good pace) you will have read 3,120. Up that to two books a week and you've covered 6,240 of the estimated 200-300 million books represented in WorldCat. The problem isn't that we don't have enough books to read; the problem is finding the 3-6,000 books that will give us the knowledge we need to face life, and be a source of pleasure while we do so. "All the Books" ignores the heights of knowledge, of culture, and of art that can be found in some of the books. Like Sarah Palin's response to the question "Which newspapers form your world view?", "all of them" is inherently an anti-intellectual answer, either by someone who doesn't read any of them, or who isn't able to distinguish the differences.

"All the Books" is a complex concept. It includes religious identity; the effect of printing on book dissemination; the loss of Latin as a universal language for scholars; the rise of non-textual media. I hope to hunker down and write this piece, but meanwhile, this is a taste.

Monday, April 26, 2010

Social aspects of subject headings

You've probably played the "my favorite subject heading" game when geeking out with librarian friends. Here's some additional fuel in case you've run out of zingers.

The Open Library takes the LC subject headings and breaks them apart at the subfield level into subjects, persons, places, genres, and times. It also includes some BISAC headings retrieved from Amazon, so the subject list is not "pure." The separate subject entries obtained are similar to, but not the same as, OCLC's FAST headings, and look much like some facets that appear in library catalogs.

The Open Library database currently holds about 24 million records for books (at least partially de-duped). In a recent dump of subjects, the total number of different subjects came out as 1,278,539. Of those, 336,638 were of the "topical" variety, that is either a 650 $a or a 65X $x. The top 25 are as follows:

825168 History
322928 Biography
212822 Politics and government
206519 Congresses
192968 History and criticism
184183 Fiction
123838 Law and legislation
119333 Bibliography
95555 Juvenile literature
93364 Description and travel
90866 Economic conditions
84787 Criticism and interpretation
74878 Claims
71468 Social life and customs
70926 Social conditions
70563 Catalogs
69205 Private Bills
69191 Private bills
66480 Education
63410 Exhibitions
63301 World War, 1939-1945
60235 Foreign relations
60068 Philosophy
56219 Dictionaries
55460 Study and teaching

I find it interesting that with the exception of "World War, 1939-1945" these appear to have the function of qualifiers, and I'm thinking that it would be interesting to contrast the $a and $x terms. My guess is that these are $x, but that not all $x are of this nature.

Of the subfields, 164,342 appear only once in the database. These are a great source of interesting an unusual headings, including "Social aspects of adzes" and "Deer as pets." In fact, the "Social aspects...." tail is so amusing that I have made a file of those with a count of 1.

The full file of topical subjects is 8 megabytes, but can probably yield innumerable hours of library cocktail hour amusement. (text in format "count - tab - subject") I will also look into names, organizations, places and times as subjects.

Monday, June 23, 2008

The "Mao" problem

I've been assisting the Internet Archive on its Open Library project, my role being primarily to help them understand library data. It's fascinating watching non-librarians encounter library data -- so much that we take for granted isn't obvious at all to others. I'm thinking that it's time for a "Library Data for Dummies." I am seriously considering setting it up as a wiki so we can all contribute to it.

Most recently on the OL project we ran into what I like to call the "Mao" problem. It begins like this: the database uses bibliographic records from libraries and from Amazon. The Amazon data presents author names in natural order ("John Smith"), while the library records use the inverted order with the family name first ("Smith, John"). It's best for users of the service to see the names presented uniformly (the mixture is quite jarring). If you think about it for a moment, you realize that converting the natural order names to inverted order will be problematic, since there is nothing to tell you where the family name begins ("Oscar della Renta"). So the solution is to un-invert the inverted names, something that is purely mechanical.

Until you encounter Mao, Zedong -- and the thousands of other authors for whom "natural order" is family name followed by a given name. I find that Mao is the example that hits the "Aha!" button for most people. Obviously, presenting the name as "Zedong Mao" pretty much makes it unrecognizable. So what to do?

Well, I suppose it helps to NOT think like a librarian. Edward Betts, the coder on this project, came up with an ingenious idea: he compared the names in the Open Library records with names on Amazon and on Wikipedia, and has made a list of names that generally appear in family name first order with a link to the source where it was found. For famous authors or historical figures, Wikipedia contains many of the names and is good about presenting various name forms. It gives the traditional and simplified Chinese forms, and sometimes both Wade-Giles and Pinyin transliterations. It also often has the note:

This is a Chinese name; the family name is Chen.

Naturally, an automated solution of this kind will produce some false hits, but that's why the Open Library is designed as a wiki -- so errors can be corrected. I'm beginning to think, though, that a link from author names to Wikipedia is not a bad idea in itself. The articles are often quite comprehensive and definitely are more useful than a link to a name authority record. I'd also favor a link to OCLC's Worldcat Identities pages, which are quite rich and link well to library data, since that's what they are based on. Presumably one could launch a search to either from a name heading. Has anyone tried this yet?

Sunday, October 28, 2007

Bibliographic ER

No, I'm not sending libraries to the emergency room, although there are days when I feel like we're at that point. The ER in the title refers to Entity-Relationship, a way to look at data that emphasizes the general viewpoint that there are things, and those things exist in relation to each other.

In one sense, this is what we have done for over a century with our library data. The bibliographic records that we create have in them many relationships: Person authored Book; Publishing House published Book; Book is in Series; Book has Topics. Those relationships are implicit in our records, but the data isn't formatted in an entity-relationship model. Our records, instead, talk about the relationships but don't make it easy to give the various entities their own existence. So we create a record that contains:

Author
Book title.
Place, publisher, date
Series
Subject A
Subject B

The record represents all of the information about the book, but there is no record that represents all of the information about the author, or all of the information about the publisher, etc. Instead, those "entities" are buried in bibliographic records scattered throughout the file.

An E-R model would give each of these entities an identity on which you could hang information about the entity.

OK, I can't draw worth beans. But basically the idea is that authors, subjects, publishers, topics, all become entries in their own right. This means that you can add information to the author record or the series record, because they have their own place in the design. It also makes it easy to look at your data from many different points of view, while still retaining all of the richness of the relationships. So from the point of view of the person who is the illustrator in the book above, the bibliographic world may look like this:

This type of model is expressed in FRBR, but the E-R aspect of FRBR does not seem to be incorporated into RDA as it stands today. Instead, RDA appears to be aimed at creating the same flat structure that we have in library data today.

If you take a look at the OpenLibrary you will see that books get a page that is about the book, and authors get a separate page that is about the author. This is very simple, but it is also very important. It means that the catalog is no longer just a list of books with authors but can become a rich source of information about authors. You can add bios for authors, link to web sites about the author, launch a discussion group about a favorite author. Because the author is an entity, not just a data element in a record about the book, it becomes a potentially active part of your information system.

In the future, I hope that we can give life to many more entities in the OpenLibrary, and also that we can give them meaningful relationships between each other. This would mean taking a semantic web approach to library data. I don't have a clear picture of where we'll end up, but I'm glad that folks there are interested in experimenting. If you've already thought this through or have ideas in this direction, please step forward. I'd love to hear from you.

Oct	JAN	Nov
	21
2019	2021	2022