The Wayback Machine - https://web.archive.org/web/20210120235948/http://kcoyle.blogspot.com/search/label/metadata
Showing posts with label metadata. Show all posts
Showing posts with label metadata. Show all posts

Saturday, November 23, 2019

The Work

The word "work" generally means something brought about by human effort, and at times implies that this effort involves some level of creativity. We talk about "works of art" referring to paintings hanging on walls. The "works" of Beethoven are a large number of musical pieces that we may have heard. The "works" of Shakespeare are plays, in printed form but also performed. In these statements the "work" encompasses the whole of the thing referred to, from the intellectual content to the final presentation.

This is not the same use of the term as is found in the Library Reference Model (LRM). If you are unfamiliar with the LRM, it is the successor to FRBR (which I am assuming you have heard of) and it includes the basic concepts of work, expression, manifestation and item that were first introduced in that previous study. "Work," as used in the LRM is a concept designed for use in library cataloging data. It is narrower than the common use of the term illustrated in the previous paragraph and is defined thus:
Class: Work
Definition: An abstract notion of an artistic or intellectual creation.
In this definition the term only includes the idea of a non-corporeal conceptual entity, not the totality that would be implied in the phrase "the works of Shakespeare." That totality is described when the work is realized through an LRM-defined "expression" which in turn is produced in an LRM-defined "manifestation" with an LRM-defined "item" as its instance.* These four entities are generally referred to as a group with the acronym WEMI.

Because many in the library world are very familiar with the LRM definition of work, we have to use caution when using the word outside the specific LRM environment. In particular, we must not impose the LRM definition on uses of the work that are not intending that meaning. One should expect that the use of the LRM definition of work would be rarely found in any conversation that is not about the library cataloging model for which it was defined. However, it is harder to distinguish uses within the library world where one might expect the use to be adherent to the LRM.

To show this, I want to propose a particular use case. Let's say that a very large bibliographic database has many records of bibliographic description. The use case is that it is deemed to be easier for users to navigate that large database if they could get search results that cluster works rather than getting long lists of similar or nearly identical bibliographic items. Logically the cluster looks like this:


In data design, it will have a form something like this:


This is a great idea, and it does appear to have a similarity to the LRM definition of work: it is gathering those bibliographic entries that are judged to represent the same intellectual content. However, there are reasons why the LRM-defined work could not be used in this instance.

The first is that there is only one WEMI relationship for work, and that is from LRM work to LRM expression. Clearly the bibliographic records in this large library catalog are not LRM expressions; they are full bibliographic descriptions including, potentially, all of the entities defined in the LRM.

To this you might say: but there is expression data in the bibliographic record, so we can think of this work as linking to the expression data in that record. That leads us to the second reason: the entities of WEMI are defined as being disjoint. That means that no single "thing" can be more than one of those entities; nothing can be simultaneously a work and an expression, or any other combination of WEMI entities. So if the only link we have available in the model is from work to expression, unless we can somehow convince ourselves that the bibliographic record ONLY represents the expression (which it clearly does not since it has data elements from at least three of the LRM entities) any such link will violate the rule of disjointness.

Therefore, the work in our library system can have much in common with the conceptual definition of the LRM work, but it is not the same work entity as is defined in that model.

This brings me back to my earlier blog post with a proposal for a generalized definition of WEMI-like entities for created works.  The WEMI concepts are useful in practice, but the LRM model has some constraints that prevent some desirable uses of those entities. Providing unconstrained entities would expand the utility of the WEMI concepts both within the library community, as evidenced by the use case here, and in the non-library communities that I highlight in that previous blog post and in a slide presentation.

To be clear, "unconstrained" refers not only to the removal of the disjointness between entities, but also to allow the creation of links between the WEMI entities and non-WEMI entities, something that is not anticipated in the LRM. The work cluster of bibliographic records would need a general relationship, perhaps, as in the case of VIAF, linked through a shared cluster identifier and an entity type identifying the cluster as representing an unconstrained work.

----
* The other terms are defined in the LRM as:

Class: Expression
Definition: A realization of a single work usually in a physical form.

Class: Manifestation
Definition: The physical embodiment of one or more expressions.

Class: Item
Definition: An exemplar of a single manifestation.

Monday, January 28, 2019

FRBR without FR or BR

(This is something I started working on that turns out to be a "pulled thread" - something that keeps on unwinding the more I work on it. What's below is a summary, while I decide what to do with the longer piece.)

FRBR was developed for the specific purpose of modeling library catalog data. I give the backstory on FRBR in chapter 5 of my book, "FRBR Before and After." The most innovative aspect of FRBR was the development of a multi-entity view of creative works. Referred to as "group 1" of three groups of entities, the entities described there are Work, Expression, Manifestation, and Item (WEMI). They are aligned with specific bibliographic elements used in library catalogs, and are defined with a rigid structure: the entities are linked to each other in a single chain; the data elements are defined each as being valid for one and only one entity; all WEMI entities are disjoint.

In spite of these specifics, something in that group 1 has struck a chord for metadata designers who do not adhere to the library catalog model as described in FRBR. In fact, some mentions or uses of WEMI are not even bibliographic in nature.* This leads me to conclude that a version of WEMI that is not tied to library catalog concepts could provide an interesting core of classes for metadata that describes creative or created resources.

We already have some efforts that have stepped away from the specifics of FRBR. From 2005 there is the first RDF FRBR ontology, frbrCore, which defines the entities of FRBR and key relationships between them as RDF classes. This ontology breaks away from FRBR in that it creates super-classes that are not defined in FRBR, but it retains the disjointness between the primary entities. We also have FRBRoo which is a FRBR-ized version of the CIDOC museum metadata model. This extends the number of classes to include some that represent processes that are not in the static model of the library catalog. In addition we have FaBiO, a bibliographic ontology that uses frbrCore classes but extends the WEMI-based classes with dozens of sub-classes that represent types of works and expressions.

I conclude that there is something in the ability to describe the abstraction of work apart from the concrete item that is useful in many areas. The intermediate entities, defined in FRBR as expression and manifestation, may have a role depending on the material and the application for which the metadata is being developed. Other intermediate entities may be useful at times. But as a way to get started, we can define four entities (which are "classes" in RDF) that parallel the four group 1 entities in FRBR. I would like to give these new names to distance them from FRBR, but that may not be possible as people have already absorbed the FRBR terminology.


FRBR            /   option1 / option2
work               / idea        / creative work
expression      / creation  / realization
manifestation / object     / product
item                / instance / individual

My preferred rules for these classes are:
  • any entity can be iterative (e.g. a work of a work)
  • any entity can have relationships/links to any other entity
  • no entity has an inherent dependency on any other entity
  • any entity can be used alone or in concert with other entities
  • no entities are disjoint
  • anyone can define additional entities or subclasses   
  • individual profiles using the model may recommend or limit attributes and relationships, but the model itself will not have restrictions
This implements a a theory of ontology development known as "minimum semantic commitment." In this theory,  base vocabulary terms should be defined with as little semantics as possible, with semantics in this sense being the axiomatic semantics of RDF. An ontology whose terms have high semantic definition, such as the original FRBR, will provide fewer opportunities for re-use because uses must adhere to the tightly defined semantics in the original ontology. Less commitment in the base ontology means that there are greater opportunities for re-use; desired semantics can be defined in specific implementations through the creation of application profiles.

Given this freedom, how would people choose to describe creative works? For example, here's one possible way to describe a work of art:

work:
    title: Acrobats
    creator: Paul Klee
    genre: abstract art
    topic: acrobats
    date: 1914
item:
    size: 9 x 9
    base material: paper
    material: watercolor, pastel, ink
    color: mixed
    signed: PKlee
    dated: 1914
   
And here's a way to describe a museum store's inventory record for a print:

work:
    title: Acrobats
    creator: Paul Klee
    genre: abstract art
    topic: acrobats
    date: 1914
manifestation:
    description: 12-color archival inkjet print
    size: 24 x 36 inches
    price: $16.99
   
There is also no reason why a non-creative product couldn't use the manifestation class (which is one of the reasons that I would prefer to call it "product," which would resonate better for these potential users):

manifestation/product:
    description: dining chair
    dimensions: 26 x 23 x 21.5 inches
    weight:  21 pounds
    color: gray
    manufacturer: YEEFY
    price: $49.99
   
Here is the sum total of what this core WEMI would look like, still using the FRBR terminology:

<http://example.com/Work> rdf:type owl:Class ;
    rdfs:label "Work"@en ;
    rdfs:comment: "The creative work as abstraction."@en .

<http://example.com/Expression> rdf:type owl:Class ;
    rdfs:label "Expression"@en ;
    rdfs:comment: "The creative work as it is expressed in a potentially perceivable form."@en .

<http://example.com/Manifestation> rdf:type owl:Class ;                                                             rdfs:label "Manifestation"@en ;
    rdfs:comment: "The physical product that contains the creative work."@en .

<http://example.com/Item> rdf:type owl:Class ;
    rdfs:label "Item"@en ;
    rdfs:comment: "An instance or individual copy of the creative work."@en .

I can see communities like Dublin Core and schema.org as potential locations for these proposed classes because they represent general metadata communities, not just the GLAM world of IFLA. (I haven't approached them.) I'm open to hearing other ideas for hosting this, as well as comments on the ideas here. For it? Against it? Is there a downside?


* Examples of some "odd" references to FRBR for use in metadata for:

Tuesday, November 27, 2018

It's "academic"

We all know that writing and publishing is of great concern to those whose work is in academia; the "publish or perish" burden haunts pre-tenure educators and grant-seeking researchers. Revelations that data had been falsified in published experimental results brings great condemnation from publishers and colleagues, and yet I have a feeling that underneath it all is more than an ounce of empathy from those who are fully aware of the forces that would lead one to put ones' thumbs on the scales for the purposes of winning the academic jousting match. It is only a slight exaggeration to compare these souls to the storied gladiators whose defeat meant summary execution. From all evidence, that is how many of them experience the contest to win the ivory tower - you climb until you fall.

Research libraries and others deal in great part with the output of the academe. In many ways their practices reinforce the value judgments made on academic writing, such as having blanket orders for all works published by a list of academic presses. In spite of this, libraries have avoided making an overt statement of what is and what is not "academic." The "deciders" of academic writing are the publishers - primarily the publishers of peer-reviewed journals that decide what information does and does not become part of the record of academic achievement, but also those presses that issue scholarly monographs. Libraries are the consumers of these decisions but stop short of tagging works as "academic" or "scholarly."

The pressure on academics has only increased in recent years, primarily because of the development of "impact factors." In 1955, Eugene Garfield introduced the idea that one could create a map of scientific publishing using an index of the writings cited by other works. (Science, 1955; 122 :108–11) Garfield was interested in improving science by linking works so that one could easily find supporting documents. However, over the years the purpose of citation has evolved from a convenient link to precedents into a measure of the worth of scholars themselves in the form of the "h-index" - the measure of how often a person (not a work) has been cited. The h-index is the "lifetime home runs" statistic of the academic world. One is valued for how many times one is cited, making citations the coin of the realm, not sales of works or even readership. No one in academia could or should be measured on the same scale as a non-academic writer when it comes to print runs, reviews, or movie deals. Imagine comparing the sales figures of "Poetic Autonomy in Ancient Rome" with "The Da Vinci Code". So it matters in academia to carve out a world that is academic, and that isolates academic works such that one can do things like calculate an h-index value.

This interest in all things academic has led to a number of metadata oddities that make me uncomfortable, however. There are metadata schemas that have an academic bent that translates to a need to assert the "scholarliness" of works being given a bibliographic description. There is also an emphasis on science in these bibliographic metadata, with less acknowledgement of the publishing patterns of the humanities. My problem isn't solely with the fact that they are doing this, but in particular with how they go about it.

As an example, the metadata schema BIBO clearly has an emphasis on articles as scholarly writing; notably, it has  a publication type "academic article" but does not have a publication type for "academic book." This reflects the bias that new scientific discoveries are published as journal articles, and many scientists do not write book-length works at all. This slights the work of historians like Ann M. Blair whose book, Too Much to Know, has what I estimate to be about 1,450 "primary sources," ranging from manuscripts in Latin and German from the 1500's to modern works in a number of languages. It doesn't get much more academic than that.

BIBO also has different metadata terms for "journal" and "magazine":
  • bibo:journal "A periodical of scholarly journal Articles."
  • bibo:magazine"A periodical of magazine Articles. A magazine is a publication that is issued periodically, usually bound in a paper cover, and typically contains essays, stories, poems, etc., by many writers, and often photographs and drawings, frequently specializing in a particular subject or area, as hobbies, news, or sports."
Something in that last bit on magazines smacks of "leisure time" while the journal clearly represents "serious work."  It's also interesting that the description of magazine is quite long, describes the physical aspects ("usually bound in a paper cover"), and gives a good idea of the potential content. "Journal" is simply "scholarly journal articles." Aside from the circularity of the definitions (journal has journal articles, magazines have magazine articles), what this says is simply that a journal is a "not magazine."

Apart from the snobbishness of the difference between these terms is the fact that one seeks in vain for a bright line between the two. There is, of course, the "I know it when I see it" test, and there is definitely some academic writing that you can pick out without hesitation. But is an opinion piece in the journal of a scientific society academic? How about a book review? How about a book review in the New York Review of Books (NYRB), where articles run to 2-5,000 words, are written by an academic in the field, and make use of the encyclopedic knowledge of the topic on the part of the reviewer? When Marcia Angell, professor at the Harvard Medical School and former Editor in Chief of The New England Journal of Medicine writes for the NYRB, has she slipped her academic robes for something else? She seems to think so. On her professional web site she lists among her publications a (significantly long) letter to the editor  (called a "comment" in academic journal-eze) of a science journal article about women in medicine but she does not include in her publication list the articles she has written for NYRB even though these probably make more use of her academic knowledge than the comment did. She is clearly making a decision about what is "academic" (i.e. career-related) and what is not. It seems that the dividing line is not the content of the writing but how her professional world esteems the publishing vehicle.

Not to single out BIBO, I should mention other "culprits" in the tagging of scholarly works, such as WikiData. Wikidata has:
  • academic journal article (Q18918145) article published in an academic journal
  • academic writing (Q4119870) academic writing and publishing is conducted in several sets of forms and genres
  • scholarly article (Q13442814) article in an academic publication, usually peer reviewed
  • scholarly publication (Q591041) scientific publications that report original empirical and theoretical work in the natural sciences
There is so much wrong with each of these, from circular definitions to bias toward science as the only scholarly pursuit (scholarly publication is a "scientific publication" in the "natural sciences"). (I've already commented on this in WikiData, sarcastically calling it a fine definition if you ignore the various directions that science and scholarship have taken since the mid-19th century.)  What this reveals, however is that the publication  and publisher defines whether the work is "scholarly." If any article in an academic publication is a scholarly article, then the comment by Dr. Angell is, by definition, scholarly, and the NYRB articles are not. Academia is, in fact, a circularly-defined world. 
Giving one more example, schema.org has this:
  • schema:ScholarlyArticle (sub-class of Article) A scholarly article.
Dig that definition! There are a few other types of article in schema, org, such as "newsArticle" and "techArticle" but it appears that all of those magazine articles would be simple "Article."

Note that in real life publications call themselves whatever they wish. With a hint at how terms may have changed over time: Ladies' Home Journal calls itself a journal, and the periodical published by the American Association for the Advancement of Science, Science, gives itself the domain sciencemag.org. "Science Magazine" just sounds right, doesn't it?

It's not wrong for folks to characterize some publications and some writing as "academic" but any metadata term needs a clear definition, which these do not have. What this means is that people using these schemas are being asked to make a determination with very little guidance that would help them separate the scholarly or academic from... well, from the rest of publishing output. With the inevitable variation in categorization, you can be sure that in metadata coded with these schemas the separation between scholarly/academic and not scholarly/academic writing is probably not going to be useful because there will be little regularity of assignment between communities that are using this metadata.

I admit that I picked on this particular metadata topic because I find the designation of "scholarly" or "academic" to be judgemental. If nothing else, when people judge they need some criteria for that judgement. What I would like to see is a clear definition that would help people decide what is and what is not "academic," and what the use cases are for why this typing of materials should be done. As with most categorizations, we can expect some differences in the decisions that will be made by catalogers and indexers working with these metadata schemas. A definition at least gives you something to discuss and to argue for.  Right now we don't have that for scholarly/academic publications.

And I am glad that libraries don't try to make this distinction.


Tuesday, November 18, 2014

Classes in RDF

RDF allows one to define class relationships for things and concepts. The RDFS1.1 primer describes classes succinctly as:
Resources may be divided into groups called classes. The members of a class are known as instances of the class. Classes are themselves resources. They are often identified by IRIs and may be described using RDF properties. The rdf:type property may be used to state that a resource is an instance of a class.
This seems simple, but it is in fact one of the primary areas of confusion about RDF.

If you are not a programmer, you probably think of classes in terms of taxonomies -- genus, species, sub-species, etc. If you are a librarian you might think of classes in terms of classification, like Library of Congress or the Dewey Decimal System. In these, the class defines certain characteristics of the members of the class. Thus, with two classes, Pets and Veterinary science, you can have:
Pets
- dogs
- cats

Veterinary science
- dogs
- cats
In each of those, dogs and cats have different meaning because the class provides a context: either as pets, or information about them as treated in veterinary science.

For those familiar with XML, it has similar functionality because it makes use of nesting of data elements. In XML you can create something like this:
<drink>
    <lemonade>
        <price>$2.50</price>
        <amount>20</amount>
    </lemonade>
    <pop>
        <price>$1.50</price>
        <amount>10</amount>
    </pop>
</drink>
and it is clear which price goes with which type of drink, and that the bits directly under the <drink> level are all drinks, because that's what <drink> tells you.

Now you have to forget all of this in order to understand RDF, because RDF classes do not work like this at all. In RDF, the "classness" is not expressed hierarchically, with a class defining the elements that are subordinate to it. Instead it works in the opposite way: the descriptive elements in RDF (called "properties") are the ones that define the class of the thing being described. Properties carry the class information through a characteristic called the "domain" of the property. The domain of the property is a class, and when you use that property to describe something, you are saying that the "something" is an instance of that class. It's like building the taxonomy from the bottom up.

This only makes sense through examples. Here are a few:
1. "has child" is of domain "Parent".

If I say "X - has child - 'Fred'" then I have also said that X is a Parent because every thing that has a child is a Parent.

2. "has Worktitle" is of domain "Work"

If I say "Y - has Worktitle - 'Der Zauberberg'" then I have also said that Y is a Work because every thing that has a Worktitle is a Work.

In essence, X or Y is an identifier for something that is of unknown characteristics until it is described. What you say about X or Y is what defines it, and the classes put it in context. This may seem odd, but if you think of it in terms of descriptive metadata, your metadata describes the "thing in hand"; the "thing in hand" doesn't describe your metadata. 

Like in real life, any "thing" can have more than one context and therefore more than one class. X, the Parent, can also be an Employee (in the context of her work), a Driver (to the Department of Motor Vehicles), a Patient (to her doctor's office). The same identified entity can be an instance of any number of classes.
"has child" has domain "Parent"
"has licence" has domain "Driver"
"has doctor" has domain "Patient"

X - has child - "Fred"  = X is a Parent 
X - has license - "234566"  = X is a Driver
X - has doctor - URI:765876 = X is a Patient
Classes are defined in your RDF vocabulary, as as the domains of properties. The above statements require an application to look at the definition of the property in the vocabulary to determine whether it has a domain, and then to treat the subject, X, as an instance of the class described as the domain of the property. There is another way to provide the class as context in RDF - you can declare it explicitly in your instance data, rather than, or in addition to, having the class characteristics inherent in your descriptive properties when you create your metadata. The term used for this, based on the RDF standard, is "type," in that you are assigning a type to the "thing." For example, you could say:
X - is type - Parent
X - has child - "Fred"
This can be the same class as you would discern from the properties, or it could be an additional class. It is often used to simplify the programming needs of those working in RDF because it means the program does not have to query the vocabulary to determine the class of X. You see this, for example, in BIBFRAME data. The second line in this example gives two classes for this entity:
<http://bibframe.org/resources/FkP1398705387/8929207instance22>
a bf:Instance, bf:Monograph .

One thing that classes do not do, however, is to prevent your "thing" from being assigned the "wrong class." You can, however, define your vocabulary to make "wrong classes" apparent. To do this you define certain classes as disjoint, for example a class of "dead" would logically be disjoint from a class of "alive." Disjoint means that the same thing cannot be of both classes, either through the direct declaration of "type" or through the assignment of properties. Let's do an example:
"residence" has domain "Alive"
"cemetery plot location" has domain "Dead"
"Alive" is disjoint "Dead" (you can't be both alive and dead)

X - is type - "Alive"                                         (X is of class "Alive")
X - cemetery plot location - URI:9494747      (X is of class "Dead")
Nothing stops you from creating this contradiction, but some applications that try to use the data will be stumped because you've created something that, in RDF-speak, is logically inconsistent. What happens next is determined by how your application has been programmed to deal with such things. In some cases, the inconsistency will mean that you cannot fulfill the task the application was attempting. If you reach a decision point where "if Alive do A, if Dead do B" then your application may be stumped and unable to go on.

All of this is to be kept in mind for the next blog post, which talks about the effect of class definitions on bibliographic data in RDF.


Note: Multiple domains are treated in RDF as an AND (an intersection). Using a library-ish example, let's assume you want to define a note field that you can use for any of your bibliographic entities. For this example, we'll define entities Work, Person, Manifestation for ex:note. You define your note property something like:

ex:note
    a rdf:Property ;
    rdfs:label "Note"@en ;
    rdfs:domain ex:Work ;
    rdfs:domain ex:Person ;
    rdfs:domain ex:Manifestation ;
    rdfs:range rdfs:literal .

Any subject on which you use ex:note would be inferred to be, at the same time, Work and Person and Manifestation - which is manifestly illogical. There is not a way to express the rule: "This property CAN be used with these classes" in RDF. For that, you will need something that does not yet exist in RDF, but is being worked on in the W3C community, which is a set of rules that would allow you to validate property usage. You might also want see what schema.org has done for domain and range.

Saturday, October 25, 2014

Citations get HOT

The Public Library of Science research section, PLOSLabs (ploslabs.org) has announced some very interesting news about the work that they are doing on citations, which they are calling "Rich Citations".

Citations are the ultimate "linked data" of academia, linking new work with related works. The problem is that the link is human-readable only and has to be interpreted by a person to understand what the link means. PLOS Labs have been working to make those citations machine-expressive, even though they don't natively provide the information needed for a full computational analysis.

Given what one does have in a normal machine-readable document with citations, they are able to pull out an impressive amount of information:
  • What section the citation is found in. There is some difference in meaning whether a citation is found in the "Background" section of an article, or in the "Methodology" section. This gives only a hint to the meaning of the citation, but it's more than no information at all.
  • How often a resource is cited in the article. This could give some weight to its importance to the topic of the article.
  • What resources are cited together. Whenever a sentence ends with "[3][7][9]", you at least know that those three resources equally support what is being affirmed. That creates a bond between those resources.
  • ... and more
As an open access publisher, they also want to be able to take users as directly as possible to the cited resources. For PLOS publications, they can create a direct link. For other resources, they make use of the DOI to provide links. Where possible, they reveal the license of cited resources, so that readers can know which resources are open access and which are pay-walled.

This is just a beginning, and their demo site, appropriately named "alpha," uses their rich citations on a segment of the PLOS papers. They also have an API that developers can experiment with.

I was fortunate to be able to spend a day recently at their Citation Hackathon where groups hacked on ongoing aspects of this work. Lots of ideas floated around, including adding abstracts to the citations so a reader could learn more about a resource before retrieving it. Abstracts also would add search terms for those resources not held in the PLOS database. I participated in a discussion about coordinating Wikidata citations and bibliographies with the PLOS data.

Being able to datamine the relationships inherent in the act of citation is a way to help make visible and actionable what has long been the rule in academic research, which is to clearly indicate upon whose shoulders you are standing. This research is very exciting, and although the PLOS resources will primarily be journal articles, there are also books in their collection of citations. The idea of connecting those to libraries, and eventually connecting books to each other through citations and bibliographies, opens up some interesting research possibilities.

Sunday, September 22, 2013

Copyright, Metadata, and Attribution

The Berkeley Center for Law and Technology (BCLT) has done some interesting research on copyright, including a white paper that details the issues of performing "due diligence" in a determination of orphan works.

Recently I attended a small meeting co-sponsored by BCLT and the DPLA to begin a discussion of the issues around copyright in metadata, with a particular emphasis on bibliographic metadata. Much of the motivation for this is the uncertainty in the library and archival community about whether they can freely share their metadata. As long as this question remains un-answered, there are barriers to the free flow of data from and between cultural heritage institutions.

At the conclusion of the meeting it was clear that it will take some research to fully define the problem space. Fortunately for all of us, BCLT may be able to devote resources to undertake such a study, similar to what they have done around orphan works.

One of the first questions to undertake is whether bibliographic metadata is copyrightable in the first place. If not, then no further steps need to be taken -- not even putting a CC0 license on the data. In fact, some knowledgeable folks worry that using CC0 implies that there do exist intellectual property rights that must be addressed.

However, before you can attempt to determine if bibliographic metadata can be argued to be a set of facts which, under US copyright law, do not enjoy protection, you must be able to define "bibliographic metadata." During the meeting we did not attempt to create such a definition, but discussion ranged from "anything about a resource" to a specific set of descriptive elements. As there were representatives of archives in the room, we also talked about some of the implications of describing unpublished materials, which have a different legal standing but also provide less self-identification than resources that have been published. Drawing the line between fact and embellishment in bibliographic metadata is not going to be easy. Nor will the determination of level of creativity of the data, a necessary part of the analysis for US law. Note that other types of metadata were also discussed, such as rights metadata and preservation metadata, as well as a recognition that the exchange of metadata will of course cross national boundaries. Any study will have to determine where it will draw the "metadata" line, and also whether one can address the the question with an international scope.

Another complexity is that bibliographic data is already "crowd-sourced" in a sense. For any given bibliographic record,  different contributions have been made by different librarians from different institutions and at different times. This recognition makes it hard to ascribe intellectual ownership to any one party. And while library catalog data may be considered to be factual, it is much more than a simple rendering of facts, as the complexity of the cataloging rules attests. I likened library cataloging to a medical diagnosis: the end result (some scribbles in a file and perhaps a prescription given to the patient) does not reveal all of the knowledge and judgment that went into the decision. Metadata is the tip of an iceberg. That may not change its legal status, but I think that unless you have delved into the intricacies of cataloging it is hard to appreciate all that goes into the fairly simple display that users see on the screen.

The legal question is difficult, and to me it isn't entirely clear that solving the question on the legality of bibliographic data exchange will be sufficient to break the logjam. In a sense, projects like DPLA and Europeana, both of which have declared their metadata to be available with a CC0 license, might have more real impact than a determination based in law. Significant discussion at the meeting was about the need for attribution on the part of cultural heritage institutions. Like academics, the reputation and standing of such institutions depends on their getting recognition for their work. Releasing metadata (including thumbnails in the case of visual materials) needs to increase the visibility of those institutions, and to raise public awareness of the value of their collections. It is possible that solving the attribution problem could essentially dissolve the barriers to metadata sharing, since the gain to the institutions would be obvious.

Perhaps my one unique contribution to the group discussion was this:

We all know the © symbol and what it means. What we need now is an equally concise and recognizable symbol for attribution. Something like "(@)The Bancroft Library" or "(@)Dr. Seuss Collection". This would shorten attribution statements but also make them quickly recognizable, and a statement could also be a link to the appropriate web page. Standardizing attribution in this way should make adding attributions easier, and would demonstrate a culture of "giving credit where credit is due." The symbol needs to be simple, and should be easy to understand. It's time to comb through the Unicode charts for just the right one. Any suggestions?

See Also:


Unicode 1F6A9 - Triangular flag meaning "location"

Wednesday, September 28, 2011

Europe's national libraries support Open Data licensing

 "Meeting at the Royal Library of Denmark, the Conference of European National Librarians (CENL), has voted overwhelmingly to support the open licensing of their data. CENL represents Europe’s 46 national libraries, and are responsible for the massive collection of publications that represent the accumulated knowledge of Europe.

What does that mean in practice?
It means that the datasets describing all the millions of books and texts ever published in Europe – the title, author, date, imprint, place of publication and so on, which exists in the vast library catalogues of Europe – will become increasingly accessible for anybody to re-use for whatever purpose they want."

From an announcement by the Conference of European National Libraries.


Friday, September 10, 2010

Libraries, FOAF, and community

Note: this is being posted simultaneously on two blogs: Metadata Matters and Coyle's InFormation

“Why don’t libraries just use FOAF for their Person metadata? Why do they insist on creating their own?”

We don’t know how many times we have heard this on various lists. It often is not really posed as a question; in other words, it isn’t asking for an explanation of why libraries do not choose to use FOAF. It’s more rhetorical, along the lines of “Why can’t we all just get along?” But it is worthy of being asked as a real question, and of getting a real answer.

[Note first that the question of FOAF comes up not so much as we consider the current library standards, but in discussions of upcoming standards that will hopefully be based on the FR** family of standards (FRBR, FRAD, FRSAR). ]

A comparison of FOAF Person and the library Person entity (either in MARC authority files, or RDA, or FRAD) shows that there is not one defined element (or “property” as it is called in Semantic Web-ese) that the two have in common. This is not a coincidence; the two vocabularies serve significantly different communities and purposes. This does not mean that they are irreconcilable; the question therefore becomes: What keeps them apart? and can that be overcome?

The key is in the nature of the two communities.

FOAF stands for ‘Friend of a Friend’, which is a clue to its context: the schema is primarily for use in social networking situations. Its focus is on people who are alive and online, and it includes online contact information like email addresses, web sites, work web sites, Facebook IDs, Skype IDs, etc. The name of the person in FOAF is not an identifier, but presumes that the name of the person plus one or more of the contact IDs is enough to distinguish most humans from one another.

Library name data (which is a form of controlled vocabulary, called “name authority data” in library terms) is focused on creating a unique identifier that brings together the different forms of a name used in published materials under one form. Library users, therefore, can expect to find all of the works by or about a named person under a single entry regardless of the various forms of the name that exist in real data. Uniqueness of names is enforced by adding information to a non-unique name, usually the year of birth, but when that isn’t known (especially for persons of antiquity) titles or even areas of endeavor (“poet”) can be added.

To accommodate both the FOAF (social) function and the libraries’ identification function, at the very least the libraries would need to define a sub-property of FOAF Person, one that has a more strict definition and usage. However, for the library “Person” to be designated as more specific than FOAF:Person does not require that these two be in the same vocabulary. That is one of the important features of Semantic Web properties: like any other resource, they can be linked and related to any other resources on the Web.

Why not combine the library and FOAF properties into a single metadata vocabulary? The answer has little to do with technology, but instead relates to the functioning of communities. Metadata standards need to be developed by (and for) actual communities. The FOAF and library communities clearly have different needs, different goals, and are working with fundamentally different use cases. They also are significantly different as communities.

FOAF is being developed by an informal group of developers, and is quite recent in origin. The group is small: the FOAF development email list has about 350 members. Another 350 individuals are listed on the FOAF wiki pages as having a FOAF profile available on the Web. This is obviously not the full extent of FOAF usage, but these numbers reflect the recent development of this kind of metadata.

The library community has hundreds of years of investment in the creation of metadata (even though it was not called that when libraries began to create it). There are at least tens of thousands of libraries in the world, many of which have been in existence for centuries. Library data has its origins in early 19th century book catalogs but has been created in a machine-readable format since the late 1960’s. Library data is created following formal rules governed in part by international agreements, and there are many hundreds of millions of machine-readable bibliographic records in existence that were created based on these library cataloging principles.

Libraries have engaged in wide-spread data sharing for centuries, and with the global networking capabilities of today libraries are actually able to exchange and re-use data on a huge scale. Libraries do not each create metadata for the same book or item, but instead share the metadata created by one library in cooperative efforts oriented towards resource sharing and efficiency.

This sharing is built into the very core of library data management. The ability to use data created by others is supported by standards and those standards form the basis for the library systems. While most users see only the library catalog available to the public, that is only one function of a system that supports purchasing, fund accounting, inventory control, circulation and patron management, and collection analysis. In the Western world these systems are not created and maintained by libraries but by a small number of specialized commercial vendors whose products are specifically created for the library customers using agreed library standards. Thus the very same system can be sold to hundreds or thousands of libraries, creating a viable market base for system development.

A number of the 70,000 libraries contributing to OCLC are using a single standard, MARC21, and others are following international standards such as ISBD that produces standardized bibliographic description. The development of these standards is based on a large scale community process with international participation. It is not a perfect process by any means, and clearly must be updated to meet modern needs and new technologies that have changed the way we work, but the degree of data sharing libraries depend on requires that a formal process be in place to support the standards of this community.

Sharing of data on a large scale is necessitated by the economic reality of the library sector. Libraries face increasingly shrinking budgets while coping with an upswing in demand for their services. Realistically, this means that changes to library data must be carefully coordinated in order to minimize disruption to the complex network of data sharing that makes cost-effective library services management, based on this data, possible. Libraries may appear to be mistrustful of change agents, and in some cases they certainly are, but there is a real need to minimize risk for the community as a whole in order to assure the health of these often financially fragile institutions.

So we come back to the question of libraries and FOAF. In the final analysis, we’re not at all sure that there’s much gain in trying to combine these two approaches, with the differences in their communities and functions. It could be like trying to combine oil and water, requiring compromises that in the end would be less than satisfactory for both communities. One could argue that the difference between the vocabularies and their contexts is a positive, allowing more than one view of the Person entity. As two separately maintained metadata vocabularies, anyone creating metadata can choose from either as needed without sacrificing precision. One can also imagine other views that will arise, such as Persons in medical data or financial data, which would each carry data elements that are neither in FOAF nor library data, from blood type to bank balance. The important thing is to make sure that these vocabularies are properly described and related to each other where possible. That way, each community can manage its own process based on its needs for standards integration, but data can be shared where appropriate.

We could begin with a more detailed discussion between the FOAF and the library communities about their metadata needs. With hundreds of years of experience in representing names in library catalogs, we feel confident that the library community’s knowledge could contribute in general to the use of personal names in the Semantic Web.

Monday, September 14, 2009

Google Books Metadata and Library Functions

In a recent post in the NGC4LIB list, we got a very welcome answer from Chip Nilges of OCLC about Google's use of WorldCat records:
To answer Karen's most recent post, Google can use any WC metadata field. And it's important to note as well that our agreement with Google is not exclusive. We're happy to work with others in the same way. The goal, as I said in my original post, is to support the efforts of our members to bring their collections online, make them discoverable, and drive traffic to library services.

Regards,

Chip

As we have seen from recent postings about the metadata being presented in the Google Books Search service, there are some problems. Although Google claims to have taken the metadata from its library partners, we can look at records in GBS and the record for that item in the library partner database and see how very different they are. It is clear that Google has not retained all of the fields that libraries have provided, and has made some very odd choices about what to keep. Perhaps what we need to do, to help Google improve the metadata, is to make clear what data elements we anticipate we will need in order to integrate the Google product with library services.

When you ask people what metadata is needed for a service, they will often reply something like "everything" or "more is better." I'm going to take a different approach here because I think it is a good idea to connect metadata needs with actual functionality. This not only justifies the metadata, but the functionality helps explain the nature of the metadata that is required. For example, if we say that we want "date of publication" in our metadata, it may seem that we could use the date from the publication statement, which can have dates like "c1956" or "[1924]." If, instead, we indicate that we want to use dates in computational research, then it is clear (hopefully) that we need the fixed field date (from the 008 field in the MARC record).

So here are the functions that come to my mind, and I welcome additions. (Do remember that at this point we are only talking about books, so many fields relating to other formats will not be included.) I'll add the related MARC fields as I get a chance.

Function: Scholarship
Need: A thorough description of the edition in question. This will include authors, titles, physical description, and series information.


Function: Metasearch
Need: To be able to combine searches with the same data elements in library catalogs. Generally this means "headings," from the bibliographic record (authors, titles, subject headings).


Function: Collection development
Need: To use GBS to fill in gaps (or make comparisons) in a library's holdings, usually using classification numbers.


Function: Linking to other bibliographic collections or databases
Need: Identifiers and headings that may be found in other collections that would allow linking.

Function: Computation
Need: Data elements that can mark a text in time and space (date and place of publication), as well as those that can help segment the file, like language. This function also may need to rely on combining editions into groupings of Works, since this research may need to distinguish Works from Manifestations. Computation will most likely use metadata as a controlled vocabulary, and the full text of the work as the "meat" of the research.

Thursday, March 26, 2009

LC discovers infinity

If you were at ALA Midwinter in Denver (January, 2009) you may have been in one of the meetings where the Library of Congress announced its intention to atone for the lcsh.info fiasco. In case you missed that, Ed Summers of LC created an online version of the Library of Congress Subject Heading authority records, re-organized as a SKOS vocabulary and available for linking on the open Web. After being available for about six months (beginning in May of 2008), Ed was asked by his employer to take down the site on December 18, 2008. This was in spite of the fact that the data had been out there long enough to have a number of users, and that the removal broke existing systems that had developed around the data.

[Note, lcsh.info has been re-born as http://lcsubjects.org/, hosted by Talis.]

The outcry in the community was strong, including a reply to Ed's lcsh.info blog post by Sir Web himself, Tim Berners-Lee. Library of Congress must have been suitably embarassed.

Thus the announcement at Midwinter that LC not only understands the value of linked open access to LCSH, but that all of the vocabularies managed by LC -- from the name authorities to the lists of document types, languages, locations, etc., -- need to be openly available in a format suitable for inclusion in Web services. LC has created a web site to host these vocabularies: id.loc.gov. On that site they say:
Initially, within 6 to 8 weeks, the Library of Congress will release its first offering: the Library of Congress Subject Headings. This will be an almost verbatim re-release of the system and content once found at the popular prototype lcsh.info service.
They also say:
We aim to make resources available on this site within 6-8 weeks. Check this site regularly for more updates as we continue to develop this service!
The page is dated 1/22/09. My calculations show that 9 weeks have passed. OK, that's only one week over their stated deadline. But nothing on the page has changed. No resources have been made available. An "almost verbatim" release of lcsh.info should not be too hard given that Ed had code written that he has made publicly available.

But even today, the promised service is 6-8 weeks away. It may stay that way for a long time. Maybe even forever.

Why does this matter? It matters because the availability of these vocabularies is essential for the library world to move forward. Some of us have been asking LC to put the vocabularies online in a machine-actionable format for a very long time. The Dublin Core community worked with LC to create a machine-actionable and URI-identified version of the MARC role terms as early as 2005. You can't find this linked from any of the MARC documentation. Some of us brought up the topic ad nauseum at MARBI meetings, but to no avail. Now LC seems to have "gotten it" conceptually but they have yet to show us that they can deliver.

I may seem to be undeservedly impatient on this score, but it's not that we have been waiting for this for 9 weeks: we've been waiting for years. And quite honestly, this is not rocket science, nor does LC have no guidance for how to manage this data. In fact, they could use the NSDL Metadata Registry, or, if they insist on hosting this themselves, the Registry's source code is available. Quite frankly, if LC does not prove to us soon that it can perform this necessary function, I feel that we are quite justified in going forward without them, registering the vocabularies where they can be used and managed by anyone who needs them, and going forward with a transformation of library data that will meet 21st century needs.

Monday, September 29, 2008

DC2008

I recently attended the annual Dublin Core conference in Berlin. I would have blogged the sessions but in fact I spent most of my time in the hallways chatting with folks. The main message of the conference was: Semantic Web. This included an interesting talk by Martin Malmsten on turning MARC records into RDF triples. (See also the work at Talis in this area.)

For me the big deal of the conference was a meeting with some of the Dublin Core folks who developed the DC Abstract Model and the DC Application Profile model. We had a nice long talk about the distance between those views and the actual production of library metadata. What we concluded was that we will work together to bridge this gap, in part by creating simple, re-usable modules that are easy to understand and that, when hooked together, provide the information necessary to engineer a fully functioning, DCAM-compliant application profile.

Yes, I know we need more of an explanation, and I'll be working on that very soon. Don't go too far away.

Saturday, September 13, 2008

Thinking About Linking

In my previous post on affordances, I included inter- and intra-metadata links. I feel like there's a lot of confusion in this area (some of which I may myself have contributed), so I'm going to do a bit of a disorganized brain dump here as an attempt to start a conversation in this area, see if maybe we (or I) can't arrive at some clarity.

In the FRBR vision that RDA has embraced, there is something called the "relational/object-oriented model." I have some basic problems with this because I perceive relational and object-oriented designs to be quite distinct. This concept of relational/object-oriented gives me one of those "blank brain" moments -- when something sounds like it should make sense but I just can't make sense out of it. So I'm going to treat it as a set of relationships within a bibliographic record.

In the FRBR/RDA model there are entities: Work, Expression, Manifestation, Item (WEMI), and Person, Corporate body, Concept, Object, Event, Place. The interesting thing about these is that none of them is intended to stand alone. This is a very inter-dependent group of entities, not a set of separate records. This is hard for us to imagine because today's model is indeed of separate records for bibliographic data and authority data (covering names and subjects). However, our view is colored by the fact that the bibliographic record carries headings from the authority records, an therefore is complete in itself. Authority records, if you think about them, even those for names, are of the nature of a controlled vocabulary. The view of these vocabularies as contributing to the bibliographic description means that we have to have a way to express both the entities themselves and the links between them.

In addition, we have to decide what one defines as a record. If, to describe a work, one must also describe the creator, then it does seem that the Work entity and Person (or Corporate) entity must be part of the same record. Otherwise, the record cannot stand alone. So what does it mean to include the Person entity, and where does that entity reside? Or is an unresolved link to a (presumed) entity sufficient to complete the bibliographic record? In other words, if the bibliographic record has, as part of the work, a link to a Person entity that resides elsewhere, is that bibliographic record complete?

Note: I read back through FRBR and FRANAR regarding the Person entity. FRBR includes only the "name heading" in its Person entity, while the FRANAR Person entity has many more elements. This parallels today's difference between the personal name field and the name authority record.
There are other kinds of relationships that are between bibliographic entities. To my mind there are two types of relationships here: dependent and independent. The dependent relationships are between the WEMI entities, none of which is considered complete in itself. In fact, I consider the WEMI to be a single entity with dependent parts. (Admittedly, this is how current library cataloging views it, with a single flat record that contains information on all of these bibliographic levels which exist simultaneously in a single object.) To me, these are indivisible -- you can't have any one of them without the others.
[Note that I consider the WEMI to be a single entity in terms of library cataloging records. The levels of this entity do have meaning on their own. For example, a literary critic will often refer to the Work, perhaps to the Expression. A publisher or bookstore advertises the Manifestation. A library identifies and circulates the Item, and a rare book seller deals almost exclusively in Items.]

The independent relationships are those between different bibliographic entities -
  • Work-Work, two works that reflect or reference each other (cited, cites; works based on other works, like parodies or sequels)
  • Whole-Part, works in which one can be contained in the other (article and journal, chapter and book, volume and series)
  • Item-Item, reproductions of all types
To a large degree, these relationships can all be expressed as properties: isCreatorOf, isExpressionOf, isCitedBy. But I can't shake the feeling that there are at least two distinct kinds of relationships: those that fill in what otherwise would be gaps in a metadata record, and those that inform relationships between bibliographic items. I also wonder about links with and between complex entities. For example, imagine a bibliographic record that links to a member of a subject vocabulary that is stored in SKOS format. The SKOS record has numerous fields covering preferred and alternate headings, definitions, links to broader and narrower terms, and all of this in various languages. What if the property in the bibliographic record has the meaning "definition of term in French"? What does one link to? Or is the only possible link to the vocabulary member as a whole?

So these are a few of the questions I have. Hopefully some of them can be cleared up quickly. I'm interested in hearing how others think about these issues. For those attending DC2008, if this interests you I'm game for some discussion.

Monday, September 08, 2008

Metadata Affordances

In my last post, I promised to spend some time thinking about metadata affordances -- that is, a view of metadata based on what you can do with it. My hope is that this will inform a metadata model that serves our needs (whoever "we" are, but admittedly this will tend toward the metadata needs of the library community). Here are the categories that I have come up with, all open to comment, discussion, correction, etc., so please comment freely.

None (opaque text)

Some metadata will necessarily be of this category, with no particular affordances inherent in the contents. At times plain text is used because that is the nature of the particular metadata element, like the recording of the first paragraph of a text, or transcribing a title from the piece. At other times plain text is used because the metadata community has chosen not to exercise control over the particular metadata element. An example of this is user-input tags. Although human intelligence may be applied to plain text fields, it requires knowledge that is not inherent in the metadata structure itself.

Structure and rules (typed strings)

Typed strings are things like formatted dates (YYYYMMDD) and currency formats ($9,999.99). There are other possible formatted strings, such as the common identifiers like ISBN and ISSN. The affordances of these strings is that you can exercise control over the input of them, forcing the consistency of the values. With consistent values you can perform accurate operations, like adding up a set of figures, sorting or searching by date, etc. Some controlled list values may also have structure: the standard format for personal names used by libraries includes structural rules ("family name followed by comma, then forenames") that facilitates the use of alphabetically ordered lists of names.

List membership/vocabulary control

One way to assure consistency in metadata is to require that the metadata value be selected from a fixed list of values, rather than being open to free text. This tends to take the form of a list of like terms: languages of text, country names, colors, physical formats.

Although it provides consistency, list membership alone does not provide much in terms of capabilities for data processing. Other information is needed to provide affordances for list members:
  • access to display and indexing forms of the term
  • access to alternate forms, including other languages
  • access to definitions of terms

The information that is needed, therefore, for any list and its members is:
  • list identifier
  • member identifier
  • location of services relating to this list/member, and what services are available

If there are no automated services, then a system will need to provide its own, which is what we generally do today by creating a copy of the list within the system and serving display forms and other features from that internal list. In a web-enabled environment, however, one could imagine lists with web services interfaces that can be queried as needed.


Inter- and intra-metadata links

There is a need to create functional links within metadata segments to other metadata segments or records. For example, the use of name and subject authority records implies a link between those records and the bibliographic metadata records that contain the names and subjects as values. There are also links needed between bibliographic records themselves. These latter represent a number of different relationships, which have been articulated in the FRBR documentation. Some examples are: work-work relationships, work-expression relationships, and part-whole relationships (chapters within books, articles within journals).

There may be other kinds of links that are needed as well, but I think that the main need is to distinguish between identifiers and links. Some identifiers, like ISBNs, can be used to retrieve metadata in a variety of situations, but those should be seen as searches, not links. Searching is appropriate in some circumstances, but the ability to create stable links is a separate affordance and should be treated as such.

Note: These categories of affordances are not mutually exclusive. Some metadata values will provide more than one type of affordance. Each should be clearly and separately articulated, however, and we should think about the advantages and disadvantages of having metadata values serve multiple functions.

Friday, September 05, 2008

Literals and non-literals, take 2

Jason Thomale responded to my previous post with his insights into literals and non-literals, and I have to say that this really hit me up-side the head, in the best of ways. Here are some paragraphs from Jason's comment (which is worth reading in full):

A literal is a value that references nothing other than itself. You could consider it the "end of the line" when you're thinking about linked data. It's data that isn't linked to anything else. For example, the property "FirstName" would probably have as its value a literal. Consider "FirstName=Karen"--Karen isn't referencing the person, it's a literal string (or "value string") that tells what the FirstName is. The FirstName property, in turn, would probably be part of a description set that describes the resource--the person--that could be identified by the string "Karen Coyle."

A non-literal, on the other hand, is a value that serves as a reference to something else. Hence "non-literal"--it's not a data value to be taken literally. It's a pointer--a link--that refers to something else. Properties whose value would logically be another resource should contain non-literals. "Author," for example. Even when we say, "The Author of this blog posting is Karen Coyle," we're not referring to the literal string "Karen Coyle." That string didn't write the posting. We're using that string as a name that references the actual person. The person authored the blog posting. "Karen Coyle" is just a convenient reference--or non-literal--that points to the person. So--first of all, that's the difference between a literal and a non-literal.

...

Since I'm an RDBMS guy at heart, it's easy to think of it in those terms. A non-literal would be like a foreign key. The value itself may or may not mean anything--it just references a record in another table. A literal would be a cell that isn't a foreign or primary key. It's the actual data.

Now--this certainly isn't unambiguous. Going back to the FirstName example, one might use a non-literal for this property if you're actually thinking about first names as entities/resources in your data model. Maybe you have a separate description of each name, complete with history, related names, etc. In this case you could use a URI to identify each name, or each resource that describes any given name, or you could keep using the value string "Karen"--but in the latter case you would also need a URI associated with it that identifies how to interpret that value. Otherwise it's just a literal. So--in this case, you have the same value string ("Karen") that we could use for the same property as a literal or as a non-literal. From my understanding, what matters is whether or not it you're using it as an identifier to refer to something else and whether or not you include a URI that describes the identifier--not whether or not it's "structure data."

What Jason does here is to look beyond the way that DCAM defines the structures of literals and non-literals and instead focuses on what UI folks would call the "affordances." In other words, what do these types of values do for me in a linked environment? Although I've heard DC folks talk about this aspect of the DCAM, it is not brought out in the DCAM document itself.

Where I think that my concept of this differs from that which circulates in the DC world is that I'm not at all interested in refining philosophical points about the fine lines between literal and non-literal. This comes up in a second comment of Jason's that I reply to. I believe that Jason's analysis is in agreement with the DCAM definitions, which, however, doesn't work for me:
Jason: "If I said, "This book's author is Karen Coyle," then the real value of "author" is *the person,* and "Karen Coyle" is being used like a non-literal value to identify *the person.*"

Karen: I believe that you can indeed say: "this book's author is [literal value 'Karen Coyle']." Simple metadata does that all of the time. I think that the distinction is *not* in the string or even in the fact the you put it in classic RDF-triple terms, but in the intended use. So in a MARC record following AACR2, an author name in the 100 field is a non-literal because it represents a heading in the authority file. In a [metadata] record that is not using any particular cataloging rules (or where you as a recipient have no idea what the rules are), the value in the [author or creator] field, even if it is identical to the entry in the AACR2 record, is a literal because you can make no inference about what it might represent outside of the metadata record.
The difference that I see here is between a theoretical non-literal ("author of this book is Karen Coyle") and a value that one can actually act on ("author of this book is person identified in library land by LC Control Number: n 89613425"). I realize that this means that the context of the data has an effect on whether one would call the data literal or non-literal, but in fact, I'm less concerned with what you would call it than what I can do with it at any given moment in time. It's this knowing what I can do with a value that to me is of prime importance, and finding a way to convey to people and machines what they can do with a value is my main goal. (I don't know if Jason would disagree with this, but he knows how to comment, so I'll let him speak for himself.)

I am now arriving at the conclusion that if we focus on real affordances for linking, rather than structure, then we can have a very useful discussion of types of metadata affordances that serve our purposes. These may or may not exactly parallel the DCAM structure, but I don't think that adoption of the DCAM is our task -- I think our task is to create a useful model for the next generation of library data. What DCAM provides us with is an existing model that we can poke at, dissect, try to work with, and throw our own ideas at. Then, once we have defined our affordances we can figure out a way to structure our data profiles so they reveal those affordances to human and machine users.

Tuesday, September 02, 2008

Semantic Dementia

"Semantic dementia" is a term for something many of us of advanced age experience: forgetting words we once knew. It brings to my mind, however, the kind of demented semantics that we often encounter in standards in our field, and the use of or creation of words that obscure the meaning of the standard.

I understand the need that standards have to be very precise in their terminology; to give terms specific meaning. There often is a conflict, however, between that desire for precision and the need to communicate well with the users of the standard. An example of this is the OpenURL* standard, which pioneered the "Context object" and its ever-obscure children like the Referent and the Referring Entity. Quick: give me a definition for Referent.... right, it's not exactly on the tip of anyone's tongue.

I'm going to say that there are two kinds of people in the world: those who think that using a standard should require many hours of study leading to a complete understanding and absorption of the concepts and terminology, so that there cannot be any possible mis-use of the standard; and those who think that a standard should be fairly understandable in a single reading, and usable shortly thereafter. Members of the former group seem to feel that the ideas in their standard are so clever, so unique, that they cannot be comprehended easily. Members of the latter group (to which I obviously belong) assume that standards recombine previous concepts into new structures, and, deep down, are generally simple ideas that one could express simply.

Jeff Young, who clearly has an element of Type 2 in him, managed to unbundle the studied obscurity of the OpenURL with the opening post to his blog, Q6. He replaced the OpenURL terms with Who, What, Where, Why, When, How. I believe that for many people, the light bulb suddenly lit up upon reading his explanation.

A similar simplification is needed for the Dublin Core Abstract Model, and I'm going to attempt that even though I think it's a dangerous thing to do. DCAM defines a set of metadata types that can help us communicate to each other about our metadata. It should simplify crosswalking of metadata sets, and make standards more understandable across communities. Unfortunately, it has not done so, at least in part, because of some rather demented semantics.

DCAM, Simplified

First, you need to understand that the DCAM is about metadata structure, not meaning, or at least not meaning in the human sense of the term. It describes a generalized underlying format for machine-readable metadata. In the most simple terms it provides the information that a program would need to determine what operations it can perform on the metadata that it receives. In this sense, it is a general metadata analogy to the OpenURL's Context Object: a formalized message about something.

The basis of the DCAM is key/value pairs, each of which is called a statement, which is the terminology from RDF. Any group of statements describe a single resource. A resource can be just about anything, but they are based on what your metadata ultimately describes. Examples are: a book; a person; an event. The set of key/value pairs that describes a resource is called a description. If you will describe more than one resource in your metadata record, then you will have multiple descriptions. These make up the description set, which is the sum of descriptions that you have defined for your purpose. These descriptions can be packaged into a record. It all looks something like this:
The statement level is where we get to the real meat of the DCAM, and the part that I think holds great potential. The actual DCAM diagram is very large and filled with terminology that makes it difficult to grasp the meaning of the concepts. I'm going to simplify that meaning here, with the understanding that there is more to DCAM than this simple explanation. Consider this Step 1, not the whole enchilada.

Essentially you have key/value pairs. A key/value pair can look something like:
title = Moby Dick
where "title" is the key and "Moby Dick" is the value.

The first rule in the DCAM is that the key must be identified with a URI, a Uniform Resource Identifier. Here's a URI that you might use for this key/value pair:
http://purl.org/dc/elements/1.1/title

There is nothing new in this; using URIs is a very common convention. It's in the definition of the value that DCAM adds something. Values can be "literals" or not. DCAM makes use of the RDF definition of a literal:
"Literals are used to identify values such as numbers and dates by means of a lexical representation. Anything represented by a literal could also be represented by a URI, but it is often more convenient or intuitive to use literals."

Literals can be plain or typed, as defined in the RDF documentation. An example of a typed literal is a date or currency. A typed literal gives you some control over the format of the string, such as "YYYYMMDD."

This is contrasted with the non-literal values. The non-literal values are not defined in the RDF documentation, except to imply that they are everything that is not a literal. The DCAM goes further and defines non-literals as being of two types: the non-literal value is either a URI that names the value or it is a value from a named, controlled vocabulary. So you can have:
http://purl.org/dc/dcmitype/Event

which is a URI for a controlled value, in this case the DCMI type value, Event. Or you can have:
[URI for ISO 3166-2, country codes] + "en" for English
This latter is similar to what we often do in MARC records, which is to record a code as a string and indicate what controlled list that string is from.

Obviously, if this was all there was to DCAM, we'd all be all over it by now. What happens next is that we start trying to apply this in a metadata world that is not as neat as we would like. For example, what do we do with an ISBN -- is it a structured value? yes. Is it a member of a controlled vocabulary? sometimes, yes, because there is a finite list of ISBNs and each one of them is known (at least to Bowker and other ISBN agencies). So, is it a typed literal, or a nonliteral?

In the end, however, perhaps it doesn't matter that this definition of "nonliteral" leaves us with some ambiguity. Perhaps what really matters is that we distinguish between these three kinds of values:
  1. Plain strings. These will be necessary for values that simply cannot be controlled or are not being controlled.
  2. Structured strings. In these, the range of values can be great, and is not being housed in a finite list, but because of their structure they can often be acted on for functions like quality control, transformations, etc.
  3. Identified values. An identified value is the essence of the semantic web. It allows algorithms to make connections between values even though those algorithms are not machine intelligences, are not understanding the human meaning behind the value.
Our mission, as information professionals, if we choose to accept it, is to move our data, where possible, from point 1 to points 2 or 3 so that there are more possibilities to act on the data in the larger web environment.

I welcome discussion, criticism, additions... whatever comes to your mind. Really.

* NOTE: I'd give you a link to the OpenURL standard, but NISO has gone with one of those content management systems that produce unusably long links. So you'll have to go to the NISO site and look for it under Standards.

Monday, June 09, 2008

More patent insanity: Google's virtual bookshelf

I've only read a few patents in my time, and they are very strange documents. Stranger even because they have a real effect on the world.

I don't know if there is a specific language and style of patents, but the ones that I have read are amazingly vague. That is especially frightening given that patents describe technologies -- things you can create in some 'real world' fashion. The latest patent to make it through the magical and mysterious process at the Patent Office that turns it from "nutty idea" to "take it to court" is the Google patent called "Computer-implemented interactive, virtual bookshelf system and method."

"A computer-implemented method and system for realizing an interactive, virtual bookshelf representing physical books and digitally stored books of the user. Using a search query, the Web is searched using search metadata to identify a desired book. Library metadata corresponding to the physical books and digitally stored books of the user is then searched using the search metadata to determine whether the book is present in the virtual online bookshelf. Results indicative of whether the desired book is present on the virtual on-line bookshelf can be displayed."
That's the abstract, but even reading the details and looking at the diagrams there are many things that are not clear. Here's one flow:

  • Search term or query of search data and/or search metadata
  • Search hits metadata of desired books

OK, so far we have exactly what happens in a library catalog (but not what happens using Google Book Search, which is based on actual text).

  • Filter or compare to library/metadata of selected virtual bookshelf

  • If found in virtual bookshelf, "User acquires physical or digitally-stored copy of desired book from physical bookshelf of selected virtual bookshelf."

Now this last one is just nonsense. There's something called Physical Bookshelf that somehow the user accesses from an Internet search. Does this mean that the user gets a call number and goes to the shelf? And in the diagrams, the Physical Bookshelf contains a "Memory of Digitally Stored Books." So this must be magic, because I don't know of any physical bookshelves with memory. Well, at least none outside of L-Space.

The last thing that happens in this very odd flow is that if one does not find the book in the virtual book shelf, the question is put:

Acquire physical or digitally stored copy of desired book?

If the answer to that is yes, then the user acquires metadata of the desired book, which is then compared to the virtual bookshelf. The same virtual bookshelf where the item wasn't found.

Since we know that patents tend to be interpreted very broadly, this patent could be seen to cover any search of metadata that results in finding books that can be either digital or physical. That is essentially every library catalog in the nation, and beyond. And indeed what is a library catalog but a "virtual bookshelf"? The one caveat is that it is the Web that is searched, not a library database. But if we go forward with our ideas to have library metadata searchable over the Web, then ...

Patents today are rarely used by their inventors to create actual products. Instead they are used to bludgeon competitors who are also working in the same approximate service space. The patents are ends in themselves and are designed to prevent invention. Quite honestly, if something isn't done about this, we'll find ourselves completely unable to innovate.

At this point I should come up with some clever, satirical example of outrageous patents, but it's really impossible to one-up reality in this particular area.

For a more positive view of the patent, see SEO by the Sea blog post, and a post about that post by Lorcan Dempsey.

Wednesday, May 21, 2008

Authors

The Open Library is, among other things, an interesting experiment in the creation of a book catalog that mixes data from libraries, publishers, and online sites (currently only Amazon).

One of the big issues that comes up, of course, is that of author names. We know that author names are recorded differently in different sources. We also know that only the library data carries what could be considered an author identifier: a unique string for each unique author. (How "unique author" is defined is a discussion for another day.)

Because the Open Library creates a web page for each author, and that web page links to the books by the author, it is important not to split an author's works into separate pages for each form of the author's name. In other words, you wouldn't want a page for "Mark Twain" and another for "Twain, Mark." Although that would be simple case.

As book data is added to the Open Library, the incoming bibliographic records are matched to those already in the database. It is only after a match is found that the authors are compared in some detail. The fact that the varying author names appear on the same bibliographic record allows you to make some inferences. So the library names can be switched around to "natural order" for comparison to Amazon data. The main question is, however, if you don't get an exact match at that point, what else would constitute a match?

There is an interesting set of data (created by data wrangler Edward Betts) that lists matched books that have un-matched authors. This small set is like a microcosm of "the author problem." Edward has added Jaro-Winkler values as a way to quantify the matches, although it isn't clear where the bright line is between match and no match. This is an interesting problem.

Here are some things that turn up in the small set:

No author vs. many authors
When a book has many authors, especially when it is a compilation of works by different authors, library catalog data does not record "an author" in the author position (MARC 100). Amazon lists all of the authors as authors. (#2, #14 in data set)

Anonymous
Libraries do not list "anonymous" as an author. Amazon does (and even has a link so you can click on it!). Compare Open Library and Amazon for the book "Primary Colors."

Transliterated name forms
Erofeev v. Erofeyev (#7), Tsernianski v. Crnjanski (# 170). Also see #s 162 and 163 for this same case. Although there are some examples of misspellings of complex names, the issue here is mainly that the library cataloging standardizes on a particular form of the name, and the publisher and bookseller data probably uses the form that appears on the work in question. You can see that in the Crnjanski book which has "Tsernianski" in the "by" statement (which is the awkward term I came up with for the statement of responsibility, and suggestions are welcome -- although don't bother to suggest "statement of responsibility" because I consider that NOT for user consumption).

Bits and Pieces
Names with "bits and pieces" like titles of address (Dr., Mrs.), multi-part names (see # 9 for De Courcy, Catherine vs. Courcy Catherine De) (and #76 for Gregory, Saint, Bishop of Tours vs. Gregory of Tours). The problems here seem to be a combination of not knowing when where and how to include some of these (like the "Jr. " in William F. Buckley Jr., which isn't included in the library heading -- #155), and, once they are there what order to put them in. This is an area where some data match formulas might be able to help out.

Just Plain Wrong?
Obviously, the data isn't always perfect, and in particular many Amazon entries seem to have been hastily input. (I wonder to what extent these represent the used book sellers on Amazon? How does that data get into Amazon's database? See #21, and its Amazon entry.) Also, there are many entries that are incomplete (e.g. just the author's last name).

I truly believe that the future will bring more instances when we will find ourselves needing to combine bibliographic data from different sources, or move data across traditional community boundaries. For this reason I have one specific request for the library community: Please provide the name as it appears on the work in a form that can be used for matching.
This means that burying it in the statement of responsibility with no further mark-up does not help. Meanwhile, we may want to consider string-matching within the SoR against names from other sources. Yeech!