The Wayback Machine - https://web.archive.org/web/20210121015411/http://kcoyle.blogspot.com/search/label/MARC
Showing posts with label MARC. Show all posts
Showing posts with label MARC. Show all posts

Monday, February 03, 2020

Use the Leader, Luke!

If you learned the MARC format "on the job" or in some other library context you may have learned that the record is structured as fields with 3-digit tags, each with two numeric indicators, and that subfields have a subfield indicator (often shown as "$" because it is a non-printable character) and a single character subfield code (a-z, 0-9). That is all true for the MARC records that libraries create and process, but the MAchine Readable Cataloging standard (Z39.2 or ISO 2709) has other possibilities that we are not using. Our "MARC" (currently MARC21) is a single selection from among those possibilities, in essence an application profile of the MARC standard. The key to the possibilities afforded by MARC is in the MARC Leader, and in particular in two positions that our systems generally ignore because they always contain the same values in our data:
Leader byte 10 -- Indicator count
Leader byte 11 -- Subfield code length
In MARC21 records, Leader byte 10 is always "2" meaning that fields have 2-byte indicators, and Leader byte 11 is always 2 because the subfield code is always two characters in length. That was a decision made early on in the life of MARC records in libraries, and it's easy to forget that there were other options that were not taken. Let's take a short look at the possibilities the record format affords beyond our choice.

Both of these Leader positions are single bytes that can take values from 0 to 9. An application could use the MARC record format and have zero indicators. It isn't hard to imagine an application that has no need of indicators or that has determined to make use of subfields in their stead. As an example, the provenance of vocabulary data for thesauri like LCSH or the Art and Architecture Thesaurus could always be coded in a subfield rather than in an indicator:
650 $a Religion and science $2 LCSH
Another common use of indicators in MARC21 is to give a byte count for the non-filing initial articles on title strings. Istead of using an indicator value for this some libraries outside of the US developed a non-printing code to make the beginning and end of the non-filing portion. I'll use backslashes to represent these codes in this example:
245 $a \The \Birds of North America
I am not saying that all indicators in MARC21 should or even could be eliminated, but that we shouldn't assume that our current practice is the only way to code data.

In the other direction, what if you could have more than two indicators? The MARC record would allow you have have as many as nine. In addition, there is nothing to say that each byte in the indicator has to be a separate data element; you could have nine indicator positions that were defined as two data elements (4 + 5), or some other number (1 + 2 + 6). Expanding the number of indicators, or beginning with a larger number, could have prevented the split in provenance codes for subject vocabularies between one indicator value and the overflow subfield, $2, when the number exceeded the capability of a single numerical byte. Having three or four bytes for those codes in the indicator and expanding the values to include a-z would have been enough to include the full list of authorities for the data in the indicators. (Although I would still prefer putting them all in $2 using the mnemonic codes for ease of input.)

In the first University of California union catalog in the early 1980's we expanded the MARC indicators to hold an additional two bytes (or was it four?) so that we could record, for each MARC field, which library had contributed it. Our union catalog record was a composite MARC record with fields from any and all of the over 300 libraries across the University of California system that contributed to the union catalog as dozen or so separate record feeds from OCLC and RLIN. We treated the added indicator bytes as sets of bits, turning on bits to represent the catalog feeds from the libraries. If two or more libraries submitted exactly the same MARC field we stored the field once and turned on a bit for each separate library feed. If a library submitted a field for a record that was new to the record, we added the field and turned on the appropriate bit. When we created a user display we selected fields from only one of the libraries. (The rules for that selection process were something of a secret so as not to hurt anyone's feelings, but there was a "best" record for display.) It was a multi-library MARC record, made possible by the ability to use more than two indicators.

Now on to the subfield code. The rule for MARC21 is that there is a single subfield code and that is a lower case a-z and 0-9. The numeric codes have special meaning and do not vary by field; the alphabetic codes aare a bit more flexible. That gives use 26 possible subfields per tag, plus the 10 pre-defined numeric ones. The MARC21 standard has chosen to limit the alphabetic subfield codes to lower case characters. As the fields reached the limits of the available subfield codes (and many did over time) you might think that the easiest solution would be to allow upper case letters as subfield codes. Although the subfield code limitation was reached decades ago for some fields I can personally attest to the fact that suggesting the expansion of subfield codes to upper case letters was met with horrified glares at the MARC standards meeting. While clearly in 1968 the range of a-z seemed ample, that has not be the case for nearly half of the life-span of MARC.

The MARC Leader allows one to define up to 9 characters total for subfield codes. The value in this Leader position includes the subfield delimiter so this means that you can have a subfield delimiter and up to 8 characters to encode a subfield. Even expanding from a-z to aa-zz provides vastly more possibilities, and allow upper case as well give you a dizzying array of choices.

The other thing to mention is that there is no prescription that field tags must be numeric. They are limited to three characters in the MARC standard, but those could be a-z, A-Z, 0-9, not just 0-9, greatly expanding the possibilities for adding new tags. In fact, if you have been in the position to view internal systems records in your vendor system you may have been able to see that non-numeric tags have been used for internal system purposes, like noting who made each edit, whether functions like automated authority control have been performed on the record, etc. Many of the "violations" of the MARC21 rules listed here have been exploited internally -- and since early days of library systems.

There are other modifiable Leader values, in particular the one that determines the maximum length of a field, Leader 20. MARC21 has Leader 20 set at "4" meaning that fields cannot be longer than 9999. That could be longer, although the record size itself is set at only 5 bytes, so a record cannot be longer than 99999. However, one could limit fields to 999 (Leader value 20 set at "3") for an application that does less pre-composing of data compared to MARC21 and therefore comfortably fits within a shorter field length. 

The reason that has been given, over time, why none of these changes were made was always: it's too late, we can't change our systems now. This is, as Caesar might have said, cacas tauri. Systems have been able to absorb some pretty intense changes to the record format and its contents, and a change like adding more subfield codes would not be impossible. The problem is not really with the MARC21 record but with our inability (or refusal) to plan and execute the changes needed to evolve our systems. We could sit down today and develop a plan and a timeline. If you are skeptical, here's an example of how one could manage a change in length to the subfield codes:

a MARC21 record is retrieved for editing
  1. read the Leader 10 of the MARC21 record
  2. if the value is "2" and you need to add a new subfield that uses the subfield code plus two characters, convert all of the subfield codes in the record:
    • $a becomes $aa, $b becomes $ba, etc.
    • $0 becomes $01, $1 becomes $11, etc.
    • Leader 10 code is changed to "3"
  3. (alternatively, convert all records opened for editing)

a MARC21 record is retrieved for display
  1. read the Leader 10 of the MARC21 record
  2. if the value is "2" use the internal table of subfield codes for records with the value "2"
  3. if the value is "3" use the internal table of subfield codes for records with the value "3"

Sounds impossible? We moved from AACR to AACR2, and now from AACR2 to RDA without going back and converting all of our records to the new content.  We have added new fields to our records, such as the 336, 337, 338 for RDA values, without converting all of the earlier records in our files to have these fields. The same with new subfields, like $0, which has only been added in recent years. Our files have been using mixed record types for at least a couple of generations -- generations of systems and generations of catalogers.

Alas, the time to make these kinds of changes this was many years ago. Would it be worth doing today? That depends on whether we anticipate a change to BIBFRAME (or some other data format) in the near future. Changes do continue to be made to the MARC21 record; perhaps it would have a longer future if we could broach the subject of fixing some of the errors that were introduced in the past, in particular those that arose because of the limitations of MARC21 that could be rectified with an expansion of that record standard. That may also help us not carry over some of the problems in MARC21 that are caused by these limitations to a new record format that does not need to be limited in these ways.

Epilogue


Although the MARC  record was incredibly advanced compared to other data formats of its time (the mid-1960's), it has some limitations that cannot be overcome within the standard itself. One obvious one is the limitation of the record length to 5 bytes. Another is the fact that there are only two levels of nesting of data: the field and the subfield. There are times when a sub-subfield would be useful, such as when adding information that relates to only one subfield, not the entire field (provenance, external URL link). I can't advocate for continuing the data format that is often called "binary MARC" simply because it has limitations that require work-arounds. MARCXML, as defined as a standard, gets around the field and record length limitations, but it is not allowed to vary from the MARC21 limitations on field and subfield coding. It would be incredibly logical to move to a "non-binary" record format (XML, JSON, etc.) beginning with the existing MARC21 and  to allow expansions where needed. It is the stubborn adherence to the ISO 2709 format really has limited library data, and it is all the more puzzling because other solutions that can keep the data itself intact have been available for many decades.

Wednesday, April 12, 2017

If It Ain't Broke

For the first time in over forty years there is serious talk of a new metadata format for library bibliographic data. This is an important moment.

There is not, however, a consensus within the profession on the need to replace the long-standing MARC record format with something different. A common reply to the suggestion that library data creation needs a new data schema is the phrase: "If it ain't broke, don't fix it." This is more likely to be uttered by members of the cataloging community - those who create the bibliographic data that makes up library catalogs - than by those whose jobs entail systems design and maintenance. It is worth taking a good look at the relationship that catalogers have with the MARC format, since their view is informed by decades of daily encounters with a screen of MARC encoding.

Why This Matters

When the MARC format was developed, its purpose was clear: it needed to provide the data that would be printed on catalog cards produced by the Library of Congress. Those cards had been printed for over six decades, so there was no lack of examples to use to define the desired outcome. In ways unimagined at the time, MARC would change, nay, expand the role of shared cataloging, and would provide the first online template for cataloging.

Today work is being done on the post-MARC data schema. However, how the proposed new schema might change the daily work of catalogers is unclear. There is some anxiety in the cataloging community about this, and it is understandable. What I unfortunately see is a growing distrust of this development on the part of the data creators in our profession. It has not been made clear what their role is in the development of the next "MARC," not even whether their needs are a driving force in that development. Surely a new model cannot be successful without the consideration (or even better, the participation) of the people who will spend their days using the new data model to create the library's data.

(An even larger question is the future of the catalog itself, but I hardly know where to begin on that one.)


If it Ain't Broke...

The push-back against proposed post-MARC data formats is often seen as a blanket rejection of change. Undoubtedly this is at times the case. However, given that there have now been multiple generations of catalogers who worked and continue to work with the MARC record, we must assume that the members of the cataloging community have in-depth knowledge of how that format serves the cataloging function. We should tap that knowledge as a way to understand the functionality in MARC that has had a positive impact on cataloging for four decades, and should study how that functionality could be carried forward into the future bibliographic metadata schema.

I asked on Twitter for input on what catalogers like about MARC, and received some replies. I also viewed a small number of presentations by catalogers, primarily those about proposed replacements for MARC. From these I gathered the following list of "what catalogers like about MARC." I present these without comment or debate. I do not agree with all of the statements here, but that is no matter; the purpose here is to reflect cataloger perspectives.

(Note: This list is undoubtedly incomplete and I welcome comments or emails with your suggestions for additions or changes.)


What Catalogers Like/Love About MARC



There is resistance to moving away from using the MARC record for cataloging among some in the Anglo-American cataloging community. That community has been creating cataloging data in the MARC formats for forty years. For these librarians, MARC has many positive qualities, and these are qualities that are not perceived to exist in the proposals for linked data. (Throughout the sections below, read "library cataloging" and variants as referring to the Anglo-American cataloging tradition that uses the MARC format and the Anglo-American Cataloging Rules and its newer forms.)

MARC is Familiar

Library cataloging makes use of a very complex set of rules that determine how a resource is described. Once the decisions are made regarding the content of the description, those results are coded in MARC. Because the creation of the catalog record has been done in the MARC format since the late 1970's, working catalogers today have known only MARC as the bibliographic record format and the cataloging interface. Catalogers speak in "MARC" - using the tags to name data elements - e.g. "245" instead of "title proper".

MARC is WYSIWYG

Those who work with MARC consider it to be "human readable." Most of the description is text, therefore what the cataloger creates is exactly what will appear on the screen in the library catalog. If a cataloger types "ill." that is what will display; if the cataloger instead types "illustrations" then that is what will display. In terms of viewing a MARC record on a screen, some cataloger displays show the tags and codes to one side, and the text of those elements is clearly readable as text.

MARC Gives Catalogers Control

The coding is visible, and therefore what the cataloger creates on the screen is virtually identical to the machine-readable record that is being created. Everything that will be shown in the catalog is in the record (with the exception of cover art, at least in some catalogs). The MARC rules say that the order of fields and subfields in the record are the order in which that information should be displayed in the catalog. Some systems violate this by putting the fields in numeric order, but the order of subfields is generally maintained. Catalogers wish to control the order of display and are frustrated when they cannot. In general, changing anything about the record with automated procedures can un-do the decisions made by catalogers as part of their work, and is a cause of frustration for catalogers.

MARC is International

MARC is used internationally, and because the record uses numerics and alphanumeric codes, a record created in another country is readable to other MARC users. Note that this was also the purpose of the International Standard Bibliographic Description (ISBD), which instead of tags uses punctuation marks to delimit elements of the bibliographic description. If a cataloger sees this, but cannot read the text:

  245 02   |a לטוס עם עין אחת / |c דני בז.

it is still clear that this is a title field with a main title (no subtitle), followed by a statement of the author's name as provided on the title page of the book.

MARC is the Lingua Franca of Cataloging

This is probably the key point that comprises all of the above, but it is important to state it as such. This means that the entire workflow, the training materials, the documentation - all use MARC. Catalogers today think in MARC and communicate in MARC. This also means that MARC defines the library cataloging community in the way that a dialect defines the local residents of a region. There is pride in its "library-ness". It is also seen as expressing the Anglo-American cataloging tradition.

MARC is Concise

MARC is concise as a physical format (something that is less important today than it was in the 1960s when MARC was developed), and it is also concise on the screen. "245" represents "title proper"; "240" represents "uniform title"; "130" represents "uniform title main entry". Often an entire record can be viewed on a single screen, and the tags and subfield codes take up very little display space.

MARC is Very Detailed

MARC21 has about 200 tags currently defined, and each of these can have up to 36 subfields. There are about 2000 subfields defined in MARC21, although the distribution is uneven and depends on the semantics of the field; some fields have only a handful of subfields, and in others there are few codes remaining that could be assigned.

MARC is Flat

The MARC record is fairly flat, with only two levels of coding: field and subfield. This is a simple model that is easy to understand and easy to visualize.

MARC is Extensible

Throughout its history, the MARC record has been extended by adding new fields and subfields. There are about 200 defined fields which means that there is room to add approximately 600 more.

MARC has Mnemonics

Some coding is either consistent or mnemonic, which makes it easier for catalogers to remember the meaning of the codes. There are code blocks that refer to cataloging categories, such as the title block (2XX), the notes block (5XX) and the subject block (6XX). Some subfields have been reserved for particular functions, such as the use of the numeric subfields in 0-8. In other cases, the mnemonic is used in certain contexts, such as the use of subfield "v" for the volume information of series. In other fields, the "v" may be used for something else, such as the "form" subfield in subject fields, but the context makes it clear.

There are also field mnemonics. For example, all tagged fields that have "00" in the second and third places are personal name fields. All fields and subfields that use the number 9 are locally defined (with a few well-known exceptions).

MARC is Finite and Authoritative

MARC defines a record that is bounded. What you see in the record is all of the information that is being provided about the item being described. The concept of "infinite graphs" is hard to grasp, and hard to display on a screen. This also means that MARC is an authoritative statement of the library bibliographic description, whereas graphs may lead users to sources that are not approved by or compatible with the library view.

Thursday, April 06, 2017

Precipitating Forward

Our Legacy, Our Mistake


If you follow the effort taking place around the proposed new bibliographic data standard, BIBFRAME, you may have noticed that much of what is being done with BIBFRAME today begins our current data in MARC format and converts it to BIBFRAME. While this is a function that will be needed should libraries move to a new data format, basing our development on how our legacy data converts is not the best way to move forward. In fact, it doesn't really tell us what "forward" might look like if we give it a chance.

We cannot define our future by looking only at our past. There are some particular aspects of our legacy data that make this especially true.          

I have said before (video, article) that we made a mistake when we went from printing cards using data encoded in MARC, to using MARC in online catalogs. The mistake was that we continued to use the same data that had been well-adapted to card catalogs without making the changes that would have made it well-adapted to computer catalogs. We never developed data that would be efficient in a database design or compatible with database technology. We never really moved from textual description to machine-actionable data points. Note especially that computer catalogs fail to make use of assigned headings as they are intended, yet catalogers continue to assign them at significant cost.

One of the big problems in our legacy data that makes it hard to take advantage of computing technology is that the data tends to be quirky. Technology developers complain that the data is full of errors (as do catalogers), but in fact it is very hard to define, algorithmically, what is an error in our data.  The fact is that the creation of the data is not governed by machine rules; instead, decisions are made by humans with a large degree of freedom. Some fields are even defined as being either this or that, something that is never the case in a data design. A few fields are considered required, although we've all seen records that don't have those required fields. Many fields are repeatable and the order of fields and subfields is left to the cataloger, and can vary.

The cataloger view is of a record of marked-up text. Computer systems can do little with text other than submit it for keyword indexing and display it on the screen. Technical designers look to the fixed fields for precise data points that they can operate on, but these are poorly supported and are often not included in the records since they don't look like "cataloging" as it is defined in libraries. These coded data elements are not defined by the cataloging code, either, and can be seen a mere "add-ons" that come with the MARC record format. The worst of it is that they are almost uniformly redundant with the textual data yet must be filled in separately, an extra step in the cataloging process that some cannot afford.

The upshot of this is that it is very hard to operate over library catalog data algorithmically. It is also very difficult to do any efficient machine validation to enforce consistency in the data. If we carry that same data and those same practices over to a different metadata schema, it will still be very hard to operate over algorithmically, and it will still be hard to do quality control as a function of data creation.

The counter argument to this is that cataloging is not a rote exercise - that catalogers must make complex decisions that could not be done by machines. If cataloging were subject to the kinds of data entry rules that are used in banking and medical and other modern systems, then the creativity of the cataloger's work would be lost, and the skill level of cataloging would drop to mere data entry.

This is the same argument you could used for any artisanal activity. If we industrialize the act of making shoes, the skills of the master shoe-maker are lost. However, if we do not industrialize shoe production, only a very small number of people will be able to afford to wear shoes.

This decision is a hard one, and I sympathize with the catalogers who are very proud of their understanding of the complexity of the bibliographic world. We need people who understand that complexity. Yet increasingly we are not able to afford to support the kind of cataloging practices of which we are proud. Ideally, we would find a way to channel those skills into a more efficient workflow.

There is a story that I tell often: In the very early days of the MARC record, around the mid-1970's, many librarians thought that we could never have a "computer catalog" because most of our cataloging existed only on cards, and we could NEVER go back and convert the card catalogs, retype every card into MARC. At that same time, large libraries in the University of California system were running over 100,000-150,000 cards behind in their filing. For those of you who never filed cards... it was horribly labor intensive. Falling 150,000 cards behind meant that a book was on the shelf THREE MONTHS before the cards were in the catalog. Some of this was the "fault" of OCLC which was making it almost too easy to create those cards. Another factor was a great increase in publishing that was itself facilitated by word processing and computer-driven typography. Within less than a decade it became more economical to go through the process of conversion from printed cards to online catalogs than to continue to maintain enormous card catalogs. And the rest is history. MARC, via OCLC, created a filing crisis, and in a sense it was the cost of filing that killed the card catalog, not the thrill of the modern online catalog.

The terrible mistake that we made back then was that we did not think about what was different between the card catalog and the online catalog, and we did not adjust our data creation accordingly. We carried the legacy data into the new format which was a disservice to both catalogers and catalog users. We missed an opportunity to provide new discovery options and more efficient data creation.

We mustn't make this same mistake again.

The Precipitant

Above I said that libraries made the move into computer-based catalogs because it was uneconomical to maintain the card catalog. I don't know what the precipitant will be for our current catalog model, but there are some rather obvious places to look to for that straw that will break the MARC/ILS back. These problems will probably manifest themselves as costs that require the library to find a more efficient and less costly solution. Here are some of the problems that I see today that might be factors that require change:

  • Output rates of intellectual and cultural products is increasing. Libraries have already responded to this through shared cataloging and purchase of cataloging from product vendors. However, the records produced in this way are then loaded into thousands of individual catalogs in the MARC-using community.
  • Those records are often edited for correctness and enhanced. Thus they are costing individual libraries a large amount of money, potentially as much or more than libraries save by receiving the catalog copy.
  • Each library must pay for a vendor system that can ingest MARC records, facilitate cataloging, and provide full catalog user (patron) support for searching and display.
  • "Sharing" in today's environment means exporting data and sending it as a file. Since MARC records can only be shared as whole records, updates and changes generally are done as a "full record replace" which requires a fair amount of cycles. 
  • The "raw" MARC record as such is not database friendly, so records must be greatly massaged in order to store them in databases and provide indexing and displays. Another way to say this is that there are no database technologies that know about the MARC record format. There are database technologies that natively accept and manage other data formats, such as key-value pairs

There are some current technologies that might provide solutions:

  • Open source. There is already use of open source technology in some library projects. Moving more toward open source would be facilitated by moving away from a library-centric data standard and using at least a data structure that is commonly deployed in the information technology world. Some of this advantage has already been obtained with using MARCXML.
  • The cloud. The repeated storing of the same data in thousands of catalogs means not being able to take advantage of true sharing. In a cloud solution, records would be stored once (or in a small number of mirrors), and a record enhancement would enhance the data for each participant without being downloaded to a separate system. This is similar to what is being proposed by OCLC's WorldShare and Ex Libris' Alma, although presumably those are "starter" applications. Use of the cloud for storage might also mean less churning of data in local databases; it could mean that systems could be smaller and more agile.
  • NoSQL databases and triple stores. The current batch of databases are open source, fast, and can natively process data in a variety of formats (although not MARC). Data does not have to be "pre-massaged" in order to be stored in a database or retrieved and the database technology and the data technology are in sync. This makes deployment of systems easier and faster. There are NoSQL database technologies for RDF. Another data format that has dedicated database technology is XML, although that ship may have sailed by now.
  • The web. The web itself is a powerful technology that retrieves distributed data at astonishing rates. There are potential cost/time savings on any function that can be pushed out the web to make use of its infrastructure. 

The change from MARC to ?? will come and it will be forced upon us through technology and economics. We can jump to a new technology blindly, in a panic, or we can plan ahead. Duh.



Monday, July 23, 2012

Futures and Options

No, I'm not talking about the stock market, but about the options that we have for moving beyond the MARC format for library data. You undoubtedly know that the Library of Congress has its Bibliographic Framework Transition Initiative that will consider these options. In an ALA Webinar last week I proposed my own set of options -- undoubtedly not as well-studied as LC's will be, but I offer them as one person's ideas.

It helps to remember the three database scenarios of RDA. These show a progressive view of moving from the flat record format of MARC to a relational database. The three RDA scenarios (which should be read from the bottom up) are

  1. Relational database model -- In this model, data is stored as separate entities, presumably following the entities defined in FRBR. Each entity has a defined set of data elements and the bibliographic description is spread across these entities which are then linked together using FRBR-like relationships.
  2. Linked authority files -- The database has bibliographic records and has authority records, and there are machine-actionable links between them. These links should allow certain strings, like name headings, to be stored only once, and should reflect changes to the authority file in the related bibliographic records.
  3. Flat file model -- The database has bibliographic records and it has authority records, but there is no machine-actionable linking between the two. This is the design used by some library systems, but it is also a description of the situation that existed with the card catalog.

These move from #3, being the least desirable, to #1, being the intended format of RDA data. I imagine that the JSC may not precisely subscribe to these descriptions today because of course in the few years since the document was created the technology environment has changed, and linked data now appears to be the goal. The models are still interesting in the way that they show a progression.

I also have in mind something of a progression, or at least a set of three options that move from least to most desirable. To fully explain each of these in sufficient detail will require a significant document, and I will attempt to write up such an explanation for the Futurelib wiki site. Meanwhile, here are the options that I see, with their advantages and disadvantages. The order, in this case, is from what I see as least desirable (#3, in keeping with the RDA numbering) to most desirable (#1).

#3 Serialization of MARC in RDF

Advantages

  • mechanical - requires no change to the data
  • would be round-tripable, similar to MARCXML
  • requires no system changes, since it would just be an export format

Disadvantages

  • does not change the data at all - all of the data remains as text strings, which do not link
  • keeps library data in a library-only silo
  • library data will not link to any non-library sources, and even linking to library sources will be limited because of the profusion of text strings

#2 Extraction of linked data from MARC records

Advantages

  • does not require library major system changes because it extracts data from current MARC format
  • some things (e.g. "persons") can be given linkable identifiers that will link to other  Web resources
  • the linked data can be re-extracted as we learn more, so we don't have to get it right or complete the first time
  • does not change the work of catalogers

Disadvantages

  • probably not round-trippable with MARC
  • the linked data is entirely created by programs and algorithms, so it doesn't get any human quality control (think: union catalog de-duping algorithms)
  • capabilities of the extracted data are limited by what we have in records today, similar to the limitations of attempting to create RDA in MARC

#1 Linked data "all the way down", e.g. working in linked data natively

Advantages

  • gives us the greatest amount of interoperability with web resources and the most integration with the information in that space
  • allows us to implement the intent of RDA
  • allows us to create interesting relationships between resources and possibly serve users better

Disadvantages

  • requires new library systems
  • will probably instigate changes in cataloging practice
  • presumably entails significant costs, but we have little ability to develop a cost/benefit analysis

There is a lot behind these three options that isn't explained here, and I am also interested in hearing other options that you see. I don't think that our options are only three -- there could be many points between them -- but this is intended to be succinct.

To conclude, I don't see much, if any, value in my option #3; #2 is already being done by the British Library, OCLC, and the National Library of Spain; I have no idea how far in our future #1 is, nor even if we'll get there before the next major technology change. If we can't get there in practice, we should at least explore it in theory because I believe that only #1 will give us a taste of a truly new bibliographic data model.

Friday, April 06, 2012

If not RDF, then what?

There's no question that the data format known as RDF is darned difficult. Let's suppose that we in the library world decide not to hitch our wagon to RDF, but would still like to create a new bibliographic framework. After all, if MARC simply won't work for the creation of RDA records, we still need something besides MARC that we can use to create data. And even if (although this is unlikely) we should decide not to move to RDA, our records still need some upgrading to fit better into current data processing models. We still need to:
  • define our entities
  • use data wherever possible, not text
  • use identifiers for things
  • relate attributes to entities (that is, say things about some thing)
  • use a mainstream serialization

Should we do this, the mainstream serialization could be anything from JSON to XML to RDF. In fact, it could be all of those if we play our cards right and define our data in a format neutral way. RDA does some of this for us, but not all. In particular, RDA does not distinguish between data and text, and although it allows for the use of identifiers it doesn't give any guidance on how to use them. RDA is probably fine as guidance rules for decision-making, but it needs the corresponding data definition before it becomes useful. Having that data definition could help to clarify some ambiguities in RDA. We have to expect that there will need to be some iteration between RDA and a data definition. (I will post shortly on a problem that I have run into.)

It also seems to me that we have everything to gain by beginning our work on a data format with no particular serialization in mind. We could go from RDA to RDA-as-data and then on to RDA-as-RDF. I see some dangers in skipping the middle step, mainly that we could end up making some decisions that fit RDA into RDF but that are problematic for other serializations.

Tuesday, November 01, 2011

Future Format: Goals and Measures

The LC report on the future bibliographic format (aka replacement for MARC) is out. The report is short and has few specifics, other than the selection of RDF as the underlying data format. A significant part of the report lists requirements; these, too, are general in nature and may not be comprehensive.

What needs to be done before we go much further is to begin to state our specific goals and the criteria we will use to determine if we have met those goals. Some goals we will discover in the course of developing the new environment, so this should be considered a growing list. I think it is important that every goal have measurements associated with it, to the extent possible. It makes no sense to make changes if we cannot know what those changes have achieved. Here are some examples of the kinds of things I am thinking of in terms of goals; these may not be the actual goals of the project, they are just illustrations that I have invented.

COSTS
 - goal: it should be less expensive to create the bibliographic data during the cataloging process
   measurement: using time studies, compare cataloging in MARC and in the new format
 - goal: it should be less expensive to maintain the format
   measurement: compare the total time required for a typical MARBI proposal to the time required for the new format
 - goal: it should be less expensive for vendors to make required changes or additions
   measurement: compare the number of programmer hours needed to make a change in the MARC environment and the new environment

COLLABORATION
 - goal: collaboration on data creation with a wider group of communities
   measurement: count the number of non-library communities that we are sharing data with before and after
 - goal: greater participation of small libraries in shared data
   measurement: count number of libraries that were sharing before and after the change
 - goal: make library data available for use by other information communities
   measurement: count use of library data in non-library web environments before and after

INNOVATION
 - goal: library technology staff should be able to implement "apps" for their libraries faster and easier than they can today.
   measurement: either number of apps created, or a time measure to implement (this one may be hard to compare)
 - goal: library systems vendors can develop new services more quickly and more cheaply than before
   measurement: number of changes made in the course of a year, or number of staff dedicated to those changes. Another measurement would be what libraries are charged and how many libraries make the change within some stated time frame

As you can tell from this list, most of the measurements require system implementation, not just the development of a new format. But the new format cannot be an end in itself; the goal has to be the implementation of systems and services using that format. The first MARC format that was developed was tested in the LC workflow to see if it met the needs of the Library. This required the creation of a system (called the "MARC Pilot Project") and a test period of one year. The testing that took place for RDA is probably comparable and could serve as a model. Some of the measurements will not be available before full implementation, such as the inclusion of more small libraries. Continued measurement will be needed.

So, now, what are the goals that YOU especially care about?

Sunday, September 18, 2011

Meaning in MARC: Indicators

I have been doing a study of the semantics of MARC data on the futurelib wiki. An article on what I learned about the fixed fields (00X) and the number and code fields (0XX) appeared in the code4lib journal, issue 14, earlier this year. My next task was to tackle the variable fields in the MARC range 1XX-8XX.

This is a huge task, so I started by taking a look at the MARC indicators in this tag range, and have expanded this to a short study of the role that indicators play in MARC. I have to say that it is amazing how much one can stretch the MARC format with one or two single-character data elements.

Indicators have a large number of effects on the content of the MARC fields they modify. Here is the categorization that I have come up with, although I'm sure that other breakdowns are equally plausible.

I. Indicators that do not change the meaning of the field

There are indicators that have a function, but it does not change the meaning of the data in the field or subfields.
  • Display constants: some, but not all, display constants merely echo the meaning of the tag, e.g. 775 Other Edition Entry, Second Indicator
    Display constant controller
    # - Other edition available
    8 - No display constant generated
  • Trace/Do not trace: I consider these indicators to be carry-overs from card production.
  • Non-filing indicators: similar to indicators that control displays, these indicators make it possible to sort (was filing) titles properly, ignoring the initial articles ("The ", "A ", etc.). 
  • Existence in X collection: there are indicators in the 0XX range that let you know if the item exists in the collection of a national library. 
II. Indicators that do change the meaning of the field

Many indicators serve as a way to expand the meaning of a field without requiring the definition of a new tag.
  • Identification of the source or agency: a single field like a 650 topical subject field can have content from an unlimited list of controlled vocabularies because the indicator (or the indicator plus the $2 subfield) provides the identity of the controlled vocabulary.
  • Multiple types in a field: some fields can have data of different types, controlled by the indicator. For example, the 246 (Varying form of title) has nine different possible values, like Cover title or Spine title, controlled by a single indicator value. 
  • Pseudo-display controllers: the same indicator type that carries display constants that merely echo the meaning of the field also has a number of instances where the display constant actually indicates a different meaning for the field. One example is the 520 (Summary, etc.) field with display constants for "subject," "review," "abstract," and others. 
 Some Considerations

Given the complexity of the indicators there isn't a single answer to how this information should be interpreted in a semantic analysis of MARC. I am inclined to consider the display constants and tracing indicators in section I to not have meaning that needs to be addressed. These are parts of the MARC record that served the production of card sets but that should today be functions of system customization. I would argue that some of these have local value but are possibly not appropriate for record sharing.

The non-filing indicators are a solution to a problem that is evident in so many bibliographic applications. When I sort by title in Zotero or Mendeley, a large portion of the entries are sorted under "The." The world needs a solution here, but I'm not sure what it is going to be. One possibility is to create two versions of a title: one for display, with the initial article, and one for sorting, without. Systems could do the first pass at this, as they often to today with taking author names and inverting them into "familyname, forenames" order. Of course, humans would have to have the ability to make corrections where the system got it wrong.

The indicators that identify the source of a controlled vocabulary could logically be transformed into a separate data element for each vocabulary (e.g. "LCSH," "MeSH"). However, the number of different vocabularies is, while not infinite, very large and growing (as evidenced by the practice in MARC to delegate the source to a separate subfield that carries codes from a controlled list of sources), so producing a separate data element for each vocabulary is unwieldy, to say the least. At some future date, when controlled vocabularies "self-identify" using URIs this may be less of a problem. For now, however, it seems that we will need to have multi-part data elements for controlled vocabularies that include the source with the vocabulary terms.

The indicators that further sub-type a field, like the 520 Summary field, can be fairly easily given their own data element since they have their own meaning. Ideally there would be a "type/sub-type" relationship where appropriate.


And Some Problems

There are a number of areas that are problematic when it comes to the indicator values. In many cases, the MARC standard does not make clear if the indicator modifies all subfields in the field, or only a select few. In some instances we can reason this out: the non-filing indicators only refer to the left-most characters of the field, so they can only refer to the $a (which is mandatory in each of those fields). On the other hand, for the values in the subject area (6XX) of the record, the source indicator relates to all of the subject subfields in the field. I assume, however, that in all cases the control subfields $3-$8 perform functions that are unrelated to the indicator values. I do not know at this point if there are fields in which the indicators function on some other subset of the subfields between $a and $z. That's something I still need to study.

I also see a practical problem in making use of the indicator values in any kind of mapping from MARC to just about anything else. In 60% of MARC tags either one or both indicator positions is undefined. Undefined indicators are represented in the MARC record with blanks. Unfortunately there are also defined indicators that have a meaning assigned to the character "blank." There is nothing in the record itself to differentiate blank indicator values from undefined indicators. Any transformation from MARC to another format has to have knowledge about every tag and its indicators in order to do anything with these elements. This is another example of the complexity of MARC for data processing, and yet another reason why a new format could make our lives easier.

More on the Wiki

For anyone else who obsesses on these kinds of things there is more detail on all of this on the futurelib wiki. I welcome comments here, and on the wiki. If you wish to comment on the wiki, however, I need to add your login to the site (as an anti-spam measure). I will undoubtedly continue my own obsessive behavior related to this task, but I really would welcome collaboration if anyone is so inclined. I don't think that there is a single "right answer" to the questions I am asking, but am working on the principle that some practical decisions in this area can help us as we work on a future bibliographic carrier.

Friday, September 09, 2011

MARC vs RDA

As LC ponders the task of moving to a bibliographic framework, I can't help but worry about how much the past is going to impinge on our future. It seems to me that we have two potentially incompatible needs at the moment: the first is to fix MARC, and the second is to create a carrier for RDA.

Fixing MARC

For well over a decade some of us have been suggesting that we need a new carrier for the data that is currently stored in the MARC format. The record we work with today is full of kludges brought on by limitations in the data format itself. To give a few examples:
  • 041 Language Codes - We have a single language code in the 008 and a number of other language codes (e.g. for original language of an abstract) in 041. The language code in the 008 is not "typed" so it must be repeated in the 041 which has separate subfields for different language codes. However, 041 is only included when more than one language code is needed. This means that there are always two places one must look to find language codes.
  • 006 Fixed-Length Data Elements, Additional Material Characteristics - The sole reason for the existence of the 006 is that the 008 is not repeatable. The fixed-length data elements in the 006 are repeats of format-specific elements in the 008 so that information about multi-format items can be encoded.
  • 773 Host Item Entry - All of the fields for related resources (76X-78X) have the impossible task of encoding an entire bibliographic description in a single field. Because there are only 26 possible subfields (a-z) available for the bibliographic data, data elements in these fields are not coded the same as they are in other parts of the record. For example, in the 773 the entire main entry is entered in a single subfield ("$aDesio, Ardito, 1897-") as opposed to the way it is coded in any X00 field ("$aDesio, Ardito,$d1897-").
Had we "fixed" MARC ten years ago, there might be less urgency today to move to a new carrier. As it is, data elements that were added so that the RDA testing could take place have made the format look more and more like a Rube Goldberg contraption. The MARC record is on life support, kept alive only through the efforts of the poor folks who have to code into this illogical format.

A Carrier for RDA

The precipitating reason for LC's bibliographic framework project is RDA. One of the clearest results of the RDA tests that were conducted in 2010 was that MARC is not a suitable carrier for RDA. If we are to catalog using the new code, we must have a new carrier. I see two main areas where RDA differs "record-wise" from the cataloging codes that informed the MARC record:
  • RDA implements the FRBR entities
  • RDA allows the use of identifiers for entities and terms
Although many are not aware of it, there already is a solid foundation for an RDA carrier in the registered elements and vocabularies in the Open Metadata Registry. Not long ago I was able to show that one could use those elements and vocabularies to create an RDA record. A full implementation of RDA will probably require some expansion of the data elements of RDA because the current list that one finds in the RDA Toolkit was not intended to be fully detailed.

To my mind, the main complications about a carrier for RDA have to do with FRBR and how we can most efficiently create relationships between the FRBR entities and manage them within systems. I suspect that we will need to accommodate multiple FRBR scenarios, some appropriate to data storage and others more appropriate to data transmission.

Can We Do Both?

This is my concern: creating a carrier for RDA will not solve the MARC record problem; solving the MARC record problem will not provide a carrier for RDA. There may be a way to combine these two needs, but I fear that a combined solution would end up creating a data format that doesn't really solve either problem because of the significant difference between the AACR conceptual model and that of RDA/FRBR.

It seems that if we want to move forward, we may have to make a break with the past. We may need to freeze MARC for those users continuing to create pre-RDA bibliographic data, and create an RDA carrier that is true to the needs of RDA and the systems that will be built around RDA data, with any future enhancements taking place only to the new carrier. This will require a strategy for converting data in MARC to the RDA carrier as libraries move to systems based on RDA.

Next: It's All About the Systems

In fact, the big issue is not data conversion but what the future systems will require in order to take advantage of RDA/FRBR. This is a huge question, and I will take it up in a new post, but just let me say here that it would be folly to devise a data format that is not based on an understanding of the system requirements that can fulfill desired functionality and uses.

Wednesday, September 07, 2011

XML and Library Data Future

There is sometimes the assumption that the future data carrier for library data will be XML. I think this assumption may be misleading and I'm going to attempt to clarify how XML may fit into the library data future. Some of this explanation is necessarily over-simplified because a full exposition of the merits and de-merits of XML would be a tome, not a blog post.

What is XML?

The eXtensible Markup Language (XML) is a highly versatile markup language. A markup language is primarily a way to encode text or other expressions so that some machine-processing can be performed. That processing can manage display (e.g. presenting text in bold or italics) or it can be similar to metadata encoding of the meaning of a group of characters ("dateAndTime"). It makes the expression more machine-usable. It is not a data model in itself, but it can be used to mark up data based on a wide variety of models.*

XML is the flagship standard in a large family of markup languages, although not the first: it is an evolution of SGML which had (perhaps necessary) complexities that rendered it very difficult for most mortals to use. It's also the conceptual granddaddy of HTML, a much simplified markup language that many of us take for granted.

Defining Metadata in XML

There is a difference between using XML as a markup for documents or data and using XML to define your data. XML has some inherent structural qualities that may not be compatible with what you want your data to be. There is a reason why XML "records" are generally referred to as "documents": they tend to be quite linear in nature, with a beginning, a middle, and an end, just like a good story.

XML's main structural functionality is that of nesting, or the creation of containers that hold separate bits of data together.

<paragraph>
   <sentence></sentence>
   <sentence></sentence> ...
</paragraph>

<name>
   <familyname></familyname>
   <forenames></forenames>
</name>

This is useful for document markup and also handy when marking up data. It is not unusual for XML documents to have nesting of elements many layers deep. This nesting, however, can be deceptive. Just because you have things inside other things does not mean that the relationship is anything more than a convenience for the application for which it was designed.

<customer>
    <customerNumber></customerNumber>
    <phoneNumber></phoneNumber>
</customer>

Nested elements are most frequently in a whole/part relationship, with the container representing the whole and holding the elements (parts) together as a unit (in particular a unit that can be repeated).

<address>
    <street1></street1>
    <street2></street2>
    <city></city>
    <state></state>
    <zip></zip>
</address

While usually not hierarchical in the sense of genus/species or broader/narrower, this nesting has some of the same data processing issues that we find in other hierarchical arrangements:
  • The difficulty of placing elements in a single hierarchy when many elements could be logically located in more than one place. That problem has to be weighed against the inconvenience and danger of carrying the same data more than once in a record or system and the chances that these redundant elements will not get updated together.
  • The need to traverse the whole hierarchy to get to "buried" elements. This was the pain-in-the-neck that caused most data processing shops to drop hierarchical database management systems for relational ones. XML tools make this somewhat less painful, but not painless.
  • Poor interoperability. The same data element can be in different containers in different XML documents, but the data elements may not be usable outside the context of the containing element (e.g. "street2").
Nesting, like hierarchy, is necessarily a separation of elements from each other, and XML does not provide a way to bring these together for a different view. Contrast the container structure of XML and a graph structure.



In the nested XML structure some of the same data is carried in separate containers and there isn't any inherent relationship between them. Were this data entered into a relational database it might be possible to create those relationships, somewhat like the graph view. But as a record the XML document has separate data elements for the same data because the element is not separate from the container. In other words, the XML document has two different data elements for the zip code:

  address:zip
  censusDistrict:zip

To use a library concept as an analogy, the nesting in XML is like pre-coordination in library subject headings. It binds elements together in a way that they cannot be readily used in any other context. Some coordination is definitely useful at the application level, but if all of your data is pre-coordinated it becomes difficult to create new uses for new contexts.

Avoid XML Pitfalls

XML does not make your data any better than it was, and it can be used to mark up data that is illogically organized and poorly defined. A misstep that I often see is data designers beginning to use XML before their data is fully described, and therefore letting the structure and limitations of XML influence what their data can express. Be very wary of any project that decides that the data format will be XML before the data itself has been fully defined.

XML and Library data

If XML had been available in 1965 when Henriette Avram was developing the MARC format it would have been a logical choice for that data. The task that Avram faced was to create a machine-readable version of the data on the catalog card that would allow cards to be printed that looked exactly like the cards that were created prior to MARC. It was a classic document mark-up situation. Had that been the case our records could very well have evolved in a way that is different to what we have today, because XML would not have had the need to separate fixed field data from variable field data, and expansion of some data areas might have been easier. But saying that XML would have been a good format in 1965 does not mean that it would be a good format in 2011.

For the future library data format, I can imagine that it will, at times, be conveyed over the internet in XML. If it can ONLY be conveyed in XML we will have created a problem for ourselves. Our data should be independent of any particular serialization and be designed so that it is not necessary to have any particular combination or nesting of elements in order to make use of the data. Applications that use the data can of course combine and structure the elements however they wish, but for our data to be usable in a variety of applications we need to keep the "pre-coordination" of elements to a minimum.



* For example, there is an XML serialization (essentially a record format) of RDF that is frequently used to exchange linked data, although other serializations are also often available. It is used primarily because there is a wide range of software tools available for making use of XML data in applications, and there are many fewer tools available for the more "native" RDF expressions such as triples or turtle. It encapsulates RDF data in a record format and I suspect that using XML for this data will turn out to be a transitional phase as we move from record-based data structures to graph-based ones.

Friday, August 26, 2011

New bibliographic framework: there is a way


Since my last post undoubtedly left readers with the idea that I have my head in the clouds about the future of bibliographic metadata, I wish to present here some of the reasons why I think this can work. Many of you were probably left thinking: Yea, right. Get together a committee of a gazillion different folks and decide on a new record format that works for everyone. That, of course, would not be possible. But that's not the task at hand. The task at hand is actually about the opposite of that. Here are a few parameters.


#1 What we need to develop is NOT a record format

The task ahead of us is to define an open set of data elements. Open, in this case, means usable and re-usable in a variety of metadata contexts. What wrapper (read: record format) you put around them does not change their meaning. Your chicken soup can be in a can, in a box, or in a bowl, but it is still chicken soup. That's the model we need for metadata. Content, not carrier. Meaning, not record format. Usable in many different situations.

#2 Everyone doesn't have to agree to use the exact same data elements

We only need to know the meaning of the data elements and what relationships exist between different data elements. For example, we need to know that my author and your composer are both persons and are both creators of the resource being described. That's enough for either of us to use the other's data under some circumstances. It isn't hard to find overlapping bits of meaning between different types of bibliographic metadata.

Not all metadata elements will overlap between communities. The cartographic community will have some elements that the music library community will never use, and vice versa. That's fine. That's even good. Each specialist community can expand its metadata to the level of detail that it needs in its area. If the music library finds a need to catalog a map, they can "borrow" what they need from the cartographic folks.

Where data elements are equivalent or are functionally similar, data definitions should include this information. Although defined differently, you can see that there are similarities among these data elements.
pbcoreTitle =  a name given to the media item you are cataloging
RDA:titleProper = A word, character, or group of words and/or characters that names a resource or a work contained in it.
MARC:245 $a = title of a work
dublincore:title = A name given to the resource
All of these are types of titles, and have a similar role in the descriptive cataloging of their respective communities: each names the target resource. These elements therefore can be considered members of a set that could be defined as: data elements that name the target resource. Having this relationship defined makes it possible to use this data in different contexts and even to bring these titles together into a unified display. This is no different to the way we create web pages with content from different sources like Flickr, YouTube, and a favorite music artist's web site, like the image here.

In this "My Favorites" case, the titles come from the Internet Movie Database, a library catalog display, the Billboard music site, and Facebook. It doesn't matter where they came from or what the data element was called at that site, what matters is that we know which part is the "name-of-the-thing" that we want to display here.

#3 You don't have to create all new data elements for your resources if appropriate ones already exist

When data elements are defined within the confines of a record, each community has to create an entire data element schema of their own, even if they would be coding some elements that are also used by other communities. Yet, there is no reason for different communities to each define a data element for an element like the ISBN because one will do. When data elements themselves are fully defined apart from any particular record format you can mix and match, borrowing from others as needed. This not only saves some time in the creation of metadata schemas but it also means that those data elements are 100% compatible across the metadata instances that use them.

In addition, if there are elements that you need only rarely for less common materials in your environment, it may be more economical to borrow data elements created by specialist communities when they are needed, saving your community the effort of defining additional elements under your metadata name space.



To do all of this, we need to agree on a few basic rules.

1) We need to define our data elements in a machine-readable and machine-actionable way, preferably using a widely accepted standard.

This requires a data format for data elements that contains the minimum needed to make use of a defined data element. Generally, this minimum information is:
  • a name (for human readers)
  • an identifier (for machines)
  • a human-readable definition
  • both human and machine-readable definitions of relationships to other elements (e.g. "equivalent to" "narrower than" "opposite of")

2) We must have the willingness and the right to make our decisions open and available online so others can re-use our metadata elements and/or create relationships to them.


3) We also must have a willingness to hold discussions about areas of mutual interest with other metadata creators and with metadata users. That includes the people we think of today as our "users": writers, scholars, researchers, and social network participants. Open communication is the key. Each of use can teach, and each of us can learn from others. We can cooperate on the building of metadata without getting in each others' way. I'm optimistic about this.

Thursday, August 25, 2011

Bibliographic Framework Transition Initiative

The Internet began as a U.S.-sponsored technology initiative that went global while still under U.S. government control. The transition of the Internet to a world-wide communication facility is essentially complete, and few would argue that U.S. control of key aspects of the network is appropriate today. It is, however, hard for those once in control to give it up, and we see that in ICANN, the body charged with making decisions about the name and numbering system that is key to Internet functioning. ICANN is under criticism from a number of quarters for continuing to be U.S.-centric in its decision-making. Letting go is hard, and being truly international is a huge challenge.

I see a parallel here with Library of Congress and MARC. While there is no question that MARC was originally developed by the Library of Congress, and has been maintained by that body for over 40 years, it is equally true that the format is now used throughout the world and in ways never anticipated by its original developers. Yet LC retains a certain ownership of the format, in spite of its now global nature, and it is surely time for that control to pass to a more representative body.

Some Background

MARC began in the mid-1960's as an LC project at a time when the flow of bibliographic data was from LC to U.S. libraries in the form of card sets. MARC happened at a key point in time when some U.S. libraries were themselves thinking of making use of bibliographic data in a machine-readable form. It was the right idea at the right time.

In the following years numerous libraries throughout the world adopted MARC or adapted MARC to their own needs. By 1977 there had been so much diverse development in this area that libraries used the organizing capabilities of IFLA to create a unified standard called UNIMARC. Other versions of the machine-readable format continued to be created, however.

The tower of Babel that MARC spawned originally has now begun to consolidate around the latest version of the MARC format, MARC21. The reasons for this are multifold. First there are economic reasons: library vendor systems have been having to support this cacophony of data formats now for decades, which increases their costs and decreases their efficiency. Having more libraries on a single standard means that the vendor has fewer different code bases to develop and maintain. The second reason is the increased amount of sharing of metadata between libraries. It is much easier to exchange bibliographic data between institutions using the same data format.

Today, MARC records, or at least MARC-like records, abound in the library sphere, and pass from one library system to another like packets over the Internet. OCLC has a database that consists of about 200 million records that are in MARC format, with data received from some 70,000 libraries, admittedly not all of which use MARC in their own systems. The Library of Congress has contributed approximately 12 million of those.  Within the U.S. the various cooperative cataloging programs  have distributed the effort of original cataloging among hundreds of institutions. Many national libraries freely exchange their data with their cohorts in other countries as a way to reduce cataloging costs for everyone. The directional flow of bibliographic data is no longer from LC to other libraries, but is a many-to-many web of data creation and exchange.

Yet, much like ICANN and the Internet, LC remains as the controlling agency over the MARC standard. The MARC Advisory Committee, which oversees changes to the format, has grown and has added members from Libraries and Archives Canada, The British Library, and Deutsche National Bibliothek. However, the standard is still primarily maintained by and issued by LC.

Bibliographic Framework Transition Initiative

LC recently announced the Bibliographic Framework Transition initiative to "determine a transition path for the MARC21 exchange format."
"This work will be carried out in consultation with the format's formal partners -- Library and Archives Canada and the British Library -- and informal partners -- the Deutsche Nationalbibliothek and other national libraries, the agencies that provide library services and products, the many MARC user institutions, and the MARC advisory committees such as the MARBI committee of ALA, the Canadian Committee on MARC, and the BIC Bibliographic Standards Group in the UK."
In September we should see the issuance of their 18-month plan.

Not included in LC's plan as announced are the publishers, whose data should feed into library systems and does feed into bibliographic systems like online bookstores. Archives and museums create metadata that could and should interact well with library data, and they should be included in this effort. Also not included are the academic users of bibliographic data, users who are so frustrated with library data that they have developed numerous standards of their own, such as BIBO, the Bibliographic Ontology, BIBJson, a JSON format for bibliographic data, and Fabio, the FRBR-Aligned Bibliographic Ontology. Nor are there representatives of online sites like Wikipedia and Google Books, which have an interest in using bibliographic data as well as a willingness to link back to libraries where that is possible. Media organizations, like the BBC and the U. S. public broadcasting community, have developed metadata for their video and sound resources, many of which find their way into library collections. And I almost forgot: library systems vendors. Although there is some representation on the MARC Advisory Committee, they need to have a strong voice given their level of experience with library data and their knowledge of the costs and affordances.

Issues and Concerns

There is one group in particular that is missing from the LC project as announced: information technology (IT) professionals. In normal IT development the users do not design their own system. A small group of technical experts design the system structure, including the metadata schema, based on requirements derived from a study of the users' needs. This is exactly how the original MARC format was developed: LC hired a computer scientist  to study the library's needs and develop a data format for their cataloging. We were all extremely fortunate that LC hired someone who was attentive and brilliant. The format was developed in a short period of time, underwent testing and cost analysis, and was integrated with work flows.

It is obvious to me that standards for bibliographic data exchange should not be designed by a single constituency, and should definitely not be led by a small number of institutions that have their own interests to defend. The consultation with other similar institutions is not enough to make this a truly open effort. While there may be some element of not wanting to give up control of this key standard, it also is not obvious to whom LC could turn to take on this task. LC is to be commended for committing to this effort, which will be huge and undoubtedly costly. But this solution is imperfect, at best, and at worst could result in a data standard that does not benefit the many users of bibliographic information.

The next data carrier for libraries needs to be developed as a truly open effort. It should be led by a neutral organization (possibly ad hoc) that can bring together  the wide range of interested parties and make sure that all voices are heard. Technical development should be done by computer professionals with expertise in metadata design. The resulting system should be rigorous yet flexible enough to allow growth and specialization. Libraries would determine the content of their metadata, but ongoing technical oversight would prevent the introduction of implementation errors such as those that have plagued the MARC format as it has evolved. And all users of bibliographic data would have the capability of metadata exchange with libraries.


Saturday, June 18, 2011

Opportunity knocks

There will soon be a call for reviews of the draft report by the W3C Incubator Group on Library Linked Data. As a member of that group I have had a hand in writing that draft, and I can tell you that it has been a struggle. Now we seriously need to hear from you, not the least because the group is not fully representative of the library world; in fact, it leans heavily toward techy-ness and large libraries and services. We need to hear from a wide range of libraries and librarians: public, small, medium, special, management, people who worry about budgets, people who have face time with users. We also need to hear from the library vendor community, since little can happen with library data that will not involve that community. (Note: a site is being set up to take comments, and I am hoping it will be possible to post anonymously or at least pseudonymously, for those who cannot appear to be speaking for their employer.)

In thinking about the possibility of moving to a new approach to bibliographic data in libraries, I created this diagram (which will not be in the report, it was just my thinking) that to me represents a kind of needs assessment. This pyramid is not just related to linked data but to any data format that we might adopt to take the place of the card catalog mark-up that we use today.

We could use this to address the recent LC announcement on replacing MARC. Here's how I see that analysis, starting with the bottom of the pyramid:
  • Motivation: Our current data model lacks the flexibility that we need, and is keeping us from taking advantage of some modern technologies that could help us provide better user service. Libraries are becoming less and less visible as information providers, in part because our data does not play well on the web, and it is difficult for us to make use of web content.
  • Leadership: Creating a new model is going to take some serious coordination among all of the parties. Who should/could provide that leadership, and how can we fund this effort? Although LC has announced its intention to collaborate, for various reasons a more neutral organization might be desired, one that is truly global in scope. Yet who can both lead the conversion effort and be available for the future to provide stability for the long term maintenance that a library data carrier will require? And how can we be collaborative without being glacially slow?
  • Skills: Many of us went through library school before the term "metadata" was in common usage. We learned to follow the cataloging rules, but not to understand the basic principles of data modeling and creation. This is one of the reasons why it is hard for us to change: we are one-trick ponies in the metadata world. The profession needs new skills, and it's not enough for only a few to acquire them: we all need to understand the world we are moving into
  • Means: This is the really hard one: how do we get the time and funding to make this much-needed change? Both will need to be justified with some clear examples of what we gain by this effort. I favor some demonstration projects, if we can find a way to create them.
  • Opportunity: The opportunity is here now. We could have made this change any time over the past decade or two while cataloging with AACR2, but RDA definitely gives us that golden moment when not changing no longer makes sense.

Tuesday, May 24, 2011

From MARC to Principled Metadata

Library of Congress has announced its intention to "review the bibliographic framework to better accommodate future needs." The translation of this into plain English is that they are (finally!) thinking about replacing the MARC format with something more modern. This is obviously something that desperately needs to be done.

I want to encourage LC and the entire library community to build its future bibliographic data on solid principles. Among these principles would be:

  • Use data, not text. Wherever possible, the stuff of bibliographic description should be computable data, not human-interpretable text. Any part of your metadata that cannot be used in machine algorithms is of limited utility in user services.
  • Give your things identifiers, not language tags. Identification allows you to share meaning without language barriers. Anything that has been identified can be displayed in language terms to users in any language of your (or the user's) choice.
  • Adopt mainstream metadata standards. This is not only for the data formats but also in terms of the data itself. If other metadata creators are using a particular standard language list or geographic names, use those same terms. If there are metadata elements for common things like colors or sizes or places or [whatever], use those. Work with international communities to extend metadata if necessary, but do not create library-specific versions.

There is much more to be said, and fortunately a great deal of it is being included in the report of the W3C Incubator Group on Library Linked Data. Although still in draft form you can see the current state of that group's recommendations, many of which address the transition that LC appears to be about to embark on. A version of the report for comments will be available later this summer.

The existence of this W3C group, however, is the proof of something very important that the Library of Congress must embrace: that bibliographic data is not solely of interest to libraries, and the future of library data should not be created as a library standard but as an information standard. This means that its development must include collaboration with the broader information community, and that collaboration will only be successful if libraries are willing to compromise in order to be part of the greater info-sphere. That's the biggest challenge we face.

Tuesday, October 12, 2010

Beyond MARC-up

In the recent Code4lib journal, Jason Thomale has published an article "Interpreting MARC: Where’s the Bibliographic Data?" in which he struggles to find the separate logical elements in a MARC 245 field. I must admit that I'm not entirely clear on what he means by 'bibliographic data' but I empathize with his attempts to find the data in MARC. In his conclusion he says:
... MARC has as much in common with a textual markup language (such as SGML or HTML) as it does with what we might consider to be “structured data.”
I have myself often referred to MARC as a markup language, to distinguish it from what a computer scientist would call "data." We took the catalog card and marked it up so that we could store the text in a machine-readable form and re-create the card format as precisely as possible. Along the way, a few fields (publication date, language, format) were considered in need of being expressed as actual data, and so the fixed fields were designed to hold those. Oddly enough, though, in most cases the same information was available in the text, meaning that the information had to be entered twice: once as text, and once as data.
008 pos. 07/10 = 1984
260 $c 1984
This fact is proof that at one point the MARC developers were fully aware that the text in the variable fields was ill-suited to machine operations other than printing on a card (or display on a screen).

I have been working off and on for a number of years on an analysis of MARC that is perhaps similar to Thomale's search for the bibliographic data of MARC. I characterize my project as an attempt to define the data elements of the MARC record. The logic goes like this: if we want to create a new, more flexible format for library data, one way to begin that process is to break MARC data up into its data elements. These can then be re-combined using a new data carrier. The converse is that if we cannot break MARC up into its data elements, then any new carrier will surely be saddled with some of the problematic aspects of MARC, such as:
  • redundancy, especially the repeat of the same content in many different fields
  • inconsistency, where the content in those different fields is coded differently or with a different level of granularity
  • potential contradiction between data in fixed fields and textual data
I am still just in the beginnings of my analysis, but for anyone who wants to follow along and comment/cajole/criticize, I am doing my thinking out loud on the futurelib wiki. I thought I would start with the 0XX fields, but decided to drop back and start with 007/008. I have a database of all of the 007/008 elements and their values, (linked in tab-delimited format on this wiki page) so I've been able to sort and eliminate and do other database-y things that help me see what's there.

I'm not interested in replicating MARC, so I do not want to create something that is one-to-one with MARC fields and subfields. As an example, some fixed field data elements and their values appear more than once in the MARC format, such as the 008 "Government publication" element which is identical in the 008 for books, computer files, maps, continuing resources and visual materials. As far as I'm concerned it is a single data element. On the other hand, an element named "Color" appears in more than one 007 field, but in each case the values that are valid for the data element are different. These then are different data elements.

I am struggling with how to create usable output from my investigations. I may code some things in the Open Metadata Registry, but at the moment that would have to be done by hand and I need something more automated. I would like to represent the controlled lists in the fixed fields in an RDF-compatible way using SKOS. This should be relatively simple once certain decisions are made (naming, URIs, etc.).

A big question is how to link all of this back to MARC. For the fixed fields it's relatively easy to create a string that represents the MARC origins of the data, for example:
  • 007microform05 to represent the data element (field 007, category of material Microform, position 05)
  • 007microform05f to represent the actual value (field 007, category of material Microform, position 05, value=f)
When it comes to the variable fields this is going to be more difficult because, as Thomale points out in his article, a logical element may span more than one field/subfield, and there may also be multiple elements in a single subfield. Working that out is going to be very, very difficult. So it seems best to go for the low-hanging fruit of the fixed fields.

Note that there have been other good starts at defining the MARC fixed fields in SKOS, and eventually we may be able to bring this all together. Meanwhile, I did grab marc21.info for the URI portion of this work and obviously am working toward dereferenceable URIs.

Wednesday, April 07, 2010

After MARC

The report on the Future of Bibliographic Control made it clear that the members of that committee felt that it was time to move beyond MARC:
"The existing Z39.2/MARC “stack” is not an appropriate starting place for a new bibliographic data carrier because of the limitations placed upon it by the formats of the past." p. 24

The recent report from the RLG/OCLC group Implications of MARC Tag Usage on Library Metadata Practices comes to a similar conclusion:
"5. MARC itself is arguably too ambiguous and insufficiently structured to facilitate machine processing and manipulation." p.27

We seem to be reaching a point of consensus in our profession that it is time to move beyond MARC. When faced with that possibility, many librarians will wonder if we have the technical chops to make this transition. I don't have that worry; I am confident that we do. What worries me, however, is the complete lack of leadership for this essential endeavor.

Where could/should this leadership come from? Library of Congress, the maintenance agency for the current format, and OCLC, the major provider of records to libraries, both have a very strong interest in not facilitating (and perhaps even in preventing) a disruptive change. So far, neither has shown any interest in letting go of MARC. The American Library Association has just invested a large sum of money in the development of a new cataloging code. It has neither the funds nor the technical expertise to take the logical next step and help create the carrier for that data. Yet, a code without a carrier is virtually useless in today's computer-driven networked world. NISO, the official standards body for everything "information" is in the same situation as ALA: it cannot fund a large effort, and it has no technical staff to guide such a project.

It seems ironic that there have been projects funded recently to develop library-related software based on MARC even though we consider this format to be overdue for replacement. The one effort I'm aware of to obtain funding for the development of a new carrier was rejected on the grounds that it wasn't technically interesting. In fact, the technology of such an effort isn't all that interesting; the effort requires the creation of a social structure that will nurture and maintain our shared data standard (or standards, as the case may be). It requires an ongoing commitment, broad participation, and stability. Above all, however, it requires vision and leadership. Those are the qualities that are hard to come by.

Friday, March 05, 2010

MARC: from mark-up to data

The main reason that I keep pushing the semantic web is not that I think the semantic web is the answer to all of our problems -- but I do think we need to have something to be moving toward in terms of transforming our data carrier to something both more modern and web-compatible. The semantic web gives us some basic concepts of data design. I'm not sure that the semantic web concepts will hold for data as complex as the library bibliographic record, but there's only one way to find out: do it. That's a huge task, of course.

The first question to be answered is: What are our data elements? In theory, this should be one of the simpler questions, but it's not. I can create a list of all of the MARC fields, subfields, and fixed field elements (which I have, and they are linked from this page of the futurelib wiki), but that doesn't answer the question. Here's why:

Indicators

The indicators in the MARC fields are like a wild card in poker -- they can be used to utterly transform the play. Some of the indicators are simple and probably can be dismissed: the non-filing indicators and the indicators that control printing. Some are data elements in themselves: "Existence in NAL collection" is essentially a binary data element. Many further refine the meaning of the field, allowing the field to carry any one of a number of related subelements:
Second - Type of ring
# - Not applicable
0 - Outer ring
1 - Exclusion ring
Others name the source of the term, such as LCSH or MeSH. It'll take a fair amount of work to figure out what all of these qualifiers mean in terms of actual data elements.

Redundancy

There is non-textual (although not non-string) data in the MARC record, primarily in the fixed fields (00X) but also in some of the number and code fields (0XX). Some of these, actually most of these, are redundant with display information in the body of the record. Should these continue to be separate data elements, or can we remove this redundancy and still have useful user displays? Basically, having the same information entered in two different ways in your data is just begging for trouble and we've all seen fixed field dates and display (260 $c) dates that contradict each other.

Inconsistency

Primarily due to the constraints of the MARC format, the same information has been coded differently in different fields. A personal author entry in the 100 field uses subfields abcdejqu; in the 760 linking entry field, all of that data is entered into subfield a. It's the same data element, and by that I mean that the some contents are contained in the concatenation of abcdejqu as in a. Bringing together all of these krufty bits into a more rational data definition is something I really long for.

And of course my favorite... data buried in text

So much of our data isn't data, it's text, or it's data buried in text. My favorite example is the ISBN. Everyone knows how important the ISBN is in all kinds of bibliographic linking operations. But there isn't a place in our record for the ISBN as a data element. Instead, there is a subfield that takes the ISBN as well as other information.
020 __ |a 0812976479 (pbk.)
This means that every system that processes MARC records has to have code that separates out the actual ISBN from whatever else might be in the subfield. Other buried information includes things like pagination and size or other extents:
300 __ |a 1 sound disc : |b analog, 33 1/3 rpm, stereo. ; |c 12 in.

300 __ |a 376 p. ; |c 21 cm.



Once this analysis is done (and I do need help, yes, thank you!), it may be possible to compare MARC to the RDA elements and see where we do and don't have a match. I have a drafty web page where I am putting the lists I'm creating of RDA elements, but I will try to get it all written up on the futurelib wiki so it's all in one place. I encourage others to grab this data and play with it, or to start doing whatever you think you can do with the registered RDA vocabularies. And please post your results somewhere and let me know so that I can gather it all, probably on the wiki.