Open Bibliographic Data: How Should the Ecosystem Work?
The following guest post is from John Wilkin who is Executive Director of the HathiTrust, a Librarian at the University of Michigan and a member of the OKF’s Working Group on Open Bibliographic Data.
In the conversations about openness of bibliographic data, I often find myself in an odd position, vehemently in support of it but almost as vehemently alarmed at the sort of rhetoric that circulates about the ways that data should be shared.
The problem with both the arguments OCLC makes and many of the arguments for openness seem to be predicated on the view that bibliographic data are largely inert, lifeless “records” and that these records are the units that should be distributed and consumed.1
Nothing could be further from the truth. Good bibliographic data are in a state of fairly constant, even if minor, flux. There are periodic refinements to names and terms (through authority work), corrections to or amplifications of discrete elements (e.g., dates, titles, authors), and constant augmentation of the records through connection with ancillary data (e.g., statements about the copyright status of the specific manifestation of the work).
In fact, bibliographic data are the classic example of data that need to live in the linked data space, where not only constant fixes but constant annotation and augmentation can take place. That fact and the fact that most of the bibliographic data we have has been created through a kind of collaborative paradigm (e.g., in OCLC’s WorldCat) makes the OCLC position all the more offensive.
Locking bibliographic data up, particularly through arguments around community norms, means that they won’t be as used or as useful as they might be, and that we will rarely receive the benefits of community in creating and maintaining them. The way these data are often used when shared, however, makes the hue and cry of the other side, which essentially says “give me a copy of your data,” all the more nonsensical: by disseminating these records all over the networked world, we undermine our collective opportunities.
My teeth grind during most of these arguments as equally dismal alternatives are presented and disputed. I’m troubled when the OCLC membership releases a record use policy that defines boundaries between members and non-members. But I’m also troubled when I hear the insistence by, for example, the OpenLibrary that taking a copy of the records in WorldCat and populating the OpenLibrary database (in the name of openness) does something to advance our collective needs.
By walling off the data, we, the members of the OCLC cooperative, lose any possibility of community input around a whole host of problems bigger than the collectivity of libraries: Author death dates? Copyright determination? Unknown authors or places of publication?
These problems can best be solved by linked data and crowd-sourcing. And all of this should happen with a free and generous flow of data. OCLC should define its preeminence not by how big or how strong the walls are, but by how good and how well-integrated the data are. If WorldCat were in the flow of work, with others building services and activities around it, no one would care whether copies of the records existed elsewhere, and most of the legitimate requests for copies of the records would morph into linked data projects.
The role of our library community around the data should not be that we are the only ones privileged to touch the data, but that we play some coordinating management role with a world of very interested users contributing effort to the enterprise.
On the other hand, every time someone says this is a problem that should be solved by having records all over the Internet like so many flower seeds on the wind, I see a “solution” that produces exactly what the metaphor implies, a thousand flowers blooming, each metaphorical flower an instance of the same bibliographic record.
What is being argued is that having bibliographic records move around in this way is the sine qua non and even the purpose of openness. When we do that, instead of the collective action we need, we get dispersed and diluted action. Where we need authority, we get babel.
If the argument is “give us your records OCLC because we can do the job you can’t seem to do?” the right approach would be to throw down the gauntlet and see if OCLC picks it up. Frankly, what I typically see is an argument that boils down to “give us your records because we intend to run a competitive business and cut into your membership and funding sources….”.
The last thing I want is one copy of my record integrated in this author explication effort, one copy integrated into that copyright determination process, and yet another copy of the record corrected in the other bibliographic refinement effort.
Just as OCLC has a moral responsibility to open the doors to the data to support a linked data effort, these other initiatives have a moral responsibility to conceive of their efforts as linked data efforts. I am not opposed to the free flow of records; however, that should not be the first and most important goal of openness.
I wanted to use this blog forum as an opportunity to make this point, and also, seemingly incongruously, to announce the availability of nearly 700,000 records from the University of Michigan catalog with a CC-0 license, records that can also be found in OCLC. They are now available here: (CKAN package for the Michigan records).
Using automated mechanisms around cataloging conventions (well, the 040 field, to be precise), we determined that these records in Michigan’s catalog were contributed by the University of Michigan library to OCLC.
According to OCLC member rules, we are thus permitted to make the records freely available to anyone, including non-members. Why would I share records in this way after expressing such opposition?
For one thing, I believe the data should be shared, and the fact that we have not developed norms for sharing them in the best possible way should not inhibit that. It is at the same time, however, an attempt to demonstrate the futility of our record-oriented paradigm of data sharing: should I commit staff time to updating this file frequently? Should I notify recipients of the file that a record has changed? Should we develop an update mechanism that identifies elements within records as having undergone some sort of revision?
I hope a simple “no” doesn’t sound curt or uncooperative: Michigan’s library, like other libraries, has limited resources, resources that should always be devoted directly or indirectly to serving our immediate constituencies; I’d be remiss in my duties if I syphoned off significant staff effort to support activities where there was no connection to our institutional needs.
That said, I believe having the records out there will stimulate even more discussion about the value of openness and the role of OCLC. I’ll have my staff update the file periodically, and in the next release will add the CC-0 mark to the records themselves. I hope the records prove useful to all sorts of initiatives, but I also hope that their availability and my argument helps spur more collective action around solving these problems through linking and associated strategies of openness, and not through file sharing.
-
See, for example, much of the language around community norms and responsibilities in the OCLC record use policy; cf. arguments about “shar[ing] your catalog” in this blog post. ↩




Pingback: Tweets that mention Open Knowledge Foundation Blog » Blog Archive » Open Bibliographic Data: How Should the Ecosystem Work? -- Topsy.com
Pingback: yes yes yes yes. : infomusings
Pingback: Thursday Threads: Open Publishing Alternatives, Open Bibliographic Data, Earn an MBA in Facebook, Unconference Planning | Disruptive Library Technology Jester
Pingback: A Word On Michigan Library CC0 Initiative: Are We Blowing Wish Flowers? | InTechWeb Blog
Pingback: WorldCat, Samkatalogen och Expertgruppen | betabib
Pingback: Open Bibliographic Data: How Should the Ecosystem Work? | Observatoire des technologies