Open Bibliographic Data: How Should the Ecosystem Work?

November 29, 2010, by Rufus Pollock

The following guest post is from John Wilkin who is Executive Director of the HathiTrust, a Librarian at the University of Michigan and a member of the OKF’s Working Group on Open Bibliographic Data.

In the conversations about openness of bibliographic data, I often find myself in an odd position, vehemently in support of it but almost as vehemently alarmed at the sort of rhetoric that circulates about the ways that data should be shared.

The problem with both the arguments OCLC makes and many of the arguments for openness seem to be predicated on the view that bibliographic data are largely inert, lifeless “records” and that these records are the units that should be distributed and consumed.¹

Nothing could be further from the truth. Good bibliographic data are in a state of fairly constant, even if minor, flux. There are periodic refinements to names and terms (through authority work), corrections to or amplifications of discrete elements (e.g., dates, titles, authors), and constant augmentation of the records through connection with ancillary data (e.g., statements about the copyright status of the specific manifestation of the work).

In fact, bibliographic data are the classic example of data that need to live in the linked data space, where not only constant fixes but constant annotation and augmentation can take place. That fact and the fact that most of the bibliographic data we have has been created through a kind of collaborative paradigm (e.g., in OCLC’s WorldCat) makes the OCLC position all the more offensive.

Locking bibliographic data up, particularly through arguments around community norms, means that they won’t be as used or as useful as they might be, and that we will rarely receive the benefits of community in creating and maintaining them. The way these data are often used when shared, however, makes the hue and cry of the other side, which essentially says “give me a copy of your data,” all the more nonsensical: by disseminating these records all over the networked world, we undermine our collective opportunities.

My teeth grind during most of these arguments as equally dismal alternatives are presented and disputed. I’m troubled when the OCLC membership releases a record use policy that defines boundaries between members and non-members. But I’m also troubled when I hear the insistence by, for example, the OpenLibrary that taking a copy of the records in WorldCat and populating the OpenLibrary database (in the name of openness) does something to advance our collective needs.

By walling off the data, we, the members of the OCLC cooperative, lose any possibility of community input around a whole host of problems bigger than the collectivity of libraries: Author death dates? Copyright determination? Unknown authors or places of publication?

These problems can best be solved by linked data and crowd-sourcing. And all of this should happen with a free and generous flow of data. OCLC should define its preeminence not by how big or how strong the walls are, but by how good and how well-integrated the data are. If WorldCat were in the flow of work, with others building services and activities around it, no one would care whether copies of the records existed elsewhere, and most of the legitimate requests for copies of the records would morph into linked data projects.

The role of our library community around the data should not be that we are the only ones privileged to touch the data, but that we play some coordinating management role with a world of very interested users contributing effort to the enterprise.

On the other hand, every time someone says this is a problem that should be solved by having records all over the Internet like so many flower seeds on the wind, I see a “solution” that produces exactly what the metaphor implies, a thousand flowers blooming, each metaphorical flower an instance of the same bibliographic record.

What is being argued is that having bibliographic records move around in this way is the sine qua non and even the purpose of openness. When we do that, instead of the collective action we need, we get dispersed and diluted action. Where we need authority, we get babel.

If the argument is “give us your records OCLC because we can do the job you can’t seem to do?” the right approach would be to throw down the gauntlet and see if OCLC picks it up. Frankly, what I typically see is an argument that boils down to “give us your records because we intend to run a competitive business and cut into your membership and funding sources….”.

The last thing I want is one copy of my record integrated in this author explication effort, one copy integrated into that copyright determination process, and yet another copy of the record corrected in the other bibliographic refinement effort.

Just as OCLC has a moral responsibility to open the doors to the data to support a linked data effort, these other initiatives have a moral responsibility to conceive of their efforts as linked data efforts. I am not opposed to the free flow of records; however, that should not be the first and most important goal of openness.

I wanted to use this blog forum as an opportunity to make this point, and also, seemingly incongruously, to announce the availability of nearly 700,000 records from the University of Michigan catalog with a CC-0 license, records that can also be found in OCLC. They are now available here: (CKAN package for the Michigan records).

Using automated mechanisms around cataloging conventions (well, the 040 field, to be precise), we determined that these records in Michigan’s catalog were contributed by the University of Michigan library to OCLC.

According to OCLC member rules, we are thus permitted to make the records freely available to anyone, including non-members. Why would I share records in this way after expressing such opposition?

For one thing, I believe the data should be shared, and the fact that we have not developed norms for sharing them in the best possible way should not inhibit that. It is at the same time, however, an attempt to demonstrate the futility of our record-oriented paradigm of data sharing: should I commit staff time to updating this file frequently? Should I notify recipients of the file that a record has changed? Should we develop an update mechanism that identifies elements within records as having undergone some sort of revision?

I hope a simple “no” doesn’t sound curt or uncooperative: Michigan’s library, like other libraries, has limited resources, resources that should always be devoted directly or indirectly to serving our immediate constituencies; I’d be remiss in my duties if I syphoned off significant staff effort to support activities where there was no connection to our institutional needs.

That said, I believe having the records out there will stimulate even more discussion about the value of openness and the role of OCLC. I’ll have my staff update the file periodically, and in the next release will add the CC-0 mark to the records themselves. I hope the records prove useful to all sorts of initiatives, but I also hope that their availability and my argument helps spur more collective action around solving these problems through linking and associated strategies of openness, and not through file sharing.

See, for example, much of the language around community norms and responsibilities in the OCLC record use policy; cf. arguments about “shar[ing] your catalog” in this blog post. ↩

Rufus Pollock

Website | + posts

Rufus Pollock is Founder and President of Open Knowledge.

18 thoughts on “Open Bibliographic Data: How Should the Ecosystem Work?”

Pingback: Tweets that mention Open Knowledge Foundation Blog » Blog Archive » Open Bibliographic Data: How Should the Ecosystem Work? -- Topsy.com
George Oates says:

November 29, 2010 at 23:42

Hi John

“the OpenLibrary that taking a copy of the records in WorldCat and populating the OpenLibrary database”

I just thought it might be worth clarifying that that has never actually happened in any wholesale or organized way. Cheers.
MJ Ray says:

November 30, 2010 at 11:09

So are “the members of the OCLC cooperative” pushing their co-op to open up, to show “Concern for Community” and “Co-operation among Co-operatives” by being a bit more sharing with records? Those are two of the basic co-op values, after all.

Well done the University of Michigan for putting records on CKAN, though!
John Wilkin says:

November 30, 2010 at 18:08

I’ve had a little bit of out-of-band communication about my blog piece that I wanted to bring in here. One person questioned our having posted the records (on the UM Library website, incidentally, with a pointer from CKAN) when I made the argument I did. I can only say that I wish I’d said more clearly that I think sharing the records this way is an inferior strategy. What I would add, though, is that simply arguing against sharing the records as an effective strategy isn’t enough: adding to the chaos through this act of sharing adds an exclamation point to the chaos. Frankly, I don’t think railing against this kind of record sharing strategy is going to make the desire to see records shared go away.

I also saw a bit of conversation that had me excoriating OCLC and giving a free pass to the Open Library, which that writer lauded as a paragon of the kind of integration and enrichment strategies I was advocating. This misses the point entirely. What efforts does the Open Library or any other record copying effort make to return the enriched information to the source records? We’re only contributing to the fragmentation through this kind of process, and until we have better record management that’s simultaneously in the flow of the work of users and the institutions managing the objects being described, we’re losing the value that the crowd brings to these problems. Mind you, here I’m not arguing for reinforcing OCLC’s central role, but arguing instead for an effective way of managing the information that doesn’t also make it well-integrated in the rest of the web.
George Oates says:

November 30, 2010 at 20:41

“What efforts does the Open Library or any other record copying effort make to return the enriched information to the source records?”

For what it’s worth, I agree with your implication here, that Open Library isn’t doing as well as it could to return enriched (or simply corrected) information back to source catalogs, and it’s a weakness we’re definitely aware of, and keen to address.

I’ve spoken publicly about the lack of utility in extracting data from Open Library. While it’s true (and always has been) that dumps of the entire OL dataset are available, they are massive files, and in JSON, so aren’t necessarily in a form that’s easy for anyone to work with. (Perhaps this argument holds for most big library datasets…)

Over the past year or so, we’ve put a lot of work into:

a) making the original MARC source records more accessible, per record, by surfacing links to them in a record’s history, if the source is MARC (example). Any/all catalogs we have imported or plan to import are also archived on the Internet Archive.

b) improving the RDF offering for Works, Authors and Editions (with hearty thanks to Karen Coyle for her work on this). Again, we’ve tried to surface these resources more overtly, with links on every page where they’re available. The RDF is evolving, in particular, for Works.

c) documenting more of the Open Library API. We’ve released an addition 5 or 6 new API methods for things like Subjects, the recently released, rewritten Search Inside and Recent Changes

d) working on ways to extract smaller records sets from the dataset, in the form of a new feature called “Lists” which will be launching soon, hopefully before the end of the year.

All of this work attempts to make it easier for external catalogs to make more use of Open Library data. We are also one of the largest free cover repositories on the web, and that’s probably one of the more prominent uses of the API today.

In terms of the “ecosystem”, Open Library is seeing between 100,000 and 150,000 edits per month now. You’re right again though, that unless we find ways of getting this improved or corrected data (author merges, typos, additional metadata, new covers) back out into the world, it’s a hollow victory.

I’m not sure I’d quite agree with your defining import of records into Open Library as “fragmentation” – in my mind at least, it’s more about aggregation, actually. How wonderful to be able to collate records from several library catalogs around the world about a certain subject?

“until we have better record management that’s simultaneously in the flow of the work of users and the institutions managing the objects being described”

To some degree, this is beyond the influence of Open Library. But, there are precedents emerging amongst our regular editors where Open Library is effectively used as a catalog manager, albeit most often in a commercial context.

I’ve been working with the superstars over at Koha to see how we might integrate Open Library records into the workflow you’re looking for, but nothing to show on that yet. Hopefully soon!

“arguing instead for an effective way of managing the information that doesn’t also make it well-integrated in the rest of the web”

What are some of your ideas?
Jonathan Rochkind says:

November 30, 2010 at 21:38

I see an analogy with open source software and ‘forking’. Generally, if a project forks into two seperate projects (with both still have the same goals), that’s a failure, because it represents a dilution of programmer effort.

However, having the ability to fork is what gives us confidence in the open source code — if the code isn’t being maintained as productively as we like, we aren’t stuck with the current maintainers (as in proprietary software), but we can, if we need to, fork the code into a project managed the way we like.

Generally this ends up resulting in one of the two forks dying, and the other one taking on the mantle of central coordination — which one depends on if the forkers manage to get general agreement that the original management is disastrously flawed and the new management is better.

Currently, I think OCLC is not providing us with what we need for management of our collective cataloging patrimony. Not providing us with the features and technical infrastructure, and not providing us with the price points and costs-to-benefits needed for our businesses.

There are real barriers and challenges to figuring out how to give us what we need, I’m not saying it’s easy. But as with other challenging goals, it doesn’t hurt to have multiple entities working on them at once, and learning from each other. It can in fact help a lot. One of the biggest barriers to this happening is OCLC’s effective monopoly control of our metadata patrimony.

Ultimately we definitely want and need some coordinated integration of all our data maintenance efforts, I agree with what John writes entirely. But it helps us if various entities who think they can give us what we need all have the ability to try, whether you call that market competition or simply opening up the field to more hands. And sharing the records at least makes that possible — doesn’t mean the end goal is a thousand separate gardens instead of one common field. But doesn’t hurt to share your seeds with someone who isn’t happy with your garden and would like to give it a go themselves either. Wow, that metaphor held up surprisingly well.
Pingback: yes yes yes yes. : infomusings
Pingback: Thursday Threads: Open Publishing Alternatives, Open Bibliographic Data, Earn an MBA in Facebook, Unconference Planning | Disruptive Library Technology Jester
Karen Coyle says:

December 6, 2010 at 11:51

I must say that I see things quite differently from JPW. Although I agree that a bunch of static bibliographic files do not open library linked data make, my view is:

1) Each file represents a person or group who got interested in transforming library data and went through the learning process of actually doing it. Therefore each file is a contribution to our collective knowledge about linked data. When we add these files to heterogeneous stores like Open Library or Freebase, we exercise that knowledge.

2) These files are the fodder for further experimentation with mixing library data and non-library data, which to me is one of the main points of linked library data. We are in the “training wheels” stage of this change, and like training wheels these early files may end up being discarded when we finally learn to ride. I see no harm in that.

3) This experimentation is taking place primarily outside of the US in places where the OCLC record use policy does not apply. The British Library, the National Library of Sweden, soon the Bibliotheque Nationale, and a handful of German libraries are at the forefront of this. If you cannot release your bibliographic data openly, you cannot participate in the linked data movement.
Pingback: A Word On Michigan Library CC0 Initiative: Are We Blowing Wish Flowers? | InTechWeb Blog
Pingback: WorldCat, Samkatalogen och Expertgruppen | betabib
Karen Coyle says:

December 17, 2010 at 14:51

John, since your post, OCLC has made the following statement in a legal document:

“The nature of these documents is not pled: it is not claimed that these documents are anything other than ‘guidelines’ OCLC publishes or that OCLC has ever used these documents to prevent a library from providing its catalog records to Plaintiffs or any other entity.” (Motion, p. 7)

http://www.librarytechnology.org/docs/15273.pdf

Does this agree with your understanding of the policy? It seems to say that you could publish your entire catalog, although that would not be within the guidelines. However, the implication is that there would be no retribution (although perhaps a lot of peer pressure?).
Pingback: Open Bibliographic Data: How Should the Ecosystem Work? | Observatoire des technologies
Jennifer Younger says:

December 28, 2010 at 19:47

December 28, 2010
Dear John,
I begin, as you did on the Open Knowledge Foundation blog, with introducing some multiple roles I have in the library profession: chair of the Catholic Research Resources Alliance (CRRA) Board of Directors, president of OCLC Global Council, and co-chair of the OCLC Record Use Policy Council. From this perspective, I read your post with keen interest.
My role with the CRRA and its portal, as well as my representative and advisory roles with OCLC has developed my thinking about the aggregation and distribution of data on the web. In particular, I am increasingly mindful of the importance of finding innovative ways to optimize multiple collections’ and sites’ contributions to the advancement of knowledge both generally as well as in a particular area of study. The CRRA’s mission is to provide enduring global access to all research resources reflecting the Catholic intellectual tradition and scholarship. OCLC’s mission is ‘connecting people to knowledge through library cooperation.’ Both missions are founded in creating something new and valuable from disparate, unconnected sources. I am far from an expert on linked data, but it seems a promising approach for exposing, connecting, and sharing related data that is otherwise dispersed around the web. For that reason, and in the context of the CRRA and OCLC missions, I support further exploration of the potential of linked data projects.
At the same time, I feel it’s important to point out that I don’t see OCLC’s new record use policy as a barrier to the aggregation and sharing of data on the Catholic portal, even if our contributors gave us linked data, which they are not doing nor have they asked to do so, or if we decided to provide the portal as linked data. I’m confident we could work out a way of doing this in the context of the WorldCat rights and responsibilities described in the record use policy. Our clear intent in writing the poicy as we did was, and is, to encourage broad use of WorldCat bibliographic data while also supporting the ongoing, long-term viability and utility of WorldCat.
Thanks for providing the opportunity for sending along my comments.
Jennifer Younger
Jonathan Rochkind says:

December 28, 2010 at 20:58

Jennifer, thanks for your comments.

The need to ‘work out a way of doing this’ with OCLC appears to many of us on-the-ground library software developers to be a barrier to collaboration and sharing in a way that I’ve come to think OCLC honestly doesn’t understand. So let me try to explain a bit.

I, like most library software developers (and certainly not unique to libraries either) am awfully busy, and working on many things at once. Sometimes I’m working on something, and I think, gee, with just a bit of extra effort, I could share this with other library developers (or the world in general). Maybe by putting my code in a public repository; maybe by providing a public API to my service; maybe by sharing my data. (That last one of sharing data is often implemented by an API; and the second one of an API often will involve sharing data if it is to be useful).

I’m not really sure if this is going to be useful to others or not, but it’s worth a bit of extra effort just in case. The effort will be less if I do it when I’m in the middle of the project then if I try to go back and do it later. Throw enough of those things out there, some of them will ‘stick’ and wind up useful to others, some of them won’t. Some of them that do ‘stick’ will find synergy with similar things someone else has done and synergistically build into more than the sum of its’ parts — this how a lot of collaborative open innovation on the internet happens.

But it’s just worth a BIT of extra effort, I’ve got way too many things to do to make a big project out of sharing something that just MIGHT be useful.

Your suggestion that if you really want to share something you can find a ‘way to work it out’ basically means: 1) I’ve got to get my boss involved, probably. 2) We’ve got to have a discussion with OCLC, which 3) First involves finding the right person at OCLC to talk to, and 4) Then involves having a bunch of conversations, possibly leading up to 5) Signing some contract or agreement with OCLC, which probably will mean getting my boss’s boss involved, and possibly even the next higher up boss. And that agreement may require 6) Additional development time to make sure it’s done the way OCLC wants.

This is a real barrier, even if the end result may indeed usually be OCLC agreeing that the sharing is acceptable. It’s enough of a barrier that many things simply won’t be shared, we won’t be throwing everything out there to see what ‘sticks’ anymore, we’re only willing to go through that effort with things we’re pretty darn sure will be awfully useful (and our bosses and bosses’ bosses will agree are worth putting institutional resources into sharing too). But we don’t always know in advance what those things are, we’re often wrong predicting it — ‘web scale’ innovation (to borrow a phrasal adjective) doesn’t happen with that kind of pre-planning, it happens from throwing a bunch of things out there in an agile way and seeing what happens to it.

Now, it may very well be that OCLC’s business interests and ongoing, long-term viability mean that it’s just not possible to provide for an environment supporting that kind of agile sharing and innovation. I suppose that’s a decision for OCLC to make. But I don’t think OCLC neccesarily realizes the trade-off being made here — we developers aren’t unhappy with the record use policy restrictions just out of spite, or because we want to ruin OCLC’s sustainability or viability (most of want no such thing). We’re unhappy because it really is a barrier to innovation, because many things we’d like to do will end up remaining undone.

(And when I say “I suppose that’s a decision for OCLC to make”, really, as a member cooperative, that’s a decision for OCLC’s members to make, and I’m not sure the administrators at OCLC member institutions realize the trade-off either.)
Robin Murray says:

December 31, 2010 at 08:33

John’s post makes thoughtful observations on the bibliographic environment and the developments needed in this rapidly changing space. John makes some general directional statements, which I can only agree with. He also makes comments and challenges on how OCLC is, and should be, acting to address the requirements of this emerging ecosystem. Being deeply involved in OCLC’s efforts in this space, I have a naturally different vantage point from which to view OCLC’s activities.

On the OCLC Blog at:
http://community.oclc.org/cooperative/2010/12/perspectives-on-worldcat.html

I have taken the opportunity to post a reasonably lengthy response to these challenges and hopefully provide more context to OCLC’s activities and direction.
MJ Ray says:

January 1, 2011 at 18:46

It’s great to see some OCLC folks here, but I’m a bit disappointed that Robin Murray has split the venue of the discussion.

Fundamentally, what I don’t understand, is how OCLC reconciles its fairly-closed management of WorldCat with cooperative values and principles. WorldCat looks a bit like an attempt to enclose a commons and become a landlord who can sustain themselves from it, in a perversion of the idea of sustainability. It’s worrying that OCLC folks write about “a club good” instead of “a common good” and “rights and responsibilities” instead of “values and principles”. Does OCLC educate its members about co-ops? Does OCLC distribute a social report to its members? How does WorldCat appear in those lessons and reports?
Ed Summers says:

March 28, 2011 at 12:23

Really nice post John. While it may seem that this dump of data is “inferior” I would just like to stress how it is a really, really important first step. It gets you over the hump of deciding what to release, who to release it to, and how to release it — which typically seem to be the hardest bits.

Now that you’ve got past that, I would encourage you to look at some simple feed based approaches to making updates available. My preference is for Atom since it’s a IETF standard, and widely deployed. It has paging mechanisms (similar to OAI-PMH’s rwesumption tokens) for large responses. Speaking of PMH, I know UMich has invested heavily in it in the past. But if I were you I’d leap frog that particular library-centric technology and join the Atom folks.