As regular readers of the Open Knowledge Foundation blog will know, bibliographic metadata is a subject close to our heart (see e.g., here, here and here). Hence we were delighted to see today’s announcement that CERN Library are releasing their bibliographic metadata under an open license!

From the announcement:

Librarians are in general very favourable to the principles of Open Access, but surprisingly few libraries have so far set free the data they produce themselves. As one of the first scientific libraries in the world, the CERN Library offers now the bibliographic book records, held in its library catalog, to be freely downloaded by any third party. The records are provided under the Public Domain Data License, a license that permits colleagues around the world to reuse and upgrade the data for any purpose.

Jens Vigen, Head of the CERN Library, says: “Books should only be catalogued once. Currently the public purse pays for having the same book catalogued over and over again. Librarians should act as they preach: data sets created through public funding should be made freely available to anyone interested. Open Access is natural for us, here at CERN we believe in openness and reuse. There is a tremendous potential. By getting academic libraries worldwide involved in this movement, it will lead to a natural atmosphere of sharing and reusing bibliographic data in a rich landscape of so-called mash-up services, where most of the actors who will be involved, both among the users and the providers, will not even be library users or librarians. Our action is made in the spirit of the Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities; bibliographic data belongs to the cultural heritage.All other signatories should align their policy accordingly.”

The data of CERN Library will be used by the Open Library Project to provide a webpage for every book and allow users to add content like table of contents, classifications and summaries.

For massive reuse of data, the data will be provided soon by an open Z39.50, SRU and OAI interface via biblios.net, a repository of open bibliographic data.

This is fantastic news - and we hope that other libraries and archives consider following suit and opening up their bibliographic metadata!

We’ve created a new CKAN package for the data at:

We’re very pleased to see that a large collection of German language digital texts has just been released under an open license.

Yesterday, it was announced that Wikimedia Germany, Creative Commons Germany and TextGrid are releasing a large collection of “culturally valuable” texts either in the public domain or under a CC-BY license, which is compliant with the Open Knowledge Definition.

From the press release:

The research group TextGrid recently obtained the texts of the online library zeno.org with financial support from the Federal Ministry of Education and Research (BMBF). This digital collection is the most comprehensive of its kind in the German-speaking areas and contains texts from the beginning of printing to the first decades of the 20th century.

TextGrid, Wikimedia Germany and Creative Commons Germany are now cooperating in order to make this collection of texts freely usable for the general public. Wikimedia will soon make the collection available with the assistance of TextGrid. Subsequent use of the texts will be possible without restriction if they are comprised of contents that are in the public domain (particularly in terms of the digitalized texts themselves). If additional data for providing access is included (bibliographic metadata, for example), it will be covered under the license CC-BY 3.0 de.

We were also interested to read the following comment from Dr. Heike Neuroth, TextGrid Project Manager at the Lower Saxony State- and University Library Göttingen:

The primary task of the Digital Humanities is no longer digitalization, as it was in the 90s, but instead the methodically innovative development of structured data sets. With this cooperation we will make access to this information possible not only to research communities but also to the general public.

At the Open Knowledge Foundation, we’ve also very interested in new ways of analysing, visually representing and otherwise exploring digital humanities texts - from our annotation and text analysis tools in projects such as Open Shakespeare and Open Milton to our Working Group in the Humanities which we’ve had on the backburner for a little while.

Many congratulations to Wikimedia Germany, Creative Commons Germany and TextGrid for the new release - and we look forward to the material going live online and learning more about what the collection contains!

Featured Project: MusicBrainz

November 27th, 2009

MusicBrainz logo

MusicBrainz is a user-maintained community music metadatabase. The MusicBrainz community collects and maintains data about recorded music releases such as artist name, release title and track listing. That data is re-used by music services across the web, including Amazon and Last.fm, as well as in Free and Open Source Software applications.

Robert Kaye is Executive Director of the MetaBrainz Foundation, the non-profit group which operates MusicBrainz. The Open Knowledge Foundation spoke with Robert about the history of the MusicBrainz project, about the role of community, about the way open licensing helps MusicBrainz work, and about the future he sees for music meta data.

This interview is the first in what we hope will become a regular spotlight on individual members of the Open Knowledge community.

The core MusicBrainz data is in the public domain, hence open in accordance with the Open Knowledge Definition. Further details are available at:


Open Knowledge Foundation: Can you give us a brief history of MusicBrainz? When was it started? Who was responsible? And what initially motivated the project?

Robert Kaye: A full history of the project is here: http://musicbrainz.org/doc/MusicBrainz_History

It was started in 1999 as the CD Index and then in 2000/2001 it became MusicBrainz. I’m the person who started MusicBrainz.

OKF: How has it developed? Has anything happened along the way that was completely unexpected?

RK: It developed better than I expected – there are many facets that I didn’t envision when I first started the project. For instance, I didn’t envision people getting quite so involved and passionate about the project. One example is a person in Oslo, Norway who has Asperger’s syndrome and has a hard time leading a normal life. This person will probably never hold down a regular job, but this person has made quite serious contributions to MusicBrainz. In a sense MusicBrainz gave this person an opportunity that was never before available to them.

I didn’t see that one coming at all!

OKF: Has anything turned out differently to the way you expected it to?

RK: Yes. At first I thought automatic collection of data could make the job for people a lot easier. But it turns out that automatic collection is the exact wrong thing to do – a certain subset of people (those who contribute to MusicBrainz and Wikipedia) do not trust automatically collected data and we threw out most of the automatically collected data in favour of human verified data. I also considered importing the FreeDB data at some point in order to bootstrap MusicBrainz, but the community made it clear that they would revolt if I did that. They felt that they didn’t want to clean up that giant mess – instead we opted to build a clean database one step at a time.

There are tons of things that I didn’t expect, but many of those are technical in nature and not so relevant to open data.

OKF: What’s been your biggest challenge and what has been your biggest moment of pride?

RK: The biggest challenge has been dealing with “Poisonous People”. People (developers, interestingly enough) who are well intentioned but don’t always get along with everyone else. These people divide the community as they rally support for their views and that hinders forward progress. When a community is divided it’s hard to get anything done, because everyone is bickering and sniping at each other.

In the summer of 2006 we had this problem and it nearly ripped the project in half – that was clearly a low point for MusicBrainz.

The biggest moment of pride? Getting the BBC on board and having the BBC use MusicBrainz data to organise their whole music play-out and music tracking system. MusicBrainz provides the metadata that gives BBC Music its structure. To see how the BBC integrated MusicBrainz in a publicly visible manner, see: http://www.bbc.co.uk/music/reviews/dw9x

OKF: How big is the community that contributes to Music Brainz, and what do you think motivates people to contribute?

RK: We have 465,897 registered users and of those 1,385 were active in the last week. The MusicBrainz community fits into roughly three categories:

· The core people: These are the people who are hacking on MusicBrainz or editing profusely. MusicBrainz is their hobby, job or resume builder.

· Regular editors: People who love music more than the average person and want to make sure the data for their artists is clean and that their music collection is sparkling clean as well.

· Tagger users: People who use one of the tagging applications to clean up their music collection. These people tend to be a very transient group. They come, clean up their collection and leave. They may come back in a while to clean up new data in their collection. To them MB is a means to clean up their collection.

OKF: What is the role of the community inside MusicBrainz?

RK: The community is critical for MusicBrainz. If people stopped editing the data in MusicBrainz, the data would stop changing and the business model would instantly vaporise. The software that powers MusicBrainz is worthless without the data. The data is worthless without the people behind it. Given that, we need to make sure that we don’t alienate our contributors - we can’t afford any missteps that would cause the community to lose faith in MusicBrainz.

OKF: Can you give us some stats on the material? Releases? Updates?

RK: All the stats you could possibly ever need are here:

http://musicbrainz.org/show/stats/

OKF: Where do you get your data from? Do you build on any other open material?

RK: It’s all user-curated and the users decide what sources they want to use to verify information. I know our users use Amazon quite a bit to glean information and Discogs when Discogs has the data they are looking for. The only open data source we use is FreeDB.

OKF: Where can you download the data? What format is the data in?

RK: You can download the data here: http://musicbrainz.org/doc/Database

It’s in Postgres data dump format. Normally you would use our open source software to load the data into a Postgres database. The page above also talks about our Live Data Feed, which is how we keep our customers updated on an hourly basis. Commercial use of this service requires a license from us, which is how we make ends meet at the foundation.

OKF: Why did you decide to use an open licence? What are the advantages of using an open licence from your point of view?

RK: Mainly I was upset that CDDB, which used to be freely downloadable, was taken private by Escient (now (dis)GraceNote). I typed in several hundred CDs and now someone else was making money off my work. I was pissed. At the time I was getting into open source and I saw that open data would be a critical play in the future - a future I perceived to be off in a number of months - I wasn’t ready to wait a decade for it to be really ready.

The vision I saw included a well linked data set with stable identifiers that didn’t change so that the data set could be cross-linked in a stable manner. What I saw was the “Semantic Web” or what we’re now calling “linked data” and it was clear to me that in order to play in this field you couldn’t make a walled garden around your data. If you ever hoped that others would link to your data, it was clear to me that I had to bend over backwards in order to make this data available to everyone. I also saw Linux growing steadily and slowly making in-roads against Microsoft – how can Microsoft compete with free AND high quality? It would be hard. We’re seeing the same happening with Wikipedia and classic encyclopaedias – Microsoft recently shelved Encarta, a sign that Wikipedia is edging out some of the smaller players.

This vision was the easy part. Then the hard work started - what licence should I use? The only licence out there was the Open Content licence, which was largely unproven. And it didn’t address the issues that faced data very well. In an email conversation with Richard Stallman he suggested that I use the GFDL… Compared to the GPL the GFDL is a horrid abomination! (I’m still trying to find the front matter and the appendix in my database tables!!) Mr. Stallman also brusquely informed me that the text of the GPL was *NOT* available under the GPL or any licence for that fact. He specifically forbade me from using his text to create a better, more data oriented licence. Not surprisingly, I stopped being a fan of RMS from that point on.

I ended up having many conversations about licences and was quite frustrated… then I got a call from the Creative Commons! They were about to launch and were looking for projects who would adopt their licences before they went public. I read the licences and was immediately jazzed about them. I had already been educating myself about the Public Domain and the Feist vs Rural Telephone company case and thought that my core data needed to be in the Public Domain. Now the CC provided a nice and clean method for doing this – I adopted the licenses clear across the board.

The non-commercial licence was actually the magic that enabled me to found the MetaBrainz Foundation! I was convinced to NOT create a legal entity for MusicBrainz until I could see a business model emerge that didn’t hinge on begging. My concept was to allow free access to the core data, but play gatekeeper on the data and control how quickly and how conveniently someone could get access to the data. By allowing the public non-commercial unfettered access to the data, I would win over the Open Source communities, which we have. But by taking money for timely and convenient access, I could fund the foundation and in turn fund my own paycheque. This has been working well so far – while we’re not making oodles of money (especially in this economic climate), we’ve been in the black year over year since inception. I never resort to begging and yet I can license public domain data to make ends meet.

What’s even more trippy about this is that I may have created the first 100% profit non-profit business model. Since the operations of the project are for the public at large, we make this as cheap as possible. And making the data available to the public is part of that deal – it is written into our IRS charter. When a commercial customer comes along, they tap into our live data feed, which they pick up from our FTP site, which is actually operated by the Oregon State University with support from Google. In other words, the incremental cost for adding a new customer is ZERO. After I sign the contract, I do nothing but cash the cheques. It’s a rather odd arrangement, but the IRS hasn’t given me grief and my community and customers are happy.

OKF: Where has MusicBrainz been re-used?

RK: A roster of our paying customers is here:

http://metabrainz.org/about/customers.html

There have also been quite a few research projects and university papers. The Solr 1.4: Enterprise Search Server book has been written using the MusicBrainz data as examples. There are dozens of start-ups that are using our data and if they ever make it past the seed stage they will becomes customers of MetaBrainz. Plus the Open Source world uses our data and it can be hard to see who makes use of the data – so there are many places that use our data without us ever knowing about it.

OKF: What are your plans for MusicBrainz in the future? In an ideal world where do you hope it will go?

RK: We want to support classical music much better than we do now. We’re in the process of creating a new schema that allows us to finish support for classical music. There are lots and lots of ways in which we can improve the experience for our users and make it easier for everyone to contribute. We’re also keen on getting music information from the whole world over – not just those who can read English.

Then I want to make sure that MusicBrainz gets more connected to the outside world. I want reviews and concert information to be one click away. And as applications like Google Maps get more impressive, I want to provide the data about musicians who are playing at a given venue when you walk past that venue. There are many more places where music metadata could be used and I want to make sure that MusicBrainz gets into all of these nooks and crannies.

OKF: Where do you need help?

RK: We need help programming the next generation of MusicBrainz. We’re always looking for people who can code Perl/Javascript and understand complex database schemas. Of course we’re also always looking for music fans, musicians to tell us about their music and labels to groom the data about their artists.

OKF: What can contributors do?

RK: We need help editing the data, cleaning it up and throwing out duplicate data. We also need documentation written and help answering emails from people who have questions.

OKF: Is there any work you need done that volunteers could help with?

RK: Tons! We’re driven by volunteers!

OKF: How can people get involved?

RK: Our homepage is suited just for this reason: http://musicbrainz.org/

Start tagging your music collection with MusicBrainz’ Picard and then spot problems in our data and help us fix them!

On the first of January every year works from around the world fall out of copyright and into the public domain. But, how do we know which works fall into the public domain when?

In previous years there have been blog posts about this - for example, see the Everybody’s Libraries posts from 1st January 2008 and 1st January 2009. In preparation for Public Domain Day 2010, we decided to prepare our own list of authors who’s works fall into the public domain this coming January.

You can find the list of 563 authors on our Public Domain Works project, which is a simple registry of artistic works that are in the public domain:

The list can be sorted by author surname, birth date, death date and number of works by clicking on the relevant headings. Notable authors include the poets William Butler Yeats and Osip Mandelstam, as well as the father of psychoanalysis Sigmund Freud

While this starts to answer the question What works fall into the public domain this year?, the calculation is still very basic and we hope to improve the list in two main ways:

  1. The results above are based on a crude life+70 computation of copyright expiry (and associated entry into the public domain). This is almost certainly wrong for some jurisdictions and for some types of work. The Public Domain Calculators project is actively working to produce jurisdiction-specific algorithms for precisely determining public domain status. Once complete this effort will be integrated into the calculations presented here. If you’d like to help out with a calculator in your jurisdiction, please get in touch!
  2. The list is not comprehensive - and there are many authors, composers, artists and other creators which we are missing. To improve the list we need better data about authors and works - whether from library catalogues, or other archives of information about creative works. If you know where we might be able to get hold of such data, we’d love to hear from you!

If you’d like to participate in the Public Domain Works project, please join our pd-discuss list and introduce yourself!

Our Open Data and Semantic Web workshop is coming up next Friday 13th November in London, kindly sponsored by Talis.

In preparation for the workshop, we have started a Linking Open Data group on CKAN, our open-source registry of open data, based on the new group feature we announced last week. We currently have 83 Linked Data packages listed, which you can see at:

If you know of any datasets we should add, please consider adding them to CKAN! If you’d like to become an administrator for the LOD group, please get in touch.

We have also converted all of the CKAN data to RDF and loaded it onto Talis’s Connected Commons platform (which was launched earlier this year at OKCon 2009). This can be queried at:

Card Catolog by a trying youth

There have recently been several posts about what features are desirable in government data catalogues.

The Sunlight Foundation recently announced they are planning to build on data.gov to allow “community participation so that people can submit their own data sources” (including support for adding data that is not open such as data with noncommercial restrictions).

The City of San Francisco’s Open SF project are working on CivicDB, which is an open-source platform for helping people to access government data.

They’ve also been working on a list of Data Consumer Requirements - which includes things like:

  • Downloadable data sets should be available for regular time periods (i.e., by month, year).
  • Proprietary data formats, and non-malleable formats should be avoided wherever possible (i.e., Excel, PDF, etc.).

In addition to data.gov (which was launched back in May), the last few months have seen the launch of several other prominent catalogues for government data, including:

  • New Zealand’s Opengovt.org.nz
    • .. an attempt to collate the many different datasets available through the New Zealand Government Departments and Local Bodies
  • The USA’s IT Dashboard
    • The IT Dashboard provides the public with an online window into the details of Federal information technology investments and provides users with the ability to track the progress of investments over time.

Many of the issues being discussed are things we’ve thought about in relation to CKAN - our registry of (collections of) open data and open content.

Here are a few suggestions for those building catalogues for (open) government data based on our experience developing CKAN:

  • Make the catalogue itself open!
    • By using a legal tool such as CC0, the PDDL or the ODbL to make your data catalogue’s metadata open (even if some of the data it describes isn’t), you ensure that the fruits of your hard work can be integrated with that of others! Also, by making the code open source you allow others to re-use and build on it.
    • All of CKAN’s code and data is available under an open license - which lets other projects like Infochimps use it.
  • Let others download the catalogue data in bulk (not just via an API)
    • Create a regular dump of the metadata in your catalogue describing the data - so that your work can be built upon.
    • CKAN’s data dump is updated daily.
  • Include information on how to get the data, and how it can be used
    • In addition to basic details such as title and description, it should be made clear how to get the data, and how it can be used. If it is in the public domain make this explicit (or use a legal tool, such as CC0 or the PDDL). If it is available under the terms of a license - make this explicit and include the text or a link.
    • Each entry on CKAN includes a license field, which includes a drop down menu for common open content/data licenses and tools, as well as licenses for Free/Open Source Software. There is also a free text field for any further details.
  • Make it versioned!
    • If you are going to allow people to add items to or edit the catalogue you might consider making it versioned like a wiki. This allows others to see changes that have been made to each item - which can be useful for reversing and otherwise keeping track of user contributions.
    • You can see the history of changes for each item on CKAN. Furthermore the CKAN’s code (and its domain model) are versioned.

What features do you think are important in catalogues for open government data? We’d love to hear what you think!

OKCon 2009 participants

We’re pleased to announce that slides, audio and photos from OKCon 2009 are now available at:

Speakers included:

If you have any pictures, notes, slides, and so on, or if you’ve blogged about the event, please email us at info [at] okfn [dot] org and we’ll post a link!

Physics book by basykes

The Community College Open Textbook Project, California Digital Marketplace, and the Open Knowledge Foundation invite those with an interest in open textbooks to a meeting on Wednesday, May 20th at 1330-1530 pm PDT (2130-2330 GMT or 2230-0030 CET).

The meeting will be primarly focused on metadata, tagging, interoperability issues, and repository efforts for open textbooks.

If you are in California, you can attend in person at the Foothil College Campus - otherwise you can participate virtually. Further details, including how to participate and an agenda, can be found on the wiki at:

We’re pleased to announce that Talis launched their Connected Commons for open data at OKCon 2009 on Saturday!

The Talis Connected Commons scheme is intended to directly support the publishing and reuse of Linked Data in the public domain by removing the costs associated with those activities.

The scheme is intended to support a wide range of different forms of data publishing. For example scientific researchers seeking to share their research data; dissemination of public domain data from a variety of different charitable, public sector or volunteer organizations; open data enthusiasts compiling data sets to be shared with the web community.

Specifically, they are offering (for datasets under the PDDL or CC0):

  • Free hosting of up to 50 million RDF triples and 10Gb of content,
  • Access to data access services that operate on that data, including data retrieval and text search,
  • Free access to a public SPARQL endpoint for each dataset.

For more information you can see:

We were also pleased to notice in their FAQ:

We will also be encouraging dataset owners to register the data with CKAN — the Comprehensive Knowledge Archive Network, this will provide another route for open data hackers to find the data.

Great news for open data, and for those looking for hosting and services to do things with it!

You may have heard that lcsh.info - which explored how Library of Congress Subject Headings could be represented as a Semantic Web application - was closed down last month.

The good news is that there are now two new projects publishing library-related open data:

The first, ICONCLASS, is “an experimental service that makes the ICONCLASS Iconographic Classification system available as linked-data using the SKOS vocabulary”.

The second, from the University of Huddersfield Library, publishes circulation and recommendation data under both CC0 and the PDDL. Dave Pattern writes:

In total, there’s data for over 80,000 titles derived from a pool of just under 3 million circulation transactions spanning a 13 year period.

Data is comprised of two parts:

  1. Circulation Data. This breaks down the loans by year, by academic school, and by individual academic courses. This data will primarily be of interest to other academic libraries. UK academic libraries may be able to directly compare borrowing by matching up their courses against ours (using the UCAS course codes).

  2. Recommendation Data. This is the data which drives the “people who borrowed this, also borrowed…” suggestions in our OPAC. This data had previously been exposed as a web service with a non-commercial licence, but is now freely available for you to download. We’ve also included data about the number of times the suggested title was borrowed before, at the same time, or afterwards.