Support Us

You are browsing the archive for Bibliographic.

What We Hope the Digital Public Library of America Will Become

Jonathan Gray - April 17, 2013 in Bibliographic, Featured, Free Culture, Open Content, Open GLAM, Open Humanities, Policy, Public Domain

Tomorrow is the official launch date for the Digital Public Library of America (DPLA).

If you’ve been following it, you’ll know that it has the long term aim of realising “a large-scale digital public library that will make the cultural and scientific record available to all”.

More specifically, Robert Darnton, Director of the Harvard University Library and one of the DPLA’s leading advocates to date, recently wrote in the New York Review of Books, that the DPLA aims to:

make the holdings of America’s research libraries, archives, and museums available to all Americans—and eventually to everyone in the world—online and free of charge

What will this practically mean? How will the DPLA translate this broad mission into action? And to what extent will they be aligned with other initiatives to encourage cultural heritage institutions to open up their holdings, like our own OpenGLAM or Wikimedia’s GLAM-WIKI?

Here are a few of our thoughts on what we hope the DPLA will become.

A force for open metadata

The DPLA is initially focusing its efforts on making existing digital collections from across the US searchable and browsable from a single website.

Much like Europe’s digital library, Europeana, this will involve collecting information about works from a variety of institutions and linking to digital copies of these works that are spread across the web. A super-catalogue, if you will, that includes information about and links to copies of all the things in all the other catalogues.

Happily, we’ve already heard that the DPLA is releasing all of this data about cultural works that they will be collecting using the CC0 legal tool – meaning that anyone can use, share or build on this information without restriction.

We hope they continue to proactively encourage institutions to explicitly open up metadata about their works, and to release this as machine-readable raw data.

Back in 2007, we – along with the late Aaron Swartz – urged the Library of Congress to play a leading role in opening up information about cultural works. So we’re pleased that it looks like DPLA could take on the mantle.

But what about the digital copies themselves?

A force for an open digital public domain

The DPLA has spoken about using fair use provisions to increase access to copyrighted materials, and has even intimated that they might want to try to change or challenge the state of the law to grant further exceptions or limitations to copyright for educational or noncommercial purposes (trying to succeed where Google Books failed). All of this is highly laudable.

But what about works which have fallen out of copyright and entered the public domain?

Just as they are doing with metadata about works, we hope that the DPLA takes a principled approach to digital copies of works which have entered the public domain, encouraging institutions to publish these without legal or technical restrictions.

We hope they become proactive evangelists for a digital public domain which is open as in the Open Definition, meaning that digital copies of books, paintings, recordings, films and other artefacts are free for anyone to use and share – without restrictive clickwrap agreements, digital rights management technologies or digital watermarks to impose ownership and inhibit further use or sharing.

The Europeana Public Domain Charter, in part based on and inspired by the Public Domain Manifesto, might serve as a model here. In particular, the DPLA might take inspiration from the following sections:

What is in the Public Domain needs to remain in the Public Domain. Exclusive control over Public Domain works cannot be re-established by claiming exclusive rights in technical reproductions of the works, or by using technical and or contractual measures to limit access to technical reproductions of such works. Works that are in the Public Domain in analogue form continue to be in the Public Domain once they have been digitised.

The lawful user of a digital copy of a Public Domain work should be free to (re-) use, copy and modify the work. Public Domain status of a work guarantees the right to re-use, modify and make reproductions and this must not be limited through technical and or contractual measures. When a work has entered the Public Domain there is no longer a legal basis to impose restrictions on the use of that work.

The DPLA could create their own principles or recommendations for the digital publication of public domain works (perhaps recommending legal tools like the Creative Commons Public Domain Mark) as well as ensuring that new content that they digitise is explicitly marked as open.

Speaking at our OpenGLAM US launch last month, Emily Gore, the DPLA’s Director for Content, said that this is definitely something that they’d be thinking about over the coming months. We hope they adopt a strong and principled position in favour of openness, and help to raise awareness amongst institutions and the general public about the importance of a digital public domain which is open for everyone.

A force for collaboration around the cultural commons

Open knowledge isn’t just about stuff being able to freely move around on networks of computers and devices. It is also about people.

We think there is a significant opportunity to involve students, scholars, artists, developers, designers and the general public in the curation and re-presentation of our cultural and historical past.

Rather than just having vast pools of information about works from US collections – wouldn’t it be great if there were hand picked anthologies of works by Emerson or Dickinson curated by leading scholars? Or collections of songs or paintings relating to a specific region, chosen by knowledgable local historians who know about allusions and references that others might miss?

An ‘open by default’ approach would enable use and engagement with digital content that breathes a life into it that it might not otherwise have – from new useful and interesting websites, mobile applications or digital humanities projects, to creative remixing or screenings of out of copyright films with new live soundtracks (like Air’s magical reworking of Georges Méliès’s 1902 film Le Voyage Dans La Lune).

We hope that the DPLA takes a proactive approach to encouraging the use of the digital material that it federates, to ensure that it is as impactful and valuable to as many people as possible.

Communia condemns the privatisation of the Public Domain by the BnF

Primavera De Filippi - January 21, 2013 in Bibliographic, COMMUNIA, OK France, Public Domain

Last week the Bibliothèque nationale de France (BnF) concluded two new agreements with private companies to digitze over 70.000 old books, 200.000 sound recordings and other documents belonging (either partially or as a whole) to the public domain. While these public private partnerships enable the digitization of these works they also contain 10-year exclusive agreements allowing the private companies carrying out the digitization to commercialize the digitized documents. During this period only a limited number of these works may be offered online by the BnF.

Together with La Quadrature du Net, Framasoft, SavoirsCom1 and the Open Knowledge Foundation France, COMMUNIA has issued a statement (in french) to express our profound disagreement with the terms of these partnerships that restrict digital access to an important part of Europe’s cultural heritage. The agreements that the BnF has entered into, effectively take the works being digitized out of the public domain for the next 10 years.

The value of the public domain lies in the free dissemination of knowledge and the ability for everyone to access and create new works based on previous works. Yet, instead of taking advantage of the opportunities offered by digitization, the exclusivity of these agreements will force public bodies, such as research institutions or university libraries, to purchase digital content that belongs to the common cultural heritage.

As such, these partnerships constitute a commodification of the public domain by contractual means. COMMUNIA, of which the OKFN is a partner, has been critical of such arrangements from the start (see their Public Domain Manifesto) and Policy Reccomendations 4 & 5. More interestingly these agreements are also in direct contradiction with the Public Domain Charter published by the Europeana Foundation in 2011. In this context it is interesting to note that the director of Bibliothèque nationale de France currently serves as the chairman of the Europeana Foundation’s Executive Board.

Goodbye Aaron Swartz – and Long Live Your Legacy

Jonathan Gray - January 14, 2013 in Access to Information, Bibliographic, Campaigning, Featured, News, Open Access, Open Data, Open Government Data, Policy

Aaron Swartz, coder, writer, archivist and activist, took his own life in New York on Friday.

Aaron worked tirelessly to open up and maximise the societal impact of information in three areas which are central to our work at the Foundation: public domain cultural works, public sector information, and open access to publicly funded research.

He was one of the original architects behind the Internet Archive’s Open Library project, which aims to create ‘one web page for every book’. While he was there we compared notes about trying to automatically estimate which works are in the public domain in different countries around the world.

This was part of a broader vision to enable public access to the public domain, and to ensure that digitisation initiatives result in open digital copies of public domain works that everyone is free to use and enjoy, not just copies owned and protected by large corporations who might sell or restrict access to the world’s heritage.

Around this time Aaron and I met in San Francisco to co-draft a petition to the Library of Congress to encourage them to take a leading role in opening up data from the world’s libraries and memory institutions. This was several years before a wave of institutions started explicitly opening up data about their holdings.

We remained in contact regarding his work on open government data in the US. Aaron was involved in drafting the highly influential 8 principles for open government data. We wanted to try to better coordinate developments on either side of the Atlantic.

Later he was in the papers for downloading around a fifth of the US government’s huge Public Access to Court Records (PACER) system, around 780 gigabytes, and releasing it for free to the public (access was usually charged by the page) – which earned him an FBI file.

In his 2008 Guerilla Open Access Manifesto Aaron argued that “the world’s entire scientific and cultural heritage, published over centuries in books and journals, is increasingly being digitized and locked up by a handful of private corporations” and, “in the grand tradition of civil disobedience”, urged internet users to “fight back”:

We need to take information, wherever it is stored, make our copies and share them with the world. We need to take stuff that’s out of copyright and add it to the archive. We need to buy secret databases and put them on the Web. We need to download scientific journals and upload them to file sharing networks. We need to fight for Guerilla Open Access.

In 2010 he founded Demand Progress, which helped to mobilise over a million people in response to proposed legislation like the Combating Online Infringement and Counterfeits Act (COICA).

In 2011 he again hit the headlines when he was arrested for downloading roughly 4 million subscription-only academic articles from JSTOR by placing a laptop in a computer cupboard at MIT and using this to gain unauthorised access to the JSTOR service. The prosecution alleged that he intended to make these articles freely available on the web.

Last September the US Federal Government raised the felony count from four to thirteen, which meant that Aaron was potentially facing a total of 50+ years and a fine in the area of $4 million for his actions. His family suggested that the case was a factor in his death – and blamed the Massachusetts U.S. Attorney’s office for “intimidation and prosecutorial overreach” and MIT for “refus[ing] to stand up for Aaron and its own community’s most cherished principles”. The president of MIT has just announced that he has ordered an investigation into their role in Aaron’s prosecution.

As Peter Eckersley from the Electronic Frontier Foundation commented on Saturday:

While his methods were provocative, the goal that Aaron died fighting for — freeing the publicly-funded scientific literature from a publishing system that makes it inaccessible to most of those who paid for it — is one that we should all support.

While Aaron was deeply involved in all kinds of technical, scholarly and organising activities to promote an open digital commons and an open internet – from helping to develop RSS 1.0 and Markdown, to early sketches of the semantic web with some of its pioneers and work on the first technical implementations of the Creative Commons licenses – he also never lost sight of the bigger picture, of what it was all for. He was a talented coder and knew how to take a principled stance, but he was never one to get lost in detail or dogma. From his writings about how data-driven transparency initiatives are not enough to effect change in themselves, to his guide to developing software that addresses real needs, he was always aware of the fact that using the information, technology and the internet to change the world is not easy, and requires graft, skill, scrutiny, critical reflection and taking risks.

Aaron’s passing is a tremendously sad and significant loss. Long live his legacy.

To find out more about Aaron’s life and works, you can look at his writings and the memorial site set up by his family. You can also read tributes from Tim Berners-Lee, Cory Doctorow, Brewster Kahle, Lawrence Lessig, and Erik Moeller, and read obituaries and news articles on the BBC, the Economist, Forbes, Gigaom, the Guardian, the Huffington Post, the New York Times, The New Yorker, The Observer, Techdirt, The Telegraph, Vice and Wired. In tribute, hundreds of academics have started tweeting links to their research papers using the hashtag #pdftribute. The Internet Archive has started an Aaron Swartz Collection.

The Digital Public Library of America moving forward

Kenny Whitebloom - November 6, 2012 in Bibliographic, External, Open Content, Open Data, Open GLAM

A fuller version of this post is available on the Open GLAM blog

The Digital Public Library of America (DPLA) is an ambitious project to build a national digital library platform for the United States that will make the cultural and scientific record available, free to all Americans. Hosted by the Berkman Center for Internet & Society at Harvard University, the DPLA is an international community of over 1,200 volunteers and participants from public and research libraries, academia, all levels of government, publishing, cultural organizations, the creative community, and private industry devoted to building a free, open, and growing national resource.

Here’s an outline of some of the key developments in the DPLA planning initiative. For more information on the Digital Public Library of America, including ways in which you can participate, please visit


In the fall of 2012, the DPLA received funding from the National Endowment for the Humanities, the Institute for Museum and Library Services, and the Knight Foundation to support our Digital Hubs Pilot Project. This funding enabled us to develop the DPLA’s content infrastructure, including implementation of state and regional digital service pilot projects. Under the Hubs Pilot, the DPLA plans to connect existing state infrastructure to create a national system of state (or in some cases, regional) service hubs.

The service hubs identified for the pilot are:

  • Mountain West Digital Library (Utah, Nevada and Arizona)
  • Digital Commonwealth (Massachusetts)
  • Digital Library of Georgia
  • Kentucky Digital Library
  • Minnesota Digital Library
  • South Carolina Digital Library

In addition to these service hubs, organizations large digital collections that are going make their collections available via the DPLA will become content hubs. We have identified the National Archives and Records Administration, the Smithsonian Institute, and Harvard University as some of the first potential content hubs in the Digital Hubs Pilot Project.

Here’s our director for content, Emily Gore, to give you a full overview:

Technical Development

The technical development of the Digital Public Library of America is being conducted in a series of stages. The first stage (December 2011-April 2012) involved the initial development of a back-end metadata platform. The platform provides information and services openly and to all without restriction by way of open source code.

We’re now on stage two: integrating continued development of the back-end platform, complete with open APIs, with new work on a prototype front end. It’s important to note that this front-end will serve as a gesture toward the possibilities of a fully built-out DPLA, providing but one interface for users to interact with the millions of records contained in the DPLA platform.

Development of the back-end platform — conducted publicly, with all code published on GitHub under a GNU Affero General Public License — continues so that others can develop additional user interfaces and means of using the data and metadata in the DPLA over time, which continues to be a key design principle for the project overall.


We’ve been hosting a whole load of events, from our large public events like the DPLA Midwest last month in Chicago, to smaller more intimate hackathons. These events have brought together a wide range of stakeholders — librarians, technologists, creators, students, government leaders, and others – and have proved exciting and fruitful moments in driving the project forward.

On November 8-9, 2012, the DPLA will convene its first “Appfest” Hackathon at the Chattanooga Public Library in Chattanooga, TN. The Appfest is an informal, open call for both ideas and functional examples of creative and engaging ways to use the content and metadata in the DPLA back-end platform. We’re looking for web and mobile apps, data visualization hacks, dashboard widgets that might spice up an end-user’s homepage, or a medley of all of these. There are no strict boundaries on the types of submissions accepted, except that they be open source. You can check out some of the apps that might be built at the upcoming hackathon on the Appfest wiki page.

The DPLA remains an extremely ambitious project, and we encourage anyone with an interest in open knowledge and the democratization of information to participate in one form or another. If you have any questions about the project or ways to get involved, please feel free to email me at kwhitebloom[at]

#OpenDataEDB 3

Naomi Lillie - September 14, 2012 in Bibliographic, Events, Join us, Linked Open Data, Meetups, OKScotland, Open Data, Open GLAM, Open Government Data, Open Knowledge Foundation

Amidst the kerfuffle and cacophony of the Fringe Festival packing up for another year, the Edinburgh contingent came together again to meet, greet, present and argue all aspects of Open Data and Knowledge.

OKFN Meet-ups are friendly and informal evenings for people to get together to share and debate all areas of openness. Depending on the number of people on a given evening, we have presentations and/or round-table discussions about Open Knowledge and Open Data – from politics and philosophy to the practicalities of theory and practice. We have had two previous events (see here for the ‘launch’ write-up and here for the invitation to the second instalment); this time we were kindly hosted by the Informatics Forum, and the weather stayed fine enough to explore the roof terrace (complete with vegetable garden, gizmos to record wind-speed and weather, a view across the city to Arthur’s Seat and even a blue moon).

Around 20 of us gathered together and presentations were given by the following people:

  • James Baster – Open Tech Calendar: an introduction to this early-stage project to bring tech meet-ups together, talk about the different ways we are trying to be open and ask for feedback and help;
  • Ewan Klein – a short overview of business models for Open Data, including for government bodies;
  • Gordon Dunsire – library standards and linked data;
  • Gill Hamilton – National Library of Scotland’s perspective of library standards and open data;
  • Bob Kerr – State of the Map Scotland (see here for Bob’s featured OKFN blog post);
  • Naomi Lillie – OKFN as part of the Scottish Open effort.

What struck me overall was that everybody already knows each-other… As well as cross-over in the talks, I kept trying to introduce people who would exclaim, “Ah yes! How was the holiday / conference / wedding?” or similar. This was quite useful, though, as it emphasised the point I made in my talk: OKFN doesn’t need to start anything in Scotland, as efforts towards Open are already ongoing and to great effect, we just want to provide support and possibly a brand under which these activities can be coordinated and promoted. With this in mind, we are going to look into a Scotland OKFN group as soon as things settle down again after OKFest – keep your eyes open for updates to follow!

To keep up-to-date with #OpenDataEDB and similar events, with the above and other interesting folks, and with the emerging Scotland OKFN group:

JISC Open Biblio 2 project – final report

Naomi Lillie - August 23, 2012 in Bibliographic, OKI Projects, Open GLAM, WG Open Bibliographic Data, Working Groups

This is cross-posted from

Following on from the success of the first JISC Open Bibliography project we have now completed a further year of development and advocacy as part of the JISC Discovery programme.

Our stated aims at the beginning of the second year of development were to show our community (namely all those interested in furthering the cause of Open via bibliographic data, including: coders; academics; those with interest in supporting Galleries, Libraries, Archives and Museums; etc) what we are missing if we do not commit to Open Bibliography, and to show that Open Bibliography is a fundamental requirement of a community committed to discovery and dissemination of ideas. We intended to do this by demonstrating the value of carefully managed metadata collections of particular interest to individuals and small groups, thus realising the potential of the open access to large collections of metadata we now enjoy.

We have been successful overall in achieving our aims, and we present here a summary of our output to date (it may be useful to refer to this guide to terms).


BibServer and FacetView

The BibServer open source software package enables individuals and small groups to present their bibliographic collections easily online. BibServer utilises elasticsearch in the background to index supplied records, and these are presented via the frontend using the FacetView javascript library. This use of javascript at the front end allows easy embedding of result displays on any web page.

BibSoup and more demonstrations

Our own version of BibServer is up and running at, where we have seen over 100 users sharing more than 14000 records across over 60 collections. Some particularly interesting example collections include:

Additionally, we have created some niche instances of BibServer for solving specific problems – for example, check out; here we have used BibServer to analyse and display collections specific to malaria researchers, as a demonstration of the extent of open access materials in the field. Further analysis allowed us to show where best to look for relevant materials that could be expected to be openly available, and to begin work on the concept of an Open Access Index for research.

Another example is the German National Bibliography, as provided by the German National Library, which is in progress (as explained by Adrian Pohl and Etienne Posthumus here). We have and are building similar collections for all other national bibliographies that we receive.


At we have produced a simple convention for presenting bibliographic records in JSON. This has seen good uptake so far, with additional use in the JISC TEXTUS project and in Total Impact, amongst others.


Pubcrawler collects bibliographic metadata, via parsers created for particular sites, and we have used it to create collections of articles. The full post provides more information.

datahub collections

We have continued to collect useful bibliographic collections throughout the year, and these along with all others discovered by the community can be found on the datahub in the bibliographic group.

Open Access / Bibliography advocacy videos and presentations

As part of a Sprint in January we recorded videos of the work we were doing and the roles we play in this project and wider biblio promotion; we also made a how-to for using BibServer, including feedback from a new user:

Setting up a Bibserver and Faceted Browsing (Mark MacGillivray) from Bibsoup Project on Vimeo.

Peter and Tom Murray-Rust’s video, made into a prezi, has proven useful in explaining the basics of the need for Open Bibliography and Open Access:

Community activities

The Open Biblio community have gathered for a number of different reasons over the duration of this project: the project team met in Cambridge and Edinburgh to plan work in Sprints; Edinburgh also played host to a couple of Meet-ups for the wider open community, as did London; and London hosted BiblioHack – a hackathon / workshop for established enthusasiasts as well as new faces, both with and without technical know-how.

These events – particularly BiblioHack – attracted people from all over the UK and Europe, and we were pleased that the work we are doing is gaining attention from similar projects world-wide.

Further collaborations


Over the course of this project we have learnt that open source development provides great flexibility and power to do what we need to do, and open access in general frees us from many difficult constraints. There is now a lot of useful information available online for how to do open source and open access. Whilst licensing remains an issue, it becomes clear that making everything publicly and freely available to the fullest extent possible is the simplest solution, causing no further complications down the line. See the open definition as well as our principles for more information.

We discovered during the BibJSON spec development that it must be clear whether a specification is centrally controlled, or more of a communal agreement on use. There are advantages and disadvantages to each method, however they are not compatible – although one may become the other. We took the communal agreement approach, as we found that in the early stages there was more value in exposing the spec to people as widely and openly as possible than in maintaining close control. Moving to a close control format requires specific and ongoing commitment.

Community building remains tricky and somewhat serendipitous. Just as word-of-mouth can enhance reputation, failure of certain communities can detrimentally impact other parts of the project. Again, the best solution is to ensure everything is as open as possible from the outset, thereby reducing the impact of any one particular failure.

Opportunities and Possibilities

Over the two years, the concept of open bibliography has gone from requiring justification to being an expectation; the value of making this metadata openly available to the public is now obvious, and getting such access is no longer so difficult; where access is not yet available, many groups are now moving toward making it available. And of course, there are now plenty tools to make good use of available metadata.

Future opportunities now lie in the more general field of Open Scholarship, where a default of Open Bibliography can be leveraged to great effect. For example, recent Open Access mandates by many UK funding councils (eg Finch Report) could be backed up by investigative checks on the accessibility of research outputs, supporting provision of an open access corpus of scholarly material.

We intend now to continue work in this wider context, and we will soon publicise our more specific ideas; we would appreciate contact with other groups interested in working further in this area.

Further information

For the original project overview, see; also, a full chronological listing of all our project posts is available at The work package descriptions are available at, and links to posts relevant to each work package over the course of the project follow:

  • WP1 Participation with Discovery programme
  • WP2 Collaborate with partners to develop social and technical interoperability
  • WP3 Open Bibliography advocacy
  • WP4 Community support
  • WP5 Data acquisition
  • WP6 Software development
  • WP7 Beta deployment
  • WP8 Disruptive innovation
  • WP9 Project management (NB all posts about the project are relevant to this WP)
  • WP10 Preparation for service delivery

All software developed during this project is available on open source licence. All the data that was released during this project fell under OKD compliant licenses such as PDDL or CC0, depending on that chosen by the publisher. The content of our site is licensed under a Creative Commons Attribution 3.0 License (all jurisdictions).

The project team would like to thank supporting staff at the Open Knowledge Foundation and Cambridge University Library, the OKF Open Bibliography working group and Open Access working group, Neil Wilson and the team at the British Library, and Andy McGregor and the rest of the team at JISC.


Naomi Lillie - July 9, 2012 in Bibliographic, DM2E, Events, OKI Projects, Open GLAM, Our Work, Sprint / Hackday, TEXTUS, WG Open Bibliographic Data, Working Groups, Workshop


Last month we ran the Open Knowledge Foundation’s largest celebration of open bibliographic data to date. The main focus of the two-day event was to get some hacking done and use the tools the Open Knowledge Foundation has helped to build, or is currently building, for working with bibliographic data, such as BibServer, TEXTUS and BibSoup.

Open GLAM Workshop


The other component to the two-day event was a one-day workshop for those working in cultural heritage institutions. It included an introduction to some of the basic technical concepts of open data such as APIs and Linked Data, as well as advice from experts in the field on how to prepare your data for a hackathon. The workshop also sought to start conversations with the institutions represented from around London about what the challenges were to opening up more of their collections online and how the Open Knowledge Foundation’s Open GLAM initiative could assist in the process.

The write up of the workshop can be found on and over on the Talis Systems website (thank you Tim Hodson!) One highlight of the workshop was Harry Harrold’s brilliant talk on how to get your data ready for a hackathon:

Bibliohack: Preparing your data for a hackathon from UKOLN on Vimeo.

The Hacking

The hacking began with an agreed approach of identifying one unified problem and established the need to create ‘A Bibliographic Toolkit’: bringing together the tools necessary to liberate bibliographic data, make it openly available on the net and to interact with that data.

The main components to this were:

  • Utilising BibServer – adding datasets and using PubCrawler
  • Creating an Open Access Index
  • Developing annotation tools

Project diagram

Groups identified particular Open Knowledge Foundation projects including TEXTUS and BibServer to find out what they could offer as part of this Toolkit, and looked into other available facilities on the web.

It was so exciting so see people approaching common problems from different angles and finding new ways around problems. One example of this was the TEXTUS group’s new approach to managing bibliographic references and how it can complement approaches to semantic annotation currently being worked on by the DM2E team who were present at the hack. Adrian Pohl and Etienne Posthumus’s attempt to load the whole of German National Bibliography into a Bibserver was another such example.

For some more detailed information on what occurred each day, check out the daily blog reports we wrote over on

Big Thanks

We’d like to thank all the groups involved who made the two days such a success, especially DevCSI, UK Discovery DM2E, Open GLAM, Open Biblio and all of the participants.

The OKFN frequently arranges workshops, hackdays and meet-ups, so do keep an eye on this blog and meet-up channel for news of upcoming events.

Bibliographic References in Textus

Tom Oinn - June 20, 2012 in Bibliographic, Open Shakespeare, Our Work, TEXTUS

Textus is the OKFN’s open source platform for working with collections of texts. It harnesses the power of semantic web technologies and delivers them in a simple and intuitive interface so that students, researchers and teachers can share and collaborate around collections of texts.

Sites such as the upcoming and the existing contain collections of texts, annotated by their respective communities. Following some excellent conversations at the recent openBiblio workshop and hack session, we now have a plan to make these text repositories play nicely with the rest of the world.

Many thanks to all the participants at the openBiblio event for their comments and help, in particular to Peter Murray-Rust and Simone Fonda, and to the organisers for getting everyone into the same room.

New features

Within the next few weeks, we’re hoping to add a whole load of new features to Textus, so you’ll be able to:

  • Browse the texts in that instance, filtering by authors, dates etc.
  • Create your own reading lists and control whether they’re publicly visible on your profile page or private. Items in these reading lists can be
    • External references, added either completely manually by filling in all the details or through a search interface.
    • Entire texts or fragments of texts from within the Textus instance itself, allowing you to add very specific content to your reading list.
    • …or you can import the entire reading list from an uploaded file in BibJSON format.
  • Add citations to your annotations. As with the reading list creation you can add entirely manually or through search, with the extra feature that you can search your own reading lists – this means you can create annotations which reference other regions of the same or other texts within the Textus instance.
  • Export your reading lists as BibJSON for import into other tools and services.

References in, references out

Currently annotations are free text comments, which may be attributable to a user or may be anonymous, but are rarely any richer than this. Annotations of this kind are valuable, but they lack solid backing. We’d like to allow our annotators to provide evidence through citations.

On the other side of things, we want to be able to reference texts or parts of texts held within a Textus installation from elsewhere, including hyperlinks directly into the reader interface such that when someone cites a fragment of a play in they can provide a link which opens that part of the play in a web browser along with any relevent annotations.

An interesting side effect of having a text in Textus is that citing any arbitrary part of that text becomes possible – traditionally it’s been difficult to create truly fine grained citations (down to the paragraph, sentence or even word level). We can do this trivially as Textus defines a coordinate system over each text and references refer to a contigious range of characters within this system. It will be interesting to see how tools which expect very coarse grained references (entire books, articles etc) cope with these much more precise citations, but that’s for the future…

Tech and implementation

To integrate the functionality described above into Textus we’re going to be taking advantage of three existing projects.

  • BibJSON provides a format to express bibliographic information.
  • BibServer provides a set of APIs we can use to search external sources of reference and expose references from Textus (allowing Textus to act as a BibServer instance itself)
  • FacetView provides a rich filtering and browsing interface embedded in the Textus website to allow navigation and display of collections. This depends on an ElasticSearch or SOLR instance with the data, happily we already use ElasticSearch as the data store for Textus.

So, there is one component we need to write (a sensible search UI across distributed BibServer instances, including the instance embedded within Textus) and a couple to integrate. There will certainly be glitches and things which aren’t as easy as we expect, but really thanks to the excellent work from these other projects we should be able to get a lot of functionality very quickly.

The Right to Read Is the Right to Mine

Peter Murray-Rust - June 1, 2012 in Bibliographic, OKI Projects, Open Access, Open Content, Open Data, Open Science, Texts, WG Open Bibliographic Data, Working Groups

The following is a draft content mining declaration developed by the Open Knowledge Foundation’s Working Group on Open Access

In brief: The Right to Read Is the Right to Mine


Researchers can find and read papers online, rather than having to manually track down print copies.  Machines  (computers) can index the papers and extract the details (titles,  keywords etc.) in order to alert scientists to relevant material.  In addition, computers can extract factual data and meaning by “mining” the content, opening  up the possibility that machines could be used to make connections (and  even scientific discoveries) that might otherwise remain invisible to  researchers.

However,  it is not generally possible today for computers to mine the content in papers due to constraints imposed by publishers.  While Open Access (OA) is improving the ability for researchers to read papers (by removing  access barriers), still only around 20% of scholarly papers are OA. The  remainder are locked  behind paywalls. As per the vast majority of subscription contracts, Subscribers may read paywalled papers, but they may not mine them.

Content  mining is the way that modern technology locates digital information. Because digitized scientific information comes from hundreds of  thousands of different sources in today’s globally connected scientific  community [2] and because current data sets can be measured in  terabytes,[1] it is often no longer possible to simply read a scholarly  summary in order to make scientifically significant use of such  information.[3]  A researcher must be able to copy information,  recombine it with other data and otherwise “re-use” it so as to produce  truly helpful results.  Not only is it a deductive tool to analyze  research data, it is how search engines operate to allow discovery of content. To prevent mining is therefore to force scientists into blind  alleys and silos where only limited knowledge is accessible.  Science  does not progress if it cannot incorporate the most recent findings and  move forward from there.


‘Open  Content Mining’ means the unrestricted right of subscribers to extract,  process and republish content manually or by machine in whatever form  (text, diagrams, images, data, audio, video, etc.) without prior  specific permissions and subject only to community norms of responsible  behaviour in the electronic age.

  • Text
  • Numbers
  • Tables: numerical representations of a fact
  • Diagrams (line drawings, graphs, spectra, networks, etc.): Graphical  representations of relationships between variables, are images and  therefore may not be, when considered as a collective entity, data.  However, the individual data points underlying a graph, similar to  tables, should be.
  • Images and video (mainly photographic)- where it is the means of expressing a fact?
  • Audio: same as images – where it is expresses the factual representation of the research?
  • XML:  Extensible Markup Language (XML) defines rules for encoding documents  in a format that is both human-readable and machine-readable.”<
  • Core  bibliographic data: described as “data which is necessary to identify  and / or discover a publication” and defined under the Open Bibliography  Principles.
  • Resource  Description Framework (RDF): information about content, such as  authors, licensing information and the unique identifier for the article


Principle 1: Right of Legitimate Accessors to Mine

We assert that there is no legal, ethical or moral reason to refuse to  allow legitimate accessors of research content (OA or otherwise) to use  machines to analyse the published output of the research community.   Researchers expect to access and process the full content of the research literature with their computer programs and should be able to use their machines as they use their eyes.

The right to read is the right to mine

Principle 2: Lightweight Processing Terms and Conditions

Mining  by legitimate subscribers should not be prohibited by contractual or  other legal barriers.  Publishers should add clarifying language in  subscription agreements that content is available for information mining by download or by remote access.  Where access is through researcher-provided tools, no further cost should be required.

Users and providers should encourage machine processing

Principle 3: Use

Researchers can and will publish facts and excerpts which they discover by reading and processing documents.  They expect to disseminate and aggregate statistical results as facts and context text as fair use excerpts, openly and with no restrictions other than attribution. Publisher  efforts to claim rights in the results of mining further retard the advancement of science by making those results less available to the research community; Such claims should be prohibited.

Facts don’t belong to anyone.


We plan to assert the above rights by:

  • Educating  researchers and librarians about the potential of content mining and the current impediments to doing so, including alerting librarians to the need not to cede any of the above rights when signing contracts with  publishers
  • Compiling  a list of publishers and indicating what rights they currently permit,  in order to highlight the gap between the rights here being asserted and  what is currently possible
  • Urging governments and funders to promote and aid the enjoyment of the above rights

[1]  Panzer-Steindel, Bernd, Sizing and Costing of the CERN T0 center, CERN-LCG-PEB-2004-21, 09 June 2004, at

[2]  The Value and Benefits of Text Mining, JISC, Report Doc #811, March 2012, Section 3.3.8 at,  citing P.J.Herron, “Text Mining Adoption for Pharmacogenomics-based  Drug Discovery in a Large Pharmaceutical Company: a Case STudy,”  Library, 2006, claiming that text mining tools evaluated 50,000 patents  in 18 months, a task that would have taken 50 person years to manually.

[3] See MEDLINE® Citation Counts by Year of Publication, at and National Science Foundation, Science and Engineering Indicators: 2010, Chapter 5 at asserting the annual volume of scientific journal articles published is on the order of 2.5%.

Get Updates