A couple of weeks back we blogged about the ‘Future of Bibliographic Control’ draft report from a working group at the Library of Congress. Since then, we’ve submitted to the group a brief, collaboratively edited response to the draft and an appendix with some additional detailed comments.

The response was drafted by the Open Knowledge Foundation and Aaron Swartz of the Open Library and was co-signed by over 150 groups and individuals, including:

Many, many thanks to all of those who helped to publicise this, and to those who co-signed the response! We hope that the working group consider amending the draft in light of our comments in January.

Articles in CTWatch Quarterly

September 4th, 2007

As some of you many have seen, Open Knowledge Foundation advisory board members Peter Suber and John Wilbanks recently wrote two interesting articles in CTWatch Quarterly.

Peter Suber’s Trends Favoring Open Access is a broad-ranging overview of developments in publishing, research, and technology that look to support Open Access. As well as looking at how publishers and scholars are accepting and adopting OA models, he suggests that misunderstanding about OA is diminishing:

Everyone is getting used to the idea that OA literature can be copyrighted, the idea that OA literature can be peer-reviewed, the idea that the expenses for producing OA literature can be recovered, and the idea that OA and TA literature can co-exist (even for the same work).

While OA supporters have “good arguments and good trends”, he warns that we may see greater lobbying against OA policies in the US and Europe.

John Wilbanks’ Cyberinfrastructure For Knowledge Sharing looks at the way knowledge is shared in the sciences, particularly focusing on the life sciences and with respect to drug development. He says:

While the Web and email pervade pharmaceutical companies, the elusive goal remains “knowledge management:” finding some way to bring sanity to the sprawling mass of figures, emails, data sets, databases, slide shows, spreadsheets, and sequences that underpin advanced life sciences research.

He suggests that though technologies that could facilitate improved knowledge sharing already exist, the problem is that much scientific content is ‘dark’ to the web - “no one has the right to download and index with scholarly literature without burning years of time and money in negotiations”.

He goes on to look at the Neurocommons project - “an open source, open access knowledge management platform, with an initial therapeutic focus on the neurosciences” - as a good example of knowledge sharing in the sciences. Finally, he suggests a list of things that are needed to improve the situation:

We need publishers to look for business models that aren’t based on locking up the full text, because the contents of the journals – the knowledge – is itself part of the infrastructure, and closed infrastructure doesn’t yield network effects. We need open, stable namespaces for scientific entities that we can use in programming and integrating databases on the open Web, because stable names are part of the infrastructure. We need real solutions about long-term preservation of data (long-term meaning a hundred years or more). We need new browsers and better text processing. We need a sense of what it means to “publish” in a truly digital sense, in place of the digitization of the paper metaphor we have in the PDF format. We need infrastructure that makes it easy to share and integrate knowledge, not just publish it on the Web.

When I think of the amount of knowledge that is ‘dead’ because of a lack of explicitness about its ‘openness’ I am always surprised by the number of examples. Consider the following two:

Example 1: Everything2 and h2g2

Years ago, back when I was at university I remember stumbling across http://www.everything2.com/. Shortly thereafter I remember being shown http://www.h2g2.com/ by a friend who’d just posted a write up of the Arrow impossibility theorem. Long before wikipedia these sites were demonstrating the ability of decentralized, uncoordinated users to generate a huge amount of interesting and valuable (though fairly unstructured) information.

Thinking about these two sites recently I asked myself: ‘what license did they use’ and, relatedly, ‘am i allowed to download/redistribute/incorporate their data in another project?’. The answer was perhaps unsurprising: neither site seemed to have thought about it — at least not originally — and, as a consequence, their copyright policy was the default: everyone retains copyright to what they do. (As is typical of anything involving copyright things are a little more complex: h2g2 after its take over by the BBC adopted a policy whereby contributors retain copyright in their articles but grant the BBC non-exclusive licence to use it as they see fit. To further complicate matters the BBC claims to retain copyright in ‘Edited Entries’ because a BBC editor has checked and/or altered the article).

Hence with respect to my second question: ‘am i allowed to redistribute/reuse their material’ the simple answer was: No — I’d would have to go out and identify, and then gain permission, from each contributor; an endeavour that would clearly be prohibitively time consuming. And this is despite the fact that — from their very participation — it is clear that the vast majority of individuals who made contributions to these sites wanted others to be able to freely access their work (and freely reuse it as well in all likelihood).

While implicitly anything put on the web is there to be freely accessed when it comes to (re)using — and redistributing (hosting) — that material explicitness really matters. Once you start building any kind of ‘commons’ in which multiple contributors are the norm[^1] this becomes especially important since relying purely on tacit agreements and implicit consent becomes a major obstacle and serious threat to the long term future, and value, of that information.

In a world in which information rots away in the form of disappearing links and disappearing pages far faster than that inscribed on the physical paper of books the ability to copy, and then to redistribute, is the only way for most works to have any permanent existence — be it one which is fragemented and partial — for it is the only then can it be ‘mirrored’, archived, made available in myriad ways, in short kept alive.

Because no effort was made to have an explicit licensing policy these ‘knowledge-bases’ have, in effect, become partially ‘closed’. While open for access — at least as long as their parent organizations continue to exist — the opportunities for reuse and redistribution have been drastically curtailed. With the advent of Wikipedia which adopted a ’share-alike’ type license from the very start, these sites have, in many respects, been superseded and it is particularly telling that there are dedicated Wikipedia pages with instructions for ‘node’ owners on everything2 and h2g2 on how to move their content to Wikipedia[^2].

[^1]: Here we need not be thinking only of massively collaborative endeavours involving hundreds or thousands of contributors but a popular weblog where apart from the original author you may have dozens of different individuals commenting on posts.

[^2]: http://en.wikipedia.org/wiki/Wikipedia:Guide_for_Everything2_Noders and http://en.wikipedia.org/wiki/Wikipedia:Guide_for_h2g2_Researchers

Example 2: Crystallographic data structures

Recently I was chatting with a Peter Murray-Rust, head of the Unilever informatics lab at Cambridge University, and one of the pioneers of open knowledge (he’s also the man behind SAX, chemical mime, the world-wide-molecular-matrix and his latest collaboration in open chemistry is http://www.blueobelisk.org/).

He was telling me about how crystallographers get asked to do analyses. Roughly each analysis costs between 300 and 600 pounds. Now what happens to the data (’structures’) produced by these analyses. Sometimes they get published (in Acta Crystallographica) but often they just sit in a basement draw gathering dust. Peter said that he had colleagues in Austrailia who had close on 1000 such unpublished ’structures’. That’s between 300k and 600k in data gathering dust.

So why does this happen? Peter suggested two reasons. They both relate to the circumstances in which the analysis occurs so let me explain that first.

These analyses are often commissioned by someone else (either in industry or academia) in relation to work they are doing. Often the crystallographic analysis is just a check and will only end up being mentioned in a footnote, if mentioned at all (something like: ‘Our hypothesis as to the structure of this molecule was confirmed by crystallographic analysis ….’).

As a result first the crystallographer can’t publish immediately since this might be preempt the associated paper or disclose sensitive information about what a company is working on. Second it is unclear who ‘owns’ the rights in the data — is it the crystallographer or the entity which commissioned the analysis? Together these uncertainties combine to place a dead hand upon publication except in circumstances where the crystallographer did the analysis on their account.

With more explicitness about the legal status (particularly if the default were that the data were open) and efforts to address the social issues (perhaps a delay of three years after which publication is allowed) access could be greatly improved.

Conclusion

To stitch together the knowledge commons it’s not good enough for information to be implicitly open, it has to be explicitly open. To be explicity open it must have clearly attached an open knowledge license. Without this the knowledge produced immediately becomes ‘locked’: in order to do anything other than have the information sit there on the original server requires a rights-clearance effort of such daunting proportions as to be completely infeasible.

Furthermore when engaging in any kind of collaborative effort — the norm on the web — the adoption of an explicitly open approach can be considered as providing a form of social contract among the participants which is clearer than the informal tacit arrangments which would otherwise operate.

At ETech 06

March 8th, 2006

I’m currently in San Diego at the O’Reilly Emerging Technology Conference (ETech ‘06) courtesy of co-presenting a talk with Jo Walsh entitled Hack Your Own Conference: the World Summit on Free Information Infrastructures.

The talk to interest the most so far was by Tim Bray who spoke about the Atom syndication format. Rather than being a technical examination this was a discussion of the process and politics of producing the standard. Summary:

  • RSS 1.0 was broken in major ways and the process to make a new RSS had become toxic (in Bray’s words).
  • This led to work on Atom starting with the ‘Pie’ wiki (as in easy as pie) initiated by Sam Ruby.
  • Ultimately it was conducted within the IETF rather than the W3C.
  • Something like 17,000 message (60 megs) over 2-3 years for the standard (went stable 2005-07) and 4600 messages (17 meg) over 1.5 years for the publication API (not yet stable):
    • All done by email/wiki (only email has standing)
    • Anyone could join
    • Consensus decreed by chair; may be appealed
    • trolls banned for 30 days may appeal
    • General reveiw by whole IETF

  • This was a slow process but overall it led to a very solid spec with a lot of authority (no-one could say they were excluded)