Peter Murray-Rust — Cambridge University chemist, Open Knowledge Foundation Advisory Board member and tireless advocate for open data in chemistry — has recently started a series of blog posts about open data, focusing on issues related to the Panton Principles for open data in science.
The first is called Open Data: why I need the Open Knowledge Foundation, and in it he introduces some of the issues he wishes to discuss and gives his vision for the role he hopes the OKF community will play in relation to open data. He writes:
After a period of silence on this blog (but not on the Open Knowledge Foundation lists) I hope to publish a flurry of ideas on Open Data. There is no doubt that “Open Data” has arrived and there is enormous interest. (By contrast when I started to investigate it 5 years ago there was nothing). It’s desperately important, more complex than I ever imagined, and it’s critical to address it immediately, responsibly, dispassionately and inclusively. If we manage to set out the concerns now, we may manage to avoid the worst problems that were encountered by the Open Source and later Open Access movements. [They have made enormous progress and without their footsteps Open Data would fall into many of the same pitfalls. But Open Data is Difficult – a phrase I shall repeat frequently.]
I am putting my faith and energy into the Open Knowledge Foundation – its people and its infrastructure. This is because it’s an organisation which is wideranging (it deals with open content of all sorts, open metadata, services, etc.). It has great expertise in legal problems and solutions (where these are necessary) and also how to find alternative approaches. It’s neutral (apart from urging Openness and developing the infrastructure). It’s very professional, and realises that ideas without implementation have less weight. So there is an impressive range of software and information skills. I am reminded of my favourite motto (from the IETF) – “rough consensus and running code”, one the greatest productive mantras of our time.
The enthusiasm is palpable. [Today I had a breakfast Skype session with Jonathan Gray (coordinator of OKF) and it’s all about how we can make things happen fast and responsibly.] The OKF works through Working Groups and discussion lists, and so when I had a concern about Open Data I brought it to the OKF and – after a great deal of work – we emerged with the Panton Principles which have now been translated into several languages by OKF members.
Simply, the OKF amplifies the visions of individuals from the almost-impossible to the attainable.
So I am putting some ideas into the OKF melting pot to see what emerges.
In the next post, titled Open Data: The concept of Panton Papers, he lays out his ideas for the Panton Papers:
The current theme is “Panton Papers”. The idea is that part of the value of the Panton Principles is that the whole document is short and the key points are simply made. But the “Principles” can therefore only address the motivation and the procedures for Open data in a general manner, and many of the problems are in the details. I believe that many of the problems in Open Access (which is simpler than Open Data) arose because not enough communal effort was given to the practice of Open Access and I want to avoid as many OD problems as possible before they occur.
Over the last 2 years (when Open Data has started to become important and discussed) I have seen several potentially difficult areas. I’ll simply list the ones I have thought of here and then outline the idea of the Panton Papers. This discussion is mirrored in part by the OKF open-science discussion list and you may wish to subscribe. There’s also a regular working group on open-science. (Almost everything in OKF is Open, but it may take a little while to find out where you want to be!). The issues that I currently have are:
- What is data? Images? Graphs? Tables? Equations? Accounts of experiments? This is a major problem and almost completely unexplored. Without solving this we are held back 10 year or more in our ability to re-use the primary scientific literature (e.g. by closed-access publishers who claim that factual graphs belong to them).
- Why should data be open? (and when should it not be?). I’ve put forward ideas here and here . They range from moral, to legal/quasi-legal to utilitarian.
- Who owns data? This is one of the trickiest areas – there is legal and contractual ownership and there is moral ownership. Generally there is far far too much “ownership” of data.
- When should data be released? This is a key question (see here for an example). Some communities have solved it – most haven’t addressed it and will have to go through the rigour of working out release protocols.
- How and where should data be exposed? I am strongly of the opinion that we need domain-specific repositories (which could be national or international) and the Institutional Repositories are almost never the best place to expose data (I expect and welcome alternative opinions). The “how” depends on understanding what the data and metadata are and is increasingly dependent on specialist software and information standards. “Archival” is often the wrong word to use.
- Datamining and textmining. Most authors, publishers, repository owners are unaware of the enormous power of automated analysis of the literature. Some closed access publishers expressly forbid these activities. We have to liberate the right of the scientific community to do this enthusiastically and efficiently.
- Reproducibility. Science is based on reproducibility – we expect to be able to replicate the “materials and methods” of an experiment and to try to falsify its claims. Physical materials are beyond the immediate discussion (though this may change) but much science is now based on computing. It should be possible to replicate simulations, data cleaning, data analysis, model fitting etc. This is a tricky area. It is difficult (though with virtualization and the cloud is becoming easier) to reproduce the computing environment. Large or complex data sets are a major problem but must be addressed. This is not without monetary cost.
I may add more.
The idea is that each of these is a “Panton Paper”. It may or may not be crafted in Pantonia (the hectare of the Chemistry Department, The OKF headquarters, and the Panton Arms in Cambridge UK). Everything I now write is mutable.
Each paper will have a top level document of similar form to the Panton Principles, i.e. 3-8 ideas, with short explanatory paragraph(s). This document will be crafted by the OKF in public view on a wiki or Ether/Piratepad. Anyone can take part. We shall welcome contributions from a wide range of disciplines (in fact this is essential). At some stage version 1.0 of the paper will be frozen and will be formally published. We have an offer from a major publisher to do this and I am hoping we can announce this at Open Science Summit.
The Paper should carry a wider range of links to other essays in Open Data and should carry examples from different disciplines. For example there is a well tried and accepted process in many areas of bioscience and astronomy as to what when and how data get published.
Peter has started drafting ideas for the first two of these at:
If you’d like to get stuck in, please head on over to the open-science list and say hi! :-)