Open Data Discussion on SPARC List

I was recently involved in some interesting discussion with John Wilbanks on the SPARC open-data list and thought it worth excerpting some of this here.

Email 1: Reply to a message from John Wilbanks

Source: https://mx2.arl.org/Lists/SPARC-OpenData/Message/100.html

Hi all, chiming in here…just joined the list.

The lack of international consensus on data makes use of CC licenses for data problematic. The EU directive doesn’t exist elsewhere, and is not

Yes, the IPR status for collections data varies greatly by jurisdiction. Some have no protection at all, others have common-law copyright (e.g. Australia, and US pace Feist), the EU has copyright + a sui generis right etc.

written into the majority of even the EU licenses. This makes

Yes though that perhaps can be fixed via a suggestion to the various local drafting teams …

interoperation along the CCi model (where I can upload a file in the US, and you can download it in Brazil) much harder to achieve, as the IPR does not exist in the US.

Sure though:

(a) I think a lot of the jurisdictions have some kind of IPR that can be used. Furthermore these kinds of ‘interoperability’ issues already exist with the CC licences for content. A lot of people in England will just point to the standard US cc licences rather than the E&W licence even though it is not precisely customized to the national law. What one really wants is a clause in any national licence saying: when you licence under this licence you licence under the equivalent licence in all other jurisdictions. Such a provision would also work for CC licences attached to data.

(b) CC licences aren’t just legal documents they are also a way of encoding the ‘social contract’. Thus even if it turns out a licence is not perfectly enforceable attaching it to work is providing a useful signal to others of what the creator (or owner) wishes to permit. Particularly in the academic community such ‘intentions’ will carry strong weight since violation can be sanctioned in all kinds of non-legal ways.

SC is examining the idea that data be simply tagged as public domain, with terms of use requesting attribution. The extension of copyright to

But many may not want their data to be public domain. Look at gracenote and freedb: the archetypal example of an appropriation of the commons. Many people want a sharealike provision in their data licences. Of course in jurisdictions where there is no underlying legal rights this is meaningless but in many jurisdictions it will not be.

data, though it would let CC licenses be used, could also result in the automatic assignment of copyrights to all data sets – which means that if sharing licenses were not attached we would likely see a vast space of orphan data with all rights reserved. It seems to be a feature of copyrighted content.

I am not sure I understand this. Either rights already exist in such datasets (though they may not be exercised) or they do not. If they do not then attaching CC licences won’t suddenly create these rights. If they do, then attaching CC licences has just made the situation clearer. I do appreciate that in reality there is, of course, quite a lot of greyness over what one is allowed to do and this can be beneficial because it allows people informally to do stuff they might not be formally allowed to (the classic case of this is provided by the data in Walsh, Cohen and Cho who show that one reason patents in biotech have not had much impact on researchers is that the patents are routinely ignored out of ignorance, see 1). That said, surely in the long run I think it is better to be explicit (more discussion on a similar theme can be found in 2).

Whereas a public domain designation with some terms of use would by definition allow the use, reuse, and distribution of data, without the need for a binding intellectual property license. In some cases, using intellectual property – which is a blunt instrument – can have dreadful unintended consequences.

It can, though as I just said I’m not sure how attaching a licence would create such IPRs. Rather it might make people aware that such IPR exists — in which case we are back to the previous point.

Also, our research is unclear as to what “attribution” and “share alike” mean in the context of data. What if I run a query across 10,000 gene expression data sets? If I access only one record per data set? Attribution and derivative works are terms built for copyrights, and the legal implications might mean you have to attribute 10,000 people every time you generate a data set. The normative values of each field of science work pretty well for this already…

These are hard questions but again if the IPR rights already exist these are questions that will have to be faced whether there is a licence or not. Furthermore, the courts have already been struggling (perhaps rather unsuccessfully) here in the EU and elsewhere to define these kinds of things. For example, the EU DB directive talks about the the right ‘to prevent extraction and/or reutilization of the whole or of a substantial part, evaluated qualitatively and/or quantitatively, of the contents of that database.’ For more on this see, e.g.:

http://www.ivir.nl/publications/hugenholtz/fordham2001.html

Paul Uhlir made a very important point to me in person at the CODATA meetings in Beijing. A “commons” isn’t just a place where “some rights” are reserved. It’s a place where “some rights, or no rights” are reserved. Data may well fall into the latter category.

Absolutely though, as mentioned above, I would not underestimate the attractiveness of ‘share-alike’ provisions. In my own experience so far with licensing discussions there has been a strong support for these kind of provisions — and we should also note the prevalence of the GPL in F/OSS community.

However, as I said, we’re examining the idea, and welcome the discussion.

Now, if you have a database, we have created a FAQ for owners, and uniprot.org (the world’s largest database of biological protein information) uses the CC license in the following manner:

“We have chosen to apply the Creative Commons Attribution-NoDerivs License to all copyrightable parts of our databases. This means that you are free to copy, distribute, display and make commercial use of these databases, provided you give us credit. However, if you intend to distribute a modified version of one of our databases, you must ask us for permission first.” (http://www.pir.uniprot.org/terms.shtml)

To my mind this would mean that the database was not open/free in the sense of the open knowledge/data definition:

http://okd.okfn.org/

What is their motivation for doing this? I assume it is an integrity concern, e.g. they don’t want different version of the database floating around the net all slightly different. However why couldn’t this be addressed by a standard provision of the PERL type: ‘if you modify this database you must not distribute it under the same name and must clearly identify that is has been modified’

Email 2: From Wilbanks in Reply

Source: https://mx2.arl.org/Lists/SPARC-OpenData/Message/101.html

I keep coming back to the issue of norms.

The desire for citation and attribution makes perfect sense to me. But from a legal perspective it is not the best solution to use IPR of any sort here. Biology has somehow managed to get by with no IPR on core data sets and the norms as encoded in the Bermuda Rules. Patents are a different question, but anyone out there can download the genome…and the cases of violators, as I understand it, were dealt with inside the community.

See: http://www.sciencemag.org/cgi/content/full/291/5507/1192

Now, can you imagine what had happened if those rules had been encoded in a contract? They’d have been in court, immediately. Outside the hands of the scientists.

As you note, this is really about encoding social intent. So why use a contract? Why not simply state your norms and values via terms of use like the Bermuda Rules? Why risk the other aspects to achieve this when “non-legal” ways – norms, academic recognition and more – have been working for a long, long time with known outputs? Why the desire for binding IPR licenses?

This is one place – one of the only places in the world of digital content – where the default switch in most of the world is already set to allow sharing. I think it should remain set that way.

These are hard questions but again if the IPR rights already exist these
are questions that will have to be faced whether there is a licence or
not.

If they didn’t exist, we wouldn’t have the problem. For data, I’d like to have a world where we didn’t need contracts in order to share! Where we didn’t write a contract that was good for geology but lousy for biology, physics and anthropology. Because at some point, someone is going to want to get at all those data sets for a complex query, and it ought to be one-click.

As for appropriation…if you put a data set online, and you see that same dataset in a commercial application – but your set is still online and free – has it been appropriated? Or has it been reused? I don’t understand how a data set can be “appropriated” if you retain the rights to post it, free of charge and IPR, for redistribution?

This all brings me back: I understand the seductive idea of attribution and share alike for data. I just don’t think that from a legal perspective it’s a good idea to use binding IPR contracts to get there. Public domain allows for reuse, redistribution, and norms should take care of the rest.