Why does this matter?

Why bother about openness and licensing for data? After all they don’t matter in themselves: what we really care about are things like the progress of human knowledge or the freedom to understand and share.

However, open data is crucial to progress on these more fundamental items. It’s crucial because open data is so much easier to break-up and recombine, to use and reuse. We therefore want people to have incentives to make their data open and for open data to be easily usable and reusable — i.e. for open data to form a ‘commons’.

A good definition of openness acts as a standard that ensures different open datasets are ‘interoperable’ and therefore do form a commons. Licensing is important because it reduces uncertainty. Without a license you don’t know where you, as a user, stand: when are you allowed to use this data? Are you allowed to give to others? To distribute your own changes, etc?

Together, a definition of openness, plus a set of conformant licenses deliver clarity and simplicity. Not only is interoperability ensured but people can know at a glance, and without having to go through a whole lot of legalese, what they are free to do. (For more see this article and this post).

Thus, licensing and definitions are important even though they are only a small part of the overall picture. If we get them wrong they will keep on getting in the way of everything else. If we get them right we can stop worrying about them and focus our full energies on other things.

Background

Over the last couple of years there has been substantial discussion about the licensing (or not) of (open) data and what ‘open’ should mean. In this debate there two distinct, but related, strands:

  1. Some people have argued that licensing is inappropriate (or unnecessary) for data.
  2. Disagreement about what ‘open’ should mean. Specifically: does openness allow for attribution and share-alike ‘requirements’ or should ‘open’ data mean ‘public domain’ data?

These points are related because arguments for the inappropriateness of licensing data usually go along the lines: data equates to facts over which no monopoly IP rights can or should be granted; as such all data is automatically in the public domain and hence there is nothing to license (and worse ‘licensing’ amounts to an attempt to ‘enclose’ the public domain).

However, even those who think that open data can/should only be public domain data still agree that it is reasonable and/or necessary to have some set of community ‘rules’ or ‘norms’ governing usage of data. Therefore, the question of what requirements should be allowed for ‘open’ data is a common one, whatever one’s stance on the PD question.

Of course, even with agreement on requirements, there is still the question of whether these should be ‘enforced’ through a license or via community norms. To summarize, the three main questions are:

Qu 1. Is it important to license?

Qu 2: What ‘restrictive’ requirements are compatible with openness? In particular does ‘open’ equate to PD only or are attribution and share-alike ‘requirements’ permitted?

Qu 3: Community norms or licenses? Should ‘community norms’ or license terms be used in order to encode requirements such as attribution and share-alike?

Below I look at each of these in turn, laying out, as I see it, the current consensus and expressing my own view.

Question 1: Is it Important to License?

The simple answer here is yes. Whether one likes it or not there are a whole bunch of jurisdictions where there are IP rights in data(bases). Note that this does not imply any monopoly rights in any facts that data represents.

Thus, even if you just want your data to be in the ‘public domain’, you need to apply a license — or something very closely resembling a license. (A suitable example is the Open Data Commons Public Domain Dedication and License).

Question 2: What Should Openness Allow?

Despite the sometimes heated discussion, there is, in fact, broad agreement: openness means freedom to use and reuse data in any way you wish. The only debate is over what, if any, conditions can be imposed when allowing use and reuse. In particular, following the example of the software and content domains, the following two items have been proposed as permissible exceptions to the basic rule of ‘allow everything’:

  1. Requirement of attribution (in a non-burdensome manner)
  2. Requirement to share-alike (a reuser or share-alike material must, when making publicly available their own material, make it openly available under a similar share-alike license)

Attribution

Everyone agrees that requiring attribution is OK. Furthermore, it also now generally accepted that having this requirement in a license is not be a problem.

(In the original Protocol for Implementing Open Access Data attribution was alleged to be problematic due to a potential for ‘attribution stacking’. However, these concerns appear to have been allayed. To my mind, it was never clear why data needed to be different: code and content both have plenty of examples of projects with many contributors, much reuse and an attribution requirement).

Share-Alike

Share-alike provisions are more controversial. It has been argued that share-alike conditions are problematic because of the potential for incompatibility between two share-alike licenses (or community norms). At the same time share-alike may provide an important incentive for individuals and communities to make their data openly available since it provides some assurance that this data will remain open. Thus, any evaluation comes down to the balance between:

  1. The costs, if any, of allowing share-alike in terms of e.g. complexity and compatibility.

  2. The benefits, if any, that share-alike provides by encouraging the creation of open data in the first place and in ensuring subsequent ‘sharing back’ by those who build upon that data.

In my view the benefits are substantial while the costs are not. Incompatibility can largely be avoided by only ‘approving’ share-alike licenses that are compatible. At the same time, share-alike enshrines a principle that is important to many communities in the code and content spheres and same seems true of data (consider e.g. Open Street Map).

(Aside: it is important to emphasize that permitting share-alike does not mean it is must be used. In fact, a particular community could recommend against using share-alike as, for example, the Python community does for code hoping to make it into its standard library.)

Question 3: Licenses versus Community Norms

Even if a basic license is used it can be argued that any ‘requirements’ for attribution or share-alike should not be in a license but in ‘community norms’. So which is best?

In my view, when making available data, licenses are much better than community norms. Why?

  1. A license is always needed even if you are taking a PD approach. So ‘norms’ don’t obviate the need to license.
  2. A license is able to encode ‘norms’ both formally and informally (for example, in a preamble — cf. the GPL).
  3. A license is likely to elicit at least as much, and almost certainly more, conformity with its provisions than community norms. This is especially true outside of the community. The future is likely to see a much more mixed data landscape whether in science or elsewhere with many ‘non-community’ (non-academic) business and among ordinary citizens. (Note also that for these groups the simplicity and formality of a license makes it superior to ‘norms’ in almost every respect — transparency, certainty etc.
  • If there are concerns that, in some jurisdictions, the absence of ‘data’ rights make e.g. share-alike provisions unenforceable nothing is lost by using a license: the license de facto reverts to the status of a community norm and any concerns regarding “false expectations” can easily be dealt with by a simple warning.

Flexibility: some have argued that ‘norms’ are more ‘flexible’ than licenses. I’m not clear what this really means:

  • Flexible = not enforceable. Perhaps true but I am unclear why this is an advantage (even to a user it is easy to comply with the open license)
  • Flexible = leeway around the edges. For example I won’t get in trouble if I don’t attribute quite right. But this is true of licenses too: it is very unlikely anyone gets sued for a minor error in attribution and even with share-alike no court is likely to award damages for a mistake made in good faith — especially if it can be easily corrected.
  • Flexible = fuzzy. Fuzziness does not seem an attractive property when sharing data — both sharer and sharee want clarity.
  • Flexible = easily changed. Allowing major changes is a serious problem both for licensors and licensees (certainty and clarity would disappear). For minor changes licenses are just as good.

Thus, in every respect I can think of, licenses are superior to community norms when making available open data.

Conclusion

Summarizing the the conclusions from the above discussion we have:

Qu 0: Does this matter?

Yes. A good definition of openness and the use of some form of licensing is crucial to a healthy future for the open data community (and that will include pretty much everyone …).

Qu 1: Is it important to license?

Ans: A ‘license’ is always necessary — even if you advocate a PD-only approach. There is too much variation (and uncertainty) about what the IP situation is across the world to just go with the default. All providers of data should apply some kind of license or PD dedication.

Qu 2: What ‘restrictive’ requirements are compatible with openness? In particular does ‘open’ equate to PD only or are attribution and share-alike ‘requirements’ permitted?

Ans: Both attribution and share-alike should be permitted. Attribution is widely agreed to be acceptable. The second, ‘share-alike’ is more controversial, but in my view should be allowed: there is no reason to break with the precedent set in code and content domains and its benefits seem substantial while costs are minimal if licenses are correctly managed.

Qu 3: Community norms or licenses?

Ans: Use licenses when making available data. Licenses provide all the benefits of community norms in terms of explicitly encoding the preferences of a community. At the same time they deliver greater clarity and transparency, and, in many jurisdictions, provides a legal enforceability which norms do not with regard to requirements of attribution or share-alike.

Colophon

This essay comes out of ongoing discussions over the last few years with a large assortment of communities and individuals. The primary motivation for sitting down and pulling the threads together came out of reading Michael Nielsen’s post on The role of open licensing in open science (+ thread) and recent emails with John Wilbanks of Science Commons on the Open Definition coord list.

Related work and earlier discussion on this matter include:

Website | + posts

Rufus Pollock is Founder and President of Open Knowledge.

9 thoughts on “Open Data: Openness and Licensing”

  1. Fully agree with your post.

    We at the LogiLogi Foundation per default use the Creative Commons Attribution-Share Alike License on all our texts and data.

  2. Publishing under a Public Domain dedication allows free usage according to community standards, while removing legal concerns. Is there any reason to choose the Open Data Commons, compared to Creative Commons PD or CC0 licenses?

  3. The PDDL is specifically aimed at data so if you are working with data it might be more appropriate.

    I believe that the original CC PD was really up to the job (that was why CCZero was developed). CCZero has been in alpha/beta state (though I think it is about to be/or just has been) released. It was therefore better to use the PDDL which was already ‘1.0’.

Comments are closed.