Why Share-Alike Licenses are Open but Non-Commercial Ones Aren’t
June 24th, 2010
It is sometimes suggested that there isn’t a real difference in terms of “openness” between share-alike (SA) and non-commercial (NC) clauses — both being some restriction on what the user of that material can do, and, as such, a step away from openness.
This is not true. A meaningful distinction can be drawn between share-alike and non-commercial clauses (or any other clause that discriminates against a particular type of person or field of endeavour), with the former being “open” and the latter being not “open”.
This distinction is important. It has relevance, for example, as to why Open Data Commons should not provide NC licenses but will provide a share-alike one. As well as to Creative Commons whose set of licenses includes both share-alike and non-commercial options. As such, not all CC licenses are open and CC licenses are are not all mutually compatible. This is something of an irony as it means that Creative Commons provide a set of licenses that don’t, in fact, result in a commons.
What’s the Problem? Why Does This Matter?
What’s the problem with NC licenses, aren’t “SA” licenses a step away from open too? And if we debate this, don’t we just end up having a pointless license holy war?
The distinction between NC and SA licenses isn’t about “holy war” but something very practical: license compatibility and the integrity of the “open” commons. The core of a “commons” of data (or code) is that one piece of “open” material contained therein can be freely intermixed with other “open” material.
This interoperability is absolutely key to realizing the main practical benefits of “openness” which is the ease of use and reuse — which, in turn, mean more and better stuff getting created and used.
The Open Knowledge/Data Definition functions as a “standard” to ensure interoperability just in the same way as normal tech standards operate (but in this case for licenses rather than for a piece of hardware or software). The aim is to ensure that any license which complies with the definition will be interoperable with any other such license meaning that data or content under the one license can be combined with data or content under the other license.
Share-alike or attribution requirements are allowed within the definition precisely because they do not break this interoperability (and may even help promote the commons by ensuring material is “shared back”). Non-commercial provisions are not permitted because they fundamentally break the commons, not only through being incompatible with other licenses but because they overtly discriminate against particular types of users. (I should emphasize here that the definition is directly following the line set out in the original open source definition …)
Thus, there is a meaningful distinction between attribution and share-alike requirements and other such as non-commercial (NC), and it is a distinction that merits the description of share-alike licenses as being open but non-commercial licenses as not being open.
Isn’t It Just About Degree?
Yes, NC and especially ND are more restrictive, but stating that NC licenses aren’t open is wrong - they’re just not as open.
This is incorrect.
To reiterate: it is a mistake to view the set of licenses as some continuous spectrum of ‘openness’ with PD at one end and full rights reserved at the other — with the implication that all licenses in between are more or less open.
There are significant discontinuities and in particular we can meaningfully partition the set of licenses into open and not-open based on a) their interoperability b) the freedom they provide to all persons (and companies) to use, reuse and redistribute.
But You Can’t Trademark Openness …
it’s annoying that someone claims to be releasing data openly, but it turns out to be NC and no-compete and a bunch of other stuff. It would be nice to say to them - “you can’t claim to be open because you don’t meet this definition”. But unfortunately it would probably be difficult to get the trademark on the word “open”
It’s quite right that you can’t trademark openness — and no-one should want to! However, we can make an effort as a community to have a clear shared meaning for “open” in relation to data and content along the lines of http://opendefinition.org/ — just as the open source definition has done for code. By insisting on this meaning we are doing something valuable: creating a standard and maintaining interoperability.
We’re extremely proud that data.gov.uk - the UK Government’s open data portal - uses CKAN, OKF’s open source registry of open data. In the months in 2009 that led up to the release of data.gov.uk, OKF worked closely with the Cabinet Office to help them realise their vision of making public data publicly available in an open, reusable way. But our involvement with the UK government didn’t start there. Civil servants - particularly members of the Office for Public Sector Information - have been attending OKF events like OKCon since at least 2005. And we know that Sir Tim Berners Lee - who was brought on as an expert advisor to the Government as they worked up to the data.gov.uk project - was reading the OKF blog prior to his now famous “Raw Data Now!” talk at TED!
A new report released late last month charts the history of open government data in the UK and the US, and it’s a fascinating read. Written by OKF board member Becky Hogge for a consortium of grant-giving organisations including the Hewlett Foundation, the Ford Foundation, the Omidyar Network, the Open Society Institute and DfID, the Open Data Study:
“…explores the feasibility of advocating for open government data catalogues in middle income and developing countries. Its aim is to identify the advocacy strategies used in the US and UK data.gov and data.gov.uk initiatives, with a view to building a set of criteria that predict the success of similar initiatives in other countries and provide a template strategy to opening government data.”
I was interviewed for the report, as were John Wonderlich from the Sunlight Foundation, Tom Steinberg from mySociety and Ory Okollah from Ushahidi. Other interviewees include experts like Ethan Zuckerman and Toby Mendel, and - of course - Sir Tim Berners Lee.
The report draws some new and surprising conclusions. As well as recognising the role of organisations like the OKF and mySociety in bringing about data.gov.uk, it emphasises how crucial engagement with civil servants was to the success of the open data project in the UK. It raises interesting questions about what motivates politicians to embrace open data strategies, and even posits that the long battle to open up geospatial data in the UK worked in a positive way: “the barrier [opening geospatial data] imposed in the UK may have served as a common call to action among both civil society and the middle layer government administrators, which in turn served to strengthen the crucial communication between these two groups in the trajectory towards data.gov.uk, and ultimately enrich the final proposition when compared to data.gov.”
The report contains mixed findings about the prospects of similar projects in developing and middle income countries, providing a useful and very detailed checklist for advocates working within those countries to consult, and pointing to the potential role of international donors in this context. In short, I’d recommend reading this report to anyone interested in open government data, or indeed, in advocacy generally. Because, as Becky notes in her blog post introducing the report:
“I’d be hard pressed to think of an idea that has permeated as quickly as open data has from the fringe to the centre.”
7th Communia Workshop, Luxembourg
February 3rd, 2010

We recently attended a workshop in Luxembourg as part of Communia, the EU policy network on the digital public domain. There was a focus on bringing together themes from previous events to make a series of policy recommendations to the European Commission (watch this space!).
Below are a few notes highlighting some of the talks and discussions that we thought might be of particular interest to readers here:
- We had a meeting to review where we are up to with the Public Domain Calculators. So far it looks like we have 10 EU countries covered, 8 maybe covered and 6 that we are still looking for help with (namely: Cyprus, Denmark, Lithuania, Luxembourg, Slovakia, Slovenia). If you’d like to help out - please drop us a line!
- Jill Cousins from the European Digital Library Foundation spoke about the latest state of play with respect to licensing the content of Europeana, a collection of over 6 million images, texts, sound recordings and videos. In particular she spoke about the possibility of libraries and cultural heritage organisations releasing digital content into the public domain or under an open license. There has been some opposition - but we very much hope that institutions contributing to Europeana have the foresight to give this serious consideration!
- Paul Keller and Lucie Guibault presented their work on the recently released public domain manifesto - discussing the rationale behind it, its genesis and various versions, and an overview of its main principles and recommendations. At the time of writing it has been signed by over 50 organisations and 1800 individuals.
- Francesco Fusaro of the European Commission DG Research spoke about the EU initiatives to support open access to scientific publications and data - from background research in this area to piloting open access to approximately 20% of FP7 funded projects.
- Patrick Peiffer gave an excellent presentation on licensing options for bibliographic metadata. In particular he suggested that non-commercial restrictions could cause substantial transaction costs and technical complications. On the other hand using an ‘attribution, sharealike’ type license that allowed commercial reuse which would cause no transaction costs, create a level playing field, allow interoperability with projects like Wikimedia and Wikimedia Commons, avoid exclusive deals and open up new channels of discovery. It would be a big step if Europeana libraries and institutions follow the lead of CERN Library, who last week announced that they were opening up their metadata!
- Mathias Schindler spoke about tools developed by the Wikipedians using open bibliographic metadata. He also described what the Wikipedia community had done to add value to collections of cultural works - such as improving the quality of metadata, adding descriptions to images and so on.
- Rufus Pollock spoke about his work at the University of Cambridge to estimate the size and value of the public domain in Europe.
See also:

Comments on the Science Commons Protocol for Implementing Open Access Data
February 9th, 2009
Here I briefly comment on the Science Commons Protocol for Implementing Open Access Data as the protocol strongly advocates a position of ‘PD’-only. As will be apparent from the earlier essay on Open Data: Openness and Licensing I do not entirely share this view.
The Protocol gives 3 basic reasons for preferring the ‘PD’ approach in a section entitled ‘Issues in Database Licensing’. I excerpt and comment on these in turn. As will be clear from the comments below I am not really convinced by any of these points that attribution or share-alike provisions should not be included in open data licenses.
(NB: As the protocol does not discuss any of the possible attractions of allowing such provisions, for example the benefits to contributors of knowing that reusers will contribute back to the ‘commons’, I I don’t really discuss them either. However, they are clearly important to this discussion).
5.1 Category errors
Any solution based on rights will result in categorization errors: the application of obligations based on copyright in situations where it is not necessary (for example, a share-alike license on the copyrightable elements may be falsely assumed to operate on the factual contents of a database). In the reverse, a user might assume that the “Facts Are Free” status of the non-copyrightable elements extends to the entire database and inadvertently infringe.
We do not know what courts will decide in the future. But it is conceivable that in 20 years, a complex semantic query across tens of thousands of data records across the web might return a result which itself populates a new database. If intellectual property rights are involved, that query might well trigger requirements carrying a stiff penalty for failure, including such problems as a copyright infringement lawsuit.
These interpretative problems are exacerbated by differences among countries over the standards for copyright protection for databases, by the existence of sui generis database rights, and by the difficulty of interpreting contractual language.
For these reasons, solutions based on selective waiving of intellectual property rights fail to provide a high degree of legal certainty and ease of use.
I’m at a loss why these are big problems for people wanting to openly license work. Suppose there is an ‘Attribution’ only style license for a DB and I’m a user. OK so I might be uncertain as to whether I can get away with not attributing if I use only the ‘facts’ but all I need to do to have total certainty is to actually do the attribution.
Simiarly for share-alike. Ok I might be uncertain, depending on what exactly I’m reusing as to whether I’m supposed to share-alike but why don’t I just act conservative and make available (NB: share-alike only applies when you’re using the derived work in some way publicly. It doesn’t mean forcing to make everything you do available). Similarly, if I can get away as a user with not following the license (because there are no rights), well big deal!
Thus, the risk here has got to be for the original licensor who has provided a db on some conditions only to find those conditions violated (because they don’t bind). But then the protocol is just urging them to remove those conditions but a priori — which given the licensor was interested in imposing conditions might not be that attractive!
In this case, the only danger is that a licensor is bitter as they feel they have been misled — but this could be avoided with a simple warning. I also cannot understand why anyone will be any happier if they have used a community norm and then found it not being obeyed — and with community norms you wouldn’t even have the threat of actual enforcement.
Lastly, in all of this it is useful to compare data with code and content. In particular, are the uncertainties so much worse for DBs than they are for, say, a complex piece of code or content? In the code domain, for example, you might ask:
- Can I get away with not infringing the GPL if I don’t copy the code exactly but duplicate the structure?
- What happens when I do linking in code? Etc.
To my mind data does not seem more problematic and code has coped quite successfully with the issues that have arisen.
5.2 False expectations
There is also the problem of false expectations. Many users choose to apply common-use licenses such as the GPL and CC in order to declare their intent: thus, a user might choose to apply a “copyleft†term to the copyrightable elements of a database, in hopes that those elements result in additional open access database elements coming online. But a user would be able to extract the entire contents (to the extent those contents are uncopyrightable factual content) and republish those contents without observing the copyleft or share-alike terms. The data provider, based on our research, is likely to feel “tricked†by this outcome. That is not a desired result.
No it is not a desired result. However, it could simply and easily be avoided by stating that this is potential ‘risk’. After all that is what the protocol is effectively doing!
Furthermore, I wonder how many entities (particularly large corporations) would want to take the risk? After all, in a whole variety of jurisdictions there would be a pretty good case if the reextracting were substantial. My point here is that I don’t see legal uncertainty as a great reason not to license. Going back to my original point: much of the uncertainty can be avoided by both parties taking fairly minor steps. Ok there could be abuse but that happens with all open licenses in all domains.
Moreover, observe that this is really only a problem for those trying to impose ’share-alike’ provisions. But surely these are very people who aren’t going to be attracted by the ‘waive everything’ approach. Given that that the share-alike provisions do have bite in at least some jurisdictions I don’t see the value in entirely removing the share-alike provision just because they might not bite everywhere since in doing so you are clearly removing the incentive for some people to openly license their material.
(And once again, I don’t see what community norms buy you here. People have less not more reason to observe a share-alike community norm).
For this reason, the use of such licenses fails to provide a high degree of of ease of use and legal certainty.
See previous comments above.
5.3 Attribution stacking
Last, there is a problem of cascading attribution if attribution is required as part of a license approach. In a world of database integration and federation, attribution can easily cascade into a burden for scientists if a category error is made. Would a scientist need to attribute 40,000 data depositors in the event of a query across 40,000 data sets? How does this relate to the evolved norms of citation within a discipline, and does the attribution requirement indeed conflict with accepted norms in some disciplines? Indeed, failing to give attribution to all 40,000 sources could be the basis for a copyright infringement suit at worst, and at best, imposes a significant transaction cost on the scientist using the data.
Therefore, a legal obligation to give attribution violates the principle of low transaction costs.
[This now seems partially resolved in that I understand SC to no longer consider attribution-stacking as a major issue]
Again I’m not convinced here. There seem to be obvious and simple ways to provide attribution in low-cost ways (attribution via url, attribution to the project not the contributor etc etc). Wikipedia has 10s of thousands of contributors. The linux kernel has had many 1000s of contributors and yet they don’t seem to encounter massive problems.
Furthermore, the one major thing scholarly communities and others all mention in discussions about opening up data is the need for credit. Doing this via ‘community norms’ instead of via an attribution requirement in a license does not seem to make much of a difference — if “attribution stacking” is an issue with a license it will be a problem with norms too. If attribution is not going to happen then I think there are going to be serious issues asking people to make data available.
Facts and Databases
February 9th, 2009
[This post is an addendum to the earlier essay on Open Data: Openness and Licensing]
It is important to be clear that any IP ‘rights’ in data(bases) are not ‘rights’ in the facts those data represent but in the ‘data collection’ (or database). Here I try to explain the difference (fairly crudely) with some examples. For more on this and IP ‘rights’ in data(bases) in general see the Guide to Open Data Licensing.
Geodata. Suppose we have a database of longitude and latitude pairs for cities. Now, no-one can ‘own’ the fact that London is at a particular long/lat. However, it may be possible for someone to have an ‘IP’ (monopoly) right in their particular collection of such facts. In that case, if you go out and copy the long/lat from the protected database you might well infringe but if you go and calculate the long-lat yourself you won’t.
Chemistry. Alternatively, consider boiling points of substances. No-one can stop you going and calculating (and publishing) the boiling point of some substance but someone might be able to stop you if your data was taken direct from their database.
To summarize: “You can’t get IP rights in facts but you can (in some jurisdictions) get them in a collection of data representing those facts”
Open Data: Openness and Licensing
February 2nd, 2009
Why does this matter?
Why bother about openness and licensing for data? After all they don’t matter in themselves: what we really care about are things like the progress of human knowledge or the freedom to understand and share.
However, open data is crucial to progress on these more fundamental items. It’s crucial because open data is so much easier to break-up and recombine, to use and reuse. We therefore want people to have incentives to make their data open and for open data to be easily usable and reusable — i.e. for open data to form a ‘commons’.
A good definition of openness acts as a standard that ensures different open datasets are ‘interoperable’ and therefore do form a commons. Licensing is important because it reduces uncertainty. Without a license you don’t know where you, as a user, stand: when are you allowed to use this data? Are you allowed to give to others? To distribute your own changes, etc?
Together, a definition of openness, plus a set of conformant licenses deliver clarity and simplicity. Not only is interoperability ensured but people can know at a glance, and without having to go through a whole lot of legalese, what they are free to do. (For more see this article and this post).
Thus, licensing and definitions are important even though they are only a small part of the overall picture. If we get them wrong they will keep on getting in the way of everything else. If we get them right we can stop worrying about them and focus our full energies on other things.
Background
Over the last couple of years there has been substantial discussion about the licensing (or not) of (open) data and what ‘open’ should mean. In this debate there two distinct, but related, strands:
- Some people have argued that licensing is inappropriate (or unnecessary) for data.
- Disagreement about what ‘open’ should mean. Specifically: does openness allow for attribution and share-alike ‘requirements’ or should ‘open’ data mean ‘public domain’ data?
These points are related because arguments for the inappropriateness of licensing data usually go along the lines: data equates to facts over which no monopoly IP rights can or should be granted; as such all data is automatically in the public domain and hence there is nothing to license (and worse ‘licensing’ amounts to an attempt to ‘enclose’ the public domain).
However, even those who think that open data can/should only be public domain data still agree that it is reasonable and/or necessary to have some set of community ‘rules’ or ‘norms’ governing usage of data. Therefore, the question of what requirements should be allowed for ‘open’ data is a common one, whatever one’s stance on the PD question.
Of course, even with agreement on requirements, there is still the question of whether these should be ‘enforced’ through a license or via community norms. To summarize, the three main questions are:
Qu 1. Is it important to license?
Qu 2: What ‘restrictive’ requirements are compatible with openness? In particular does ‘open’ equate to PD only or are attribution and share-alike ‘requirements’ permitted?
Qu 3: Community norms or licenses? Should ‘community norms’ or license terms be used in order to encode requirements such as attribution and share-alike?
Below I look at each of these in turn, laying out, as I see it, the current consensus and expressing my own view.
Question 1: Is it Important to License?
The simple answer here is yes. Whether one likes it or not there are a whole bunch of jurisdictions where there are IP rights in data(bases). Note that this does not imply any monopoly rights in any facts that data represents.
Thus, even if you just want your data to be in the ‘public domain’, you need to apply a license — or something very closely resembling a license. (A suitable example is the Open Data Commons Public Domain Dedication and License).
Question 2: What Should Openness Allow?
Despite the sometimes heated discussion, there is, in fact, broad agreement: openness means freedom to use and reuse data in any way you wish. The only debate is over what, if any, conditions can be imposed when allowing use and reuse. In particular, following the example of the software and content domains, the following two items have been proposed as permissible exceptions to the basic rule of ‘allow everything’:
- Requirement of attribution (in a non-burdensome manner)
- Requirement to share-alike (a reuser or share-alike material must, when making publicly available their own material, make it openly available under a similar share-alike license)
Attribution
Everyone agrees that requiring attribution is OK. Furthermore, it also now generally accepted that having this requirement in a license is not be a problem.
(In the original Protocol for Implementing Open Access Data attribution was alleged to be problematic due to a potential for ‘attribution stacking’. However, these concerns appear to have been allayed. To my mind, it was never clear why data needed to be different: code and content both have plenty of examples of projects with many contributors, much reuse and an attribution requirement).
Share-Alike
Share-alike provisions are more controversial. It has been argued that share-alike conditions are problematic because of the potential for incompatibility between two share-alike licenses (or community norms). At the same time share-alike may provide an important incentive for individuals and communities to make their data openly available since it provides some assurance that this data will remain open. Thus, any evaluation comes down to the balance between:
The costs, if any, of allowing share-alike in terms of e.g. complexity and compatibility.
The benefits, if any, that share-alike provides by encouraging the creation of open data in the first place and in ensuring subsequent ’sharing back’ by those who build upon that data.
In my view the benefits are substantial while the costs are not. Incompatibility can largely be avoided by only ‘approving’ share-alike licenses that are compatible. At the same time, share-alike enshrines a principle that is important to many communities in the code and content spheres and same seems true of data (consider e.g. Open Street Map).
(Aside: it is important to emphasize that permitting share-alike does not mean it is must be used. In fact, a particular community could recommend against using share-alike as, for example, the Python community does for code hoping to make it into its standard library.)
Question 3: Licenses versus Community Norms
Even if a basic license is used it can be argued that any ‘requirements’ for attribution or share-alike should not be in a license but in ‘community norms’. So which is best?
In my view, when making available data, licenses are much better than community norms. Why?
- A license is always needed even if you are taking a PD approach. So ‘norms’ don’t obviate the need to license.
- A license is able to encode ‘norms’ both formally and informally (for example, in a preamble — cf. the GPL).
- A license is likely to elicit at least as much, and almost certainly more, conformity with its provisions than community norms. This is especially true outside of the community. The future is likely to see a much more mixed data landscape whether in science or elsewhere with many ‘non-community’ (non-academic) business and among ordinary citizens. (Note also that for these groups the simplicity and formality of a license makes it superior to ‘norms’ in almost every respect — transparency, certainty etc.
- If there are concerns that, in some jurisdictions, the absence of ‘data’ rights make e.g. share-alike provisions unenforceable nothing is lost by using a license: the license de facto reverts to the status of a community norm and any concerns regarding “false expectations” can easily be dealt with by a simple warning.
Flexibility: some have argued that ‘norms’ are more ‘flexible’ than licenses. I’m not clear what this really means:
- Flexible = not enforceable. Perhaps true but I am unclear why this is an advantage (even to a user it is easy to comply with the open license)
- Flexible = leeway around the edges. For example I won’t get in trouble if I don’t attribute quite right. But this is true of licenses too: it is very unlikely anyone gets sued for a minor error in attribution and even with share-alike no court is likely to award damages for a mistake made in good faith — especially if it can be easily corrected.
- Flexible = fuzzy. Fuzziness does not seem an attractive property when sharing data — both sharer and sharee want clarity.
- Flexible = easily changed. Allowing major changes is a serious problem both for licensors and licensees (certainty and clarity would disappear). For minor changes licenses are just as good.
Thus, in every respect I can think of, licenses are superior to community norms when making available open data.
Conclusion
Summarizing the the conclusions from the above discussion we have:
Qu 0: Does this matter?
Yes. A good definition of openness and the use of some form of licensing is crucial to a healthy future for the open data community (and that will include pretty much everyone …).
Qu 1: Is it important to license?
Ans: A ‘license’ is always necessary — even if you advocate a PD-only approach. There is too much variation (and uncertainty) about what the IP situation is across the world to just go with the default. All providers of data should apply some kind of license or PD dedication.
Qu 2: What ‘restrictive’ requirements are compatible with openness? In particular does ‘open’ equate to PD only or are attribution and share-alike ‘requirements’ permitted?
Ans: Both attribution and share-alike should be permitted. Attribution is widely agreed to be acceptable. The second, ’share-alike’ is more controversial, but in my view should be allowed: there is no reason to break with the precedent set in code and content domains and its benefits seem substantial while costs are minimal if licenses are correctly managed.
Qu 3: Community norms or licenses?
Ans: Use licenses when making available data. Licenses provide all the benefits of community norms in terms of explicitly encoding the preferences of a community. At the same time they deliver greater clarity and transparency, and, in many jurisdictions, provides a legal enforceability which norms do not with regard to requirements of attribution or share-alike.
Colophon
This essay comes out of ongoing discussions over the last few years with a large assortment of communities and individuals. The primary motivation for sitting down and pulling the threads together came out of reading Michael Nielsen’s post on The role of open licensing in open science (+ thread) and recent emails with John Wilbanks of Science Commons on the Open Definition coord list.
Related work and earlier discussion on this matter include:
