Comments on the Science Commons Protocol for Implementing Open Access Data

Here I briefly comment on the Science Commons Protocol for Implementing Open Access Data as the protocol strongly advocates a position of ‘PD’-only. As will be apparent from the earlier essay on Open Data: Openness and Licensing I do not entirely share this view.

The Protocol gives 3 basic reasons for preferring the ‘PD’ approach in a section entitled ‘Issues in Database Licensing’. I excerpt and comment on these in turn. As will be clear from the comments below I am not really convinced by any of these points that attribution or share-alike provisions should not be included in open data licenses.

(NB: As the protocol does not discuss any of the possible attractions of allowing such provisions, for example the benefits to contributors of knowing that reusers will contribute back to the ‘commons’, I I don’t really discuss them either. However, they are clearly important to this discussion).

5.1 Category errors

Any solution based on rights will result in categorization errors: the application of obligations based on copyright in situations where it is not necessary (for example, a share-alike license on the copyrightable elements may be falsely assumed to operate on the factual contents of a database). In the reverse, a user might assume that the “Facts Are Free” status of the non-copyrightable elements extends to the entire database and inadvertently infringe.

We do not know what courts will decide in the future. But it is conceivable that in 20 years, a complex semantic query across tens of thousands of data records across the web might return a result which itself populates a new database. If intellectual property rights are involved, that query might well trigger requirements carrying a stiff penalty for failure, including such problems as a copyright infringement lawsuit.

These interpretative problems are exacerbated by differences among countries over the standards for copyright protection for databases, by the existence of sui generis database rights, and by the difficulty of interpreting contractual language.

For these reasons, solutions based on selective waiving of intellectual property rights fail to provide a high degree of legal certainty and ease of use.

I’m at a loss why these are big problems for people wanting to openly license work. Suppose there is an ‘Attribution’ only style license for a DB and I’m a user. OK so I might be uncertain as to whether I can get away with not attributing if I use only the ‘facts’ but all I need to do to have total certainty is to actually do the attribution.

Simiarly for share-alike. Ok I might be uncertain, depending on what exactly I’m reusing as to whether I’m supposed to share-alike but why don’t I just act conservative and make available (NB: share-alike only applies when you’re using the derived work in some way publicly. It doesn’t mean forcing to make everything you do available). Similarly, if I can get away as a user with not following the license (because there are no rights), well big deal!

Thus, the risk here has got to be for the original licensor who has provided a db on some conditions only to find those conditions violated (because they don’t bind). But then the protocol is just urging them to remove those conditions but a priori — which given the licensor was interested in imposing conditions might not be that attractive!

In this case, the only danger is that a licensor is bitter as they feel they have been misled — but this could be avoided with a simple warning. I also cannot understand why anyone will be any happier if they have used a community norm and then found it not being obeyed — and with community norms you wouldn’t even have the threat of actual enforcement.

Lastly, in all of this it is useful to compare data with code and content. In particular, are the uncertainties so much worse for DBs than they are for, say, a complex piece of code or content? In the code domain, for example, you might ask:

Can I get away with not infringing the GPL if I don’t copy the code exactly but duplicate the structure?
What happens when I do linking in code? Etc.

To my mind data does not seem more problematic and code has coped quite successfully with the issues that have arisen.

5.2 False expectations

There is also the problem of false expectations. Many users choose to apply common-use licenses such as the GPL and CC in order to declare their intent: thus, a user might choose to apply a “copyleft” term to the copyrightable elements of a database, in hopes that those elements result in additional open access database elements coming online. But a user would be able to extract the entire contents (to the extent those contents are uncopyrightable factual content) and republish those contents without observing the copyleft or share-alike terms. The data provider, based on our research, is likely to feel “tricked” by this outcome. That is not a desired result.

No it is not a desired result. However, it could simply and easily be avoided by stating that this is potential ‘risk’. After all that is what the protocol is effectively doing!

Furthermore, I wonder how many entities (particularly large corporations) would want to take the risk? After all, in a whole variety of jurisdictions there would be a pretty good case if the reextracting were substantial. My point here is that I don’t see legal uncertainty as a great reason not to license. Going back to my original point: much of the uncertainty can be avoided by both parties taking fairly minor steps. Ok there could be abuse but that happens with all open licenses in all domains.

Moreover, observe that this is really only a problem for those trying to impose ‘share-alike’ provisions. But surely these are very people who aren’t going to be attracted by the ‘waive everything’ approach. Given that that the share-alike provisions do have bite in at least some jurisdictions I don’t see the value in entirely removing the share-alike provision just because they might not bite everywhere since in doing so you are clearly removing the incentive for some people to openly license their material.

(And once again, I don’t see what community norms buy you here. People have less not more reason to observe a share-alike community norm).

For this reason, the use of such licenses fails to provide a high degree of of ease of use and legal certainty.

See previous comments above.

5.3 Attribution stacking

Last, there is a problem of cascading attribution if attribution is required as part of a license approach. In a world of database integration and federation, attribution can easily cascade into a burden for scientists if a category error is made. Would a scientist need to attribute 40,000 data depositors in the event of a query across 40,000 data sets? How does this relate to the evolved norms of citation within a discipline, and does the attribution requirement indeed conflict with accepted norms in some disciplines? Indeed, failing to give attribution to all 40,000 sources could be the basis for a copyright infringement suit at worst, and at best, imposes a significant transaction cost on the scientist using the data.

Therefore, a legal obligation to give attribution violates the principle of low transaction costs.

[This now seems partially resolved in that I understand SC to no longer consider attribution-stacking as a major issue]

Again I’m not convinced here. There seem to be obvious and simple ways to provide attribution in low-cost ways (attribution via url, attribution to the project not the contributor etc etc). Wikipedia has 10s of thousands of contributors. The linux kernel has had many 1000s of contributors and yet they don’t seem to encounter massive problems.

Furthermore, the one major thing scholarly communities and others all mention in discussions about opening up data is the need for credit. Doing this via ‘community norms’ instead of via an attribution requirement in a license does not seem to make much of a difference — if “attribution stacking” is an issue with a license it will be a problem with norms too. If attribution is not going to happen then I think there are going to be serious issues asking people to make data available.