CKAN and Finding Open Data in the Life Sciences

Melanie Dulong de Rosnay recently published an excellent paper on open data in the life sciences in Nature Precedings entitled Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness. From the abstract:

Molecular biology data are subject to terms of use that vary widely between databases and curating institutions. This research presents a taxonomy of contractual and technical restrictions applicable to databases in life science. It builds upon research led by Science Commons demonstrating why open data and the freedom to integrate facilitate innovation and how this openness can be achieved. The taxonomy describes technical and legal restrictions applicable to life science databases, and its metadata have been used to assess terms of use of databases hosted by Life Science Resource Name (LSRN) Schema. While a few public domain policies are standardized, most terms of use are not harmonized, difficult to understand and impose controls that prevent others from effectively reusing data. Identifying a small number of restrictions allows one to quickly appreciate which databases are open. A checklist for data openness is proposed in order to assist database curators who wish to make their data more open to make sure they do so.

Shirley Fung has published a directory of open datasets examined in the paper, and details of their re-usability on Molecular Biology Databases.

For each dataset, they provided basic metadata, including:

The name and URL of the database,
URL of the download page and URL of the terms of use,
Extracts of the terms of use for further review and comments,
Values for technical accessibility and legal accessibility features […]

They then looked at various technical and legal restrictions for accessing, acquiring and re-using the material – including bulk downloadability, registration, password protection, terms and conditions, and licensing – asking the following questions:

Is there a link to download the whole database?
Is it possible to access the data through a batch feature?
Is it possible to access the data through a query-based system?
Finally, is registration compulsory before downloading or accessing data in the ways
described above?
Does the database have a policy?
Are there any restrictions on the right to reformatting and redistributing?
Which restrictions?

This is very similar to the work we have been doing with ckan.net, which aims to provide basic metadata for knowledge packages, including:

url
title
download url
tags
license/legal status
unstructured text field with a description of the resource and details about its openness

Furthermore, CKAN uses certain tags to indicate any technical or legal restrictions on the packages that are listed. For technical access, this includes bulk downloads, registrations, password protection, and access through an API:

For legal terms tags include noncommercial restrictions, and cases where terms of re-use are not clear:

There are also several ‘todo’ tags to indicate where it might be useful to write to the knowledge publisher or distributor to clarify something, to split up the entry into multiple entries, or to otherwise work on the registry:

There is significant work involved in documenting the legal and technological issues involved in accessing and re-using knowledge. It would be fantastic if this could be made easier by sharing the results of this kind of research. CKAN is intended to be a community-driven resource to aid the discovery of (open) knowledge in the first instance, its automatic installation in the longer term, and ultimately to support its re-use by providing multiple download links, multiple formats, big datasets broken down into smaller components and so on.

The MBDB is a fantastic project and we hope that in future we can put our heads together with Melanie, Shirley and others to improve the discoverability (and re-usability) of open data in the life sciences!