CKAN and Finding Open Data in the Life Sciences
Melanie Dulong de Rosnay recently published an excellent paper on open data in the life sciences in Nature Precedings entitled Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness. From the abstract:
For each dataset, they provided basic metadata, including:
- The name and URL of the database,
- Values for technical accessibility and legal accessibility features [...]
They then looked at various technical and legal restrictions for accessing, acquiring and re-using the material – including bulk downloadability, registration, password protection, terms and conditions, and licensing – asking the following questions:
- Is there a link to download the whole database?
- Is it possible to access the data through a batch feature?
- Is it possible to access the data through a query-based system?
- Finally, is registration compulsory before downloading or accessing data in the ways described above?
- Does the database have a policy?
- Are there any restrictions on the right to reformatting and redistributing?
- Which restrictions?
This is very similar to the work we have been doing with ckan.net, which aims to provide basic metadata for knowledge packages, including:
- download url
- license/legal status
- unstructured text field with a description of the resource and details about its openness
Furthermore, CKAN uses certain tags to indicate any technical or legal restrictions on the packages that are listed. For technical access, this includes bulk downloads, registrations, password protection, and access through an API:
For legal terms tags include noncommercial restrictions, and cases where terms of re-use are not clear:
There are also several ‘todo’ tags to indicate where it might be useful to write to the knowledge publisher or distributor to clarify something, to split up the entry into multiple entries, or to otherwise work on the registry:
There is significant work involved in documenting the legal and technological issues involved in accessing and re-using knowledge. It would be fantastic if this could be made easier by sharing the results of this kind of research. CKAN is intended to be a community-driven resource to aid the discovery of (open) knowledge in the first instance, its automatic installation in the longer term, and ultimately to support its re-use by providing multiple download links, multiple formats, big datasets broken down into smaller components and so on.
The MBDB is a fantastic project and we hope that in future we can put our heads together with Melanie, Shirley and others to improve the discoverability (and re-usability) of open data in the life sciences!