Opening up scientific data with CKAN and the DataHub

The argument for open-access science has been won. The old model of scientific publishing was laid down when the costs of publishing were so great that charging for access was the sensible way to meet them. As scientists’ work moves online, it is the old model we can no longer afford: the costs to humanity of restricting access is too high. A few scientists may have been saying this for years, but now, not only does open-access have the backing of such respected bodies as the Wellcome Trust, but the fact gets lead front-page coverage in the national press. A government-commissioned report published yesterday adds weight to the case. The Open Access tide, we may hope, is unstoppable.

However, it has not yet breached all the defences and overrun the plains. Until then, if you are a researcher, how can you get your research results out where people can read your conclusions – and even work with your data? At the Open Knowledge Foundation, we believe we have one answer.

CKAN: open source data management

CKAN is a free, open-source data management system. It is used to get data out in the open by local and national governments as well as international bodies, but it was originally designed for the more community-oriented use of which the DataHub is an excellent example. On the DataHub, anyone can create a dataset in a couple of minutes. Data can be uploaded or linked to elsewhere on the web. Different data ‘resources’ (such as files of any kind) can be collected together in a dataset, and annotated with information about their author(s), provenance, availablity for re-use, etc.

Publishing research on the DataHub

CKAN is agnostic about what kind of data can be published. A scientific paper might be catalogued as one dataset. The resources could be, for example: different versions of the printed paper (say, the author’s TeX file, and a PDF); a link to the paper’s page on a journal website; spreadsheets of experimental results; the source code you wrote to process the results; and others, such as separate image files of your graphs and diagrams. Of course, how much is included will depend, among other things, on which rights you haven’t signed away to the publisher.

The screenshot below shows an example of a paper represented in just this kind of way (the original dataset is here):

[IMG: Dataset screenshot]

Visualising, checking and re-using data

If you publish data it is probably in the hope that other people will use it – whether to check your results or as a starting point for new research of their own. CKAN provides interactive visualisations to your data, as well as an API for querying the data directly across the web – allowing other scientists (or your future self!) to search and process your results without downloading large data files or writing their own interface. Visualisations can also be embedded in blog posts or other web pages. For example, here, live from the DataHub, is a graph of average annual global temperature anomaly, showing the effect of global warming since 1880 in hundredths of a degree:

Metadata

CKAN stores a rich set of metadata, with versioned history. By default it has standard fields such as title, author, and a free-form description, but as a scientist you want others, for example for your paper’s journal, volume number, Digital Object Identifier. No problem – you can add as many fields as you like, as the screenshot below shows. A CKAN site specialised for research could include such fields by default.

[IMG: Metadata screenshot]

Benefits to the researcher

You’ve already put your research in all kinds of places. Perhaps there’s a preprint in arXiv.org, a copy on an institutional repository or on your departmental website, and if you’re lucky enough to publish in an open-access journal, it’s on their website too. Are there any benefits of putting it in CKAN as well? Here are some.

Collect all your output together: You can create a group that collects all your output together. You may have moved instutions, published in different journals (and even different fields), leaving a trail of out-of-date home pages behind you with incomplete lists of your publications. But you can always keep a complete record of your output on your favourite CKAN datahub.

Collect publications from other hubs: Conversely, perhaps you are an institution, looking to build a repository, but your departments want to retain their own ‘look and feel’ or even their own sites. They can achieve the former with customisable theming on group pages. Alternatively, CKAN’s advanced harvesting system means you can import and synchronise metadata from other hubs, or even different systems, providing they make their metadata available in a standard format.

Acess control: You can control who can see and edit your datasets, so for example joint papers can be edited by any of the authors.

Alt metrics: Get a record of how many people have accessed or downloaded your data. If the appropriate CKAN extension is installed, your dataset can have share buttons (for Twitter, Facebook, etc) and you can also get figures for how often it has been shared.

Try it out

You can try out CKAN right now, by taking your favourite piece of research and heading over to thedatahub.org. Alternatively, if you happen to be a department / university / funding council / research group / etc and fancy your own CKAN site, have a look at ckan.org, or feel free to get in touch.

3 Comments

Mark Wainwright says:

June 20, 2012 at 16:30

Thanks Tom – it seems to be working for me, what happens if you try again now? If it still doesn’t work I’ll ask someone to look into it …

- Tom Roche says:
  
  June 22, 2012 at 17:22
  
  @Mark Wainwright: “it seems to be working for me”
  
  Yep, it was a problem on my end:
  
  NoScript filtered a potential cross-site scripting (XSS) attempt
  
  from https://blog.okfn.org/2012/06/19/ckan-science/ to the URI beginning http://thedatahub.org/dataset/…
  Allowing “unsafe reload” shows the graph.
  
Tom Roche says:

June 20, 2012 at 15:49

“here, live from the DataHub, is a graph of average annual global temperature anomaly”: actually, I’m seeing a 500 error.