Introducing Datapkg: A Tool for Distributing, Discovering and Installing Data “Packages”

[Datapkg][] [0.5][] has been released! This is the first release deemed suitable for public consumption (though we are still in alpha)! This announce therefore serves as both introduction and release announcement.

[datapkg]: http://knowledgeforge.net/ckan/doc/datapkg
[0.5]: http://knowledgeforge.net/ckan/doc/datapkg/CHANGELOG.html

## Introduction

From the [docs][datapkg]:

> datapkg is an user tool for distributing, discovering and installing data (and content) ‘packages’.
>
> datapkg is a simple way to ‘package’ data building on existing packaging tools developed for code (e.g. Debian apt, PyPI, CRAN, Gems, CPAN). datapkg is designed to integrate closely with the CKAN (Comprehensive Knowledge Archive Network).

In terms of the big picture, datapkg is the “apt-get/aptitude/dpkg” part of the vision for a ‘Debian of Data’ (i.e. scalable, distributed, open data infrastructures! — for more see [this post][comp-post] or [these recent slides][ccc-slides]):

[comp-post]: http://blog.okfn.org/2007/04/30/what-do-we-mean-by-componentization-for-knowledge/
[ccc-slides]: http://m.okfn.org/files/talks/ccc_20091228/

debian of data

Datapkg is a key part of making data sharing **automatable**. As an end-user tool it allows **automated (command-line or scripted) discovery, installation and sharing** of data “packages” either standalone or via interaction with a registry like CKAN.

## Trying It Out

If you’re interested in giving it a spin here are [installation instructions][install]. Once you’ve got it running you can then do things like (see the [manual][] for more):

[install]: http://packages.python.org/datapkg/install.html
[manual]: http://packages.python.org/datapkg/

> Search for a package in an Index e.g. on CKAN.net::
>
> # let’s search for iso country/language codes data (iso 3166 …)
> $ datapkg search ckan:// iso
> …
> iso-3166-2-data — Linked ISO 3166-2 Data
> …
>
> Get some information about one of them (in this case 2-digit ISO country codes in RDF)::
>
> $ datapkg info ckan://iso-3166-2-data
> ….
> ….
>
> Let’s install it (to the current directory)::
>
> $ datapkg install ckan://iso-3166-2-data .
>
> This will download the Package ‘iso-3166-2-data’ together with its “Resources” and unpack it into a directory named ‘iso-3166-2-data’.

## Extending

datapkg is intended to be a generic tool for data packaging. As such, we want it to deal with as many “distribution” formats and as many different registries as possible. We’ve therefore designed datapkg to be extensible so that it can easily be adapted to talk with other systems. What kinds of plugins might one write?

* A plugin to discover data “packages” from RDFa information in web-pages, especially those in Government data catalogues (suggested by [Ed Summers](http://inkdroid.org/journal/about/)
* A plugin to Ensembl
* A plugin to extract download urls or SPARQL endpoints from VoID descriptions (suggested by [Richard Cynganiak](http://dowhatimean.net/))

**We’re looking for more such suggestions as well as for people who’d like to implement plugins.** If you’re interested please get in touch:

8 thoughts on “Introducing Datapkg: A Tool for Distributing, Discovering and Installing Data “Packages””

  1. What is your plan for dealing with versions of data and whether someone has made local changes to the data in a package? Will you do any checking like a version control system does before overwriting packaged data?

Comments are closed.