Support Us

You are browsing the archive for datapkg.

CKAN v1.2 Released together with Datapkg v0.7

Rufus Pollock - November 30, 2010 in CKAN, datapkg, News, Releases

We’re delighted to announce CKAN v1.2, a new major release of the CKAN software. This is the largest iteration so far with 146 tickets closed and includes some really significant improvements most importantly a new extension/plugin system, SOLR search integration, caching and INSPIRE support (more details below). The extension work is especially significant as it now means you can extend CKAN without having to delve into any core code.

In addition there are now over 20 CKAN instances running around the world and CKAN is being used in official government catalogues in the UK, Norway, Finland and the Netherlands. Furthermore, http://ckan.net/ — our main community catalogue — now has over 1500 data ‘packages’ and has become the official home for the LOD Cloud (see the lod group on ckan.net).

We’re also aiming to provide a much more integrated ‘datahub’ experience with CKAN. Key to this is the provision of a ‘storage’ component to complement the registry/catalogue component we already have. Integrated storage will support all kinds of important functionality from automated archival of datasets to dataset cleaning with google refine.

We’ve already been making progress on this front with the launch of a basic storage service at http://storage.ckan.net/ (back in September) and the development of the OFS bucket storage library. The functionality is still at an alpha stage and integration with CKAN is still limited so improving this area will be a big aim for the next release (v1.3).

Even in its alpha stage, we are already making use of the storage system, most significantly, in the latest release of datapkg, our tool for distributing, discovering and installing data (and content) ‘packages’. In particular, the v0.7 release (more detail below) includes upload support allowing you store (as well as register) your data ‘packages’.

Highlights of CKAN v1.2 release

  • Package edit form: attach package to groups (#652) & revealable help
  • Form API – Package/Harvester Create/New (#545)
  • Authorization extended: authorization groups (#647) and creation of packages (#648)
  • Extension / Plug-in interface classes (#741)
  • WordPress twentyten compatible theming (#797)
  • Caching support (ETag) (#693)
  • Harvesting GEMINI2 metadata records from OGC CSW servers (#566)

Minor:

  • New API key header (#466)
  • Group metadata now revisioned (#231)

All tickets

Datapkg Release Notes

A major new release (v0.7) of datapkg is out!

There’s a quick getting started section below (also see the docs).

About the release

This release brings major new functionality to datapkg especially in regard to its integration with CKAN. datapkg now supports uploading as well as downloading and can now be easily extended via plugins. See the full changelog below for more details.

Get started fast

# 1. Install: (requires python and easy_install)
$ easy_install datapkg
# Or, if you don't like easy_install
$ pip install datapkg or even the raw source!

# 2. [optional] Take a look at the manual
$ datapkg man

# 3. Search for something
$ datapkg search ckan:// gold
gold-prices -- Gold Prices in London 1950-2008 (Monthly)

# 4. Get some data
# This will result in a csv file at /tmp/gold-prices/data
$ datapkg download ckan://gold-prices /tmp

# 5. Store some data
# Edit the gold prices csv making some corrections
$ cp gold-prices/data mynew.csv
$ edit mynew.csv
# Now upload back to storage
$ datapkg upload mynew.csv ckan://mybucket/ckan-gold-prices/mynew.csv

Find out more » — including how to create, register and distribute your own ‘data packages’.

Changelog

  • MAJOR: Support for uploading datapkgs (upload.py)
  • MAJOR: Much improved and extended documenation
  • MAJOR: New sqlite-based DB index giving support for a simple, central,
    ‘local’ index (ticket:360)
  • MAJOR: Make datapkg easily extendable

    • Support for adding new Index types with plugins
    • Support for adding new Commands with command plugins
    • Support for adding new Distributions with distribution plugins
  • Improved package download support (also now pluggable)

  • Reimplement url download using only python std lib (removing urlgrabber
    requirment and simplifying installation)
  • Improved spec: support for db type index + better documentation
  • Better configuration management (especially internally)
  • Reduce dependencies by removing usage of PasteScript and PasteDeploy
  • Various minor bugfixes and code improvements

Credits

A big hat-tip to Mike Chelen and Matthew Brett for beta-testing this release and to Will Waites for code contributions.

A free software model for open knowledge

jwalsh - March 17, 2010 in CKAN, datapkg, Events, OKF Projects, Open Data Commons, Open Knowledge Definition, Open Knowledge Foundation, Talks

Notes describing the talk on the work of the Open Knowledge Foundation given last week at Jornadas SIG Libre.

OKF activity graph

I was happily surprised to be asked to give this open knowledge talk at an open source software conference. But it makes sense – the free software movement has created the conditions in which an open data movement is possible. There is lots to learn from open source process, in both a technical and organisational sense.

In English we have one word “free” where Spanish like most languages has two, gratis and libre, signifying separately “free of cost” and “freedom to”. The Open Source Institute coined Open Source as a branding or marketing exercise to avoid the primary meaning “free of cost”. So whenever I say “open” I want you to hear the word “libre” [Later i was told that libre can have meaning in at least 15 different ways]

The best way to talk about the work of the Open Knowledge Foundation is to look at its projects, which form an open knowledge stack similar to the OSGeo software stack.

Open Definition

The Open Knowledge Definition is based on the OSI Open Source Software Definition (which OSGeo uses as a reference for acceptable software licenses). No restrictions on field of endeavour – non-commercial-use licenses are not open as in the OKD. An open data license will pass the cake test.

Open Data Commons

Open Data Commons is run by Jordan Hatcher, who started work on the Open Database License with support from Talis, later extensive negotiation with the OpenStreetmap community. ODbL is a ShareAlike license for data, that obviates the problems of inapplicability of copyright to facts, and greediness of the ShareAlike clause when it comes to use of maps in PDFs, etc.

PDDL is a license that implements the Science Commons protocol for open access data, explicitly placing it in the public domain.

The Panton Principles are four precepts for publishers of scientific research data who wish that data to be freely reusable. Being openly able to inspect, critique and re-analyse data is critical to the effectiveness of scientific research.

Open Data Grid

The Open Data Grid is a project in early incubation; based on the Tahoe distributed filesystem. It’s in need of development effort on Tahoe to really get going. Provide secure storage for open datasets around the edges of infrastructure that people are already running.
4340727578_da9a6671a5_b

People are handwaving about the Cloud, but storage and backup are not problems that it is really meant to solve. People make different claims about the Cloud – cheaper, greener, more efficient, more flexible. Can we get these things in other ways?

There is a saying, “never underestimate the bandwidth of a truck full of DAT tapes”

Comprehensive Knowledge Archive Network (CKAN)

CKAN is inspired by free software package repositories, perl’s CPAN, R’s CRAN, python’s PyPi. It provides a wiki-like interface to create minimal metadata for packages with a versioned domain model and HTTP API.

CKAN supports groups, which can curate a package namespace – e.g. climate data – and assess priorities for turning into fully installable packages.

CKAN’s open source code is being used in the data package catalogue for the data.gov.uk project, part of the Making Public Data Public effort in the UK.

datapkg

The Debian of Data – datapkg takes Debian’s apt tool as inspiration for fully automatable install of data packages, with dependencies between them. This is currently in usable alpha stage with a python implementation.

Where Does My Money Go?

The next challenge really is to bring the concerns and the solutions to a mainstream public. Agustín Lobo spoke of “a personal consciousness but not an institutional consciousness” when it comes to open source and open data. Media coverage, exemplary government implementations, help to create this kind of consciousness.

Pressure for increased open access is coming from academia – for the research data underlying papers, for the right to data mine and correlate different sources, for library data open for re-use. Pressure is also coming from within museums, libraries and archives – memory institutions who want to increase exposure to their collections with new technology, and recognise that open data, linked to a network of resources, will work for sustainability and not against it.

The next generation of researchers, who are kids in school now, will grow up with an expectation that code and data are naturally open. It will be interesting to see what they make!

Meanwhile OpenStreetmap is feeding several startups, and more commercial presence in open data space will be of benefit. Illustrative that one does not have to be proprietary to be commercial.

Now higher-profile government projects opening data are helping to mainstream. To what extent is open a fashionable position, to what extent is open reflected throughout the way of working?

Open process; early release, public sharing of bugs, public discussion of plans – everything in Nat Torkington’s post on Truly Open Data. The opportunity to fail in public, to learn from others’ problems, and self-interestedly collaborate.


I had a great time at SIG Libre 10. Oscar Fonts’ talk on OpenSearch Geospatial interfaces to popular services has me itching to add an OpenSearch +Geo interface to CKAN, as well as to work on getting the apparent version skew in the Geo extensions resolved amicably.

Genís Roca spoke thought-provokingly on Retorno y rentabilidad (there isn’t really an equivalent English word – “rentability” – less exploitative or focused than profitability). Rentability, especially for online services, can come in ways that sustain an organisation predictably, and don’t involve fishing in the pockets of ultimate end-users.

Ivan Sanchez showed areas of OpenStreetmap Spain with stunning level of detail, trees and fences, MasterMap-quality coverage. I’m inspired to pick up JOSM and Markaartor to add building-level detail from out of copyright 1:500 Edinburgh town plans at the National Library of Scotland’s map services.

Agustin Lobo talked about the distributed work and cross-institutional support and benefit of the R project, and the impact of open source on open access to data in science. He mentioned a Nature open peer review experiment that was discarded – am thinking it wasn’t curated enough. The talk helped me to connect the OKF’s work to the rest of the Jornadas.

The shiny slides prezi.com which many people asked for details of – this should show embedded in the page I hope. I stupidly forgot to put URLs on the slides which is partly why i have written this blog.

Introducing Datapkg: A Tool for Distributing, Discovering and Installing Data “Packages”

Rufus Pollock - February 23, 2010 in CKAN, datapkg, News

Datapkg 0.5 has been released! This is the first release deemed suitable for public consumption (though we are still in alpha)! This announce therefore serves as both introduction and release announcement.

Introduction

From the docs:

datapkg is an user tool for distributing, discovering and installing data (and content) ‘packages’.

datapkg is a simple way to ‘package’ data building on existing packaging tools developed for code (e.g. Debian apt, PyPI, CRAN, Gems, CPAN). datapkg is designed to integrate closely with the CKAN (Comprehensive Knowledge Archive Network).

In terms of the big picture, datapkg is the “apt-get/aptitude/dpkg” part of the vision for a ‘Debian of Data’ (i.e. scalable, distributed, open data infrastructures! — for more see this post or these recent slides):

debian of data

Datapkg is a key part of making data sharing automatable. As an end-user tool it allows automated (command-line or scripted) discovery, installation and sharing of data “packages” either standalone or via interaction with a registry like CKAN.

Trying It Out

If you’re interested in giving it a spin here are installation instructions. Once you’ve got it running you can then do things like (see the manual for more):

Search for a package in an Index e.g. on CKAN.net::

# let's search for iso country/language codes data (iso 3166 ...)
$ datapkg search ckan:// iso
...
iso-3166-2-data -- Linked ISO 3166-2 Data
...

Get some information about one of them (in this case 2-digit ISO country codes in RDF)::

$ datapkg info ckan://iso-3166-2-data
....
....

Let’s install it (to the current directory)::

$ datapkg install ckan://iso-3166-2-data .

This will download the Package ‘iso-3166-2-data’ together with its “Resources” and unpack it into a directory named ‘iso-3166-2-data’.

Extending

datapkg is intended to be a generic tool for data packaging. As such, we want it to deal with as many “distribution” formats and as many different registries as possible. We’ve therefore designed datapkg to be extensible so that it can easily be adapted to talk with other systems. What kinds of plugins might one write?

  • A plugin to discover data “packages” from RDFa information in web-pages, especially those in Government data catalogues (suggested by Ed Summers
  • A plugin to Ensembl http://www.ensembl.org/
  • A plugin to extract download urls or SPARQL endpoints from VoID descriptions (suggested by Richard Cynganiak)

We’re looking for more such suggestions as well as for people who’d like to implement plugins. If you’re interested please get in touch: http://www.okfn.org/contact/

Get Updates