The following is a post by Rufus Pollock, co-Founder of the Open Knowledge Foundation.
The Present: A One-Way Street
At the current time, the basic model for data processing is a “one way street”. Sources of data, such as government, publish data out into the world, where, (if we are lucky) it is processed by intermediaries such as app creators or analysts, before finally being consumed by end users1.
It is a one way street because there is no feedback loop, no sharing of data back to publishers and no sharing between intermediaries.
So what should be different?
The Future: An Ecosystem
What we should have is an ecosystem. In an ecosystem there are data cycles: infomediaries — intermediate consumers of data such as builders of apps and data wranglers — should also be publishers who share back their cleaned / integrated / packaged data into the ecosystem in a reusable way — these cleaned and integrated datasets being, of course, often more valuable than the original source.
In addition, corrected data, or relevant “patches” should find their way back to data producers so data quality improves at the source. Finally, end users of data need not be passive consumers but should be also be able to contribute back — flagging errors, or submitting corrections themselves.
With the introduction of data cycles we have a real ecosystem not a one way street and this ecosystem thrives on collaboration, componentization and open data.
What is required to develop this ecosystem model rather than a one way street? Key changes include (suggestions welcome!):
- Infomediaries to publish what they produce (and tools to make this really easy)
- Data packaging and patching format (better ways to publish and share data)
- Publisher notification of patches (pull requests) with automated integration (merge) tools
But it is just a beginning and there’s still a good way to go before we’ve really made the transition from the one-way street to a proper (open) data ecosystem.
Annexe: Some Illustrations of Our Current One Way Street
Currently it’s common to hear people describe web-apps or visualizations that have been built using some particular dataset. However, it’s unusual to hear them then say “and I published the cleaned data, and the data cleaning code back to the community in a way that was reusable” and even rarer to hear them say “and the upstream provider has corrected the errors we found in the data based on our reports”. (It’s also rare to hear people talk about the datasets they’ve created as opposed to the apps or visualizations they’ve built).
We know about this first hand. When the UK government first published its 25k spending data last Autumn we worked hard as part of our Where Does My Money Go? project to process and load the data so it was searchable and visualizable. Along the way we found data ‘bugs’ of the kind that are typical in this area — dates presented as 47653 (time since epoch), dates presented in inconsistent styles (US format versus UK format), occasional typos in names of departments or other entities etc.
We did our best to correct these as part of the load. However, it’s doubtful any of the issues we found got fixed (and certainly not as a result of our work) and we also didn’t do much to share with other datawranglers who were working on the data.
Why was this?
First, there is no mechanism to feed back to the publishers (we did notify data.gov.uk of some of the issues we found but it is very hard for them to act on this — the precise ‘publisher’ within a department may be hard to identify and may even be a machine (if the data is automatically produced in some way).
Second, there is no easy format in which to share fixes. Our cleaning code was public on the web but a bunch of python if statements is not the best ‘patch’ format for data. In a perfect world we’d have a patch format for data (or even just for csv’s) — and one that was algorithmic not line-based (so one could specific in one statement column X is wrongly formatted in this way rather than have 10k line changes); that was easily reviewable (we’re patching government spending data here!); and automatically apply-able (in short a patch format with tool support).
- I’m inevitably simplifying here. For example, there is of course some direct consumption by end users. There is some sharing between systems etc. ↩
- I discussed some of the work around data revisioning in this previous post We Need Distributed Revision/Version Control for Data. ↩