Open Data Going Mainstream?

Bret Taylor’s recent post entitled “We Need a Wikipedia for Data” has been garnering a lot of attention around the blogosphere. While his suggestions are not particularly novel, the post and the attention it has garnered, is, I think, indicative of the growing interests in the issues of (open) data and its importance for the development of related services and products.

While generally in agreement with Bret’s arguments, there are a few differences that are worth raising. First Bret appears to favour some kind of centralized repository that everyone can read from and write to:

To this end, I think we should create a Wikipedia for data: a global database for all of these important data sources to which we all contribute and that anyone can use.

As readers of this blog will know, we’re sceptical of this ‘one ring to rule them all’ approach. In this regard, it is also important to distinguish finding material, parsing it, and plugging it together, issues that got rather run together in the surrounding discussion. As I wrote in a comment to Bret’s post:

There seem to be several distinct issues you (and your commenters) are concerned with:

Discoverability of datasets. For this you want a registry of some kind and this is exactly what the Comprehensive Knowledge Archive Network (CKAN) is designed to do. …

‘Developing’ data particularly using many contributors and a versioning (wiki-like) model. This seems a general problem and one which I wrote about in this post on the collaborative development of data back in February last year. Since then various projects have launched or developed which attempt to address this issue, even if only partially (e.g. Freebase, Swivel, Numbrary, http://www.openeconomics.net …). This then leads into:

Componentizing data so that one can easily plug different datasets together rather than having to aggregate data together in one big place (crudely: ‘One Ring to Rule them All’ vs. ‘Small Pieces, Loosely Joined’). After all it seems unlikely that any one organization, however large, can hold ‘all the data’, and in ay case doing so would negate the benefits of having ‘many minds’ working on a problem. It is our hope that CKAN would start to facilitate the kind of packaging that one frequently observes in software but is, as yet, fairly rare for knowledge (data/content/…). More on this can be found in this blog post on componentization plus the slides from our presentation at XTech.

To conclude, I definitely agree about the importance of having more open data and making it easier to find and use though I’m hoping that it will take a more decentralized and componentized form than simply a ‘wikipedia’ for data. More important though than any details is the fact that this kind of interest from a wider audience indicates that issues of data openness and production are going mainstream — something we as a community should strongly welcome.