9 models to scale open data – past, present and future
The possibilities of open data have been enthralling us for 10 years.
I came to it through wanting to make Government really usable, to build sites like TheyWorkForYou.
But that excitement isn’t what matters in the end.
What matters is scale – which organisational structures will make this movement explode?
Whether by creating self-growing volunteer communities, or by generating flows of money.
This post quickly and provocatively goes through some that haven’t worked (yet!) and some that have.
Ones that are working now
1) Form a community to enter in new data. Open Street Map and MusicBrainz are two big examples. It works as the community is the originator of the data. That said, neither has dominated its industry as much as I thought they would have by now.
2) Sell tools to an upstream generator of open data. This is what CKAN does for central Governments (and the new ScraperWiki CKAN tool helps with). It’s what mySociety does, when selling FixMyStreet installs to local councils, thereby publishing their potholes as RSS feeds.
3) Use open data (quietly). Every organisation does this and never talks about it. It’s key to quite old data resellers like Bloomberg. It is what most of ScraperWiki’s professional services customers ask us to do. The value to society is enormous and invisible. The big flaw is that it doesn’t help scale supply of open data.
4) Sell tools to downstream users. This isn’t necessarily open data specific – existing software like spreadsheets and Business Intelligence can be used with open or closed data. Lots of open data is on the web, so tools like the new ScraperWiki which work well with web data are particularly suited to it.
Ones that haven’t worked
5) Collaborative curation ScraperWiki started as an audacious attempt to create an open data curation community, based on editing scraping code in a wiki. In its original form (now called ScraperWiki Classic) this didn’t scale. Here are some reasons, in terms of open data models, why it didn’t.
a. It wasn’t upstream. Whatever provenance you give, people trust data most that they get it straight from its source. This can also be a partial upstream – for example supplementing scraped data with new data manually gathered by telephone.
b. It isn’t in private. Although in theory there’s lots to gain by wrangling commodity data together in public, it goes against the instincts of most organisations.
c. There’s not enough existing culture. The free software movement built a rich culture of collaboration, ready to be exploited some 15 years in by the open source movement, and 25 years later by tools like Github. With a few exceptions, notably OpenCorporates, there aren’t yet open data curation projects.
6) General purpose data marketplaces, particularly ones that are mainly reusing open data, haven’t taken off. They might do one day, however I think they need well-adopted higher level standards for data formatting and syncing first (perhaps something like dat, perhaps something based on CSV files).
Ones I expect more of in the future
These are quite exciting models which I expect to see a lot more of.
7) Give labour/money to upstream to help them create better data. This is quite new. The only, and most excellent, example of it is the UK’s National Archive curating the Statute Law Database. They do the work with the help of staff seconded from commercial legal publishers and other parts of Government.
It’s clever because it generates money for upstream, which people trust the most, and which has the most ability to improve data quality.
8) Viral open data licensing. MySQL made lots of money this way, offering proprietary dual licenses of GPLd software to embedded systems makers. In data this could use OKFN’s Open Database License, and organisations would pay when they wanted to mix the open data with their own closed data. I don’t know anyone actively using it, although Chris Taggart from OpenCorporates mentioned this model to me years ago.
9) Corporations release data for strategic advantage. Companies are starting to release their own data for strategic gain. This is very new. Expect more of it.
What have I missed? What models do you see that will scale Open Data, and bring its benefits to billions?