Support Us

9 models to scale open data – past, present and future

Golden spiral, by Kakapo31 CC-BY-NC-SA

The possibilities of open data have been enthralling us for 10 years.

I came to it through wanting to make Government really usable, to build sites like TheyWorkForYou.

But that excitement isn’t what matters in the end.

What matters is scale – which organisational structures will make this movement explode?

Whether by creating self-growing volunteer communities, or by generating flows of money.

This post quickly and provocatively goes through some that haven’t worked (yet!) and some that have.

Ones that are working now

1) Form a community to enter in new data. Open Street Map and MusicBrainz are two big examples. It works as the community is the originator of the data. That said, neither has dominated its industry as much as I thought they would have by now.

2) Sell tools to an upstream generator of open data. This is what CKAN does for central Governments (and the new ScraperWiki CKAN tool helps with). It’s what mySociety does, when selling FixMyStreet installs to local councils, thereby publishing their potholes as RSS feeds.

3) Use open data (quietly). Every organisation does this and never talks about it. It’s key to quite old data resellers like Bloomberg. It is what most of ScraperWiki’s professional services customers ask us to do. The value to society is enormous and invisible. The big flaw is that it doesn’t help scale supply of open data.

4) Sell tools to downstream users. This isn’t necessarily open data specific – existing software like spreadsheets and Business Intelligence can be used with open or closed data. Lots of open data is on the web, so tools like the new ScraperWiki which work well with web data are particularly suited to it.

Ones that haven’t worked

5) Collaborative curation ScraperWiki started as an audacious attempt to create an open data curation community, based on editing scraping code in a wiki. In its original form (now called ScraperWiki Classic) this didn’t scale. Here are some reasons, in terms of open data models, why it didn’t.

a. It wasn’t upstream. Whatever provenance you give, people trust data most that they get it straight from its source. This can also be a partial upstream - for example supplementing scraped data with new data manually gathered by telephone.

b. It isn’t in private. Although in theory there’s lots to gain by wrangling commodity data together in public, it goes against the instincts of most organisations.

c. There’s not enough existing culture. The free software movement built a rich culture of collaboration, ready to be exploited some 15 years in by the open source movement, and 25 years later by tools like Github. With a few exceptions, notably OpenCorporates, there aren’t yet open data curation projects.

6) General purpose data marketplaces, particularly ones that are mainly reusing open data, haven’t taken off. They might do one day, however I think they need well-adopted higher level standards for data formatting and syncing first (perhaps something like dat, perhaps something based on CSV files).

Ones I expect more of in the future

These are quite exciting models which I expect to see a lot more of.

7) Give labour/money to upstream to help them create better data. This is quite new. The only, and most excellent, example of it is the UK’s National Archive curating the Statute Law Database. They do the work with the help of staff seconded from commercial legal publishers and other parts of Government.

It’s clever because it generates money for upstream, which people trust the most, and which has the most ability to improve data quality.

8) Viral open data licensing. MySQL made lots of money this way, offering proprietary dual licenses of GPLd software to embedded systems makers. In data this could use OKFN’s Open Database License, and organisations would pay when they wanted to mix the open data with their own closed data. I don’t know anyone actively using it, although Chris Taggart from OpenCorporates mentioned this model to me years ago.

9) Corporations release data for strategic advantage. Companies are starting to release their own data for strategic gain. This is very new. Expect more of it.

What have I missed? What models do you see that will scale Open Data, and bring its benefits to billions?

  • Jun

    Great post. The good folks at http://datakind.org/ are also looking at helping organisations with their capacity to develop open/closed/better data upstream. A part of “creating better data” surely has to include developing better interoperability and initiatives like http://popoloproject.com/ should help in addition to moving data sets up the http://5stardata.info/ scale. https://www.theengineroom.org/ is also looking at responsible use of data, interoperability as well as lowering the barrier to entry for small local non-profits to publish their data sets. This is one of the topics we’ll address at Okcon 2013 in our workshop “Interoperability Standards for Public Good Data”.

    I’m particularly interested in hearing whether there are good models and technology platforms that are emerging around open data curation? I think Freebase is one of them but proprietary. Using git for data is another more technology driven approach and so is work around linked data provenance. Anything else going on there that’s moving things along to allow open and closed data sets, crowd contributions, public and private curation to co-exist?

  • Sotiris Koussouris

    Dear Francis,

    I slightly disagree with thearguments you pose under the “Ones that haven’t worked” part.

    For instance, (I am a big fan of collaborative curation of data and actually working on such a project), however as you do state, there do exist problems there, such as the verification of the end result and as a consequence the trust of people in these data. However, the same problems do exist also in point 1) where you also mention a community of users. What is so different in such communities and why are their data trusted? Or is this the same case as with collaborative curation? To my point of view, both ideas are derive from the same basis (the fundamental thoughts of communities of users to power up interesting solutions) and thus they share the same benefits and risks. And there is an evident lack of features that could upscale these cases, such as community rating mechanisms, license checking functions, etc…

    Also as practise has shown, keeping the community out of the loop and not allowing people to co-create on existing things is something that will not work on the long term. Paraphrasing a famous man “… don’t think what you can do with your data, but let others play with them and produce things you have not even imagined …” :)

    So the questions is not in my opinion is not if something is not working, but WHY it is not working and WHAT can be done in oder to revert the situation.

    • frabcus

      Great questions Sotiris!

      Yes, the challenge in 5) is, how can we build such a collaborative culture?You’re also right that 1) is very strongly related.

      One difference is about clarity of value of output. For data, it is much clearer to show value in a vertical. And you do whatever it takes – OSM for example both enters its own totally new data, and also ingests (“collaboratively curates”) other sources which are suitably licensed.

      The privacy part is about who has money/energy at the same time as need. Right now, there’s few people with both, and those that do aren’t bought into open data – they are still thinking of data silos.

      Hope to have more conversations about this at OKFNCon!

  • Harris Alexopoulos

    This is a really great and helpful post!

    In order to provide one more example of an open data curation project (@disqus_tgUmjrZRJZ:disqus ) I will mention the ENGAGE project http://www.engagedata.eu/.

    The features of ENGAGE position it as a centralized and collaborative PSI e-Infrastructure providing the necessary tools for dataset processing and acquisition and differentiate it from a simple repository of open public datasets. It will be an intelligent social and collaborative space for researchers, data journalists, citizens and other potential end users, who rely on open public data for professional or personal (re-)use. – See more at: http://www.engage-project.eu/engage/wp/?p=904#sthash.gGDSph30.dpuf

Get Updates