9 models to scale open data – past, present and future

3 Min Read

The possibilities of open data have been enthralling us for 10 years.

I came to it through wanting to make Government really usable, to build sites
like TheyWorkForYou.

But that excitement isn’t what matters in the end.

What matters is scale – which organisational structures will make this movement
explode?

Whether by creating self-growing volunteer communities, or by generating flows
of money.

This post quickly and provocatively goes through some that haven’t worked
(yet!) and some that have.

Ones that are working now

1) Form a community to enter in new data. Open Street Map and MusicBrainz are two big examples. It works
as the community is the originator of the data. That said, neither has
dominated its industry as much as I thought they would have by now.

2) Sell tools to an upstream generator of open data. This is what
CKAN does for central Governments (and the new ScraperWiki CKAN tool helps with). It’s what mySociety does, when selling
FixMyStreet installs to local councils, thereby publishing their potholes as RSS feeds.

3) Use open data (quietly). Every organisation does this and never talks
about it. It’s key to quite old data resellers like Bloomberg. It is what most of
ScraperWiki’s professional services
customers ask us to do. The value to society is enormous and invisible. The
big flaw is that it doesn’t help scale supply of open data.

4) Sell tools to downstream users. This isn’t necessarily open data
specific – existing software like spreadsheets and Business Intelligence can be
used with open or closed data. Lots of open data is on the web, so tools like
the new ScraperWiki which work well with
web data are particularly suited to it.

Ones that haven’t worked

5) Collaborative curation ScraperWiki started as an audacious attempt to create an open data curation
community, based on editing scraping code in a wiki. In its original form
(now called ScraperWiki Classic) this didn’t scale.
Here are some reasons, in terms of open data models, why it didn’t.

a. It wasn’t upstream. Whatever provenance you give, people trust data most
that they get it straight from its source. This can also be a partial upstream –
for example supplementing scraped data with new data manually gathered by
telephone.

b. It isn’t in private. Although in theory there’s lots to gain by wrangling
commodity data together in public, it goes against the instincts of most
organisations.

c. There’s not enough existing culture. The free software movement built a rich
culture of collaboration, ready to be exploited some 15 years in by the open
source movement, and 25 years later by tools like Github. With a few
exceptions, notably OpenCorporates, there
aren’t yet open data curation projects.

6) General purpose data marketplaces, particularly ones that are mainly
reusing open data, haven’t taken off. They might do one day, however I think
they need well-adopted higher level standards for data formatting and syncing
first (perhaps something like dat,
perhaps something based
on CSV files).

Ones I expect more of in the future

These are quite exciting models which I expect to see a lot more of.

7) Give labour/money to upstream to help them create better data. This is
quite new. The only, and most excellent, example of it is the UK’s National
Archive curating
the Statute Law Database. They do the work with the help of staff seconded
from commercial legal publishers and other parts of Government.

It’s clever because it generates money for upstream, which people trust the most,
and which has the most ability to improve data quality.

8) Viral open data licensing. MySQL made lots of money this way, offering
proprietary dual licenses of GPLd software to embedded systems makers. In data
this could use OKFN’s Open Database License,
and organisations would pay when they wanted to mix the open data with their
own closed data. I don’t know anyone actively using it, although Chris Taggart
from OpenCorporates mentioned this model to me years ago.

9) Corporations release data for strategic advantage. Companies are starting to release
their own data for strategic gain. This is very new. Expect more of it.

What have I missed? What models do you see that will scale Open Data, and bring
its benefits to billions?

Written by

Francis Irving

CEO of ScraperWiki. Made several of the world's first civic websites, such as TheyWorkForYou and WhatDoTheyKnow.

4 Comments

Harris Alexopoulos says:

July 28, 2013 at 11:13

This is a really great and helpful post!

In order to provide one more example of an open data curation
project (@disqus_tgUmjrZRJZ:disqus ) I will mention the ENGAGE project http://www.engagedata.eu/.

The features of ENGAGE position it as a centralized and collaborative PSI e-Infrastructure providing the necessary tools for dataset processing and acquisition and differentiate it from a simple repository of open public datasets. It will be an intelligent social and collaborative space for researchers, data journalists, citizens and other potential end users, who rely on open public data for professional or personal (re-)use. – See more at: http://www.engage-project.eu/engage/wp/?p=904#sthash.gGDSph30.dpuf

Reply
Sotiris Koussouris says:

July 26, 2013 at 14:44

Dear Francis,

I slightly disagree with thearguments you pose under the “Ones that haven’t worked” part.

For instance, (I am a big fan of collaborative curation of data and actually working on such a project), however as you do state, there do exist problems there, such as the verification of the end result and as a consequence the trust of people in these data. However, the same problems do exist also in point 1) where you also mention a community of users. What is so different in such communities and why are their data trusted? Or is this the same case as with collaborative curation? To my point of view, both ideas are derive from the same basis (the fundamental thoughts of communities of users to power up interesting solutions) and thus they share the same benefits and risks. And there is an evident lack of features that could upscale these cases, such as community rating mechanisms, license checking functions, etc…

Also as practise has shown, keeping the community out of the loop and not allowing people to co-create on existing things is something that will not work on the long term. Paraphrasing a famous man “… don’t think what you can do with your data, but let others play with them and produce things you have not even imagined …” 🙂

So the questions is not in my opinion is not if something is not working, but WHY it is not working and WHAT can be done in oder to revert the situation.

Reply
- frabcus says:
  
  August 5, 2013 at 12:21
  
  Great questions Sotiris!
  
  Yes, the challenge in 5) is, how can we build such a collaborative culture?You’re also right that 1) is very strongly related.
  
  One difference is about clarity of value of output. For data, it is much clearer to show value in a vertical. And you do whatever it takes – OSM for example both enters its own totally new data, and also ingests (“collaboratively curates”) other sources which are suitably licensed.
  
  The privacy part is about who has money/energy at the same time as need. Right now, there’s few people with both, and those that do aren’t bought into open data – they are still thinking of data silos.
  
  Hope to have more conversations about this at OKFNCon!
  
  Reply
Jun says:

July 19, 2013 at 11:01

Great post. The good folks at http://datakind.org/ are also looking at helping organisations with their capacity to develop open/closed/better data upstream. A part of “creating better data” surely has to include developing better interoperability and initiatives like http://popoloproject.com/ should help in addition to moving data sets up the http://5stardata.info/ scale. https://www.theengineroom.org/ is also looking at responsible use of data, interoperability as well as lowering the barrier to entry for small local non-profits to publish their data sets. This is one of the topics we’ll address at Okcon 2013 in our workshop “Interoperability Standards for Public Good Data”.

I’m particularly interested in hearing whether there are good models and technology platforms that are emerging around open data curation? I think Freebase is one of them but proprietary. Using git for data is another more technology driven approach and so is work around linked data provenance. Anything else going on there that’s moving things along to allow open and closed data sets, crowd contributions, public and private curation to co-exist?

Reply