Scaling the Open Data Ecosystem

October 31, 2011 in Ideas and musings, News, Open Data

This is a post by Rufus Pollock, co-Founder of the Open Knowledge Foundation. As reported elsewhere I’ve been fortunate enough to have my Shuttleworth Fellowship renewed for the coming year so that I can continue and extend my work at the Open Knowledge Foundation on developing the open data ecosystem. The following text and video formed the main part of my renewal application.

Describe the world as it is.

The last several decades the world has seen an explosion of digital technologies which have the potential to transform the way knowledge is disseminated. This world is rapidly evolving and one of its more striking possibilities is the creation of an open data ecosystem in which information is freely used, extended and built on. The resulting open data ‘commons’ is valuable in and of itself, but also, and perhaps even more importantly, because the social and commercial benefits it generates — whether in helping us to understand climate change; speeding the development of life-saving drugs; or improving govenance and public services.

In developing this open data ecosystem there are three key things are needed: material, tools and people. This is a key point: open information without tools and communities to utilise it is not enough, after all, openness isn’t an end itself – open material has no value if it isn’t used.We need therefore to have widely available the capabilities for utilising open material, for processing, analysing and sharing it, especially on a large scale. Relevant tools need to be freely and openly available and the related infrastructure — after all tools need somewhere to run, and data needs somewhere to be stored — should be capable of effective deployment by distributed communities.

Over the last few years we’ve started to see increasing amounts of open material made available, with release of open data really starting to take off in the last couple of years. But the (open) tools and the communities to use them are still very limited — we’re just starting to see the first self-identified “data wranglers / data hackers / data scientists” (note how the terms have not settled yet!). Key architectural elements of the ecosystem, such as how we create and share data in an open componentized way, are only just beginning to be worked through. We are therefore at a key moment where we transition from just ‘getting the data’ (and building the app) to a real data ecosystem in which data is transformed, shared and reintegrated and we replace a ‘data pipeline’ with ‘data cycles’.

What change do you want to make?

I want to see a world in which open data – data that can be freely shared and used without restriction – is ubiquitous and in which that data is used to improve the world around us, whether by finding you a better route to work, helping us to prevent climate change, or improving reportage. I want open data to allow us to build the tools and systems to help us navigate and managing the increasingly complex information-based world in which we now live.

Specifically, I want to help grow the emerging open data ecosystem. While part of this involves supporting and expanding the ongoing release of material — building on the major progress of the last few years — the biggest change I want to make is develop the tools and communities so that we can make effective use of the increasing amounts of open data is now becoming available.

Particular changes I want to make are:

  • Development of real ‘data cycles’ (especially for government data). By data cycles I mean a process whereby material is released, it’s used and improved by the community and then that work finds its way back to the data source.
  • Greater connection of open data to journalists and other types of reporters/analysts who can use this data and bring it to a wider audience.
  • Development of an active and globally-connected community of open data wranglers.
  • Development of better open tools and infrastructure for working with data, especially in a distributed community using a componentization approach that allow us to scale rapidly and efficiently.

What do you want to explore?

I’m interested in learning more about the actual and potential user communities for open data. I want to explore what they want — in relation to both tools and data — and, also their awareness of what is already out there. I’m especially interested in areas like journalism, government, and the general civic hacker community.

I want to explore the processes around ‘data refining’ — obtaining, cleaning and transforming source data into something more useful and data ‘analysis’ (usually closely related tasks). I’m especially interested in existing business activity in this area — often labelled with headings like business intelligence and data warehousing. I want to see what we could learn from business regarding tools and process that could be used in the wider open data community as well as how the business community can take advantage of open data.

I want to explore how we can connect together the distributed community of data wranglers and hacktivists, focusing on a specific area like civic information or finances. How do we allow for loose networks across different location and different organisations while sharing information and collaborating on the development of tools.

Lastly, I want to explore the tools and processes needed to support decentralised, collaborative, and componentised development of data. How can we build robust and scalable infrastructures? How can we build the technology to allow people to combine multiple sources of official data in a wiki-like manner – so that changes can be tracked, and provenance can be traced? How can we break down data into smaller manageable components, and then successfully recombine them again? How can we ‘package’ data and create knowledge APIs to enable automated distribution and reuse of datasets? How can we achieve real read/write status for official information – not just access alone?

What are you going to do to get there?

I want to focus my efforts in this next year on 3 key areas, breaking new ground but also building on existing work I’ve been doing with the Open Knowledge Foundation.

First, I want to build out CKAN software and community from a registry to a data hub – a platform for working with data not just listing it. The last year has seen very significant uptake of the CKAN with dozens of CKAN instances around the world including several official government and institutional deployments. Improving and expanding CKAN we will allow us to capitalize on this success to make CKAN into an essential tool and platform for open data “development”.

The most important aspect of the software side of this will be the development of a datastore component supporting the processing and visualization of data within CKAN. With features like these CKAN can become a valuable tool not just for tech-savvy data ‘geeks’ but for the more general users of data such as journalists and civil servants. Engaging this wider, “non-techy” audience is a key part of scaling up the ecosystem. It is important to emphasize that this won’t just be about developing software but is about understanding and engaging with the a variety of data-user communities, exploring how they work, what they want and how they can be helped.

Second I want to build out the OpenSpending platform and community. OpenSpending is Where Does My Money Go Goes globalized — a worldwide project to ‘map the money’. Following the successful launch of Where Does My Money Go last autumn in the UK, in the last 6 months we have dramatically expanded of coverage with data now from more than 15 countries (in May our work on Italy received coverage in La Stampa, the Guardian and other major newspapers).

Working with OpenSpending complements work on CKAN because it is a chance to act as a data user and refiner — we already have some basic integration with CKAN but it’s still very basic. Furthermore, OpenSpending presents the chance to develop a specific data wrangler / data user community and one which can and should have close links with users and analysts of data including journalist and civic ‘hacker’ groups. In this way OpenSpending can act as a microcosm and prototype for developments in the wider open data community.

Third, I want to develop the OKF Open Data Labs. Much like the “Google Labs” for Google’s web services, Mozilla Labs for the Web, and the “Sunlight Labs” for US transparency websites, I would like the “Open Data Labs” to be a place for coders and data wranglers to collaborate, experiment, share ideas and prototypes, and ultimately build a new generation of open source tools and services for working with open data. The labs would form a natural complement to the my other activities with CKAN and OpenSpending – the Labs could build on material and tools from those projects while simultaneously acting as an incubator for new extensions and ideas useful both there and elsewhere.

  • Chris Taggart

    Definitely agree about the data ‘cycles’. Even the UK government is failing to get its head around this one, still less achieve it.

    I think the other point that needs stressing is that an ecosystem requires a variety of participants (community, non-profits, for-profits), with a variety of motives (fun, grants, money) — without those participants the Open Source world would not have flourished, and one of the problems of the current open data landscape is there are too few participants, and very fee businesses.


