An approach to building open databases

This post has been co-authored by Adam Kariv, Vitor Baptista, and Paul Walsh.

Open Knowledge International (OKI) recently coordinated a two-day work sprint as a way to touch base with partners in the Open Data for Tax Justice project. Our initial writeup of the sprint can be found here.

Phase I of the project ended in February 2017 with the publication of What Do They Pay?, a white paper that outlines the need for a public database on the tax contributions and economic activities of multinational companies.

The overarching goal of the sprint was to start some work towards such a database, by replicating data collection processes we’ve used in other projects, and to provide a space for domain expert partners to potentially use this data for some exploratory investigative work. We had limited time, a limited budget, and we are pleased with the discussions and ideas that came out of the sprint.

One attendee, Tim Davies, criticised the approach we took in the technical stream of the sprint. The problem with the criticism is the extrapolation of one stream of activity during a two-day event to posit an entire approach to a project. We think exploration and prototyping should be part of any healthy project, and that is exactly what we did with our technical work in the two-day sprint.

Reflecting on the discussion presents a good opportunity here to look more generally at how we, as an organisation, bring technical capacity to projects such as Open Data for Tax Justice. Of course, we often bring much more than technical capacity to a project, and Open Data for Tax Justice is no different in that regard, being mostly a research project to date.

In particular, we’ll take a look at the technical approach we used for the two-day sprint. While this is not the only approach towards technical projects we employ at OKI, it has proven useful on projects driven by the creation of new databases.

An approach

Almost all projects that OKI either leads on, or participates in, have multiple partners. OKI generally participates in one of three capacities (sometimes, all three):

  • Technical design and implementation of open data platforms and apps.
  • Research and thought leadership on openness and data.
  • Dissemination and facilitating participation, often by bringing the “open data community” to interact with domain specific actors.

Only the first capacity is strictly technical, but each capacity does, more often than not, touch on technical issues around open data.

Some projects have an important component around the creation of new databases targeting a particular domain. Open Data for Tax Justice is one such project, as are OpenTrials, and the Subsidy Stories project, which itself is a part of OpenSpending.

While most projects have partners, usually domain experts, it does not mean that collaboration is consistent or equally distributed over the project life cycle. There are many reasons for this to be the case, such as the strengths and weaknesses of our team and those of our partners, priorities identified in the field, and, of course, project scope and funding.

With this as the backdrop for projects we engage in generally, we’ll focus for the rest of this post on aspects when we bring technical capacity to a project. As a team (the Product Team at OKI), we are currently iterating on an approach in such projects, based on the following concepts:

  • Replication and reuse
  • Data provenance and reproducibility
  • Centralise data, decentralise views
  • Data wrangling before data standards

While not applicable to all projects, we’ve found this approach useful when contributing to projects that involve building a database to, ultimately, unlock the potential to use data towards social change.

Replication and reuse

We highly value the replication of processes and the reuse of tooling across projects. Replication and reuse enables us to reduce technical costs, focus more on the domain at hand, and share knowledge on common patterns across open data projects. In terms of technical capacity, the Product Team is becoming quite effective at this, with a strong body of processes and tooling ready for use.

This also means that each project enables us to iterate on such processes and tooling, integrating new learnings. Many of these learnings come from interactions with partners and users, and others come from working with data.

In the recent Open Data for Tax Justice sprint, we invited various partners to share experiences working in this field and try a prototype we built to extract data from country-by-country reports to a central database. It was developed in about a week, thanks to the reuse of processes and tools from other projects and contexts.

When our partners started looking into this database, they had questions that could only be answered by looking back to the original reports. They needed to check the footnotes and other context around the data, which weren’t available in the database yet. We’ve encountered similar use cases in both and OpenTrials, so we can build upon these experiences to iterate towards a reusable solution for the Open Data for Tax Justice project.

By doing this enough times in different contexts, we’re able to solve common issues quickly, freeing more time to focus on the unique challenges each project brings.

Data provenance and reproducibility of views on data are absolutely essential to building databases with a long and useful futureData provenance and reproducibility

We think that data provenance, and reproducibility of views on data, is absolutely essential to building databases with a long and useful future.

What exactly is data provenance? A useful definition from wikipedia is “… (d)ata provenance documents the inputs, entities, systems, and processes that influence data of interest, in effect providing a historical record of the data and its origins”. Depending on the way provenance is implemented in a project, it can also be a powerful tool for reproducibility of the data.

Most work around open data at present does not consider data provenance and reproducibility as an essential aspect of working with open data. We think this is to the detriment of the ecosystem’s broader goals of seeing open data drive social change: the credible use of data from projects with no provenance or reproducibility built in to the creation of databases is significantly diminished in our “post truth” era.

Our current approach builds data provenance and reproducibility right into the heart of building a database. There is a clear, documented record of every action performed on data, from the extraction of source data, through to normalisation processes, and right to the creation of records in a database. The connection between source data and processed data is not lost, and, importantly, the entire data pipeline can be reproduced by others.

We acknowledge that a clear constraint of this approach, in its current form, is that it is necessarily more technical than, say, ad hoc extraction and manipulation with spreadsheets and other consumer tools used in manual data extraction processes. However, as such approaches make data provenance and reproducibility harder, because there is no history of the changes made, or where the data comes from, we are willing to accept this more technical approach and iterate on ways to reduce technical barriers.

We hope to see more actors in the open data ecosystem integrating provenance and reproducibility right into their data work. Without doing so, we greatly reduce the ability for open data to be used in an investigative capacity, and likewise, we diminish the possibility of using the outputs of open data projects in the wider establishment of facts about the world. Recent work on beneficial ownership data takes a step in this direction, leveraging the PROV-DM standard to declare data provenance facts.

Centralise data, decentralise views

In OpenSpending, OpenTrials, and our initial exploratory work on Open Data for Tax Justice, there is an overarching theme to how we have approached data work, user stories and use cases, and co-design with domain experts: “centralise data, decentralise views”.

Building a central database for open data in a given domain affords ways of interacting with such data that are extremely difficult, or impossible, by actively choosing to decentralise such data. Centralised databases make investigative work that uses the data easier, and allows for the discovery, for example, of patterns across entities and time that can be very hard to discover if data is decentralised.

Additionally, by having in place a strong approach to data provenance and reproducibility, the complete replication of a centralised database is relatively easily done, and very much encouraged. This somewhat mitigates a major concern with centralised databases, being that they imply some type of “vendor lock-in”.

Views on data are better when decentralised. By “views on data” we refer to visualisations, apps, websites – any user-facing presentation of data. While having data centralised potentially enables richer views, data almost always needs to be presented with additional context, localised, framed in a particular narrative, or otherwise presented in unique ways that will never be best served from a central point.

Further, decentralised usage of data provides a feedback mechanism for iteration on the central database. For example, providing commonly used contextual data, establishing clear use cases for enrichment and reconciliation of measures and dimensions in the data, and so on.

Data wrangling before data standards

As a team, we are interested in, engage with, and also author, open data standards. However, we are very wary of efforts to establish a data standard before working with large amounts of data that such a standard is supposed to represent.

Data standards that are developed too early are bound to make untested assumptions about the world they seek to formalise (the data itself). There is a dilemma here of describing the world “as it is”, or, “as we would like it to be”. No doubt, a “standards first” approach is valid in some situations. Often, it seems, in the realm of policy. We do not consider such an approach flawed, but rather, one with its own pros and cons.

We prefer to work with data, right from extraction and processing, through to user interaction, before working towards public standards, specifications, or any other type of formalisation of the data for a given domain.

Our process generally follows this pattern:

  • Get to know available data and establish (with domain experts) initial use cases.
  • Attempt to map what we do not know (e.g.: data that is not yet publicly accessible), as this clearly impacts both usage of the data, and formalisation of a standard.
  • Start data work by prescribing the absolute minimum data specification to use the data (i.e.: meet some or all of the identified use cases).
  • Implement data infrastructure that makes it simple to ingest large amounts of data, and also to keep the data specification reactive to change.
  • Integrate data from a wide variety of sources, and, with partners and users, work on ways to improve participation / contribution of data.
  • Repeat the above steps towards a fairly stable specification for the data.
  • Consider extracting this specification into a data standard.

Throughout this entire process, there is a constant feedback loop with domain expert partners, as well as a range of users interested in the data.


We want to be very clear that we do not think that the above approach is the only way to work towards a database in a data-driven project.

Design (project design, technical design, interactive design, and so on) emerges from context. Design is also a sequence of choices, and each choice has an opportunity cost based on various constraints that are present in any activity.

In projects we engage in around open databases, technology is a means to other, social ends. Collaboration around data is generally facilitated by technology, but we do not think the technological basis for this collaboration should be limited to existing consumer-facing tools, especially if such tools have hidden costs on the path to other important goals, like data provenance and reproducibility. Better tools and processes for collaboration will only emerge over time if we allow exploration and experimentation.

We think it is important to understand general approaches to working with open data, and how they may manifest within a single project, or across a range of projects. Project work is not static, and definitely not reducible to snapshots of activity within a wider project life cycle.

Certain approaches emphasise different ends. We’ve tried above to highlight some pros and cons of our approach, especially around data provenance and reproducibility, and data standards.

In closing, we’d like to invite others interested in approaches to building open databases to engage in a broader discussion around these themes, as well as a discussion around short term and long term goals of such projects. From our perspective, we think there could be a great deal of value for the ecosystem around open data generally – CSOs, NGOs, governments, domain experts, funders – via a proactive discussion or series of posts with a multitude of voices. Join the discussion here if this is of interest to you.

2 thoughts on “An approach to building open databases”

  1. A few years ago we started to include in the scripts used to download the data from the official source and to process it in the source repository, as a way to improve reproducibility and making it easier to keep it up-to-date with the official source.

    It’s nice to see other OKI projects go above and beyond that to even more fully document data provenance, a very important aspect of the data supply and usage chain that so often gets neglected.

    The proposed approach seems to be a pretty sensible to deal with the problem.

Leave a Reply

Your email address will not be published. Required fields are marked *