This blogpost was co-authored by Danny Lämmerhirt, Pierre Chrzanowski and Sander van der Waal (*author note at the bottom)
January 22 will mark a crucial moment for the future of open data in Europe. That day, the final trilogue between European Commission, Parliament, and Council is planned to decide over the ratification of the updated PSI Directive. Among others, the European institutions will decide over what counts as ‘high value’ data. What essential information should be made available to the public and how those data infrastructures should be funded and managed are critical questions for the future of the EU.
As we will discuss below, there are many ways one might envision the collective ‘value’ of those data. This is a democratic question and we should not be satisfied by an ill and broadly defined proposal. We therefore propose to organise a public debate to collectively define what counts as high value data in Europe.
What does PSI Directive say about high value datasets?
The European Commission provides several hints in the current revision of the PSI Directive on how it envisions high value datasets. They are determined by one of the following ‘value indicators’:
- The potential to generate significant social, economic, or environmental benefits,
- The potential to generate innovative services,
- The number of users, in particular SMEs,
- The revenues they may help generate,
- The data’s potential for being combined with other datasets
- The expected impact on the competitive situation of public undertakings.
Given the strategic role of open data for Europe’s Digital Single Market, these indicators are not surprising. But as we will discuss below, there are several challenges defining them. Also, there are different ways of understanding the importance of data.
The annex of the PSI Directive also includes a list of preliminary high value data, drawing primarily from the key datasets defined by Open Knowledge International’s (OKI’s) Global Open Data Index, as well as the G8 Open Data Charter Technical Annex. See the proposed list in the table below.
List of categories and high-value datasets:
Category | Description |
1. Geospatial Data | Postcodes, national and local maps (cadastral, topographic, marine, administrative boundaries). |
2. Earth observation and environment | Space and situ data (monitoring of the weather and of the quality of land and water, seismicity, energy consumption, the energy performance of buildings and emission levels). |
3. Meteorological data | Weather forecasts, rain, wind and atmospheric pressure. |
4. Statistics | National, regional and local statistical data with main demographic and economic indicators (gross domestic product, age, unemployment, income, education). |
5. Companies | Company and business registers (list of registered companies, ownership and management data, registration identifiers). |
6. Transport data | Public transport timetables of all modes of transport, information on public works and the state of the transport network including traffic information. |
According to the proposal, regardless of who provide them, these datasets shall be available for free, machine-readable and accessible for download, and where appropriate, via APIs. The conditions for re-use shall be compatible with open standard licences.
Towards a public debate on high value datasets at EU level
There has been attempts by EU Member States to define what constitutes high-value data at national level, with different results. In Denmark, basic data has been defined as the five core information public authorities use in their day-to-day case processing and should release. In France, the law for a Digital Republic aims to make available reference datasets that have the greatest economic and social impact. In Estonia, the country relies on the X-Road infrastructure to connect core public information systems, but most of the data remains restricted.
Now is the time for a shared and common definition on what constitute high-value datasets at EU level. And this implies an agreement on how we should define them. However, as it stands, there are several issues with the value indicators that the European Commission proposes.
For example, how does one define the data’s potential for innovative services? How to confidently attribute revenue gains to the use of open data? How does one assess and compare the social, economic, and environmental benefits of opening up data? Anyone designing these indicators must be very cautious, as metrics to compare social, economic, and environmental benefits may come with methodical biases. Research found for example, that comparing economic and environmental benefits can unfairly favour data of economic value at the expense of fuzzier social benefits, as economic benefits are often more easily quantifiable and definable by default.
One form of debating high value datasets could be to discuss what data gets currently published by governments and why. For instance, with their Global Open Data Index, Open Knowledge International has long advocated for the publication of disaggregated, transactional spending figures. Another example is OKI’s Open Data For Tax Justice initiative which wanted to influence the requirements for multinational companies to report their activities in each country (so-called ‘Country-By-Country-Reporting’), and influence a standard for publicly accessible key data.
A public debate of high value data should critically examine the European Commission’s considerations regarding the distortion of competition. What market dynamics are engendered by opening up data? To what extent do existing markets rely on scarce and closed information? Does closed data bring about market failure, as some argue (Zinnbauer 2018)? Could it otherwise hamper fair price mechanisms (for a discussion of these dynamics in open access publishing, see Lawson, Gray and Mauri 2015)? How would open data change existing market dynamics? What actors proclaim that opening data could purport market distortion, and whose interests do they represent?
Lastly, the European Commission does not yet consider cases of government agencies generating revenue from selling particularly valuable data. The Dutch national company register has for a long time been such a case, as has the German Weather Service. Beyond considering competition, a public debate around high value data should take into account how marginal cost recovery regimes currently work.
What we want to achieve
For these reasons, we want to organise a public discussion to collectively define
- i) What should count as a high value datasets, and based on what criteria,
- ii) What information high value datasets should include,
- ii) What the conditions for access and re-use should be.
The PSI Directive will set the baseline for open data policies across the EU. We are therefore at a critical moment to define what European societies value as key public information. What is at stake is not only a question of economic impact, but the question of how to democratise European institutions, and the role the public can play in determining what data should be opened.
How you can participate
- We will use the Open Knowledge forum as main channel for coordination, exchange of information and debate. To join the debate, please add your thoughts to this thread or feel free to start a new discussion for specific topics.
- We gather proposals for high value datasets in this spreadsheet. Please feel free to use it as a discussion document, where we can crowdsource alternative ways of valuing data.
- We use the PSI Directive Data Census to assess the openness of high value datasets.
We also welcome any reference to scientific paper, blogpost, etc. discussing the issue of high-value datasets. Once we have gathered suggestions for high value datasets, we would like to assess how open proposed high-value datasets are. This will help to provide European countries with a diagnosis of the openness of key data.
Author note:
Danny Lämmerhirt is senior researcher on open data, data governance, data commons as well as metrics to improve open governance. He has formerly worked with Open Knowledge International, where he led its research activities, including the methodology development of the Global Open Data Index 2016/17. His work focuses, among others, on the role of metrics for open government, and the effects metrics have on the way institutions work and make decisions. He has supervised and edited several pieces on this topic, including the Open Data Charter’s Measurement Guide.
Pierre Chrzanowski is Data Specialist with the World Bank Group and a co-founder of Open Knowledge France local group. As part of his work, he developed the Open Data for Resilience Initiative (OpenDRI) Index, a tool to assess the openness of key datasets for disaster risk management projects. He has also participated in the impact assessment prior to the new PSI Directive proposal and has contributed to the Global Open Data Index as well as the Web Foundation’s Open Data Barometer.
Sander van der Waal is Programme Lead for Fiscal Transparency at Open Knowledge International. Furthermore, he’s responsible for the team at Open Knowledge International that works in the areas of Research, Communications, and Community. Sander combines a background in Computer Science with Philosophy and has a passion for ‘open’ in all its form, ranging from open data to open access and open source software.
Danny Lämmerhirt works on the politics of data, sociology of quantification, metrics and policy, data ethnography, collaborative data, data governance, as well as data activism. You can follow his work on Twitter at @danlammerhirt. He was research coordinator at Open Knowledge Foundation.