The Global Open Data Index (GODI) is one of the core projects of Open Knowledge International. Originally launched in 2013, it has quickly grown and now measures open data publication in 122 countries. GODI is a community tool, and throughout the years the open data community have taken an active role in shaping it by reporting problems, discussing issues on GitHub and in our forums as well as sharing success stories. We welcome this feedback with open arms and in 2016, it has proved invaluable in helping us produce an updated set of survey questions.
In this blogpost we are sharing the first draft of the revised GODI survey. Our main objective in updating the survey this year has been to improve the clarity of the questions and provide better guidance to submitters in order to ensure that contributors understand what datasets they should be evaluating and what they should be looking for in those datasets. Furthermore, we hope the updated survey will help us to highlight some of the tangible challenges to data publication and reuse by paying closer attention to the contents of datasets.
Our aim is to adopt this new survey structure for future editions of GODI as well as the Local Open Data Index and we would love to hear your feedback! We are aware that some changes might affect the comparability with older editions of GODI and it’s for this reason that your feedback is critical. We are especially curious to hear the opinion of the Local Open Data Index community. What do you find positive? Where do you see issues with your local index? Where could we improve?
In the following we would like to present our ideas behind the new survey. You will find a detailed comparison of old and new questions in this table.
A brief overview of the proposed changes:
- Better measure and document how easy it is to find government data online
- Enhance our understanding of the data we measure
- Improve the robustness of our analysis
- Better measure and document how easy or difficult it is to find government data online
Even if governments are publishing data, if potential users cannot find them, then it goes without saying that they will not be able to use it. In our revised version of the survey, we ask submitters to document where they found a given dataset as well how much time they needed to find it. We recognise this to be an imperfect measure, as different users are likely to vary in their capacity to find government data online. However, we hope that this question will help us to extract critical information around the challenges related to usability that are not easily captured by a legal and technical analysis of a given dataset, even if it would be difficult to quantify the results and therefore use it in the scoring.
- Enhance our understanding of the data we measure
It is common from governments to publish datasets in separate files and places. Contributors might find department spending data scattered across different department websites or, even when made available in one place such as a portal, the data could be split up into a multiple files. Some portion of this data might be openly licensed, another portion machine-readable while others are in PDFs. Sometimes non-machine-readable data is available without charge, while machine-readable files are available for a fee. In the past, this has proven to be an enormous challenge for the Index as submitters are forced to decide what data should be evaluated (see this discussion in our forum).
The inconsistent publication of government data leads to confusion among our submitters and negatively impacts the reliability of the Index as an assessment tool. Furthermore, we think it is safe to say if open data experts are struggling to find or evaluate datasets, potential users will face similar challenges and as such, the inconsistent and sporadic data publication policies of governments is likely to affect data uptake and reuse. In order to ensure that we are comparing like with like, GODI assesses the openness of clearly defined datasets. These dataset definitions are what have determined, in collaboration with experts in the field, to be essential government data – data that contains crucial information for society at large. If a submitter only finds parts of this information in a file or scattered across different files, rather than assessing the openness of key datasets, we end up assessing a partial snapshot that is unlikely to be representative. There is more at stake than our ability to assess the “right” datasets – incoherent data publication significantly limits the capacity of civil society to tap into the full value of government data.
- Improve the robustness of our analysis
In the updated survey, we will determine whether datasets are available from one URL by asking “Are all the data downloadable from one URL at once?” (formerly “Available in bulk?”). To respond in the affirmative, submitters would have to be able to demonstrate that all required data characteristics is made available in one file. If the data cannot be downloaded from one URL, or if submitters find multiple files on one URL, they will be asked to select one dataset, from one URL, which the most number of requirements and is available free of charge. Submitters will document why they’ve chosen this dataset and data source in order to help reviewers understand the rationale for choosing a given dataset and to aid in verifying sources.
The subsequent question will, “Which of these characteristics are included in the downloadable file?”, will help us verify that the dataset submitted does indeed contain all the requisite characteristics. Submitters will assess the dataset by selecting each individual characteristic contained within it. Not only will this prompt contributors to really verify that all the established characteristics are met, it will also allow us to gain a better understanding of the common components missing when governments are publishing data, thus giving civil society a better foundation to advocate for publishing the crucial data. In our results we will more explicitly flag which elements are missing and declare only those dataset fully open that match all of our dataset requirements.
This year, we are committed to improving the clarity of the survey questions:
- “Does the data exist?” – The first question in previous versions of the Index was often confusing for submitters and has been reformulated to ask: “Is the data published by government (or a third-party related to government)?” If the response is no, contributors will be asked to justify their response. For example, does the collection, and subsequent publication, of this data fall under under the remit of a different level of government? Or perhaps the data is collected and published (or not) by a private company? There are a number of legal, social, technical and political reasons that might mean that the data we are assessing simply does not exist and the aim of this question is to help open data activists advocate for coherent policies around data production and publication (see past issues with this question here and here).
- “Is data in digital form?” – The objective of this question was to cover cases where governments provided large data on DVDs, for example. However, users have commented that we should not ask for features that do not make data more open. Ultimately, we have concluded that if data is going to be usable for everyone, it should be online. We have therefore deleted this question.
- “Publicly Available?” – We merged “Publicly available?” with “Is the data available online?”. The reason is that we only want to reward data that is publicly accessible online without mandatory registrations (see for instance discussions here and here) .
- “Is the data machine-readable?” – There have been a number of illuminating discussions in regards to what counts as machine-readable formats (see for example discussions here and here). We found that the question “Is the data machine-readable?” was overly technical. Now we simply ask users “In which file formats are the data?”. When submitters enter the format our system automatically recognises if the format is machine-readable and in an open format.
- “Openly licensed” – Some people argued that the question “Openly licensed?” does not adequately take into account the fact that some government data are in the public domain and not under the protection of copyright. As such, we have expanded the question to “Is the data openly licensed/in the public domain”. If data are not under the protection of copyright, they do not necessarily need to be openly licensed; however, a clear disclaimer must be provided informing users about their copyright status (which can be in form of an open licence). This change is in line with the Open Definition 2.1. (See discussions here and here).
Looking forward hearing your thoughts on the forum or by commenting on this post!
Danny Lämmerhirt works on the politics of data, sociology of quantification, metrics and policy, data ethnography, collaborative data, data governance, as well as data activism. You can follow his work on Twitter at @danlammerhirt. He was research coordinator at Open Knowledge Foundation.