Open data quality – the next shift in open data?

3 Min Read

This blog post is part of our Global Open Data Index blog series. It is a call to recalibrate our attention to the many different elements contributing to the ‘good quality’ of open data, the trade-offs between them and how they support data usability (see here some vital work by the World Wide Web Consortium). Focusing on these elements could help support governments to publish data that can be easily used. The blog post was jointly written by Danny Lämmerhirt and Mor Rubinstein.

Some years ago, open data was heralded to unlock information to the public that would otherwise remain closed. In the pre-digital age, information was locked away, and an array of mechanisms was necessary to bridge the knowledge gap between institutions and people. So when the open data movement demanded “Openness By Default”, many data publishers followed the call by releasing vast amounts of data in its existing form to bridge that gap.

To date, it seems that opening this data has not reduced but rather shifted and multiplied the barriers to the use of data, as Open Knowledge International’s research around the Global Open Data Index (GODI) 2016/17 shows. Together with data experts and a network of volunteers, our team searched, accessed, and verified more than 1400 government datasets around the world.

We found that data is often stored in many different places on the web, sometimes split across documents, or hidden many pages deep on a website. Often data comes in various access modalities. It can be presented in various forms and file formats, sometimes using uncommon signs or codes that are in the worst case only understandable to their producer.

As the Open Data Handbook states, these emerging open data infrastructures resemble the myth of the ‘Tower of Babel’: more information is produced, but it is encoded in different languages and forms, preventing data publishers and their publics from communicating with one another. What makes data usable under these circumstances? How can we close the information chain loop? The short answer: by providing ‘good quality’ open data.

Understanding data quality – from quality to qualities

The open data community needs to shift focus from mass data publication towards an understanding of good data quality. Yet, there is no shared definition what constitutes ‘good’ data quality.

Research shows that there are many different interpretations and ways of measuring data quality. They include data interpretability, data accuracy, timeliness of publication, reliability, trustworthiness, accessibility, discoverability, processability, or completeness. Since people use data for different purposes, certain data qualities matter more to a user group than others. Some of these areas are covered by the Open Data Charter, but the Charter does not explicitly name them as ‘qualities’ which sum up to high quality. Current quality indicators are not complete – and miss the opportunity to highlight quality trade-offs

Also, existing indicators assess data quality very differently, potentially framing our language and thinking of data quality in opposite ways. Examples are:

Some indicators focus on the content of data portals (number of published datasets) or access to data. A small fraction focus on datasets, their content, structure, understandability, or processability. Even GODI and the Open Data Barometer from the World Wide Web Foundation do not share a common definition of data quality.

Arguably, the diversity of existing quality indicators prevents from a targeted and strategic approach to improving data quality.

At the moment GODI sets out the following indicators for measuring data quality:

Completeness of dataset content
Accessibility (access-controlled or public access?)
Findability of data
Processability (machine-readability and amount of effort needed to use data)
Timely publication

This leaves out other qualities. We could ask if data is actually understandable by people. For example, is there a description what each part of the data content means (metadata)?

Improving quality by improving the way data is produced

Many data quality metrics are (rightfully so) user-focussed. However, it is critical that government as data producers better understand, monitor and improves the inherent quality of the data they produce. Measuring data quality can incentivise governments to design data for impact: by raising awareness of the quality issues that would make data files otherwise practically impossible to use.

At Open Knowledge International, we target data producers and the quality issues of data files mostly via the Frictionless Data project. Notable projects include the Data Quality Spec which defines some essential quality aspects for tabular data files. GoodTables provides structural and schema validation of government data, and the Data Quality Dashboard enables open data stakeholders to see data quality metrics for entire data collections “at a glance”, including the amount of errors in a data file. These tools help to develop a more systematic assessment of the technical processability and usability of data.

A call for joint work towards better data quality

We are aware that good data quality requires solutions jointly working together. Therefore, we would love to hear your feedback. What are your experiences with open data quality? Which quality issues hinder you from using open data? How do you define these data qualities? What could the GODI team improve? Please let us know by joining the conversation about GODI on our forum.

Written by

OKFN

The official voice of the Open Knowledge Foundation.

6 Comments

copper knee sleeve says:

September 15, 2017 at 18:56

What’s up everyone, it’s my first go to see at this web page, and post is really fruitful in favor of me,
keep up posting these posts.

Reply
Audrey Lobo-Pulo says:

July 15, 2017 at 04:15

My thoughts around data quality and how that may best be addressed through AI may be found at https://medium.com/@global_digital/open-data-the-promises-the-disillusion-and-the-panacea-part-2-the-disillusion-525ee949cb72

Reply
Gérard Chenais says:

May 31, 2017 at 17:55

As a dye hard retired official statistician with entire professional life spent in developing countries’ NSOs, I couldn’t agree more that data quality is the next shift in open data, and in official statistics. You could refer to the UN National Quality Assurance Framework (for official statistics) https://unstats.un.org/unsd/dnss/QualityNQAF/nqaf.aspx to find some usable good practices. The key concept (re ISO) here is fit for purpose. As you say improving quality requires improving production so that output is fit for purpose. As ISO observes, the quality of some production outputs like data can’t be assessed simply by examining the resulting data and controlling the relevant characteristics, and often it is impossible to reprocess expecting to get better data.
Opening government data is often simply assuming that they can have some uses other than the one the data where created for in the first place; and this is the key to open data quality as data can fit government use and not necessary the use by some unidentified users.
Some statisticians are using the concept of sufficient quality when any further refinement in accuracy or other characteristics wouldn’t change the derived decision. https://www.cbs.nl/nl-nl/achtergrond/2013/21/quality-reporting-and-sufficient-quality
Then improving the way data are processed means that opening should be part of the specifications at the design phase. Meta data is necessary to inform on the usability of the released data.
Regarding official statistics, a subset of government data, the purpose it to be used for any official decision making, law, regulation, policies, reporting, and for accountability.

Sincèrement,

Gérard Chenais

Reply
- Danny Lämmerhirt says:
  
  June 1, 2017 at 20:35
  
  Dear Gérard,
  
  Thanks so much for sharing this valuable insight. I couldn’t agree more, and are delighted to see that this shift happens in NSOs (I haven’t been aware of this). Just to respond to this: Last year we explored exactly the questions you outlined. Your comments resonate very well – that we need to focus on fitness-for-purpose, and the “multi-valence” of data (the many different values ascribed to data).
  
  Just some background (and I would love to hear your opinion): Past year we worked on research for the GP4SDD. The goal was to understand how citizen-generated data can make “all voices count” within the SDGs and the data revolution. “If we want to make all voices count, how can citizens and civil society support this through data practices?”
  
  One key finding: citizen-generated data is often rejected because of “poor” quality. This however is an extreme simplification for two reasons: 1) It assumes that there is an essential “data quality”. 2) And it neglects from which discipline this critique comes. The critique seemed to be driven by NSOs and their concerns of producing reliable, accurate, large-scale data (since they deal with household surveys, s.o.)
  
  When doing in-depth studies about the impact of citizen-generated data, we found much more complex explanations why citizen-generated data works. They pretty much boil down to the fact that data needs to be “good enough”. Key factors are perceived usefulness, shared acceptance, and an appreciation for the value of data (politics of data). And this all depended on which organisations use the data and their priorities. For more information, I recommend reading our reports here: (http://civicus.org/thedatashift/learning-zone-2/research/)
  
  The challenge is to understand where the threshold of “just good enough” lies: This is exacerbated that even departments within one organisation value data differently. (an accountant needs precise disaggregate data, a top-level manager needs three concise charts to make decisions). User-centric data design is key, and I would love to hear your thoughts and experiences on this!
  
  There is at least some evidence that CSOs understand to use data qualities in the right way, opening up new spaces for civic participation (see this report: https://blog.okfn.org/2017/02/09/data-and-the-city-new-report-on-how-public-data-is-fostering-civic-engagement-in-urban-regions/)
  
  From my POV, data ethnography and organisational ethnography are good ways forward. Oversimplified ideas of data quality (accuracy, etc.) that neglect fitness-for-purpose only help to design information that never starts to travel across organisations.
  
  With best regards
  Danny
  
  Reply
  - Gérard Chenais says:
    
    June 15, 2017 at 14:54
    
    Dear Danny,
    I am back on this discussion. I can talk about what I know best, the work of official statisticians expecting that it sheds some light on quality for open data.
    To start with the general purpose and the requirements when producing official statistical data (N° 1 fundamental principle for official statistics) :
    . serving the Government, the economy and the public
    . about the economic, demographic, social and environmental situation
    . official statistics that meet the test of practical utility are to be compiled and made available on an impartial basis by official statistical agencies to honour citizens’ entitlement to public information
    
    The meaning and use of official statistics are defined ((ex ante) before the data is collected and processed and over the years statisticians have capitalised on their experience at providing data to government, the economy and the public and developed conceptual frameworks and best practices (international recommendations) see : https://unstats.un.org/unsd/methods.htm. Practical utility means that it is not about theory or research and impartial means that no-one may be privileged and in other words official statistics are open by requirement.
    
    Some characteristics are derived from this first principle.
    . Relevant (the degree to which statistics meet current and potential users’ needs)
    . Accurate (the degree to which the information correctly describes the phenomena)
    . Comparable (Comparability aims at measuring the impact of differences in applied statistical concepts and definitions on the comparison of statistics between geographical areas, non-geographical dimensions, or over time).
    
    Edmond Malinvaud, https://en.wikipedia.org/wiki/Edmond_Malinvaud, a French statistician once said : Statistician’s mission is to supply concepts and measurements that provide a relevant and rigorous answer to concerns (“la mission du statisticien est de fournir des concepts et des observations qui apportent une réponse pertinente et rigoureuse à des préoccupations”).
    The concerns come first, then the concepts to allow proper interpretation of data and then data processing and simultaneous dissemination to all and anyone.
    When having to prioritise statistical programmes, for statisticians it to identify where they can provide independent, evidence-based information, as well as statistical standards and infrastructure, for use by governments and the broader community (UNECE Generic Activity Model for Statistical Organisations https://statswiki.unece.org/display/GAMSO/GAMSO+v1.1).
    
    Statisticians collect data through sample surveys to get results on a wider population with sufficient accuracy and from a much smaller number of respondents than the whole population; it is simply a cost-effective solution (with limitations) for data collection.
    
    One can find some information on the nature of government data (administrative data) that are not initially produced to be official statistics: Using Administrative and Secondary Sources for Official Statistics, http://www.unece.org/fileadmin/DAM/stats/publications/Using_Administrative_Sources_Final_for_web.pdf
    Très sincèrement,
    Gérard
Paul Malyon says:

May 31, 2017 at 11:01

Hi there. We (Experian) did some work with the ODI last year on open data quality and assessed some of the common challenges that come from publishing data on a shoestring budget or from a collection of sources. The report and data are openly available from the ODI website.
We came up with some ideas on how to improve the publication of data to give publishers more time & resource to focus on gradual quality improvements. I’d be keen to speak to anyone on how to take this forward.

Reply