The Frictionless Data project is about making it effortless to transport high quality data among different tools and platforms for further analysis. We are doing this by developing a set of software, specifications, and best practices for publishing data. The heart of Frictionless Data is the Data Package specification, a containerization format for any kind of data based on existing practices for publishing open-source software.
Through its pilots, Frictionless Data is working directly with organisations to solve real problems managing data. The University of Pittsburgh’s Center for Urban and Social Research is one such organisation.
One of the main goals of the Frictionless Data project is to help improve data quality by providing easy to integrate libraries and services for data validation. We have integrated data validation seamlessly with different backends like GitHub and Amazon S3 via the online service goodtables.io, but we also wanted to explore closer integrations with other platforms.
An obvious choice for that are Open Data portals. They are still one of the main forms of dissemination of Open Data, especially for governments and other organizations. They provide a single entry point to data relating to a particular region or thematic area and provide users with tools to discover and access different datasets. On the backend, publishers also have tools available for the validation and publication of datasets.
Data quality varies widely across different portals, reflecting the publication processes and requirements of the hosting organizations. In general, it is difficult for users to assess the quality of the data and there is a lack of descriptors for the actual data fields. At the publisher level, while strong emphasis has been put in metadata standards and interoperability, publishers don’t generally have the same help or guidance when dealing with data quality or description.
We believe that data quality in Open Data portals can have a central place on both these fronts, user-centric and publisher-centric, and we started this pilot to showcase a possible implementation.
To field test our implementation we chose the Western Pennsylvania Regional Data Center (WPRDC), managed by the University of Pittsburgh Center for Urban and Social Research. WPRDC is a great example of a well managed Open Data portal, where datasets are actively maintained and the portal itself is just one component of a wider Open Data strategy. It also provides a good variety of publishers, including public sector agencies, academic institutions, and nonprofit organizations.
The portal software that we are using for this pilot is CKAN, the world leading open source software for Open Data portals (source). Open Knowledge International initially fostered the CKAN project and is now a member of the CKAN Association.
We created ckanext-validation, a CKAN extension that provides a low level API and readily available features for data validation and reporting that can be added to any CKAN instance. This is powered by goodtables, a library developed by Open Knowledge International to support the validation of tabular datasets.
The ckanext-validation extension allows users to perform data validation against any tabular resource, such as CSV or Excel files. This generates a report that is stored against a particular resource, describing issues found with the data, both at the structural level, such as missing headers and blank rows, and at the data schema level, such as wrong data types and out of range values.
Read the technical details about this pilot study, our learnings and areas we have identified for further work in the coming days here on the Frictionless Data website.