Facilitating data validation and reuse for a scientific community

eLife supports the sharing of research data at the time of publication. We worked with eLife to set up measures to assess the data quality, so that the value of this shared data is better understood by the research community. We also established a way to identify issues to inform a strategy for improving data quality and reuse by supporting researchers to prepare data in machine-friendly ways.

Opportunity

Data sharing is an important cornerstone in the movement towards more reproducible science. It provides a means to validate assertions made, which is why many journals and funders require research data to be shared publicly and appropriately, within a reasonable timeframe, following a research project.

Open research data is an important asset in the record of original research, and its reuse in different contexts helps to make the research enterprise more efficient. Sharing and reusing research data is fairly common; however, the poor quality of data and lack of documentation are significant factors preventing this data from being reused.

Researchers can spend a lot of time cleaning, restructuring, and comparing multiple datasets prior to publication: this is a barrier to making that data available. There is an opportunity to improve the reusability of research data in a way that requires minimal effort on the researcher’s part.

How we helped

At eLife, authors are encouraged to deposit their data in an external repository and to cite the datasets in their articles. Where this is not possible or suitable, they can publish source data as supplements to the articles themselves. The data files are then stored in the eLife data repository and made available through download links available within the articles.

Informed by work building and deploying the CKAN open-source data portal platform, and based on learning about various data publication workflows, the Frictionless Data team at OKF has developed solutions to remove the barriers in obtaining, sharing and validating data. These help people truly benefit from the wealth of data being opened up every day.

One such solution to these barriers is Good Tables – a library and web service developed to support the validation of tabular datasets, both in terms of structure and also with respect to a published schema.

Good Tables was used to evaluate the quality of source data shared directly through eLife, and through this tool we were able to produce a report highlighting common errors. This allowed us to understand the current state of eLife-published data, and opened the possibility of doing more exciting things with it, such as more comprehensive tests and visualisations. A detailed walkthrough of the processes we went through can be found here.

Results

The resulting analysis gave eLife overall confidence in the quality of their data, as well as where improvement was needed.

A key finding was that researchers tend to present data with a view to a human visually inspecting it: for example, cells were highlighted in colours, or were used to visually separate different groups of data. We found that the reusability of datasets shared with eLife could be improved by publishing data in machine-friendly ways, where this is appropriate. This means that efforts can be made to educate researchers and help them to prepare their data with this in mind, for example through updated documentation and publication advice.

Because certain types of errors are so common, particularly with Excel files, we have introduced default ‘ignore blank rows’ and ‘ignore duplicate rows’ options in our standalone validator goodtables.io to help bring more complex issues to the surface. This allowed eLIfe to focus attention on other errors which were less trivial to resolve.

We also found a few issues with the data itself, beyond presentation preferences, which were more easily resolved. More than three quarters of the articles analyzed contained at least one ‘invalid’ file.

Much less frequent errors were related to difficulties retrieving and opening data files; the ability of our validation process to identify this problem was of great benefit to eLife, allowing the publishers to check continued data availability after publication.

Once published, this properly structured tabular data is easier to open in different user tools, and schemas describing the data allow easier reuse, which facilitates the reproduction of data. Reproducibility is a key benefit: critical in rigorous scientific testing, having clean and well-described data facilitates the easy comparison of results and ensures their validity.

Good Tables makes data-quality issues visible by providing a slick user experience and reports that can be understood by anyone working with data. The reports produced will be a vital educational tool in helping eLife communicate the importance of data quality to researchers publishing to the platform and helping them to consider addressing their own data. Overall this will result in better quality and more useful data on the platform, and will help to highlight the overall importance of published open data in the research ecosystem.