eLife: Facilitating data validation & reuse with goodtables

Open Knowledge International is working on the Frictionless Data project to remove the friction in working with data. We are doing this by developing a set of tools, standards, and best practices for publishing data. Through a series of pilots, we are working directly with organizations to solve real problems managing data. eLife is a non-profit organisation with a mission to help scientists accelerate discovery by operating a platform for research communication that encourages and recognises the most responsible behaviours in science. eLife publishes important research in all areas of the life and biomedical sciences. The research is selected and evaluated by working scientists and is made freely available to all readers. In this blog, Jo Barrat, Adria Mercader and Naomi Penfold share learnings from a pilot of Frictionless Data’s goodtables on data shared with eLife.

eLife is a non-profit organisation with a mission to help scientists accelerate discovery by operating a platform for research communication that encourages and recognises the most responsible behaviours in science. eLife publishes important research in all areas of the life and biomedical sciences. The research is selected and evaluated by working scientists and is made freely available to all readers.

“Ensuring data availability alone is insufficient for the true potential of open data to be realised. The push from journals and funders at the moment is to encourage sharing, which is the first step towards reuse. The next step is to consider how we ensure actually reusable data. Any efforts to make it easier for researchers to prepare high quality reusable datasets, and to do so with minimal effort, are welcome. Further, tools that reduce the burden of reviewing datasets are of interest to data publishers.”
– Naomi Penfold, eLife

Use Case

Data sharing is an important cornerstone in the movement towards more reproducible science: it provides a means to validate assertions made, which is why many journals and funders require that research data is shared publicly and appropriately within a reasonable timeframe following a research project. At eLife, authors are encouraged to deposit their data in an appropriate external repository and to cite the datasets in their article or, where this is not possible or suitable, publish the source data as supplements to the article itself. The data files are then stored in the eLife data store and made available through download links available within the article.

Source data shared with eLife is listed under the Figures and data tab. Source: Screenshot from eLife 2017;6:e29820.

Open research data is an important asset in the record of the original research, and its reuse in different contexts help make the research enterprise more efficient. Sharing and reuse of research data is fairly common, and researchers may reuse others’ data more readily than they might share their own.

The exact nature of data reuse however is less clear: forty percent of Wellcome Trust-funded researchers make their data available as open access, and three-quarters report reusing existing data for validation, contextualisation, methodological development, and novel analyses, for example (Van den Eynden et al, 2016). Interestingly, a third of researchers who never publish their own data report reusing other researcher’s open data (Treadway et al, 2016) and dataset citation by researchers other than the original authors appears to be growing at least in line with greater availability of data (for gene expression microarray analysis; Piwowar & Vision, 2013). However, only a minority of citations (6% of 138) pertained to actual data reuse when citation context was manually classified in this study.

Indeed, the quality of the data and its documentation were listed as important factors when Wellcome Trust-funded researchers were deciding whether to reuse a dataset or not (Van den Eynden et al, 2016).

Very few formal studies that look into the problems faced by researchers when attempting to reuse open data have been published. Anecdotal evidence from conversations with life scientists indicates that:

The process of preparing open data for reuse — including cleaning, restructuring, and comparing multiple datasets prior to combining — is onerous and time-consuming.
The time and effort it takes for researchers to prepare their own data for repository deposition is considered a barrier to sharing. Further, the quality of the data and its documentation are important factors when deciding whether to reuse a dataset or not. (Van den Eynden et al, 2016)
This is why projects that improve the reusability of research data in a way that requires minimal effort on the researcher’s part are of interest within the eLife Innovation Initiative.
There is also a sparsity of formal structures for secondary data users to openly collaborate with original data providers, to share the work of improving quality of open research data. Such infrastructure could provide the social and academic feedback cycle in a rapid enough timescale to fuel a rich and useful Open Data ecosystem. While the utility of goodtables does not extend to this use case, it is the first step along this pathway.

These problems are relevant not only to open data in academic research but also to government data. Similarly to moving beyond incentivising sharing data to encouraging sharing reusable data for research, we shouldn’t only incentivise governments for raw publication and access. We need to incentivise data quality, towards actual insight and change. Without a simple, solid foundation of structural integrity, schematic consistency, and timely release, we will not meet quality standards higher up in the chain. We need to have essential quality assurances in plain text publication of data first, for data that is published via manual and automated means.

We shouldn’t only incentivise governments for raw publication and access. We need to incentivise data quality, towards actual insight and change.

For our Frictionless Data pilot work, we analyzed 3910 articles, 1085 of which had data files. The most common format was Microsoft Excel Open XML Format Spreadsheet (xlsx), with 89% of all 4318 files being published on this format. Older versions of Excel and CSV files made up the rest.

A summary of the eLife research articles analysed as part of the Frictionless Data pilot work

In terms of validation, more than three quarters of the articles analyzed contained at least one invalid file. Following analysis of a sample of the results, the vast majority of the errors appear to be due to the data being presented in aesthetically pleasing tables, using formatting to make particular elements more visually clear, as opposed to a machine-readable format.

Data from Maddox et al. was shared in a machine-readable format (top), and adapted here to demonstrate how such data are often shared in a format that looks nice to the human reader (bottom). Source: Source data The data file is presented as is and adapted from Maddox et al. eLife 2015;4:e04995 under the Creative Commons Attribution License (CC BY 4.0).

This is not limited to the academic field of course, and the tendency to present data in spreadsheets so it is visually appealing is perhaps more prevalent in other areas – perhaps because consumers of the data are even less likely to have the data processed by machines or because the data is collated by people with no experience of having to use it in their work.

Work to improve the reusability of research data pushes towards an ideal situation where most data is both machine-readable and human-comprehensible. In general the eLife datasets had better quality than for instance those created by government organisations, where structural issues such as missing headers and extra cells are much more common. So although the results here have been good, the community may derive greater benefit from researchers going that extra mile to make files more machine-friendly and embrace more robust data description techniques like Data Packages.

Overall, the findings from this pilot demonstrate that there are different ways of producing data for sharing: datasets are predominantly presented in an Excel file with human aesthetics in mind, rather than structured for use by a statistical program. We found few issues with the data itself beyond presentation preferences. This is encouraging and is a great starting point for venturing forward with helping researchers to make greater use of open data.

You can read more about this work in the Frictionless Data Pilot writeup.

Parts of this piece are cross-posted on eLife Labs.