Recently, there have been a great deal of debates revolving around research progress and impact across certain areas of science. As a moderate percentage of experimental findings haven’t been successfully replicated, scientists worry that the fruits of their research aren’t valid and reliable enough to change and benefit society at large, beyond academia. This and other pressing issues that fall under the umbrella of ‘open research practices’ have motivated my colleague and I to collaborate with Frictionless Data, as part of our Open Research for Academics workshops organised in London and Manchester. Below I am attempting to address this broad matter by focussing on a key-element that fuels it.
If you are a scientist, you might already be intimately acquainted with the reproducibility crisis. It has taken many scholars aback, especially those studying the human mind and body. The term ‘reproducibility’ refers to the ability of experiments and experimental findings to be reproduced by different scientists across different laboratories. For scientific discoveries to be valid and reliable, they need to be replicable. For replicability to be possible, research materials need to be shared among investigators. Academics need to communicate.
In this blog post, I’m looking at the replicability crisis by focussing on its communication crux, modestly attempting to provide a short overview of the problem and propose a practical solution.
In 2005, Professor of Medicine and Statistics John Ioannidis made an issue of the questionable research practices across social and medical sciences. In 2015, a seminal review published by the Open Science Collaboration added fuel to the growing fire: only 36% of 100 high-profile psychological findings were replicated by fellow cognitive scientists. Why would this be the case? After all, we live in a digital age where records of materials, data and analyses can be made publicly available by means of an array of digital tools. Moreover, the American National Academy of Sciences Committee on Science, Engineering and Public Policy reiterates that researchers have a ‘fundamental obligation to keep quality records of their research and, once it’s published, to render other investigators access to the data and research materials’. The Royal Society Science Policy Centre emphasises that ‘good scientiﬁc communication is assessable communication, which allows those who follow it not only to understand what is claimed, but also to assess the reasoning and evidence behind the claim’. It seems like inter-researcher communication is the elephant in the room.
This handicap is somewhat justifiable. Well-substantiated arguments include the common angst that sharing research materials invites intellectual theft, the realisation that open data need to be hosted by platforms whose maintenance is costly (e.g. about £5.5 million allocated for the UK Data Archive in 2015 alone) and the pressure and limited amount of time academics are subject to, which they righteously prefer to dedicate to ideas generation and publication rather than distribution.
The other side of the argument is that the act of sharing research materials is associated with increased citation rate and research impact, rapid research growth and a tacit spread of knowledge, which benefits the novices of today and the authorities of tomorrow. Indeed, early career researchers defer to their mentors when borrowing all manner of research practices, including pristine data screening, cleaning and handling criteria as well as more sophisticated reasoning related to the philosophy of science and collegiality. Thus, an understandable fear is that such standardised ways of spreading knowledge mar young scientists’ creativity by making them dependent upon the collective protocol of their laboratories.
I believe a balance can be achieved and I am not alone in my conviction. Caspar Addyman, a developmental psychologist passionate about open research and statistics, has saved fellow colleagues many nerve-wracking hours by depositing on github many of the stimuli and scripts he created and employed in his own studies. The cognitive scientists behind the Peer Reviewers’ Openness Initiative have categorically stated that they, and whoever else adheres to their initiative, will not offer manuscript review unless experimental materials are made publicly available, accompanied by guidelines and explanations. The scientific journal Nature have just proclaimed their commitment towards openness by encouraging the establishment of clear standards and repositories, yet echoing the exceptions mentioned in the Peer Reviewers’ Openness Initiative manifesto – clear or ethical reasons for not sharing experimental materials are understandable. Change is slowly occurring yet more coverage and practicality are needed. What could still be done?
The Frictionless Data project from Open Knowledge International firmly believes that data containerisation removes barriers, thus aiding creative work-flow. The Frictionless Data approach encourages data producers to adhere to a set of specifications and best practices which promotes a smooth and friction-free path along the produce/improve/share sequence that lies at the heart of research. The concrete way to fluidise the research process is by improving the way spreadsheets are organised, refining the annotation of data analysis scripts and transparently summarising the reasoning behind all statistical analyses employed in experiments. In addition, records of raw data and stimuli creation and presentation should be kept and updated when changes to paradigms are implemented.
Thus, clever, automated tools help. Open Knowledge International has developed GoodTables, a free app which allows one to identify and correct errors in spreadsheets before sharing them. This ensures that the standards for further analysis and usage are met. Similar to how developers manage code projects, GoodTables can be used for continuous data validation to manage data as they change. The process is described on the Open Knowledge International developer Labs blog. The main idea is that, as datasets are often collaboratively created, failure to make accurate notes of updates can result in code crashes. Continuous integration allows for code to be automatically run and notifications on every update to be released on the dataset’s shared depository. This way, ‘good’ or ‘bad’ data, i.e. poorly structured data, spreadsheets and code, are instantaneously flagged and the appropriate amendments can be made.
Another advantage of Frictionless Data is the network of projects and organisations adapting the specifications. Their Case Study series focusses on the experience of researchers from different fields, whose work-flow and final product have been considerably improved by Open Knowledge International tools. Examples include energy researchers and cell biologists or investigators working with ecological or any other type of tabular data. This list is by no means exhaustive and hungry minds are encouraged to visit the Frictionless Data website and engage with their comprehensive guides and blog posts.
Let’s assume the philosophical debate surrounding the context-dependent standardisation of research practices is solved. Could projects such as Frictionless Data eradicate irreproducibility? I don’t believe so. The reproducibility crisis is multi-faceted, touching upon many grey areas such as policymaking, finances and ethics. Thus, a word of caution is needed: Frictionless Data and other like-minded initiatives are not a panacea – after all, where institutions promote the ‘publish or perish’ mindset and research is hurried in order to advance or even preserve careers, the public availability of research materials constitutes only one node within an ever-growing network of contentious compromises. Ironically though, sharing data supports fast-paced research, as by doing so investigators can run and publish studies more quickly, without wasting time cleaning datasets or programming experiments and data analysis procedures which were already used by others in past investigations.