Evidence Appraisal Data-Thon: A recap of our Open Data Day event

This blog has been reposted from Medium

This blog is part of the event report series on International Open Data Day 2018. On Saturday 3 March, groups from around the world organised over 400 events to celebrate, promote and spread the use of open data. 45 events received additional support through the Open Knowledge International mini-grants scheme, funded by Hivos, SPARC, Mapbox, the Hewlett Foundation and the UK Foreign & Commonwealth Office. The event in this blog was supported through the mini-grants scheme under the Open Research Data theme.

Research can save lives, reduce suffering, and help with scientific understanding. But research can also be unethical, unimportant, invalid, or poorly reported. These issues can harm health, waste scientific and health resources, and reduce trust in science. Differentiating good science from bad, therefore, has big implications. This is happening in the midst of broader discussions about differentiating good information from misinformation. Current controversy regarding political ‘fake news’ has specifically received significant recent attention. Public scientific misinformation and academic scientific misinformation also are published, much of it derived from low quality science.

EvidenceBase is a global, informal, voluntary organization aimed at boosting and starting tools and infrastructure that enhance scientific quality and usability. The critical appraisal of science is one of many mechanisms seeking to evaluate and clarify published science, and evidence appraisal is a key area of EvidenceBase’s work. On March 3rd we held an Open Data Day event to introduce the public to evidence appraisal and to explore and work on an open dataset of appraisals. We reached out to a network in NYC of data scientists, software developers, public health professionals, and clinicians and invited them and their interested friends (including any without health, science, or data training).

Our data came from the US’s National Library of Medicine’s PubMed and PubMed Central datasets. PubMed offers indexing, meta-data, and abstracts for biomedical publications and PubMed Central (PMC) offers full-text in pdf and/or xml. PMC has an open-access subset. We explored the portion of this subset that 1) was indexed in PubMed as a “journal comment” and 2) was a comment on a clinical trial. The structure of our 10 hour event was an initial session introducing the general areas of health trials, research issues, and open data and then the remainder of the day consisted of parallel groups tackling three areas: lay exploration and Q&A; dataset processing and word embedding development; and health expertise-guided manual exploration and annotation of comments. We had 2 data scientists, 4 trial experts, 3 physicians, 4 public health practitioners, 4 participants without background but with curiosity, and 1 infant. Our space was donated, and the food was provided from a mix of a grant from Open Data Day provided by SPARC and Open Knowledge International (thank you!) and voluntary participant donations.

On the dataset front, we leveraged the clinical trial and journal comment meta-data in PubMed, and the links between PubMed and PMC, and PMC’s open subset IDs to create a data subset that was solely journal comments on clinical trials that were in PMC’s open subset with xml data. Initial exploration of this subset for quality issues showed us that PubMed metadata tags misindex non-trials as trials and non-comments as comments. Further data curation will be needed. We did use it to create word embeddings and so some brief similarity-based expansion.

The domain experts reviewed trials in their area of expertise. Some participants manually extracted text fragments expressing a single appraisal assertion, and attempted to generalize the assertion for future structured knowledge representation work. Overall participants had a fun, productive, and educational time! From the standpoint of EvidenceBase, the event was a success and was interesting. We are mainly virtual and global, so this in person event was new for us, energizing, and helped forge new relationships for the future.

We also learned:

We can’t have too much on one person’s plate for logistics and for facilitation. Issues will happen (e.g. food cancellation last minute).
Curiosity abounds, and people are thirsty for meaningful and productive social interactions beyond their jobs. They just need to be invited, otherwise this potential group will not be involved.
Many people who have data science skills have jobs in industries they don’t love, they have a particular thirst to leverage their skills for good.
People without data science expertise but who have domain expertise are keen on exploring the data and offering insight. This can help make sense of it, and can help identify issues (e.g. data quality issues, synonyms, subfield-specific differences).
People with neither domain expertise nor data science skills still add vibrancy to these events, though the event organizers need more bandwidth to help orient and facilitate the involvement of these attendees.
Public research data sets are messy, and often require further subsetting or transformation to make them usable and high quality.
Open data might have license and accessibility barriers. For us, this resulted in a large reduction in journal comments with full-text vs. not, and of those with full-text, a further large reduction in those where the text was open-access and licensed for use in text mining.

We’ll be continuing to develop the data set and annotations started here, and we look forward to the next Open Data Day. We may even host a data event before then!