Travis Gerke: ‘Academia is still very supportive of the singular hero scientist. But most things take teams’

This is the nineteenth conversation of the 100+ Conversations to Inspire Our New Direction (#OKFN100) project.

Starting in 2023, we are meeting with over 100 people to discuss the future of open knowledge, shaped by a diverse set of visions from artists, activists, academics, archivists, thinkers, policymakers, data scientists, educators, and community leaders from around the world.

How can openness accelerate and strengthen struggles against the complex challenges of our time? This is the key question behind conversations like the one you can read below.

In today’s conversation, we speak to Travis Gerke, a biostatistician and epidemiologist with deep experience across both academia and industry.

Travis’s academic labs focused on machine learning (ML) and causal inference with oncology applications, and his subsequent work in hospital IT and CROs featured the deployment of varied data-driven methods into production. A recurring theme across his efforts is a passion for open source development paired with accessible public health and clinical trial data.

In this interview, he discusses the barriers to open science, the promise and pitfalls of AI, and why we need to rethink how we share health data.

This conversation with Travis took place in late October and was conducted by Renata Ávila, CEO of OKFN.

We hope you enjoy reading it.

Renata Ávila: Let’s start with a little bit about you and how you chose this path for your career – at the intersection of so many public interest things.

Travis Gerke: I used to play in punk rock bands and be in construction for a long time. And then I said, what’s the opposite of that? Because it wasn’t paying the bills. So I said, math is the opposite of that.

I went through grad school, studied biostatistics and epidemiology, stayed in academia for about six years. When it was time to go up for tenure, I was asking myself, is this really making me happy? And I thought, nah, this isn’t really it for me.

What’s interesting about academia is that it’s still very supportive of the singular hero scientist. And I don’t think that’s very realistic. Most things take teams. I enjoy being a support person. I don’t wanna agonise about where my name lands in the author list. What is better is I just want to help teams of people produce good things faster.

So I left academia to embrace the full data science track. I did hospital IT for a bit, then jumped into clinical trials where I’ve been for four or five years now. For me, kind of a pet project is making sure that findings can be transparent and findable by patients or new people looking for information. That has not been an initiative that pharma has embraced historically. But there are places where some of those corners are being turned, and it’s promising.

Renata Ávila: One of the things that struck me – if you or someone you care deeply about is ill with something quite rare, you want to get as much data and research as possible to understand the situation yourself. And somehow I feel that this knowledge has been held hostage.

I’ve seen it with many close friends suffering from long COVID or genetic diseases. There’s public money into this research and monopolies being granted to make money out of the discoveries. And yet the knowledge that would help us collectively identify problems or find solutions as a community is far away.

What’s your experience with that? What’s your main frustration and what do you think could be unlocked to make the situation better?

Travis Gerke: I think there are a couple layers of siloing of information from the public. One is the academic publishing industry. I think that’s absurd that trials, which are in many cases funded by the government, have excellent manuscripts written about them and those publications are often behind a paywall.

There have been good strides to improve that here in the US. As requirements of grants from the federal government, you do have to make at least the landmark publication publicly accessible. Although even when that happens, there are sometimes embargoes by journals. There’s no reason for it in my mind, other than just sheer profit.

I love that the preprint movement has really taken off. ArXiv and bioRxiv – these are great resources. And as publishers have lifted restrictions on submitting to their journal when you’ve preprinted, that’s been a good thing. That’s one silo: the academic publishing industry.

The second is really just the reluctance to share open data that relates to clinical trials. One reason is that there is not much funding or incentive available to fully de-identify and anonymise data to then ship it into the public domain. From pharma’s or academics’ perspective, why would they do this? They don’t get any returns personally.

And then there’s a general fear of, “What if I’ve done something wrong in the analysis of my data and then I become exposed?” I can appreciate that, but in the end I think we lose fidelity on. We’re doing this to help make health better and make the world better. Even if we’ve made a mistake somewhere along the way, it’s not like it was ill-intentioned. If we open source our experiences, our wins, and even our missteps, we all learn from it and everyone becomes healthier and better.

Renata Ávila: One thing I wanted to know about is anonymising the data. Is there a low-tech, cheap solution that could impact at scale? Because I think it would make a big difference in the public interest.

Travis Gerke: Yeah, it’s a really good question. I think maybe four or five years ago, the buzz was tokenisation of data. There are companies that do this now at scale, but they’re for-profit and charge astronomical amounts of money.

But there are lower-tech ways. There are rules for scrambling data that classify it as fully anonymised. For example, dates are PHI (protected health information) – if you had a surgery on some particular day, you cannot share that information. However, if the granularity becomes plus or minus a year, there are ways you can shift that data randomly such that it becomes de-identified.

You don’t wanna shift all the dates because then you lose the causal chain of events. You don’t want a surgery to happen before your actual diagnosis, right? So you could scramble all dates in the dataset, shift randomly one direction or the other, so the time ordering of events is preserved. Names are never necessary. For geolocation, you don’t have to get down to county level. You can just aggregate by region.

Yes, there are ways that I think are low tech that just involve some fairly basic data engineering, just application of existing methods.

Renata Ávila: One of the things you mentioned is the frustration with the speed of knowledge in academia. I share that frustration. Sometimes I feel that those in the know have information at the right moment that could really transform people’s lives, but the information doesn’t flow fast – because of format, because of the form of academia.

What would be a different model for you that would make sense on the way that we produce and share and apply knowledge?

Travis Gerke: I’m trying to think of what one of the main barriers to speed is right now. And it’s probably the peer review process.

You conduct a study which takes years, then you write a manuscript and submit it to a journal, and it takes two to three months for the reviewers to review it, and then it gets rejected. So you take it and submit it to the next journal and it takes two to three months and it gets rejected. Before you know it, two years have elapsed. And this research, which has very likely been funded by the people and the government, is still just under lock and key, awaiting a proper stamp of approval from peer review.

I don’t want to say that peer review is bad. It’s of course very good. But I think we could better scale the way it’s operationalised. I do think preprints are a step in the right direction. However, they’re a little bit dangerous because nothing is peer-reviewed in advance of posting.

I think incentivising the scientific community to peer review preprints actively could be a way forward. But I struggle with how to formalise that, because there has to be an incentive structure for academics who read preprints and peer review them.

If there’s a more streamlined way to release data associated with the preprint, that would be huge. If you can pull up a paper and say, “I’m not sure about this result. Let me tinker on my side and see if I adjusted for this other variable”, that’d be amazing. I think we’d have more community and crowdsourced science happening faster.

Renata Ávila: But what incentive is there for me to spend time and effort peer reviewing a dataset that could be used for so many researchers? And then where is my credit?

The incentives seem completely distorted. And another interesting issue is how much of a business it’s becoming. The complex technological stack being pushed to the open access space. Super complex, expensive subscription software. It’s become an industry itself – the monopoly guys found out a model to monetise the same way, to lock in the knowledge by limiting who can basically run an open access journal.

Have you seen interesting incentives that could lead us to a different model?

Travis Gerke: The incentive structure is near and dear to my heart. To work as a supportive team member in academia, one always needs funding. And in general, funders do not want to support infrastructure grants. I would submit infrastructure-looking grants to say, “I’m going to build this tool that will help cancer research groups move faster.” And they would say, “No, but what’s your research question?”

And I get it. Congress mandates that requests for proposals be written a certain way. They have to achieve certain end goals. Unfortunately, those most often align with investigating new research ideas and generating new datasets. So the notion that one would build infrastructure to increase value from existing data resources ends up being at odds with the funding mechanisms that exist here.

You mentioned something about getting citations for peer reviewing a dataset, almost like how GitHub contributions work. That could work. But there would have to be a lot of thought around the metric alignment that makes it digestible by academic administrators who need to promote and retain. It’s tricky. It’s a people problem.

You mentioned siloed tech stacks and vendor lock, which is also anathema to me. There are so many emerging tech players whose sole business model is to invoke a highly complicated technology and data stack that you cannot back out of.

There’s a reason Excel is basically undefeated for decades now, right? It just works. Flat files work, CSVs work. I don’t know that we need a lot of cloud infrastructure and complicated things to analyse what are often simple data structures that also align with the highest impact questions for science and for humanity. I think we often don’t need highly complicated data to move the needle in terms of progress. But I think we’ve lost sight of that.

Renata Ávila: What do you think of knowledge as a digital public infrastructure? Scientific knowledge as a digital public infrastructure that helps society solve problems, rather than just academia as a sector? How do we need to reframe it so it is a societal investment and also a societal opportunity to participate?

Travis Gerke: In another role, I work with an organisation called cStructure, and we’re concerned specifically with this problem. The end goal is doing good causal inference: how do we make good decisions? Rather than building big black box machine learning (ML) models that predict what may happen next, you seek to answer questions like, “What if we make this policy decision, what happens next?”

To do that well, you have to have a good knowledge base, a structured knowledge graph. We’ve been working towards a platform which will encode knowledge as graphs in a low code way, easily accessible to experts. And once you have many of those, you can start thinking about how these knowledge graphs, specific to given problems, start to touch each other. Then you build a very large knowledge resource that can answer many “what if” style questions, which should impact the way the world works.

Renata: If you have a magic wand, what would be the change you would push ahead? And do you have any ongoing fear of what is ahead of us?

Travis: Let’s go fear first. I imagine most people you talk to have the same answer: AI is very dangerous and we are going to stop learning how to think.

As a programmer, I do use LLM-assisted tools – it can be very useful, but one has to do so with extreme caution. One of the challenges I think about is when I use an LLM system to help me work through a programming task, it uses the existing knowledge base upon which that model was built. What we’re not doing is inventing new creative solutions to solving problems.

If we start relying too heavily on AI, then who will create the next textbook? And who will make new knowledge happen in the public domain if we’re all just relying on this machine? Which by the way, very often doesn’t do a very good job, it gives hallucinations and incorrect answers and misinformation all the time.

I think we could be led very astray by overreliance, and in the US, our entire economy these days seems to be built upon the promise of AI. And that’s very risky and it frightens me.

In terms of if I had a magic wand, I’d probably make everything I just described go away. Let’s revert back to the days when ML was just called ML and not AI, and linear regression was called linear regression and not AI.

And then the most magical wand I would wave would be to go back in time like 30, 40 years and be more thoughtful about what common data models should look like for various application domains. Right now, most domain areas – whether economics, health, or finance – they’re all struggling with very disparate looking datasets. No matter how we try to tackle answering questions, we always get stuck for 80% of the journey trying to reassemble and understand how these disparate datasets convey information.

If we could make them all look the same and enforce standards such that organisations would adhere their data to those data models, then all the conversations we had about how we open up more data, how do we incentivise data access, all of that becomes self-evident. Because there’s already this large infrastructure built around a common standard. And when you adhere to that standard, your data becomes more reusable and understandable by both humans and machines. And we learn faster.