Bad Science on Open Data

The following article is from Guardian columnist Dr Ben Goldacre and was originally published on his blog as “Nullius in verba. In verba? Nullius!”. He kindly allowed us to reprint it here. It discusses the pros and cons of publishing data in the context of investigative medical journalism.

Ben Goldacre, Not In The Guardian, Saturday 26 June 2010

Here is some pedantry: I worry about data being published in newspapers rather than academic journals, even when I agree with its conclusions. Much like Bruce Forsyth, the Royal Society has a catchphrase: nullius in verba, or “on the word of nobody”. Science isn’t about assertions on what is right, handed down from authority figures. It’s about clear descriptions of studies, and the results that came from them, followed by an explanation of why they support or refute a given idea.

Last week the Guardian ran a major series of articles on the mortality rates after planned abdominal aortic aneurysm repair in different hospitals. Like many previously published academic studies on the same question, they again discovered that hospitals which perform the operation less frequently have poorer outcomes. I think this is a valid finding.

The Guardian pieces aimed to provide new information, in that they did not use the Hospital Episodes Statistics, which have been used for much previous work on the topic (and on the NHS Choices website to rate hospitals for the public). Instead they approached each hospital with a Freedom of Information Act request, asking the surgeons themselves for the figures of how many operations they did, and how many people died.

Many straightforward academic papers are built out of this kind of investigative journalism work, from early epidemiology research into occupational hazards, through to the famous recent study hunting down all the missing trials of SSRI antidepressants that companies had hidden away. It’s not clear whether this FOI data will be more reliable than the Hospital Episodes numbers – “discuss the strengths and weaknesses of the HES dataset” is a standard public health exam question – and reliability will probably vary from hospital to hospital. One unit, for example, reported a single death after 95 emergency AAA operations on FOI request, when on average about one in 3 people in the UK die during this procedure, and that suggests to me that there may be problems in the data. But there’s no doubt this was a useful thing to do, and there’s no doubt that hospitals should be helpful and share this information.

So what’s the problem? It’s not the trivial errors in the piece, although they were there. The article says there are ten hospitals with over 10% mortality, but in the data there are only 7. It says 23 hospitals do over 50 operations a year, but looking at the data there are only 21.

But here’s what I think is interesting. This analysis was published in the Guardian, not an academic journal. Alongside the articles, the Guardian published their data, and as a longstanding campaigner for open access to data, I think this is exemplary. I downloaded it, as the Guardian webpage invited, did a quick scatter plot, and a few other things: I couldn’t see the pattern for greater mortality in hospitals that did the procedure infrequently. It wasn’t barn door. Others had the same problem. I received a trickle of emails from readers who also couldn’t find the claimed patterns (including a professor of stats, if that matters to you). Jon Appleby, chief economist on health policy at the King’s Fund, posted on Guardian CommentIsFree explaining that he couldn’t find the pattern either.

The journalists were also unable to tell me how to find the pattern. They referred me instead to Peter Holt, an academic surgeon who’d analysed the data for them. Eventually I was able to piece together a rough picture of what was done, and after a few days, more details were posted online. It was a pretty complicated analysis, with safety plots and forest plots. I think I buy it as fair.

So why does it matter, if the conclusion is probably valid? Because science is not a black box. There is a reason why people generally publish results in academic journals instead of newspapers, and it’s got little to do with “peer review” and a lot to do with detail about methods, which tell us how you know if something is true. It’s worrying if a new data analysis is published only in a newspaper, because the details of how the conclusions were reached are inaccessible. This is especially true if the analysis is so complicated that the journalists themselves did not know about it, and could not explain it, and this transparency is especially important if you’re seeking to influence policy. The information needs to be somewhere.
Open data – people posting their data freely for all to re-analyse – is the big hip new zeitgeist, and a vitally important new idea. But I was surprised to find that the thing I’ve advocated for wasn’t enough: open data is sometimes no use unless we also have open methods.