You are browsing the archive for davidjones.

Bad Science on Open Data

July 2, 2010 in Uncategorized

The following article is from Guardian columnist Dr Ben Goldacre and was originally published on his blog as “Nullius in verba. In verba? Nullius!”. He kindly allowed us to reprint it here. It discusses the pros and cons of publishing data in the context of investigative medical journalism.

Ben Goldacre, Not In The Guardian, Saturday 26 June 2010

Here is some pedantry: I worry about data being published in newspapers rather than academic journals, even when I agree with its conclusions. Much like Bruce Forsyth, the Royal Society has a catchphrase: nullius in verba, or “on the word of nobody”. Science isn’t about assertions on what is right, handed down from authority figures. It’s about clear descriptions of studies, and the results that came from them, followed by an explanation of why they support or refute a given idea.

Last week the Guardian ran a major series of articles on the mortality rates after planned abdominal aortic aneurysm repair in different hospitals. Like many previously published academic studies on the same question, they again discovered that hospitals which perform the operation less frequently have poorer outcomes. I think this is a valid finding.

The Guardian pieces aimed to provide new information, in that they did not use the Hospital Episodes Statistics, which have been used for much previous work on the topic (and on the NHS Choices website to rate hospitals for the public). Instead they approached each hospital with a Freedom of Information Act request, asking the surgeons themselves for the figures of how many operations they did, and how many people died.

Many straightforward academic papers are built out of this kind of investigative journalism work, from early epidemiology research into occupational hazards, through to the famous recent study hunting down all the missing trials of SSRI antidepressants that companies had hidden away. It’s not clear whether this FOI data will be more reliable than the Hospital Episodes numbers – “discuss the strengths and weaknesses of the HES dataset” is a standard public health exam question – and reliability will probably vary from hospital to hospital. One unit, for example, reported a single death after 95 emergency AAA operations on FOI request, when on average about one in 3 people in the UK die during this procedure, and that suggests to me that there may be problems in the data. But there’s no doubt this was a useful thing to do, and there’s no doubt that hospitals should be helpful and share this information.

So what’s the problem? It’s not the trivial errors in the piece, although they were there. The article says there are ten hospitals with over 10% mortality, but in the data there are only 7. It says 23 hospitals do over 50 operations a year, but looking at the data there are only 21.

But here’s what I think is interesting. This analysis was published in the Guardian, not an academic journal. Alongside the articles, the Guardian published their data, and as a longstanding campaigner for open access to data, I think this is exemplary. I downloaded it, as the Guardian webpage invited, did a quick scatter plot, and a few other things: I couldn’t see the pattern for greater mortality in hospitals that did the procedure infrequently. It wasn’t barn door. Others had the same problem. I received a trickle of emails from readers who also couldn’t find the claimed patterns (including a professor of stats, if that matters to you). Jon Appleby, chief economist on health policy at the King’s Fund, posted on Guardian CommentIsFree explaining that he couldn’t find the pattern either.

The journalists were also unable to tell me how to find the pattern. They referred me instead to Peter Holt, an academic surgeon who’d analysed the data for them. Eventually I was able to piece together a rough picture of what was done, and after a few days, more details were posted online. It was a pretty complicated analysis, with safety plots and forest plots. I think I buy it as fair.

So why does it matter, if the conclusion is probably valid? Because science is not a black box. There is a reason why people generally publish results in academic journals instead of newspapers, and it’s got little to do with “peer review” and a lot to do with detail about methods, which tell us how you know if something is true. It’s worrying if a new data analysis is published only in a newspaper, because the details of how the conclusions were reached are inaccessible. This is especially true if the analysis is so complicated that the journalists themselves did not know about it, and could not explain it, and this transparency is especially important if you’re seeking to influence policy. The information needs to be somewhere. Open data – people posting their data freely for all to re-analyse – is the big hip new zeitgeist, and a vitally important new idea. But I was surprised to find that the thing I’ve advocated for wasn’t enough: open data is sometimes no use unless we also have open methods.

A tour of climate data at CKAN

February 24, 2010 in CKAN, External, Open Science

The following guest post is by David Jones who is, among other things, a curator of the climate data group on CKAN (the OKF’s open source registry of open data) and co-founder of Clear Climate Code (which was previously featured on our blog here and here).

Take a tour of some of the additions we’ve made to the climate data group at OKF’s CKAN.

The Mauna Loa observatory, Hawaii, has the longest period of continual recording of the amount of CO2 (carbon dioxide) in the air, the airborne fraction. The data are available in CKAN, and here they are in chart form: co2

CO2 is a relatively well mixed gas in the atmosphere, but even so, it would be unwise to rely on a single location for measurements. The Carbon Dioxide Analysis Center maintain a global network of stations collecting CO2. In fact Mauna Loa is not used for the global average because its height gives it a CO2 fraction that is lower than the surface average by 1 to 2 ppm.

What about reconstructing historical CO2 levels? One source is ice cores. On the Antarctic ice sheet snow falls every year and never melts. New snow falls on the snow from the previous year, building up in layers. Eventually the snow builds up to a thickness where it compresses the snow beneath it into solid ice, ice that is impermeable to gas. At that point, air at the surface becomes trapped in little bubbles in the ice.

vostok400

By drilling down through the ice we can reach older and older ice. Vostok Station sits on the Antarctic ice sheet, above Lake Vostok. Researchers have drilled down through the ice to a depth of 3623m and reached ice that is about 400,000 years old. Drilling was stopped, tantalisingly close to the lake surface, because the Scientific Committee on Antarctic Research (SCAR) raised concerns that life in Lake Vostok, potentially forming a unique biome, may be contaminated.

By measuring the CO2 content of the gas trapped in the ice core, we can reconstruct the historical levels. Of course, the data are in CKAN. Here’s a chart: vostokco2

(Other data from the Vostok ice core are also available)

Vostok is well known for being the coldest place on Earth. Vostok Station was established in 1957, and since that time weather records have been kept by researchers working there. The temperature record for Vostok Station is just one of the many thousands of records made available in the Global Historical Climate Network (GHCN). Here’s Vostok’s temperature record for the last three decades (more data are available, but three decades fits nicely):

vostoktemp

The different colours are because for a particular station the whole series can be comprised of individual records that cover only part of the range (due to different equipment, different reporting procedures, and so on); each record gets a different colour (unfortunately the records often overlap, confusing the colours).

The station records in GHCN, sometimes augmented by other similar datasets such as SCAR’s READER, are used to reconstruct global temperature anomalies, like the Japan Meteorological Agency Global Surface Temperature Anomaly, HadCRUT3 from the UK’s Met Office and the University of East Anglia’s Climate Research Unit, and, perhaps most famously, GISTEMP from NASA:

The seasonal cycle is evident in the Mauna Loa CO2 (it’s caused by photosynthesis of plants, mostly in the Northern Hemisphere, drawing down more CO2 from the atmosphere), and also in the Vostok temperature record. In some cases the seasonal signal and the long term trend are easily visible, in others in takes effort to recover the long term trend. The GISTEMP graph can be thought of as recovering the long term trend from many thousands of individual station records.

Another well known climate data series with both seasonal and long term trends is the National Snow and Ice Data Centre’s Arctic Sea Ice Extent:

arcticseaice

The seasonal cycle in the Arctic sea ice is of course due to summer melt and winter freeze.

Another data set available as a CKAN package is the Colorado Sea Level data. This is a measurement of global mean sea level obtained by a series of satellites: TOPEX, then JASON-1 and JASON-2. The next satellite in this series is JASON-3, and it has just secured funding from a European consortium. There is a seasonal cycle in this data too:

satellitemeansealevel

Now I think the seasonal cycle in mean sea level is due to the thermal expansion of the oceans. In the summer the ocean warms and expands; both the northern hemisphere and the southern hemisphere are affected but the effects don’t balance so there is a seasonal cycle. Please contact your tour guide (leave a comment!) if you can find a reliable explanation (I looked and was unable to find a good source).

The satellite era has been tremendously useful for earth observation and climate science, but of course the records from satellites are short. For example, the satellite data for sea level only goes back to 1993. Since climate is often a matter of looking at events on long timescales we often have to find longer series from other measurements.

The UK’s Natural Environment Research Council maintains the Permanent Service for Mean Sea Level at the Proudman Oceanographic Laboratory. Using a global network of about 2000 tide gauges they can reconstruct a global mean sea level record that documents sea level rise since 1880. Here’s a chart of the data available from the CKAN package:

meansealevel

The tour is coming to an end now. The data that I’ve shown here are just a selection of the data available, both generally and in CKAN. Often there is much more detailed data (and more detailed science) behind each of these datasets, but one of the reasons I’ve selected many of these datasets is that they are key indicators. They are the headline figures that show increased CO2 emissions, rising sea levels, decreasing Arctic sea ice. These are the data that a curious member of the public will want to engage with, and that reason makes it important that the data are accessible and freely accessible.

If you’d like to contribute to the climate data group then please drop us an e-mail. If you’d like to continue the tour on your own you might want to try the Red Sea Sea Level records and the Paleo Tree Ring records which are just around the corner in the Open Archeology wing.

If you’re interested in promoting open data in climate science, you may wish to endorse the Panton Principles, which were launched last week.

References and Credits

Clear Climate Code, and Data

January 28, 2010 in Exemplars, External, Open Data, Open Science

The following guest post is by David Jones who is, among other things, a curator of the climate data group on CKAN (the OKF’s open source registry of open data) and co-founder of Clear Climate Code (which we blogged about back in 2008).

Clear Climate Code have been working on ccc-gistemp, a project to reimplement in clear Python NASA’s GISTEMP. GISTEMP is a global historical temperature analysis, it produces, amongst other things, graphs like this, that tell you whether the Earth is getting warmer or cooler:

Official GISTEMP global anomaly.

Because this graph is important for studying the world’s climate (and determining the signature of global warming), there is a lot of public discussion about where this data comes from. The raw data underlying the graph is surface weather station temperature records. The raw data is processed to produce the data for the graph:

gistemp

The box in the middle, labelled “GISTEMP”, is a process that converts the raw station records into the data for the graph on the right, which is the global temperature anomaly. There are descriptions of this process available, for example Hansen and Lebedeff, 1987. A description is one thing, but it might not tell you everything you need to know. Perhaps the description is sufficiently clear and accurate for you to reproduce the process, perhaps not. The ultimate authority on the process is the source code that implements it, because It’s the source code that is executed in order to produce the processed data. So if you want to know exactly what the process involves, you have to get hold of the source code.

In effect it is the source code that adds value to the raw data to produce processed data. So in a sense, the value of the processed data is embodied in the source code. That’s what makes the source code important.

The source code for GISTEMP is written mostly in Fortran by scientists at NASA, and is available from them. This source code is the working code used by the NASA scientists, it is not necessarily the best source code for explaining how the process works (to an interested and competent member of the general public). There is the question of whether NASA, a publicly funded body, should be paying someone to write code that makes a better tool for communicating with the public (for example by writing better documentation, or writing it in a more exemplary style). I am not going to address that question. The source code NASA use is the source code we have right now.

Our goal at Clear Climate Code is to take this code and produce a new version that is clearer, but does the same thing. We have taken great steps forward towards this goal: We have recently released a version which is all in Python and which reproduces NASA’s results exactly. We think much of this code is already a great deal clearer than the starting material, but we continue to make it clearer. Of course we would welcome your support. If you want to help, please join our mailing list, or you can follow our progress at our blog and on twitter.

The reasons Clear Climate Code chose Python as the implementation language for ccc-gistemp are: accessibility, clarity, and familiarity. By accessible I mean that there is a large community of Python programmers, but also there are several tutorials and other materials for learning Python should you be motivated. Python is used to teach undergraduates programming. Python is relatively clear; it’s deliberately designed to be free of the clutter that imperils other programming languages. It’s certainly possible for people who are not professional programmers to create small programs in Python, and examine and modify existing Python programs. And lastly, it’s familiar; Nick Barnes and I already knew Python when we started the project. This seems like a trivial consideration, but in fact Clear Climate Code is an unpaid project and it’s pretty easy to come up with reasons to do something else instead, so the fact that we already knew Python was important.

Hopefully Clear Climate Code illustrates how both code and data are central to the public understanding of science. For an issue like global warming it is absolutely crucial that public are involved. CKAN’s climate data group is a place where non-specialists can access scientist’s data more easily, and hopefully use it to innovate, do their own hobby science, or create visualisations to better communicate with the public. I’m hoping to add more data sources to the climate data group in the near future, if you’re interested in adding more data to this group, please get in touch.