The following guest post is by David Jones who is, among other things, a curator of the climate data group on CKAN (the OKF’s open source registry of open data) and co-founder of Clear Climate Code (which we blogged about back in 2008).
Clear Climate Code have been working on ccc-gistemp, a project to reimplement in clear Python NASA’s GISTEMP. GISTEMP is a global historical temperature analysis, it produces, amongst other things, graphs like this, that tell you whether the Earth is getting warmer or cooler:
Because this graph is important for studying the world’s climate (and determining the signature of global warming), there is a lot of public discussion about where this data comes from. The raw data underlying the graph is surface weather station temperature records. The raw data is processed to produce the data for the graph:
The box in the middle, labelled “GISTEMP”, is a process that converts the raw station records into the data for the graph on the right, which is the global temperature anomaly. There are descriptions of this process available, for example Hansen and Lebedeff, 1987. A description is one thing, but it might not tell you everything you need to know. Perhaps the description is sufficiently clear and accurate for you to reproduce the process, perhaps not. The ultimate authority on the process is the source code that implements it, because It’s the source code that is executed in order to produce the processed data. So if you want to know exactly what the process involves, you have to get hold of the source code.
In effect it is the source code that adds value to the raw data to produce processed data. So in a sense, the value of the processed data is embodied in the source code. That’s what makes the source code important.
The source code for GISTEMP is written mostly in Fortran by scientists at NASA, and is available from them. This source code is the working code used by the NASA scientists, it is not necessarily the best source code for explaining how the process works (to an interested and competent member of the general public). There is the question of whether NASA, a publicly funded body, should be paying someone to write code that makes a better tool for communicating with the public (for example by writing better documentation, or writing it in a more exemplary style). I am not going to address that question. The source code NASA use is the source code we have right now.
Our goal at Clear Climate Code is to take this code and produce a new version that is clearer, but does the same thing. We have taken great steps forward towards this goal: We have recently released a version which is all in Python and which reproduces NASA’s results exactly. We think much of this code is already a great deal clearer than the starting material, but we continue to make it clearer. Of course we would welcome your support. If you want to help, please join our mailing list, or you can follow our progress at our blog and on twitter.
The reasons Clear Climate Code chose Python as the implementation language for ccc-gistemp are: accessibility, clarity, and familiarity. By accessible I mean that there is a large community of Python programmers, but also there are several tutorials and other materials for learning Python should you be motivated. Python is used to teach undergraduates programming. Python is relatively clear; it’s deliberately designed to be free of the clutter that imperils other programming languages. It’s certainly possible for people who are not professional programmers to create small programs in Python, and examine and modify existing Python programs. And lastly, it’s familiar; Nick Barnes and I already knew Python when we started the project. This seems like a trivial consideration, but in fact Clear Climate Code is an unpaid project and it’s pretty easy to come up with reasons to do something else instead, so the fact that we already knew Python was important.
Hopefully Clear Climate Code illustrates how both code and data are central to the public understanding of science. For an issue like global warming it is absolutely crucial that public are involved. CKAN’s climate data group is a place where non-specialists can access scientist’s data more easily, and hopefully use it to innovate, do their own hobby science, or create visualisations to better communicate with the public. I’m hoping to add more data sources to the climate data group in the near future, if you’re interested in adding more data to this group, please get in touch.
Given you have a very good understanding of the code now, does the Hansen and Lebedeff documentation accurately describe it or do you think better documentation is required?
A while ago on the CCC blog I wrote:
Is this an accurate description? In my opinion it is accurate enough to reproduce a result that is “substantially the same” as the GISTEMP result. In other words, well within any meaningful error bounds. In order to match the GISTEMP result as closely as we have you probably need access to the actual code (for example, when combining stations to make a gridded data set, stations are combined from longest to shortest. However this leaves unspecified what happens when several stations have the same length records. The order matters (a tiny bit), and to get as close as we have to GISTEMP we have to reproduce their combining order exactly, and this is only evident from the code, not the paper).
Accuracy is one thing, but we strive for clarity, and we recognise that academic journals are not (yet?) the place where scientists choose to make their results clear to the public.
Better documentation is required, and we intend to do that. Do you want to help?
David – where do you get your weather station data from to test how your code compares to that of the Fortran version of GISTEMP? Lisa
David – where do you get your weather station raw data from to test the results of your code against that of the Fortran GISTEMP? Does someone collate the raw weather station data? Lisa