**The following interview was [published earlier this week](http://www.ejc.net/magazine/article/journalism_meets_big_data/) by the [European Journalism Centre](http://www.ejc.net/) in the Netherlands.**
In recent years the practice and philosophy of making data freely available for use and re-use has been taken up by many different institutions, from national governments to international organizations such as the World Bank. Journalists too have started to tap into the potential of free accessible data.
Now, with major news providers such as The Guardian beginning to open up the data they work with, the opportunities for producing visualisations or web mash-ups based on this information readily exist.
Increasing online open data availability puts the processing power into journalists’ hands; rather than relying on outside specialists such as policy makers to provide the insights, raw data can now be analysed and interpreted in newsrooms.
This is the emerging field of data-driven journalism, in which journalists gather, analyse and visualise ‘big’ data and combine it with compelling, credible storytelling.
Jonathan Gray, Open Knowledge Foundation
European Journalism Centre (EJC) questioned Jonathan Gray (see right), community coordinator at the Open Knowledge Foundation, a UK non-profit organisation dedicated to promoting open knowledge internationally, about how journalists can exploit the potential of open data. Jonathan is also the founder of WhereDoesMyMoneyGo.org, a web application which allows users to explore and visually represent UK public spending.
EJC: Where and how can journalists access free and open data sources on the Internet?
Jonathan Gray: There is currently a huge wealth of freely accessible data on the Internet, scattered over web real estate belonging to government departments, academics, NGOs, news organisations, technologists and others. Yet despite the fact that there is so much information out there, it is not always easy to find the exact information you are looking for; get hold of raw data sources (as opposed to reports, articles and other material about these sources), and find out whether or not you can freely reuse the information (just because something is accessible online, this does not always mean you’re allowed to reuse or republish it).
To address these issues we started the CKAN project, which aims to make it easy to find collections of documents and datasets which anyone can freely re-use for any purpose.
CKAN is now being used by the UK government in its data.gov.uk project, and we are helping open government data advocates around the world set up new instances to track sources of open data in over a dozen countries. We’re also working hard to make sure that other data catalogues in other countries are interoperable with CKAN, and to promote uptake of the project in different communities creating and using public datasets – from geospatial data analysts to civic hackers, from climate modellers to semantic web technologists. We hope that CKAN will become an international, multilingual, distributed one-stop shop for open data.
If you are interested in using this technology to set up a data registry in your country (whether you’re an advocate or a civil servant), we’d love to hear from you.
EJC: Would journalists need special skills, like programming to explore and analyze these datasets?
JG: Having some programming skills is no doubt useful for journalists whose work may depend on extracting, analysing, and understanding information in large databases. This may be particularly valuable for investigative journalists trawling public sources to build up a richer picture of complex chains of events or states of affairs. However new digital technologies are making it increasingly easy for journalists without programming skills to explore and analyse datasets.
Social web services such as Many Eyes or Google Chart Tools mean that anyone can visually represent data sets ‘on the fly’. Free and open source desktop applications also enable people to drill down into databases in increasingly sophisticated ways.
While these kinds of tools can go a long way, there are obviously still limits as to what non-technical journalists can do. A good example of this is the recent release of the COINS data in the UK – a huge release of information on public spending, unprecedented in its scope and detail. We had journalists from many different national newspapers and news outlets calling us to ask what was in the data, and in particular whether we had found any good stories. At the time several people expressed disappointment with the release, basically complaining that getting useful or interesting information from the database was like getting blood from a stone. The good news is that since then we’ve seen at least half a dozen new projects which let people sort, search, explore and comment on the data, and no doubt we will see many more in the next few months and years.
To quote Peter Murray-Rust, OKF Advisory Board member and tireless advocate of open data in chemistry, “Data is difficult”. Whether we’re talking about statistical data, environmental data or spending data, the chain from production to presentation can often be long and complicated. The more people involved in the process of cleaning up, checking, interpreting, and visually representing datasets the better. Hence at the OKF we are strong advocates of a community-driven approach, involving experts from across the board, as well as interested and motivated citizens.
We hope to move to a situation where rather than a single official point of contact for datasets (whether from national governments, international bodies or NGOs), we have an ecosystem of open data with lots of datasets connected together, accessible via many different interfaces and with plenty of tools to help people understand and interact with the data.
Rather than the traditional treasure hunt, for example looking for data buried deep on an official website or PDF document, working out how to use the shiny front end interface, etc., we hope there will be more of a two way relationship with information around us, i.e. delivery on demand according to interests, read/write access, commenting, telling stories with data, enabling people to embed dynamic visual representations which link back to source, and so on. By explicitly opening up datasets for others to freely reuse without restriction we allow a thousand flowers to bloom.
While the core task of journalists will presumably continue to be much the same, i.e. interpreting, communicating and framing the information in meaningful narratives, with comment and analysis, I think the precise division of labour between journalists and others remains to be seen. Hopefully we’ll see some boundaries begin to blur.
EJC: What are the common challenges associated with data usage, production and presentation? And what is there to learn for journalists who want to go in this direction from your perspective?
JG: As I alluded to above, piecing together and interpreting data can be hard. (Getting data in the first place can also be hard—but that’s another story!). There are lots of great examples of this from Ben Goldacre’s analysis of a recent Guardian article take on NHS death rates, to dodgy newspaper graphs (see this presentation from Simon Field, CTO of the UK’s Office for National Statistics).
We’ve experienced a number of difficulties when trying to make sense of UK spending data as part of the Where Does My Money Go? project, including (but by no means limited to) missing data, different figures from different government sources, retired schema, absent keys, changing categories, multiple different codes, data delivered on thousands of sheets of paper, and so on.
Before this project I never knew working out where public funds are spent could be so much like theology, with some intangible, ineffable base layer (the actual spending) and many different denominations of interpreting this for different purposes, with different rules, different explanations, different calendar dates and different terminology. Luckily it’s all falling into place, but not without talking to lots of helpful government folks and outside experts.
To understand a dataset is often not just to understand a bunch of facts about the world, but to understand all kinds of assumptions and processes involved in the production of that data – which can be critical, for example, in understanding whether the books have been cooked. Another good reason why it’s better to have many pairs of eyes and hands!
In terms of things for journalists to learn, I guess I’d like to see us move from a situation where we consult data on a case-by-case basis in order to provide facts for stories, to one where we allow stories to emerge from the data (i.e. data-driven journalism!). For example, more systematically following transcripts and drilling down into datasets from national parliaments and international bodies such as the EU or the UN. Also being able to give events a broader sense of context, for example by being able to link directly to documents and datasets and letting people discuss and comment on these.
I think visualisation tools may turn out to be very powerful with respect to giving people a ‘bigger picture’. We are currently taking baby steps, but I’m sure things will pick up pace as the technologies begin to get more robust, and we start to build a visual language for interfacing with datasets, similar to the kinds of graphical user interfaces we are all now used to with our operating systems, or on web browsers and web applications.
EJC: Do you think journalists are capable of producing offerings similar to your Where does my money go? project or to CKAN?
JG: Yes of course! There is no compelling reason why journalists shouldn’t continue to contribute to these kinds of projects. I guess it’s more a question of people’s time and priorities.
Starting to have more provision for learning to code on journalism courses might be a good way to systematically encourage journalists to start working on and engaging with these kinds of projects. But I think it’s often also to do with motivation. Many really good programmers I know had no formal training, but learned by having a project they wanted to do and teaching themselves with input from various free/open source software communities. Formal training is often neither necessary nor sufficient to code on projects like Where Does My Money Go, They Work For You, and so on – its far more important to set aside the time, and stick at it.
Having more widespread acceptance of this sort of thing as something which is ‘legitimate’ for journalists to undertake may well help people to justify spending their time in this way (to themselves, to employers, etc). There already seems to be growing interest here – e.g. from BBC, Guardian, Telegraph, New York Times, and so on – so hopefully it won’t be long before programming comes to be accepted as a valuable part of a journalist’s repertoire.
EJC: Your work provides a great example, though we think one aspect is missing: storytelling. Would you consider presenting short stories, interviews, videos on top of your data in the future?
JG:Yes definitely. I think one of the main things that sets Hans Rosling’s Gapminder apart from other similar projects is his wonderfully entertaining set of lectures where he talks about an issue (population growth, development, health, quality of life, etc) with reference to Trendalyzer, the Gapminder visualisation tool. We’re really proud to have Hans on the Open Knowledge Foundation’s advisory board, and I think he’s someone who really understands the relationship between stories and interactive visualisation tools. On the one hand his narrative makes the colourful bubbles meaningful, on the other hand the colourful bubbles are indispensable in his lectures! A good illustration of why a picture is worth a thousand words is to watch any of his videos where he narrates hundreds of bubbles moving around the screen—highlighting a significant move like a football commentator highlighting something in a game. While the words may aid understanding and capture key aspects, they are no substitute for the image. They are also clever because after you’ve watched a few, you have a pretty good idea how to use the tool to ask your own questions and build your own visualisations.
These kinds of tools are new, and can often not be particularly intuitive to use, partly because we have no ‘accepted’ way of doing things yet. Tutorials and help menus can be boring, and the video lectures are a great way of teaching people to use the tool without them noticing! We’d definitely like to do more of this kind of thing with our own projects. In general I think that storytelling will play a very important part in new visualisation tools—from creating narratives within the visual design and the user’s experience, through to being able to seamless embed and integrate visualisations within external articles, texts and other media.