What data can and cannot do

*[Mining for Information](http://www.flickr.com/photos/jdhancock/3386035827/in/photostream/) by JD Hancock on Flickr (CC BY)*

In the early days of photography there was a great deal of optimism around its potential to present the public with an accurate, objective picture of the world. In the 19th century pioneering photographers (later to be called photojournalists) were heralded for their unprecedented documentary depictions of war scenes in Mexico, Crimea and across the US. Over a century and a half later – after decades of advertising, propaganda, and PR, compositing, enhancement and outright manipulation – we are more cautious about seeing photographs as impartial representations of reality. Photography has lost its privileged position in relation to truth. Photographs are just a part of the universe of evidence that must be weighed up, analysed, and critically evaluated by the journalist, the analyst, the scholar, the critic, and the reader.

The current wave of excitement about data, data technologies and all things data-driven might lead one to suspect that this machine-readable, structured stuff is a special case. The zeitgeist at times bears an uncannily resemblance to the optimism of a loose-knit group of scientists, social scientists, and philosophers at the start of the 20th century, who thought they could eschew value-laden narratives for an objective, fact-driven model of the world. “Facts are sacred” says the Guardian Datablog and “for a fact-based worldview” says Gapminder. The thought of tethering our reportage, analyses and reflection to chunks of data-given truth is certainly consoling. But the notion that data gives us special direct access to the way things are is – for the most part – a chimera.

Data can be an immensely powerful asset, if used in the right way. But as users and advocates of this potent and intoxicating stuff we should strive to keep our expectations of it proportional to the opportunity it represents. We should strive to cultivate a critical literacy with respect to our subject matter. While we can’t expect to acquire the acumen or fluency of an experienced statistician or veteran investigative reporter overnight, we can at least try to keep various data-driven myths from the door. To that end, here are a few reminders for lovers of data:

* **Data is not a force unto itself**. Data clearly does not literally create value or change in the world by itself. We talk of data changing the world metonymically – in more or less the same way that we talk of the print press changing the world. Databases do not knock on doors, make phonecalls, push for institutional reform, create new services for citizens, or educate the masses about the inner workings of the labyrinthine bureaucracies that surround us. The value that data can potentially deliver to society is to be realised by human beings who use data to do useful things. The value of these things is the result of the ingenuity, competence and (perhaps above all) hard work of human beings, not something that follows automatically from the mere presence and availability of datasets over the web in a form which permits their reuse.

* **Data is not a perfect reflection of the world**. Public datasets (unsurprisingly) do not give us perfect information about the world. They are representations of the world gathered, generated, selected, arranged, filtered, collated, analysed and corrected for particular purposes – purposes as diverse as public sector accounting, traffic control, weather prediction, urban planning, and policy evaluation. Data is often incomplete, imperfect, inaccurate or outdated. It is more like a shadow cast on the wall, generated by fallible human beings, refracted through layers of bureaucracy and official process. Despite this partiality and imperfection, data generated by public bodies can be the best source of information we have on a given topic and can be augmented with other data sources, documents and external expertise. Rather than taking them at face value or as gospel, datasets may often serve as an indicative springboard, a starting point or a supplementary source for understanding a topic.

* **Data does not speak for itself**. Sometimes items in a database will stand by themselves, and do not require additional context or documentation to help us interpret them – for example, when we consult transport timetables to find out when the next train leaves. But often data will require further research and analysis in order to make sense of it. In many ways official datasets resemble official texts: we need to learn how to read and interpret them critically, to read between the lines, to notice what is absent or omitted, to understand the gravity and implications of different figures, and so on. We should not imagine that anyone can easily understand any dataset, any more than we would think that anyone can easily read any policy document or academic article.

* **Data is not power**. Data may enable more people to scrutinise official activities and transactions through more detailed, data-driven reportage. In principle it might help more people participate in the formulation of more evidence based policy proposals. But the democratisation of information is different from the democratisation of power. Knowing that something is wrong or that there is a better way of doing things is not the same thing as being in a position to fix things or to affect change. For better or for worse flawless arguments and impeccable evidence are usually not sufficient in themselves to affect reform. If you want to change laws, policies or practices it usually helps to have things like implacable advocacy, influential or high profile supporters, positive press attention, hours of hard graft, bucketloads of cash and so on. Being able to see what happens in the corridors of power through public datasets does not mean you can waltz down them and move the furniture around. Open information about government is not the same as open government, participatory government or good government.

* **Interpreting data is not easy**. Furthermore there is a tendency to think that the widespread availability of data and data tools represent a democratisation of the analysis and interpretation of data. With the right tools and techniques, anyone can understand the contents of a dataset, right? Here it is important to distinguish between different orders of activity: while it is easier than ever before to do things with data on computers and on the web (scrape it, visualise it, publish it), this does not necessarily entail that it is easier to know what a given dataset means. Revolutionary content management systems that enable us to search and browse legal documents don’t mean that it is easier for us to interpret the law. In this sense it isn’t any easier to be a good data journalist than it is to be a good journalist, a good analyst, a good interpreter. Creating a good piece of data journalism or a good data-driven app is often more like an art than a science. Like photography, it involves selection, filtering, framing, composition and emphasis. It involves making sources sing and pursuing truth – and truth often doesn’t come easily. Amid all of the services and widgets, libraries and plugins, talks and tutorials, there is no sure-fire technique to doing it well.

I’m sure as time goes by we’ll have a more balanced, critical appreciation of the value of data, and its role within our information environment. As former BBC journalist Michael Blastland [writes](http://datajournalismhandbook.org/1.0/en/understanding_data_1.html) in the recently published [Data Journalism Handbook](http://datajournalismhandbook.org/1.0/en/), “we need to be neither cynical nor naive, but alert”.

*This article was [originally posted on the Guardian Datablog](http://www.guardian.co.uk/news/datablog/2012/may/31/data-journalism-focused-critical) on 31st May 2012.*