Give Us the Data Raw, and Give it to Us Now

November 7, 2007, by Rufus Pollock

One thing I find remarkable about many data projects is how much effort goes into developing a shiny front-end for the material. Now I’m not knocking shiny front-ends, they’re important for providing a way for many users to get at the material (and very useful for demonstrating to funders where all the money went). But shiny front ends (SFEs from now on) do have various drawbacks:

They often take over completely and start acting as a restriction on the way you can get data out of the system. (A classic example of this is the Millenium Development Goals website which has lots of shiny ajax which actually make it really hard to grab all of the data out of the system — please, please just give me a plain old csv file and a plain old url).
Even if the SFE doesn’t actually get in the way, they do take money away from the central job of getting the data out there in a simple form, and …
They tend to date rapidly. Think what a website designed five years ago looks like today (hello css). Then think about what will happen to that nifty ajax+css work you’ve just done. By contrast ascii text, csv files and plain old sql dumps (at least if done with some respect for the ascii standard) don’t date — they remain forever in style.
They reflect an interface centric, rather than data centric, point of view. This is wrong. Many interfaces can be written to that data (and not just a web one) and it is likely (if not certain) that a better interface will be written by someone else (albeit perhaps with some delay). Furthermore the data can be used for many other purposes than read-only access. To summarize: The data is primary, the interface secondary.
Taking this issue further, for many projects, because the interface is taken as primary, the data does not get released until the interface has been developed. This can cause significant delay in getting access to that data.

When such points are made people often reply: “But you don’t want the data raw, in all its complexity. We need to clean it up and present it for you.” To which we should reply:

“No, we want the data raw, and we want the data now”

Rufus Pollock

Website | + posts

Rufus Pollock is Founder and President of Open Knowledge.

30 thoughts on “Give Us the Data Raw, and Give it to Us Now”

Pingback: Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Give Us the Data Raw, and Give it to Us Now
Pingback: Unilever Centre for Molecular Informatics, Cambridge - Jim Downing » Blog Archive » Between Open License and Open Use
Jim Downing says:

November 8, 2007 at 13:53

I agree with you that the argument that users need protecting from data is specious. I’ve heard it re-phrased as “the data is nuanced, proles could accidentally draw incorrect conclusions without proper statistical training”. Access to the same data is often given to a restricted group of agencies who will deliberately draw incorrect conclusions.

What if it isn’t trivial to provide access to the raw data? Would it be acceptable for a data project to produce downloadable data and no SFE?

This issue is close to my heart at the moment, in making the data from CrystalEye more easily available (http://wwmm.ch.cam.ac.uk/blogs/downing/?p=142).
Rufus Pollock says:

November 9, 2007 at 08:47

Jim: Good to hear your views on this (and I’ve just read your blog post). I’m glad you share my opinion that SFEs can often get in the way and your point that they can even be used for intentional obstructionism, whether in order to close off data or just for paternalism, is spot on — something I should have emphasized even more in the post.

On you second point: I definitely think it would be acceptable for a data project to produce downloadable data only and no SFE. After all producers of software don’t feel they’ve got to provide a production installation of it available to the world before they distribute the tarball.
jason says:

November 13, 2007 at 14:16

Which Millennium Development Goals website are you talking about?
Rufus Pollock says:

November 13, 2007 at 14:40

This website: http://mdgs.un.org/unsd/mdg/

The data page is at: http://mdgs.un.org/unsd/mdg/Data.aspx

That page is pretty much run on javascript which can makes getting the data out in a machine automated way rather difficult! However once you’ve navigated to a given dataset the site is pretty good as you can download the material in CSV, XML of Excel format (and that is a nice url which can be accessed by anything and does not require js). Furthermore after a bit of hacking around in the source HTML of the original Data.aspx (to get round javascript) you can grab a full list of datasets. The results of these kind of efforts can be seen at (produced back in April/May 2007 — see the __init__.py file for info on scraping):

http://knowledgeforge.net/econ/svn/trunk/data/mdg/

However accessing the metadata for series (via http://mdgs.un.org/unsd/mdg/Metadata.aspx was harder since all of that was run by js and there seemed no easy way to find a nice url which just returned the metadata for a given series id.

PS: Having just checked their site again it seems that you can now grab the full dataset in one go from:

http://mdgs.un.org/unsd/mdg/Handlers/ExportHandler.ashx?Type=Csv

There is also a link from the metadata page entitled “View all metadata (printable)” which does gives all the metadata in one go. However the link is still operated via js alas and so not machine automatable but still better (I can now just download one file by hand and write some parsing code to break it into individual items to match it back up with the series.
Pingback: Open Knowledge Foundation Weblog » Blog Archive » On Getting Raw Data for Cancer Research
Pingback: Nodalities » Blog Archive » This Week’s Semantic Web
Pingback: Science Commons » Blog Archive » Voices from the future of science: Rufus Pollock of the Open Knowledge Foundation
Pingback: miscellaneous factZ » Blog Archive » Open Economics: History via Wheat Prices
Pingback: Open Knowledge Foundation Blog » Blog Archive » What Obama can do to promote openness
Pingback: Open Knowledge Foundation Blog » Blog Archive » Opening up local government data?
Pingback: Prova Articolo - Bloggerman
sandrar says:

September 10, 2009 at 13:12

Hi! I was surfing and found your blog post… nice! I love your blog. :) Cheers! Sandra. R.
Pingback: ‘Reusing, remixing and building on’: the importance of making data legally open « Co-creating an open declaration on public services 2.0
Pingback: Open Knowledge Foundation Blog » Blog Archive » Opening up e-Government in Europe: accessibility, transparency and the ‘right to reuse’
Pingback: Open Knowledge Foundation Blog » Blog Archive » Open data on cities: an international round up
Pingback: Open Knowledge Foundation Blog » Blog Archive » Climate Change, Climate Sceptics and Open Data
Pingback: Open Knowledge Foundation Blog » Blog Archive » UK Government announces lots of new open data!
Pingback: Open Knowledge Foundation Blog » Blog Archive » US Government announces more open government data!
Laura Brisbane says:

June 29, 2010 at 00:10

Great article, very interesting.

Just wanted to note that us “proles” can just as easily draw incorrect conclusions from refined (and possibly manipulated) data as raw data.

Keep writing, thanks!
Pingback: El Gobierno inglés libera contenidos | Administraciones Públicas
Pingback: Tim Berners-Lee demonstrates the impact of early Open Data initiatives
Pingback: Introduction to the power of open data « Kathryn Corrick
Pingback: infomisa.net» Blog Archive » Semantic Web in the news
Pingback: miscellaneous factZ – The online home of Rufus Pollock » Blog Archive » Open Government Data Goes Global – OGDCamp Keynote
the writer says:

June 29, 2011 at 19:27

I have browsed the website admin has posted here i.e. millennium indicators. Well an impressive job done here.
In your post, the most intriguing point is that data has the primary importance, not the interface – I am developing a project and your this line has really helped .
Regards and thanks,
Tina Mecloed.
AddieS says:

August 10, 2011 at 01:51

You make some great points, but I think there can be a balance. Most advanced users would like the data as immediately as possible, raw or not so that they can do their own manipulation. But for the rest of us, we like a little bit of pretty and interpretation of the information.
Pingback: International Open Government Data Camp looks to build community | National Cyber Security
beboihanoi.vn says:

May 9, 2014 at 04:01

Chuyen Xay dung be boi