Give Us the Data Raw, and Give it to Us Now
November 7th, 2007
One thing I find remarkable about many data projects is how much effort goes into developing a shiny front-end for the material. Now I’m not knocking shiny front-ends, they’re important for providing a way for many users to get at the material (and very useful for demonstrating to funders where all the money went). But shiny front ends (SFEs from now on) do have various drawbacks:
- They often take over completely and start acting as a restriction on the way you can get data out of the system. (A classic example of this is the Millenium Development Goals website which has lots of shiny ajax which actually make it really hard to grab all of the data out of the system — please, please just give me a plain old csv file and a plain old url).
- Even if the SFE doesn’t actually get in the way, they do take money away from the central job of getting the data out there in a simple form, and …
- They tend to date rapidly. Think what a website designed five years ago looks like today (hello css). Then think about what will happen to that nifty ajax+css work you’ve just done. By contrast ascii text, csv files and plain old sql dumps (at least if done with some respect for the ascii standard) don’t date — they remain forever in style.
- They reflect an interface centric, rather than data centric, point of view. This is wrong. Many interfaces can be written to that data (and not just a web one) and it is likely (if not certain) that a better interface will be written by someone else (albeit perhaps with some delay). Furthermore the data can be used for many other purposes than read-only access. To summarize: The data is primary, the interface secondary.
- Taking this issue further, for many projects, because the interface is taken as primary, the data does not get released until the interface has been developed. This can cause significant delay in getting access to that data.
When such points are made people often reply: “But you don’t want the data raw, in all its complexity. We need to clean it up and present it for you.” To which we should reply:
“No, we want the data raw, and we want the data now”
Related posts:
- Opening Up Government Data: Give it to Us Raw, Give it to Us Now Last month Rufus Pollock, Director of
- Striking confirmation from Google of the problems with ‘open’ APIs As of December 5th 2006 Google stopped i
- On Getting Raw Data for Cancer Research Andrew Vickers, a biostatistician at the
- Collaborative Development of Data $ This version: 2007-02-15 (First versio
- Vote Raw Data Now at SXSW panelpicker - ends 27 August Announcement below — voting ends 2
Related posts brought to you by Yet Another Related Posts Plugin.

November 8th, 2007 at 1:40 pm
[...] Us the Data Raw, and Give it to Us Now Rufus Pollock (OKFN) expresses a common sentiment Give Us the Data Raw, and Give it to Us Now One thing I find remarkable about many data projects is how much effort goes into developing ashiny front-end for the material. Now I’m not knocking shiny front-ends, they’re important for providing a way for many users to get at the material (and very useful for demonstrating to funders where all the money went). But shiny front ends (SFEs from now on) do have various drawbacks: [...]
November 8th, 2007 at 1:49 pm
[...] The “give me a big bundle of your raw data” request was one I’d heard before, from Rufus Pollock at OKFN, when I was working on the DSpace@Cambridge project, a topic he returned to yesterday, arguing that data projects should put making raw data available as a higher priority than developing “Shiny Front Ends” (SFE). [...]
November 8th, 2007 at 1:53 pm
I agree with you that the argument that users need protecting from data is specious. I’ve heard it re-phrased as “the data is nuanced, proles could accidentally draw incorrect conclusions without proper statistical training”. Access to the same data is often given to a restricted group of agencies who will deliberately draw incorrect conclusions.
What if it isn’t trivial to provide access to the raw data? Would it be acceptable for a data project to produce downloadable data and no SFE?
This issue is close to my heart at the moment, in making the data from CrystalEye more easily available (http://wwmm.ch.cam.ac.uk/blogs/downing/?p=142).
November 9th, 2007 at 8:47 am
Jim: Good to hear your views on this (and I’ve just read your blog post). I’m glad you share my opinion that SFEs can often get in the way and your point that they can even be used for intentional obstructionism, whether in order to close off data or just for paternalism, is spot on — something I should have emphasized even more in the post.
On you second point: I definitely think it would be acceptable for a data project to produce downloadable data only and no SFE. After all producers of software don’t feel they’ve got to provide a production installation of it available to the world before they distribute the tarball.
November 13th, 2007 at 2:16 pm
Which Millennium Development Goals website are you talking about?
November 13th, 2007 at 2:40 pm
This website: http://mdgs.un.org/unsd/mdg/
The data page is at: http://mdgs.un.org/unsd/mdg/Data.aspx
That page is pretty much run on javascript which can makes getting the data out in a machine automated way rather difficult! However once you’ve navigated to a given dataset the site is pretty good as you can download the material in CSV, XML of Excel format (and that is a nice url which can be accessed by anything and does not require js). Furthermore after a bit of hacking around in the source HTML of the original Data.aspx (to get round javascript) you can grab a full list of datasets. The results of these kind of efforts can be seen at (produced back in April/May 2007 — see the __init__.py file for info on scraping):
http://knowledgeforge.net/econ/svn/trunk/data/mdg/
However accessing the metadata for series (via http://mdgs.un.org/unsd/mdg/Metadata.aspx was harder since all of that was run by js and there seemed no easy way to find a nice url which just returned the metadata for a given series id.
PS: Having just checked their site again it seems that you can now grab the full dataset in one go from:
http://mdgs.un.org/unsd/mdg/Handlers/ExportHandler.ashx?Type=Csv
There is also a link from the metadata page entitled “View all metadata (printable)” which does gives all the metadata in one go. However the link is still operated via js alas and so not machine automatable but still better (I can now just download one file by hand and write some parsing code to break it into individual items to match it back up with the series.
February 4th, 2008 at 9:47 pm
[...] This is an excellent particular case of a more general line we take at the OKF (e.g. see Give Us the Data Raw, and Give it to Us Now and Dead Knowledge: why being explicit about openness matters). Surely much is lost if data that could prove useful to cancer researchers sits collecting dust. Much could be gained if more trials data was open. [...]
April 17th, 2008 at 4:55 pm
[...] Give us the data raw and give it us now! [...]
August 18th, 2008 at 7:01 pm
[...] The impetus behind CKAN was to make it easier for people to find open data, as well as to make their data available to others (especially in a way that can be automated). If you use Google to search for data, you’re much more likely to find a page about data than you are to find the data itself. As a scientist, you don’t want to find just one bit of information — you want the whole set. And you don’t want shiny front ends or permission barriers at any point in the process. We’ve been making updates to CKAN so machines can better interact with the data, which makes it so people who want data don’t have to jump as many hurdles to get it. Ultimately, we want people to be able to request data sets and have the software automatically install any additions and updates on their computers. [...]
September 15th, 2008 at 11:11 am
[...] One of the active Open Knowledge Foundation projects is Open Economics. A substantial part of that effort ends up being data acquisition and ‘cleaning’: getting hold of economic data, parsing it into (computer) usable form and adding it to the Store. (Wouldn’t it be nice if that data was already nicely packaged up or at least in a decent raw form …). [...]
January 20th, 2009 at 3:50 pm
[...] Publish public information in way which makes it easy to re-use. For example, publish in XML or Text/CSV, not PDF files which data must be extracted from. Allow direct, bulk downloading, rather than access through an API or piecemeal access via a web service. (For more on this see our post Give Us the Data Raw, and Give it to Us Now.)The Data Catalogue of Vivek Kundra’s Office in the District of Columbia is a great example of this. [...]
August 4th, 2009 at 5:50 pm
[...] government data in the UK. We are pleased to see that OKF Director Rufus Pollock’s call for Raw Data Now, which Sir Tim cited in a talk at TED, has played a prominent part in his [...]
August 8th, 2009 at 12:17 pm
[...] raw data and for the mashed up data. Data publishers (e.g., government departments) just produce raw data now, and consumer-facing sites (e.g., soccer sites) mash up data from many sources. I might talk about [...]
September 10th, 2009 at 1:12 pm
Hi! I was surfing and found your blog post… nice! I love your blog.
Cheers! Sandra. R.
October 13th, 2009 at 5:12 am
[...] a format which allows it to be re-used easily. As Open Knowledge Foundation Director Rufus Pollock wrote in 2007 (echoed by Tim Berners-Lee at TED): We want the data raw, and we want the data now! Possibly [...]
October 13th, 2009 at 1:50 pm
[...] a format which allows it to be re-used easily. As Open Knowledge Foundation Director Rufus Pollock wrote in 2007 (echoed by Tim Berners-Lee at TED): We want the data raw, and we want the data [...]
November 15th, 2009 at 5:07 pm
[...] or not there are facilities to download raw data in bulk - i.e. whether they easily allow users to directly download all the data in open, machine readable [...]
December 6th, 2009 at 4:22 am
[...] or not there are facilities to download raw data in bulk - i.e. whether they easily allow users to directly download all the data in open, machine readable [...]
December 7th, 2009 at 1:22 pm
[...] in raw form - as OKF Director Rufus Pollock first blogged about two year ago last month, and alluded to by Sir Tim Berners-Lee at [...]
December 8th, 2009 at 9:27 pm
[...] pleased to see a focus on legal and technical reusability. It looks like the new data will be raw and machine readable as well as compliant with the Open Knowledge [...]
June 29th, 2010 at 12:10 am
Great article, very interesting.
Just wanted to note that us “proles” can just as easily draw incorrect conclusions from refined (and possibly manipulated) data as raw data.
Keep writing, thanks!
July 7th, 2010 at 6:47 am
[...] y los ciudadanos”. En tiempos de guerra el gobierno británico confía al joven profesor que dijo “Queremos los datos a granel y los queremos ya”, la agenda de la “liberación de [...]
August 2nd, 2010 at 10:02 am
[...] In 2009 Sir Time Berners-Lee, cited by many as the innovative mind behind the development of the world wide web, encouraged the audience at one of his talks to join him in the chant “Raw Data Now!”. The arguments for open data (and even the chant itself!) had been developing over the previous few years. [...]
August 5th, 2010 at 4:13 pm
[...] has let me know that TBL’s shout of “Raw Data Now!” was a meme that started with this OKF post, and which Tim cites here. Influential folk these. from → public policy, technology [...]