The following post is by Friedrich Lindenberg (and on Twitter), originally posted here.
What is the next European dataset that investigative journalists should look at? Back in 2012 at the DataHarvest conference, Brigitte, investigative superstar from FarmSubsidy and co-host of the conference, had a clear answer: let’s open up TED (Tenders Electronic Daily). TED is the EU’s shared procurement mechanism, and is at the heart of the EU contracting process. Opening it up would shine a light on the key questions of who receives public money, and what they receive it for.
Her suggestion triggered a two-year project, OpenTED, which, as of last week, has finally matured into a useful resource for journalists and researchers. While gaps remain, we hope it will now start to be used by journalists, NGOs, analysts and citizens to get information on everything from large scale trends to local municipal developments.
OpenTED
TED collects tender notices for large public projects so that companies from all EU countries can bid on those contracts. For journalists, there are many exciting questions such a database would be able to answer: What major projects are being announced? Who is winning the contracts for these projects, and is that decision made prudently and impartially? Who are the biggest suppliers in a particular country or industry?
The OpenTED project, started by Anders Pedersen and Joost Cassee, was initially born as an attempt to scrape the official TED web site. Soon, however, this first version of OpenTED was faced with a number of practical problems: the data was impossible for journalists to use without an interface, and the stuff was so messy that even Sunlight Foundation’s finance data genius Kaitlin Devine couldn’t help us pull apart the errors. To make things worse, in June 2013 the EU Publications Office updated the TED web site to make bulk scraping impossible – leaving us without a way to update the data.
We were out of options. To answer our questions, we were going to need to look at the database directly – not just at the website provided by the EU Publications Office.
Scraping with words
We decided to take a radical step for a bunch of nerds: talk to the EU. Speaking to the Publication Office’s unit lead, we were surprised to learn that they were already in the process of changing their licensing regime: while access to machine-readable data had been sold to re-users in the past, the plan was to make the data freely available in January 2014. Thanks, Neelie!
So, in early January, I pinged @EUTenders on Twitter, asking what happened to the publication plan. Expecting some sort of rejection, I was surprised to promptly receive a direct message with credentials for their raw data file server. The site offered DVD images for download, with XML dumps of TED’s data since 2011 – this was exactly what we were looking for.
Building a community
As DataHarvest 2014 approached, we decided to make an updated version of OpenTED, offering slices of the newly opened data in an accessible format (CSV) and in small portions, divided by country and year, so that journalists without database skills would be able to grab the data and explore it in a spreadsheet application.
The resulting discussion focussed on the quality and completeness of the data. Many pieces of essential information are missing – including many contract values and supplier names. Additionally, the existing data is very messy, particularly when it comes to clearly identifying the public body and economic operator involved in a contract.
What now?
In many ways, the next step is up to the journalists who attended DataHarvest. We have, I think, created a rich resource for them to use in investigations and have set up a network of technologists that are ready to support analysis of the data. However, while we now have access to contract metadata – the recipient, amount and topic of EU contracts, it became clear during the workshop that in order to answer the in-depth questions journalists want to ask, access to the actual contract documents, detailing the terms and precise scope of the agreements our governments make on our behalf, is required. For this, we need to insist in greater contracting transparency and tell our governments to Stop Secret Contracts.
Oh, and I’d love to know more about that 700 trillion Euro building they’re constructing in Galway…
RESOURCES
The data & tools:
- OpenTED data archive, the main result of this work.
- The Usual Suppliers – A little analysis application for journalists.
Some code:
- pudo/monnet – the scrapers for TED (and many other EU data sources)
- opented/opented – the CSV export site
- opented/the-usual-suppliers – the analysis mini-app linked above
This post is by a guest poster. If you would like to write something for the Open Knowledge Foundation blog, please see the submissions page.
As clarification, the reason I couldn’t perform any data analysis was because not enough data was being scraped to allow for a meaningful analysis. The awards data on the open ted site was not as parseable as some of the other basic data. So I have no idea if there were a lot of errors on ted, just that the data are very difficult to access in bulk.
-Kaitlin Devine