This guest blog post has been written by Marc Joffe, of Public Sector Credit Solutions.
Open government data is valuable only to the extent that it can be used cost-effectively. When governments provide “open data” in the form of voluminous PDFs they offer the appearance of openness without its benefits.
In this situation, the open government movement had two options: demand machine readable data or hack the PDFs – using technology to liberate the interesting data from them. The two approaches are complimentary; we can pursue both at the same time.
When it comes to liberating data from PDFs, advanced technologies are available but expensive. In my previous life as a technology manager at a financial firm, I was given the opportunity to purchase a sophisticated PDF extraction tool for USD 200,000 – not counting annual maintenance and implementation consulting costs.
This amount is beyond the reach of just about every startup and non-profit in the open data world. It is also beyond the means of most media organizations, so lowering the cost of PDF extraction is also a priority for journalists.
The data journalism community has responded by developing software to harvest usable information from PDFs. Tabula, a tool written by Knight-Mozilla OpenNews Fellow Manuel Aristarán, extracts data from PDF tables in a form that can be readily imported to a spreadsheet – if the PDF was “printed” from a computer application. Introduced earlier this year, Tabula continues to evolve thanks to the volunteer efforts of Manuel, with help from OpenNews Fellow Mike Tigas and New York Times interactive developer Jeremy Merrill. Meanwhile, DocHive, a tool whose continuing development is being funded by a Knight Foundation grant, addresses PDFs that were created by scanning paper documents. DocHive is a project of Raleigh Public Record and is led by Charles and Edward Duncan.
These open source tools join a number of commercial offerings such as Able2Extract and ABBYY Fine Reader that extract data from PDFs. A more comprehensive list of open source and commercial resources is available here.
Unfortunately, the free and low cost tools available to data journalists and transparency advocates have limitations that hinder their ability to handle large scale tasks. If, like me, you want to submit hundreds of PDFs to a software tool, press “Go” and see large volumes of cleanly formatted data, you are out of luck. These limits reduce our ability to analyze and report on Parliamentary/Congressional financial disclosures, campaign contribution records and government budgets – which often arrive in volume, in PDF form.
PDF hacking has uses outside the government transparency / data journalism nexus. As Peter Murray-Rust has argued, the progress of science is being retarded because valuable data are “jailed” within PDF journal articles. For this reason, Dr. Rust and several colleagues have been developing AMI – a tool that leverages Apache PDFBox to mine usable content from scientific documents.
Whether your motive is to improve government, lower the cost of data journalism or free scientific data, you are welcome to join The PDF Liberation Hackathon on January 18-19, 2014 – sponsored by The Sunlight Foundation, Knight-Mozilla OpenNews and others. We’ll have hack sites at the NYU-Poly Incubator in New York, Chicago Community Trust, Sunlight’s Washington DC office and at RallyPad in San Francisco (one or two locations will have an opening social on the evening of the 17th). Developers can also join remotely because we will publish a number of clearly specified PDF extraction challenges before the hackathon.
Participants can work on one of the pre-specified challenges or choose their own PDF extraction projects. Ideally, hackathon teams will use (and hopefully improve upon) open source tools to meet the hacking challenges, but they will also be allowed to embed commercial tools into their projects as long as their licensing cost is less than $1000 and an unlimited trial is available.
Prizes of up to $500 will be awarded to winning entries. To receive a prize, a team must publish their source code on a GitHub public repository. To join the hackathon in DC or remotely, please sign up at Eventbrite; to hack with us in SF, please sign up via this Meetup. Signup links for New York and Chicago will be posted here. Please also complete our Google Form survey.
The PDF Liberation Hackathon is going to be a great opportunity to advance the state of the art when it comes to harvesting data from public documents. I hope you can join us.