Francis Irving

CEO of ScraperWiki. Made several of the world's first civic websites, such as TheyWorkForYou and WhatDoTheyKnow.

More Reading

Post navigation

3 Comments

  • Julian tried parsing it, but the tables were not just inconsistent in formatting, but in content too.

    He says with more time he could have built a tool that would make it easier to scrape that kind of case, but we didn’t have the time because we had to be finished before the programme was broadcast!

    In the end we had a less sophisticated solution – Nicola went through it by hand and copied the data we wanted into a spreadsheet.

    She also used Julian’s PDF clipping tool to select the areas for highlighting.

    Part of that doing it by hand was in the end useful in terms of journalism – it meant she could choose what was and wasn’t interesting (bundling dull things into one lump), making the final visualisation just right in terms of interest.

  • This is very impressive and I’m really pleased that scraper wiki is involved in these types of programmes.

    But from your and Nicola’s posts I don’t understand how you coped with the problem of the asset register being only available in a monster PDF. How did you get at the data? I guess one solution would be to open the PDF in Acrobat Professional and cut and paste the relevant tables, but I’m wondering if you have a more sophisticated solution?

Leave a Reply

Your email address will not be published. Required fields are marked *

back to top