The following post is from Francis Irving, CEO of ScraperWiki.
‘Should Britain flog off the family silver to cut our national debt?‘ — that’s the question the UK current affairs documentary Dispatches tackled last Monday.
ScraperWiki worked with Channel 4 News and Dispatches to make two supporting data visualisations, to help viewers understand what assets the UK Government owns. This blog post tells you a bit about the background to them – where the data came from, what it was like, and how and why we made the visualisations.
1. Asset bubbles
Inspired by Where Does My Money Go’s bubble chart of public spending, the first is a bubble chart of what central Government owns.
We couldn’t find any detailed national asset registry more recent than 2005 (assembled in the National Asset Registry 2007). With a good accounting system, and properly published data all the way through Government, such a thing would constantly update.
In some ways there is less of a problem than with Government spending at needing drill down. There isn’t the equivalent problem of wanting to know who the contractor for some spending is, or to see the contract. Instead, you want to know assessments of value, and what investments could do to that value, as well as strategic consequences of losing control of the asset – detailed information that perhaps the authorities themselves often don’t have.
The PDFs were mined by hand (by Nicola) to make the visualisation, and if you drill down you will see an image of the PDF with the source of the data highlighted. That’s quite an innovation – one of the goals of the new data industry is transparency of source. Without knowing the source of data, you can’t fully understand the implications of making a decision based on it.
Julian used RaphaelJS to code the bubbles (source code here). You can think of it as “JQuery for in browser SVG”. Amazingly, it even works in (most) versions of Internet Explorer (using a compatibility layer via VRML).
This has some advantages over Flash – you at least get iPad compatibility. It’s also easier for people with other web skills to maintain than Flex, plus people can “view source” and learn from each other just like in the good old days of the web.
That said, on the down side, CSS compatibility with the stylesheets of the site it is embedded in were a pain. We had to override a few higher level styles (e.g. background transparency) to get it to work. Perhaps next time we should use an iframe :)
2. Brownfield sites
The second is a map of brownfield landed owned by local councils in England.
Or at least, that they owned in 2008. There isn’t a more recent version, yet, of the National Land Use Database. One of the main pieces of feedback we got was people frustrated that we didn’t have up to date, or always complete data. There is definitely an expectation in the public that something as basic as what the Government owns should be available in an online, up to date fashion.
The dataset is compiled by the Homes and Communities Agency, who have a goal of improving use of brownfield land to help reduce the housing shortage. This makes it reasonably complete, and cover the whole of England. That’s important, as it gives everyone a good chance that they will find something near them.
The data is prepared by local authorities, sent using an Excel or a GIS file (see the guidance notes linked near the bottom of this page) to the agency. Depending where you live, the detail and thoroughness will vary.
The same dataset contains lots of information about privately owned land, but we deliberately only show the local authority owned land, as the Dispatches show was about what the state could sell off. It’s quite interesting that a dataset gathered for purposes of developing housing is also useful, as an aside, for measuring what the state owns. It’s that kind of twist of use of data that really requires understanding of the source of the data.
The actual application is fairly straightforward Google Maps API and JQuery, although as with the asset bubbles, Zarino made it look and behave fantastic. The main innovative thing is that it tells a story about each site which is constructed from the dataset.
For example, what was originally quite a hard to read line in an Excel file comes out as:
JUNCTION OF PARK ROAD, NORTHUMBERLAND STREET
Liverpool City Council own this brownfield land. This site was dwellings and is now derelict. It is proposed that it is used for housing. Planning permission is detailed. A developer could build an estimated 14 homes here, selling for £1,820,000 (if they were at £130,000 per home, the median North West price).
Nicola did a lot of testing to make the wording as natural as possible, although we could have done even more. You can see the source code here. We think of these paragraphs as mini constructed stories, local to the viewer, a kind of visualisation as text.
Conclusion
This kind of visualisation, to help a viewer dig into the details they are most interested in of an overall story or theme, is just the start of how use of (open!) data can help media organisations.
I’d like to see more work to integrate the data early on in the development of stories – so it acts as another source, finding leads in an investigation. And I think there are lots of opportunities for news organisations to build ongoing applications, which build audience, revenue and personal stories even when the story isn’t in the 24 hour news cycle.
See also Nicola’s post 600 Lines of Code, 748 Revisions = A Load of Bubbles on the ScraperWiki blog.
CEO of ScraperWiki. Made several of the world's first civic websites, such as TheyWorkForYou and WhatDoTheyKnow.
This is very impressive and I’m really pleased that scraper wiki is involved in these types of programmes.
But from your and Nicola’s posts I don’t understand how you coped with the problem of the asset register being only available in a monster PDF. How did you get at the data? I guess one solution would be to open the PDF in Acrobat Professional and cut and paste the relevant tables, but I’m wondering if you have a more sophisticated solution?
Julian tried parsing it, but the tables were not just inconsistent in formatting, but in content too.
He says with more time he could have built a tool that would make it easier to scrape that kind of case, but we didn’t have the time because we had to be finished before the programme was broadcast!
In the end we had a less sophisticated solution – Nicola went through it by hand and copied the data we wanted into a spreadsheet.
She also used Julian’s PDF clipping tool to select the areas for highlighting.
Part of that doing it by hand was in the end useful in terms of journalism – it meant she could choose what was and wasn’t interesting (bundling dull things into one lump), making the final visualisation just right in terms of interest.