The following guest post is from Julian Todd, who works on projects such as Public Whip, UNdemocracy, and ScraperWiki. He is also a member of the Open Knowledge Foundation’s Working Group on Open Government Data.. The post was originally published on Julian’s blog, Freesteel.
Yesterday Transport for London made a data dump of various locations and links to their traffic cameras, station locations, and so on.
A quick and effective use of some of the data is CheckMyRoute by Stefan Wehrmeyer that shows you all the CCTV traffic-cams on the route between two points in London.
This makes use of the googlemap’s route-finding function to thin out the awesome overload of camera locations you would otherwise see if you plotted them all at once.
It’s an attractive application because it’s an end product, rather than a stepping stone to the big solution of getting all data structured all ways so it can be used everywhere all the time for everything.
Over on ScraperWiki I’m workling towards this big solution by parsing the self-service cycle hire locations. The data allowed me to plot the following attractive map — as a byproduct.
This map is not an end in itself. It’s just to prove I have the data. The pins are coloured according to whether the hire locations have under 20, between 20 and 30, or more than 30 listed under “Capacity”.
I believe these are the docking stations of London’s new exciting self-service public cycle hire scheme.
Let’s discuss the data and the code for parsing it.
What we have are about 400 comma separated value (CSV) lines with the following fields:
‘Name’, ‘Postcode_District’, ‘TfL_Ref’, ‘Capacity’, ‘Lat’, ‘Long’, ‘Easting’, ‘Northing’
Here are a couple of rows:
Embankment (Horse Guards),SW1,01/610104,32,51.50494561,-0.123247648,530350.1,180121.23
As you can see, there is redundancy. We can assume that ‘Lat’ and ‘Long’ are in WGS84 coordinates, because that’s what googlemaps takes and most GPS devices deliver, even though coordinate schemes and datum shifts are an extremely complicated issue.
Because we are in Britain, the ‘Easting’ and ‘Northing’ must be in the British national grid reference system, which is the grid we do our maps in.
This is a a useful grid, because it’s flat and written in metres. You can tell that the Horse Guards is about 2km north of Vauxhall Bridge — which is very useful if you’re making maps with a ruler on paper, as most of the time they were.
The ‘Lat’ and ‘Long’ values, on the other hand, are magic numbers that require a computer that understands ellipsoidal geometry and transverse mercators to use. There is no perfect conversion from one pair of numbers to the other unless you also know the altitudes, because each system has its own idea of the down vector.
You don’t need to be interested in this stuff (I’m not particularly), but it is a very good idea to develop an appreciation for where the hard problems are, so you can avoid them rather than walk straight into them with your eyes closed.
This redundancy in the dataset shows that the person who created it has a similar appreciation for the difficulties.
The Postcode_district field is obviously redundant.
The Name field occasionally contains an inexplicable “xa0″ character that broke our string handling routines, so it had to be substituted out, sometimes for a space, and sometimes for nothing. I have no idea what it’s doing there.
The TfL_Ref is the unique identifier. Unique identifiers are so handy that most datasets have one (eg invoice numbers, document codes). Unfortunately this one has a ‘/’ in it which means you’ll have difficulty if you try to use it as part of a URL. In other datasets (eg my undemocracy.com) I tried substituting every ‘/’ for a ‘-’, and then found that ‘-’ characters were pretty common too, so I couldn’t escape back.
As I said, I have coloured the map symbols accordingly to capacity. It would be better to make the pins larger or smaller according to the capacity, but I wanted to use that little bike symbol, and the google chart API does not allow me to to vary the size.
Well, obviously if there are public CCTV cameras looking at the cycle racks, I ought to be able to merge the datasets so I could check whether there were any bikes at a location before I walked there.
Maybe a universal travel planner that had all the bus underground timetables routes could offer a cycle journey across town to the my friend’s office if there was an alternative option. Perhaps the computer could plot the hypothetical route for me, and compare it to the Annual Average Daily Traffic Flows at various junctions and decide, “No, maybe it’s not a good idea as you actually won’t enjoy this particular journey and all the trucks at this time of day.”
And don’t forget the datasets of accident and crime statistics that must be kicking around somewhere. We know that cycle fatalities are reported in far greater detail than the average run-of-the-mill car crashes or bus muggings in the kinds of papers your mother reads, so it’s important to obtain the actual live numbers to argue the case of what is safe.
Integrate that with having a lesser mortality due to actually getting some exercise for a change (the facts are around somewhere) matched to your actual age and demographic (if you’re 98 like my grandfather, I will concede that you will live longer if you take the bus), and we’ll never need to think for ourselves again.
Aside from having to answer the question at the top: “Where do you want to go?”
ScraperWiki is ready for business if you know any other datasets you would like to draw in to the common pool and are willing to code. Soon it will have PHP support (not just for Python).
Oh, also, you mustn’t forget to declare every dataset onto CKAN as I have done with this cycle information so that more people will be able to find it again.
Once a critical mass of connected and consistent information develops, the bigger projects become possible.