The following guest post is by Christopher Gutteridge, a Web & Systems Programmer and Open Data Architect at the University of Southampton. When he was young he wrote the “coffee stain” filter for GIMP, and is the developer of Graphite RDF PHP library & tools. He is a member of the OKF Working Group on Open Bibliographic Information.
I know that it’s best to practice a new technique before employing it on anything major. I also like over-engineering for its own sake. (Beware the Modeller!) That’s how I ended up building data.totl.net.
There are essentially 3 kinds of dataset on the site.
First of all, the occult stuff. There’s something really satisfying about how over-engineered some of the occult materials are: the agonising effort made to map one system into another, like tarot cards onto the Kaballah. Alisdeir Crowley produced a book that was practically just a huge occult spreadsheet. I really enjoy the process of messing around with data, and it seemed too good a data source to miss, and one nobody else would worry about.
In an over-enthusiastic moment, I asked Professor Nigel Shadbolt (the UK Government advisor on Open Data) if occult correspondence was transitive. He gave me one of those expressions that makes you feel like you’ve just put back your career. Not much… maybe a couple of weeks. Still, he’s now got me doing some interesting stuff with Open Data so hopefully he’s forgotten.
Don’t get me wrong, I’m not asserting any of that data is true. But the complex interrelating patterns of information remind me of some spiritual precursor to RDF.
The second category of dataset is the perverse ones. I kinda hate OWL and over-modelling. (While I love over-engineering, I think it’s unhelpful. I’ve never actually benefited from writing a machine-readable schema, except to get the AI people to get off my back) The perverse datasets kinda reflect this. The zzStructures is a homage to the fact that they were only a teeny-tiny bit different to RDF, and literally decades earlier (Ted Nelson for the nearly-win!) but those differences hobbled the system. RDF is hobbled too, of course, but less so.
Right now most open data is put out from well-meaning liberal sources, and there’s a dangerous assumption that you can find truth in it. There is not absolute truth, and the Daily Mail Cancer Causes dataset is intended as a cautionary dataset to those used to getting open data from the Guardian. It’s screen-scraped from the oncology-ontology site.
###How about a nice game of Chess?
The last set of data is accidentally quite interesting. It defines URIs for states of deterministic games like Noughts & Crosses and Chess. The chess one was done because everybody who saw the O&X one suggested it. I’d not recommend actually loading it into a triplestore as it’s generated on-the-fly and, while finite, the data would probably take billions of years to download.
Describing a game of chess (or tic-tac-toe) using URIs for the states and moves (my dataset in fact defines a URI for every move from every possible state of a chess game), means you could resolve the URIs to find alternate moves and the state of the board etc.
A final bit of headache on this was trying to use the unicode chess symbols as part of the URIs to represent chess-pieces. This wasn’t too bad, but I’d never done it before.
###Just because it’s stupid doesn’t mean it’s not Open Linked Data
I have to say I’m rather proud that data.totl.net made it onto the last Linked Open Data Map (down and left a bit from DBPedia). Richard Cyganiak, who produces the map (as well as the invaluable prefix.cc site), was slightly snarky about it, as he does it as a service to the community and it gets bigger every time. But I wrote him a handy little analytical tool by way of apology and he says he’s now forgiven me, so long as I buy him a pint if we ever meet in person.
Hopefully his job’s going to get even harder. Our EPrints software puts out Open Data out of the box, with v3.2.1 released last year. The second someone builds a dataset which links 100’s of EPrints (and similar tools) into the diagram, it becomes unmaintainable… And that’s a good sign of healthy take-up of the ideas. I remember my student homepage, back in 1995, appeared in the list of “new pages on the web, today” – as one of over a 1000 pages that day. Soon after that counting became a pointless exercise, thanks to exponential growth.
Anyway, now I’ve hopefully got all that over-engineered, low-utility RDF out of my system I can return to my day job, producing sensible, useful open data.
###Silliness with SPARQL
Just before I go, here are a couple of silly uses for sensible data. These use SPARQL and dbpedia for very silly ends and were produced as a result of the Oxford Open Data Hack Day.