Opening up university infrastructure data
The following guest post is from Christopher Gutteridge, Web Projects Manager at the Electronics and Computer Science (ECS), University of Southampton and member of the OKF’s Working Group on Open Bibliographic Data.
Around five years ago we (The School of Electronics and Computer Science, University of Southampton) had a project to create open data of our infrastructure data. This included staff, teaching modules, research groups, seminars and projects. This year we have been overhauling the site based on what we’ve learned in the interim. We made plenty of mistakes, but that’s fine and what being a university is all about. We’ll continue to blog about what we’ve learned.
We have formally added a “CC0″ public domain license to all our infrastructure RDF data, such as staff contact details, research groups and publication lists. One reason few people took an interest in working with our data is that we didn’t explicitly say what was and wasn’t OK, and people are disinclined to build anything on top of data which they have no explicit permission to use. Most people want to instinctively preserve some rights over their data, but we can see no value in restricting what this data can be used for. Restricting commercial use is not helpful and restricting derivative works of data is non-sensical!
Here’s an Example; Someone is building a website to list academics by their research area and they use our data to add our staff to this. How does it benefit us to force them to attribute our data to us? They are already assisting us by making our staff and pages more discoverable, why would we want to provide a restriction?. If they want to build a service that compiles and republishes data they would need to track every license and that’s going to be a bother of a similar scale to the original BSD Clause 3.
Our attitude is that we’d like an attribution where convenient, but not if it’s a bother. must-attribute is a legal requirement, we say “please-attribute”. It’s our hope that this step will help other similar organisations take the same step with the confidence of not being the first to do so.
The CC0 license does not currently extend to our research publications documents (just the metadata) or to research data. It is my personal view that research funders should make it a requirement of funding that a project publishes all data produced, in open formats, along with any custom software used to produce it, or required to process it, along with the source and (ideally) the complete cvs/git/svn history. This is beyond the scope of what we’ve done recently in ECS, but the University is taking the management of research data very seriously and it is my hope that this will result in more openness.
Another mistake we have learned from is that we made a huge effort to correctly model and describe our data as semantically accurately as possible. Nobody cares enough about our data to explain to their tool what an “ECS Person” is. We’re in the process of adding in the more generic schemes like FOAF and SIOC etc. The awesome thing about the RDF format is that we can do this gently and incrementally. So now everybody is both (is rdf:type of) a ecs:Person and a foaf:Person. (example). The process of making this more generic will continue for a while, and we may eventually expire most of the extraneous ecs:xyz site-specific relationships except where no better ones exist.
The key turning point for us was when we started trying to us this data to solve our own problems. We frequently build websites for projects and research groups and these want views on staff, projects, publications etc. Currently this is done with an SQL connection to the database and we hope the postgrad running the site doesn’t make any cock-ups which result in data being made public which should not have been. We’ve never had any (major) problems with this approach, but we think that loading all our RDF data into a SPARQL server (like an SQL server, but for RDF data and connects with HTTP) is a better approach. The SPARQL server only contains information we are making public so the risks of leaks (eg. staff who’ve not given formal permission to appear on our website) is minimised. We’ve taken our first faltering steps and discovered immediately that our data sucked (well, wasn’t as useful as we’d imagined). We’d modelled it with an eye to accuracy, not usefulness, believing if you build it they will come. The process of “eating our own dogfood” rapidly revealed many typos, and poor design decisions which had not come to light in the previous 4 or 5 years!
Currently we’re also thinking about what the best “boilerplate” data is to put in each document. Again, we’re now thinking about how to make it useful to other people rather than how to accurately model things.
There’s no definitive guidance on this. I’m interested to hear from people who wish to consume data like this to tell us what they *need* to be told, rather than what we want to tell them. Currently we’ve probably got an overkilll!
One field I believe should be standard which we don’t have is where to send corrections to. Some of the data.gov.uk is out of date and an instruction on how to correct it would be nice and benefit everyone.
At the same time we have started making our research publication metadata available as RDF, also CC0, via our EPrints server. It helps that I’m also lead developer for
the EPrints project! By default any site upgrading to EPrints 3.2.1 or later will get linked data being made available automatically (albeit, with an unspecified license).
Now let me tell you how open linked data can save a university time and money!
Scenario: The university cartography department provides open data in RDF form describing every building, it’s GPS coordinates and it’s ID number. (I was able to create such a file for 61 university buildings in less than an hours work. It is already freely published on maps on our website so no big deal making it available.
The university teaching support team maintain a database of learning spaces, and the features they contain (projectors, seating layout, capacity etc.) and what building each one is in. They use the same identifier (URI) for buildings as the cartography dept. but don’t even need to talk to them, as the scheme is very simple. Let’s say:
Each team undertakes to keep their bit up to date, which is basically work they were doing anyway. They source any of their systems from this data so there’s only one place to maintain it. They maintain it in whatever form works for them (SQL, raw RDF, textfile, Excel file in a shared directory!) and data.exampleuniversity.ac.uk knows how to get at this and provide it in well formed RDF.
The timetabling team wants to build a service to allow lecturers and students to search for empty rooms with certain features, near where they are now. (This is a genuine request made of our Timetable team at Southampton that they would like a solution for)
The coder tasked with this gets the list of empty rooms from the timetabling team, possibly this won’t be open data, but it still uses the same room IDs (URIs). eg.
She can then mash this up with the learning-space data and the building location data to build a search to show empty rooms, filtered by required feature(s). She could even take the building you’re currently in and sort the results by distance away from you. The key thing is that she doesn’t have to recreate any existing data, and as the data is open she doesn’t need to jump through any hoops to get it. She may wish to register her use so that she’s informed of any planed outages or changes to the data she’s using but that’s about it. She has to do no additional maintenance as the data is being sourced directly from the owners. You could do all this with SQL, but this approach allows
people to use the data with confidence without having to get a bunch of senior managers to agree a business case. An academic from another university, running a conference at exampleuniversity can use the same information without having to navigate any of the politics and bureaucracy and improve their conference sites value to delegates by joining each session to it’s accurate location. If they make the conference programme into linked data (see http://programme.ecs.soton.ac.uk/ for my work in this area!) then a 3rd party could develop an iPhone app to mash up the programme & university building location datasets and help delegates navigate.
But the key thing is that making your information machine readable, discoverable and openly licensed is of most value to your own members in an organisation. It stops duplication of work and reduces time wasted trying to get a copy of data other staff maintain.
“If HP knew what HP knows, we’d be three times more profitable.” – Hewlett-Packard Chairman and CEO Lew Platt
I’ve been working on a mindmap to brainstorm every potential entity a university may eventually want to identify with a URI. Many of these would benefit from open data. Please contact me if you’ve got ones to add! It would be potentially useful to start recommending styles for URIs for things like rooms, courses and seminars as most of our data will be of a similar shape, and it makes things easier if we can avoid needless inconsistency!