I am in Vienna, along with my colleague Ira, for a plenary meeting of the assorted partners of the LOD2 project. LOD2 is an EU-funded research project on Linked Open Data, the vision of an interlinked web of data known to many from Tim Berners-Lee’s TED talk. The meeting runs for 3 days, in which there will be discussions about the various work packages, but I have been given the task of blogging about the opening introductory session on Wednesday afternoon. (Full disclosure: I have received a handsome LOD2 mug as advance payment for my efforts.) The Open Knowledge Foundation is one of the partners, because the pan-European CKAN data portal publicdata.eu is part of the project. But being personally a relative newcomer, I was looking forward to finding out in this introductory session what the project is really all about.
Delegates at the LOD2 plenary
Sören Auer, the project co-ordinator, kicked off, giving an overview of the overview. He described the lifecycle of Linked Data, from extraction (from other structured or unstructured data) through to linking in to existing data, enrichment (perhaps by adding more structure), to the point where it can be explored for interesting patterns. For each stage in the lifecycle, there are tools being developed by the project – many are already released. Collectively these tools, which are all Open Source, form the LOD2 ‘stack’. Sören also mentioned some recent milestones, including a Serbian CKAN portal holding a lot of data in RDF, the native format for Linked Data; and a planned new data-oriented conference, the European Data Forum.
The tools: Work Packages 2-6
WP2: Optimising the store
Peter Boncz of CWI spoke about Work Package 2. (What happened to WP1, you ask? It was a prototype which finished earlier in the project.) WP2 concerns Virtuoso, the database part of the LOD2 stack. The challenge with RDF is to make a database that runs efficiently with huge quantities of data, as the potential for rich interlinking means the data is not neatly segmented into tables as in a normal database. A lot of progress has already been made, and he hopes that Virtuoso 7 will be released soon. It will be structured to enable better compression (speeding up processing by reducing I/O), and use adaptive caching to try to minimise the number of queries that need to be done more than once.
WP3: Getting the data
Jens Lehman of AKSW at the University of Leipzig was next, talking about WP3 on ‘extraction, enrichment and repair': the creation of Linked Data from existing structured or unstructured sources, its enrichment with suitable taxonomies to describe it, and detecting inconsistencies or other problems with its structure. If that sounds like a wide-ranging package, it is: as Jens told me later over dinner (not entirely seriously), ‘anything that doesn’t fit in one of the other packages gets stuffed into WP3’! There are currently over 20 tools playing a role in this stage, including Natural Language Processing techniques for extracting data from free text.
WP4: Creating links
Next up was Robert Isele of the Freie Universität Berlin. WP4 aims to enrich RDF data by adding links to other data sources, as well as linking data together by identifying duplicate entities within or between datasets. Automatic tools suggest links that a user can confirm or reject. WP4 also includes work to create an RDF-enabled version of the open source data cleaning tool Google Refine.
WP5: User interfaces
Sean Policarpio of DERI reported on WP5 on browsing, visualisation and authoring interfaces. He demonstrated geospatial data on a map, filtered with a structured (faceted) search – combining the power of Linked Data with a mapping search like Google Maps. Associated with this, they have produced a ‘semantic authoring’ tool, allowing the user to add or edit Linked Data via the map. Their next tasks are to implement ‘social semantic networking’ – for example, notifications based on semantic content – and mobile interfaces for their semantic tools.
WP6: Integrating the tools
Finally, the engaging and very Belgian Bert van Nuffelen of TenForce spoke about WP6, which aims to make the various disparate tools in the LOD2 stack play nicely together. They have worked on making it easier for users to install the stack tools, a shared interface and shared authorisation using WebID. They have also recently released an intermediate version of the stack (version 1.1) with new and upgraded tools and better documentation.
By now it was 3 o’clock and, against all expectations, the meeting was ahead of schedule. So we had a relatively luxurious half-hour break for tea. Your correspondent and another relative newcomer, Jan from Tenforce, took the opportunity to get some fresh air and a feel for the Viennese genius loci. Or should that be Ortsgeist?
The use cases
We had heard about the tools that had been, and are being, developed to manipulate Linked Data. But how will they be used? Refreshed by tea we returned to the meeting to hear about the three Work Packages concerned with use cases. Perhaps the most exciting talk of the afternoon came from Christian Dirschl of WP7 and Wolters Kluwer Germany (WKD). WKD is a legal and accountancy publisher who are already adapting and using the LOD2 stack tools to enhance their publishing business. Christian told us that ‘semantic technologies enable publishing media to create added value’, and WKD’s first release of news and media datasets created using Linked Data tools is on course for publication in April. By December they will release an interlinked version of the datasets, including links to DPpedia and further optimised tools.
Amar-Djalil Mezaour of Exalead presented the ‘enterprise’ use case WP8, an application to human resources with the aim of matching job vacancies to applicants. Some early work trying to model CVs had met criticism on the ground, among others, that the EU reviewers had doubts about volume of data freely available. WP8 has refocused its attention on job vacancies rather than CVs, for which there is plenty of data and better RDF support. They hope to release the results later this year, with vacancies ‘dashboards’ and analytics, faceted by sector, region, salary, etc, using Linked Data, and enriched with mashups with other sites such as social networks.
WP9: Government data
After a long wait in the wings, it was time for the OKF’s own Ira Bolychevsky to take centre stage at last. WP9 aims to explore the applications to making government data available and maximising its use. Its main visible output is publicdata.eu, which republishes open data from government portals throughout the European Union. publicdata.eu has recently been upgraded and repaired: it now runs the latest version of CKAN, introducing features such as data previews (like this) and – live on the DataHub and coming soon to publicdata.eu – a data API for structured data. Two subjects we hope to discuss more later in the plenary are closer integration with the LOD2 stack, and metadata standards.
Ira presenting WP9
Jindřich Mynarz briefly mentioned the new Czech CKAN portal. They have developed a detailed methodology as well as a ‘Quick Start guide’ for publishers, both of which they promise to make available in English soon (hurrah!)
Finally Vojtech Svatek of UEP gave a quick overview of WP9a, which aims to use Linked Data technology in the field of public procurement, with ontologies for public sector contracts – providing matchmaking and analytics not dissimilar from those in WP8.
A jug of wine, a loaf of bread
Perhaps the reader has read enough of Work Packages for now. Anticipating your satiety, the organisers had decided to defer the presentations from WP10-12 until Friday. In their place an outsider to the LOD2 project, Allan Hanbury, gave a lightning talk on a slightly related EU project, Khresmoi, which aims to provide useful searching tools for large medical databases.
Thus concluded the day’s business, and we all dispersed to our various hotels. The OKF contingent, along with TenForce, are staying in one just a couple of roads away. Crossing a road is hazardous in Vienna, because there are sometimes cars parked in what seems to be the middle of the road. You keep half-expecting some lights to change and the cars to zoom off. In fact they are parked between the road and the tramlines, along which long and elderly trams snake through the city.
In the evening, everyone from the day’s meetings reconvened and were whisked away on one such tram to an outlying districts of the city, for an evening at a (more or less) traditional Austrian Heurige, an untranslatable type of wine tavern. A true Heurige, Helmut from the Semantic Web Company explains to me as we hurtle along, is run by a vineyard, and gives people an opportunity to sample its new year’s crop of wine. (‘Heurige’ in Austrian German literally means ‘this year’.) It will have a licence to open for only 2 or 3 weeks a year, and when open will hang out a spray of branches and a lamp to signify the fact.
There is still some wine grown in Vienna, I am told, but most of the Viennese Heurigen are open all year round and are really just restaurants. But they recreate the atmosphere of the real thing. Patrons are served wine and a mixed plate of traditional local foods, which, for readers not familiar with Austrian cuisine, mainly consist of various kinds of sausage, potato and cabbage. They are delicious, and so is the Apfelstrudel that comes along later. The only thing I cannot recommend in Vienna is the tea. When will these foreigners learn that it must be made with boiling hot water?
To follow blogs from the LOD2 plenary, see the blog parade from the project blog.