Many of you will be familiar with the now ubiquitous Linked Open Data cloud diagram, maintained by Richard Cyganiak. The diagram illustrates efforts to link together many different data sources, from the CIA World Factbook to DBpedia, a structured database of information extracted from Wikipedia. It looks like this:

We’re very pleased that the diagram’s maintainers, Anja Jentzsch, Richard Cyganiak, and Chris Bizer, have decided to use CKAN to maintain a registry of information about the datasets, from which the diagram will be automatically updated. They have put out a call for up to date information about datasets included in the diagram until next Wednesday 8th September.

From their announcement:

We are in the process of drawing the next version of the LOD cloud diagram. This time it is likely to contain around 180 datasets altogether having a size of around 20 billion RDF triples.

For drawing the next version of the LOD cloud, we have started to collect meta-information about the datasets to be included on CKAN, a registry of open data and content packages provided by the Open Knowledge Foundation.

The list of datasets about which we have already collected information is be found here:

In addition to basic meta-information about a dataset such as its size and the number of links pointing at other datasets, we also collect additional meta-information about the license of the dataset, alternative access options like SPARQL endpoints or dataset dumps, and whether there exist a voiD description of the dataset or a Semantic Web Sitemap.

So if your dataset is not listed yet and you want to have it included into the next version of the LOD cloud, please add it to CKAN until next Wednesday (September 8th, 2010).

Also, if we have collected wrong information about your dataset or if your dataset is only partially described up till now, it would be great if you could add the missing information.

Guidelines about how to add datasets to CKAN as well as about the tags that we are using to annotate the datasets are found here:

We thank all contributors in advance for their input and help, which hopefully will allow us to draw the next version of the LOD cloud as accurate as possible.

We’re delighted to see that the data.gov.uk folks have released the code for their CKAN Drupal module. As many will know, the OKF’s CKAN powers data.gov.uk as well as over a dozen other data catalogues around the world.

From the blog post:

As part of the government’s ongoing work around transparency, today we are releasing some of the custom software code we’ve developed – a CKAN module for Drupal. This is available for anyone to review, use, or modify. We’re excited to see how developers and colleagues across the world put this work to good use in their own applications and projects.

The code itself is attached to this blog post as a tar.gz file and contains one main package with two sub-packages within. This code release allows content to be synched from CKAN into Drupal. CKAN is the system we use as our “back end” to store information about all the data government has released. Drupal is a system to publish web content, and serves as our “front end” through which people can use to find our datasets and comment on them.

The main CKANPackage code creates a Drupal custom content type to represent data in the same way as CKAN. The first sub-package is the CKANImporter which imports packages from CKAN into Drupal and allows this to take place as a one-off batch import or as an update to the latest changes since a specified time. The second sub-package is CKANDatagovuk which correlates fields in CKAN with Drupal hooks.

The code release includes comments in the files to assist users with the functionality. You can of course contact us should you have any questions.

We’re delighted to announce a meetup on Data Journalism in Berlin in September organised by the Open Knowledge Foundation and Georgi Kobilarov at Uberblic Labs. Details are as follows:

  • When? 1st September 2010
  • Where? Fjord Office, Friedrichstrasse 210, Berlin
  • Register? You can register here!

Speakers will include:

  • Martin Belam, The Guardian
  • Jonathan Gray, The Open Knowledge Foundation
  • Christian Heise, ZEIT Online
  • Gerd Kamp, Deutsche Presse Agentur
  • Georgi Kobilarov, Uberblic Labs
  • John O’Donovan, BBC News
  • Tom Scott, BBC Earth
  • Ole Wintermann, Bertelsmann Foundation

From the blurb:

Data Journalism and the new and exciting possibilities that the Web of Data opens up for creators and consumers of news and media online will be the topic of this first meetup.

We have a brilliant lineup of speakers from media organisations like the BBC, The Guardian, the Deutsche Presse Agentur, the Bertelsmann Foundation coming to Berlin and talking about data journalism and the latest developments and projects in this field, and our friends from ZEIT Online will join the discussion.

The event takes place at the office of our friends at Fjord in the heart of Berlin. Starting at 2pm, you’ll hear talks followed by a panel discussion and an open space for working groups, and when the official programme ends at 7pm we’ll of course have drinks with all of you.

Language of all talks at the event will be English, but don’t be surprised to hear a bit of German here and there in conversations.

Announcement below — voting ends 27 August

Raw Data Now: Building an Open Data Ecosystem Rufus Pollock and Jordan Hatcher of the Open Knowledge Foundation have submitted a proposal for a workshop highlighting the great work of the Open Knowledge Foundation, including Where Does My Money Go?, Open Shakespeare, CKAN, the Open Definition, and Open Data Commons (among many many more great projects!). The panel will cover:
  • What legal rights apply to databases?
  • What tools are available to developers and data publishers involved in public sector data?
  • How do I encourage public sector institutions to release data?
  • If I’m in the public sector, what’s the best way for me to release my data?
  • Why is open data different from open source or open content?
Voting is a key part of the SXSW selection process, so please vote for our panel.

===

Also plug for The Itinerant Poetry Librarian’s panel will very likely also be of interest to OKFN folks into open bibliographic data and all things librarian:: “They stopped coming?”: Librarians Don’t Cry They Re-View

We are pleased to announce a one day workshop on Open Bibliographic Data and the Public Domain. Details are as follows:

Here’s the blurb:

This one day workshop will focus on open bibliographic data and the public domain. In particular it will address questions like:

  • What is the role of freely reusable metadata about works in calculating which works are in the public domains in different jurisdictions?
  • How can we use existing sources of open data to automate the calculation of which works are in the public domain?
  • What data sharing policies in libraries and cultural heritage institutions would support automated calculation of copyright status?
  • How can we connect databases of information about public domain works with digital copies of public domain works from different sources (Wikipedia, Europeana, Project Gutenberg, …)?
  • How can we map existing sources of public domain works in different countries/languages more effectively?

The day will be very much focused on productive discussion and ‘getting things done’ — rather than presentations. Sessions will include policy discussions about public domain calculation under the auspices of Communia (a European thematic network on the digital public domain), as well as hands on coding sessions run by the Open Knowledge Foundation. The workshop is a satellite event to the 3rd Free Culture Research Conference on 8-9th October.

If you would like to participate, you can register at:

If you have ideas for things you’d like to discuss, please add them at:

To take part in discussion on these topics before and after this event, please join:

The following article was originally published on the Guardian Datablog by Lisa Evans, the Lead Researcher on the OKF’s Where Does My Money Go? project.

We thought we were getting everything with the COINS release. In fact we were missing the best part of all: the Whole of Government Accounts.

Before he became chancellor George Osborne promised:

We will publish, shortly after coming to office, the Treasury’s COINS database that reports several thousand programme spending items in a consistent format across departments

Sure enough, in June, with George as our brand new chancellor, we saw the publication of COINS.

I’d been investigating the COINS (Combined Online Information System) prior to release and was expecting great things.

Like many others, we thought we would get a very detailed picture of the financial health of every government-funded body, because as the Treasury’s guide to COINS (pdf) explained: COINS is used for “the preparation of Whole of Government Accounts (WGA)”.

Now, I knew that the Whole of Government Accounts (WGA) requires each public authority to complete a detailed record of what they own and what they have bought.

You can take a look at the form each authority has to fill out, it is called an L-pack.

You’ll see the kind of information the WGA gathers, details about bank accounts, shares owned and services bought. There were 553 Local Authorities and 320 NHS trusts and foundations who completed this form last year - that’s a lot of data.

On top of that, each central government body has to fill out a C-pack. Once complete, all the L-Packs and C-Packs are uploaded to COINS.

Then, on COINS, the completed records are audited. The auditing involves the WGA team checking that each exchange of money between departments is accurately recorded by both parties.

Auditing, I believe, means “matching up” buyers and providers of services and goods. For example, a perfect match would be if Barnet Council records the purchase of an item costing £5.5m from Enfield Council, and Enfield Council records the sale of the same item at £5.5m to Barnet Council. The COINS scripts would eliminate this to zero.

However if Barnet Council records the purchase of an item costing £5.0 m from Enfield Council and Enfield Council records a sale of the item as £5.5 m to Barnet Council, then COINS would eliminate 5.0m and and put 0.5M into suspense. The suspense account then needs to be investigated more, to see where the mistake is. This investigation is the job of the WGA team.

The WGA has been running every year, for 10 years. And how many results have the public seen from the whole exercse? Exactly zero.

When COINS was published I expected to see this rich body of WGA data, but none of it was there.

So, I investigated, resulting in my request for the WGA for 2008/09.

The reply was unlike anything else I have seen. The Treasury conducted a public interest survey which consisted of a list of pros and cons for release of the WGA data. The list of pros were that the public would benefit by seeing more of the process.

Amongst the list of cons where:

Ministers and officials need space in which to develop policy, including space for the development of policy through an interactive process of testing and refining ideas. This process could be weakened if information was released prematurely or when proposals where not finalised, as this could lead to poorer decision-making

Overall the cons won and my request was rejected.

There are no plans to publish any of the 10 years worth of “dry run” data from the WGA. But the 2009/10 data will be published in spring 2011 - I’m told this report will be similar to company accounts level of detail.

So, when we hear about greater transparency on public spending, it is important to bear in mind that we have made great progress but we don’t have the full picture yet.

About Lisa Evans

Lisa Evans is Lead Researcher on Where Does My Money Go? an
independent non-partisan project run by the Open Knowledge Foundation
which makes government spending and finances understandable to the general public - showing each of us where every pound of our taxes go

The following guest post is from Ivan Begtin, who is a member of the Open Knowledge Foundation’s Working Group on Open Government Data.

I would like to announce new open data project on Russian government spending…

Background

Russian Federal Law - 94-FZ of 21.07.2005 declared that Russian Federal Treasury and Russian regional procurement agencies should publish online limited but valuable information about government contracts. So since 1 January 2006 all government procurement systems were reconstructed to be online and to publish online contracts registries. These registries are just tables, DOC, XLS and PDF files and so on. Nothing visualized, no analytics at all, but lot’s of unstructured raw material.

So there is quite a bit of material available for anyone interested. But so far nobody seems to have converted this data into a public service!

The Project

This project named RosGosZatraty (Russian Government Spending) can be found at:

This project dedicated to all Russian government spending by government contracts and president grants. It’s completely public, it contains all raw data and provides details of any contract and any grant, it includes lots of reports and other analytics, as well as a quick and simple search function.

It was initiated and launched by Institute of Contemporary Development, which is a non-profit fund which Russian president Dmitry Medvedev is on the board of.

For now we have information 2007-2009 years that includes:

  • 137 high level government agencies (including dissolved)
  • 26 654 government bodies
  • 266 032 government suppliers
  • 1 390 704 individual contracts
  • 1306 grants

And lots of reports:

What next?

Sure, not everything is yet complete. We currently don’t have:

  • API
  • Better visualization
  • Data export
  • … and so on.

We will keep working on and improving the site.

For now it’s just non-profit mashup project based on existing open data. But I hope that later we will be able to make it more semantic web ready.

I am personally represent small software development company behind this project and as e-Gov expert and public spending specialist I am project manager of it. So it you have any questions - feel free to get in touch.

P.S. We don’t yet have any English pages on the site, so if you don’t know Russian then Google Translate can help you to find out more.

This is a post by Lisa Evans, lead researcher on Where Does My Money Go?.

When I saw the COINS data that was published at the beginning of June, I suspected there was something missing.

I had been reading about the Whole of Government Accounts (WGA) — a project to provide a really good detailed overview of government finances (more information in this previous post).

I was therefore expecting to see the local council assets and accruals data of the sort that is recorded in the L-packs as well as central government spending captured annually in the C-packs. But it wasn’t there.

I conducted some more investigation, speaking to the team at the Whole Of Government accounts. There team is really quite small — only two people in Communities and Local Government WGA team and five or six people in the Treasury — but they do an amazing job of documenting all public assets and accruals. What is more, they have been running it every year for 10 years, each year gathering a detailed picture of local authorities financial health.

Anyway, based on my existing knowledge and my conversations with the WGA team and others, I can now confidently confirm the WGA is completely absent from the COINS data that was released. This means there is no reporting of local authority’s spending in COINS. A report from the WGA is planned spring next year. But I believe this will be at a very high level of detail — the sum of the whole government’s assets and accurals, not the details of individual authorities and departments.

I have requested the 2008/2009 WGA data, with the Department of Health and the Department of Defence data removed, as I believe these two departments may have failed the relevant audit.

Now we’ll wait to see what happens.

The following guest post is from Chris Taggart of OpenlyLocal, who advises the Where Does My Money Go? project on local spending data, and is a member of the Open Knowledge Foundation’s Working Group on Open Government Data. This is a cross-post — Chris’ original post here.

When the coalition announced that councils would have to publish all spending over £500 by January next year, there’s been a palpable excitement in the open data and transparency community at the thought of what could be done with it (not least understanding and improving the balance of councils’ relationships with suppliers).

Secretary of State for Communities & Local Government Eric Pickles followed this up with a letter to councils saying, “I don’t expect everyone to do it right first time, but I do expect everyone to do it.” Great. Raw Data Now, in the words of Tim-Berners Lee.

Now, however, with barely the ink dry, the reality is looking not just a bit messy, a bit of a first attempt (which would be fine and understandable given the timescale), but Not Open At All.

As a member of the Local Public Data Panel, I’ve worked with other members and councils to draw up some clear and pragmatic draft guidelines for publishing the local spending data. We’ve had a great response in the comments and in conversations, and together with some lessons I did on importing the existing data, I think these will allow us to do a second draft soon.

One thing we weren’t explicit in that first draft – because we took it for granted – was that the data had to be open, and free for reuse by all. Equality of access by all is essential.

So I’ve been watching the activities of Spikes Cavell’s SpotlightOnSpend with some wariness and now those fears seem to have been borne out, as the company seems to set out not to consume the open data that councils are publishing, but to control this data.

The idea seems to be that councils should give Spikes Cavell privileged access to their detailed invoice information, which the company then adds to their proprietry and definitely non-open database, and then publishes an extract of this information on the SpotlightOnSpend website. Exactly what information they get, and under what terms isn’t disclosed anywhere.

The website’s got most of the buzzwords: transparency, accessible, efficiency. It’s even got a friendly .org.uk domain. If that’s not enough to convince councils, liberally sprinkled around the site is an apparent endorsement from the Secretary of State himself:

I’m really excited about the opportunities of transparency and it’s something this government is utterly committed to. spotlightonspend demonstrates that, when innovative businesses work with far-sighted public bodies, we can inform the public, reduce costs and improve democracy both locally and nationally.
Eric Pickles
Secretary of State
Communities and Local Government

However, when you go to the data and click on the download link this is what you get:

Note the “This data is for your personal use only”  (not to mention the fact that the use of a captcha’ to screen out machines downloading the data means, er, you can’t use machines to automatically download the data, which is sort of the point of publishing the data in a machine-readable way).

Never mind, surely you can just head over to the council’s website and download the data from there? No chance. This is what you get on the Guildford website:

You can search and view this financial data using a new Spotlight on Spend national website. Just follow the link found in the offsite links section of this page.

What about Mole Valley Council:

This data is now available on the spotlight on spend website. You can look at categories and individual suppliers to see how much has been spent in each area or you can download all the data to see individual transactions.

But what about Windsor & Maidenhead, who are closely affiliated with the project, and who are publishing data on their website? Well, download the data from SpotlightOnSpend and it’s rather different from the published data. Different in that it is missing core data that is in W&M published data (e.g. categories), and that includes data that isn’t in the published data (e.g. data from 2008).

So the upshot seems to be this, councils hand over all their valuable financial data to a company which aggregates for its own purposes, and, er, doesn’t open up the data, shooting down all those goals of mashing up the data, using the community to analyse and undermining much of the good work that’s been done.

It’s worth linking here to the Open Knowledge Foundation’s draft guidelines on reporting of Government Finances (disclosure: I helped draw them up), of which the first point is ‘Make data openly available using an explicit license’. And let me be absolutely clear here: this is not open data, not a desirable approach, will not achieve the results of transparency or of equality of access, and is not good for the public sector.

I’m hoping this is a matter of councils and the Secretary of State not understanding the process and implications of giving this data to Spike Cavell on a privileged basis. If not, perhaps it could be the first test case for the newly setup of Public Sector Transparency Board to rule on.

Some months ago we started looking at how we might possibly use an RDF store instead of a SQL database behind data-driven websites — of which OKF has several. The reasons have to do with making the data reuseable in a better way than ad-hoc JSON APIs.

As we tend to program in Python and use the Pylons framework_static/write-ops.png this led us to consider some alternatives like RDFAlchemy and SuRF. Both of those build on top of RDFLib and try to present a programming interface reminiscent of SQL-ORM middleware like SQLObject and SQLAlchemy. They assume a single database-like storage for the RDF data and in some cases make some assumptions about the form of the data itself.

One important thing that they do not directly handle is customised indexing — and triplestores vary widely in terms of how well certain types of queries will perform, if they are supported at all. Overall, using RDFAlchemy or SuRF didn’t seem like much of a gain over using RDFLib directly. So we started writing our own middleware which we’ve named ORDF (OKFN RDF Library).

Code and documentation is at http://ordf.org/

ORDF Features and Structure

Key features of ORDF:

  • Open-source and python-based (builds on RDFLib)
  • Clean separation of functionality such as storage, indexing, web frontend
  • Easy pluggability of different storage and indexing engines (all those supported by RDFLib, 4store, simple-disk using pairtree etc)
  • Extensibility via messaging (we use rabbitmq)
  • Built-in rdf “revisioning”: every set of changes to the RDF store is kept in a “changeset”. This enables provenance, roll-back, change reporting “out-of-the-box”

To illustrate how this works, here’s a diagram showing a write operation in ORDF using most of the features described above. Below we go into detail describing how it all works.

Write operations in ORDF diagram

Forward Compatibility with RDFLib

The ORDF middleware solves several problems. The first, and most mundane, is to paper over the significant API changes between versions 2.4.2 and 3.0.0 of RDFLib. The RDFLib moved things around a bunch and this tends to break things because statements like from rdflib import Graph need to be changed to from rdflib.graph import Graph. So the first thing ORDF does is let you do from ordf.graph import Graph which will work no matter which version of RDFLib you have installed. This is important because the changes in 3.0.0 are deeper than just some renaming of modules, there is software, the FuXi reasoner and anything that uses the SPARQL query language, that will not work well with the new version. This means that we basically have a forward compatibility layer that means that software developed with ORDF should continue to work once newer RDFLib stabilises.

Pylons Support

Only slightly less mundane than the previous, ORDF includes some code that should be common amongst web applications using the Pylons framework for accessing the ORDF facilities. This means controllers for obtaining copies of graphs in various serialisations and for implementing a SPARQL endpoint.

Indices and Message Queues

Then we have indexes and queueing. Named graphs, the moral equivalent of the objects from the SQL-ORM world are stored in more than one place to facilitate different kinds of queries,

  • The pairtree filesystem index, which is good for retrieving a graph if you know its name and simply stores it as a file in a specialised directory hierarchy on the disk. This is not good for querying but is pretty much future-proof — at least as long as it is possible to read a file from the disk.
  • An rdflib supported storage, suitable for small to medium sized datasets, does not depend on any external software and allows SPARQL queries over the data for graph traversal operations
  • The 4store quad-store which fulfills a similar role for larger datasets, allowing SPARQL queries but requires an additional piece of software running (possibly on a cluster for very large datasets) and is somewhat harder to set up.
  • A xapian full-text search index, allows free-form queries over text strings, something that no triplestore does very well.

There are plans for further storage back-ends, specifically using Solr as well as other triplestores such as Jena and Virtuoso.

A key element of this indexing architecture is that it is distributed. Whilst you can configure all of these index types into a single running program — and it is common to do so for development — in reality some indexing operations are expensive and you don’t necessarily want the client program sitting and waiting while they are done synchronously. So there is also a pseudo-index that sends graphs to a rabbitmq messaging server and for each index a daemon is run that listens to a queue on a fan-out exchange.

Introducing a layer of message queueing also makes it possible to support inferencing or the process of deriving new information or statements from the given data. This is an operation that is considerably more computationally expensive than mere indexing. It is accomplished by using two queues. When a graph is save, it first gets put on a queue conventionally called reason. The FuXi reasoner listens to that queue, computes some new statements (known in the literature as a production rule or forward-chaining system), and then puts the resulting, augmented, graph onto a queue called index and thence to the indexers.

Ontology Logic

Until most recently there was only one ontology-specific behaviour coded into ORDF and that was the ChangeSet. It is still important. It provides low level, per-statement, provenance and change history information. This is built into the system. A save operation on a graph is accomplished by obtaining a change context and adding one or more graphs to it, then committing the changes. Before sending the graphs out for indexing or reasoning or queueing or whatnot, a copy of the previous version of the graphs is obtained (usually from pairtree storage) and the differences are calculated. These differences along with some metadata make up the ChangeSet which is saved and indexed along with the graphs themselves. This accomplishes what we call Syntactic Provenance because it operates at the level of individual statements.

Lately several more modules have been added to support other vocabularies. The work on the Bibliographica project led to the introduction of the OPMV vocabulary for Semantic Provenance. This is used to describe the way a data record from an external source (in this case MARC data) is transformed by a Process into a collection of statements or graph, and the way other graphs are derived from this first one. This is a distinct problem from Syntactic Provenance since it deals with the relationships between entities or objects and not simply add/remove operations on their attributes.

Another addition has been the ORE Aggregation vocabulary which is also used in Bibliographica. In our system since distinct entities or objects are stored as named graphs, we want to avoid having data duplicated in places where it should not be. For example, a book might have an author and users are ultimately interested in seeing the author’s name when they are looking at data about the book. But we do not want to store the author’s details in the book’s graph because that means that if someone notices and corrects an error the error must be corrected both in the author’s graph and their book’s. Better to keep such changes in one place. So what we actually do is create an aggregation. The aggregation contains (points at, aggregates) the book and author graph and also includes a pointer to some information on how to display it.

More to come on concrete implementation of ontology-specific behaviour, MARC processing and Aggregations in a following-up post on Bibliographica.

Next Steps

There is much more ontology-specific work to be done. First on the list is an implementation in Python of the Fresnel vocabulary that is used to describe how to display RDF data in HTML. It is more a set of instructions than a templating language and we have already written an implementation in JavaScript. It is crucial, however, that websites built with ORDF do not rely on JavaScript for presentation and we should rely on custom templates as little as possible.

ORDF is now stable enough to start using in other projects, at least within the OKF family. A first and fairly easy case will be updating the RDF interface to CKAN to use it — fitting as ORDF actually started out as a refactor of that very codebase.