You are browsing the archive for admin.

Avatar of admin

by admin

Open Data – Louder Voices?

June 20, 2011 in External, OKCon

The following guest post is by Michael Gurstein from the Centre for Community Informatics Research, Development and Training in Vancouver. Micheal will be joining us at OKCon 2011 for his talk Open Data – Louder Voices?

This post follows on from earlier posts on Michael’s blog here, here, and here.

There is a great deal of celebration these days about the shift in many governmental jurisdictions towards Open Data, and there is much to celebrate in this.

However, there is also the need for some caution in how this is being approached and particularly there is the need for considerable attention to be given to making sure that the use/user perspective is not lost in the rush to design pretty apps for mobile platforms to satisfy the cravings of the information empowered for even more data and for the personal empowerment that goes along with this.

Recognizing that at least for now most Open Data initiatives are based on accessing and using this data via the Internet, here are a few notes suggesting caution:

Access

According to Internet World Statistics, only 30% of the people in the world have Internet access (10% in Africa).

Hardware

Some 20% of the world’s population owns a computer

So how many people in the world are able to directly access “open data”?

Software

500,000,000 use Microsoft Office

Can we take this as a surrogate for the number of those who are able to actively manage “open data”?

Content

While the world adult literacy rate at 83%, ranges from 63% in Africa to 99% in Europe, it is estimated that “nearly a quarter of 16 to 65-year-olds in the world’s richest countries are functionally illiterate”.

This suggests a global level of “functional literacy” (defined by the OECD as the ability to complete real-life tasks, such as reading and understand brochures, train timetables, road maps, and simple instructions for household appliances) as being below 50%.

Understanding

The average readability level of American state and federal websites is at the 11th grade, and yet half of Americans read at the eighth grade level or lower, according to this report.

An important question we therefore need to ask ourselves is what would be the proportion of those in various jurisdictions able to read/comprehend various “open data” sites/initiatives?

Use

As Tobias Escher observed in his analysis of users and usage of the WriteToThem online citizen democracy tool:

The overall demographics of these users extend the traditional biases in political participation: compared to the profile of British Internet users, WriteToThem users are twice as likely to have a higher degree and are twice as often on a higher income (more than £37,500 per year). Apart from this, WriteToThem attracts more male users and those 45 years and older, while Internet users younger than 35 are less likely to use the site. In particular, teenagers (<18 years) stay largely out of reach – they account for only one in a hundred users. … In part the reported biases mirror traditional patterns of engagement in this particular form of political participation as comparative data show that people who have contacted a politician via any means are similarly biased towards men or high-income groups. At the same time WriteToThem extends some of these already present biases, for example the overrepresentation of people with higher education and those in the 55-64 age bracket. [However] Low-income groups including the unemployed are well represented, a sign of success in reaching out to the poorer citizens and not just a side effect of a young people or student involvement.

This suggests that even for the most basic “open government” site there is a direct relationship between use and education.

Governance

No, we are not party political, and this project is neither left nor right wing. It is about building useful digital tools for anyone who wants to use them. And unlike most think tanks that say they’re non-partisan, we really are – none of that ‘It’s not official, but everybody knows they’re really close to party X’ nonsense here.

From the My Society website.

WriteToThem.com is a website that allows everyone to find out who their elected representatives are and to send them messages. These goals are to establish a dialogue between constituent and representative as well as to let representatives focus on genuine emails (and not on sorting out spam) by preventing mass emailing of copy-and-paste letters.

From Tobias Escher’s report on WriteToThem.

TheyWorkForYou is a website, launched in 2004, that provides detailed information on members of parliament (including their voting behaviour and expenses) as well as parliamentary proceedings such as debates … to allow fact checking (e.g. give access to source evidence) and make MPs feel accountable; to reward truthful MPs, to allow fair judgement of MPs on basis of what they do.

From Tobias Escher’s report on TheyWorkForYou.

To take the TheyWorkForYou.com and WriteToThem.com sites as broadly representative of (at least) an important genre of “open data/open government” initiatives, the implicit model of political behaviour that is represented here is one of an individual interacting directly with the individual representative. There is no mention of parties (whose function of course is to integrate and frame the actions of individual representatives) nor is there an opportunity for individuals to aggregate their responses to individual representatives (meet up) and thus through aggregation amplify their voices.

In the absence of this aggregation the capacity of the individual to act in any other manner than as either an individual complainer or supplicant would appear to be very small.

Equally, in the absence of linking individual actions by representatives into parties and their overall policy responsibilities there is an implicit assumption that individual representatives are in fact “accountable” for their political actions and capable of independent political action in their respective spheres.

Finally the given demographics of the users of these sites should be noted i.e. they are those who would otherwise already be influential—older, richer, more likely to be male, better educated.

So what does this all tell us?

1. The vast vast majority of people in the world and even in the most Developed Countries are unable for a variety of reasons to benefit from “open data—open government”.

  1. Attention must be paid to ensuring Internet access, computer access, literacy, readablility of websites etc. that would make “open data—open government” more accessible/usable to the general population

  2. The absence of such attention as a component of “open data—open government” means that additional opportunities for accessing and using government information is for the most part simply a means to further enable/empower those already well provided by society with the means to influence government—the educated, the well off, older persons, males. Making the already louder voices even louder.

Given the clear advantage that those with the already louder voices have in making use of facilities like TheyWorkForYou.com and WriteToThem.com, what will the net effect be of these kinds of initiatives? Certainly the opening up of information on the actions of representatives and facilitating means for communicating with representatives should extend the opportunities for democratic engagement. However, whether they do this by making democratic participation more inclusive or simply by reinforcing existing patterns of influence rooted in long-standing structures of privilege and position seems still to be an open question.

How relevant will these opportunities be in responding to the needs of the excluded and the marginalized? And overall what measures are in place to ensure that those who have otherwise been excluded are not further excluded in fact, finding their exclusion reinforced in this new data environment? How will open data and open government respond to the needs of and open up opportunities for the urban and rural poor, indigenous people in both developed and particularly in developing countries; the landless and the migrants?   Recognizing that there is a risk, is sufficient attention being given to developing and implementing measures that might ensure a balance—to facilitating the effective use of the data and networking resources that are being made available through training, design, facilitation, and where necessary direct intervention? Recognizing that in many cases it will be precisely those with the most to gain from access to open data who will have the least capability in gaining access and obtaining the means to make effective use of this data. What responsibility do those who are making the data available have in ensuring the broad and most inclusive base for not only access but the opportunity for effective use of these significant new resources—for ensuring a balance between the louder and the weaker voices.

See the OKCon programme here

You can register for OKCon here

Avatar of admin

by admin

Open Biblio Principles Announced

January 24, 2011 in Bibliographic, News, Open Data, Open Definition, WG Open Bibliographic Data

The following post is by Mark McGillivrary, a member of the Open Knowledge Foundation Working Group on Open Bibliographic Data.

Last week the Open Biblio Principles were launched by the Open Knowledge Foundation’s Working Group on Open Bibliographic Data. The principles are the product of six months of development and discussion within the working group and the wider bibliographic community:

Producers of bibliographic data such as libraries, publishers, universities, scholars or social reference management communities have an important role in supporting the advance of humanity’s knowledge. For society to reap the full benefits from bibliographic endeavours, it is imperative that bibliographic data be made open — that is available for anyone to use and re-use freely for any purpose.

As this makes clear, the principles have a simple message: make bibliographic data open data as defined by the http://opendefinition.org/. Specifically, there are 4 core principles:

  1. When publishing bibliographic data make an explicit and robust license statement.
  2. Use a recognized waiver or license that is appropriate for data.
  3. If you want your data to be effectively used and added to by others it should be open as defined by the Open Definition (http://opendefinition.org/) – in particular non-commercial and other restrictive clauses should not be used.
  4. Where possible, explicitly place bibliographic data in the Public Domain via PDDL or CC0.

You can read the full version of the principles at: http://openbiblio.net/principles

And, perhaps even more importantly, you can endorse them: http://openbiblio.net/principles/endorse

Please help us spread the word, and the links, to individuals and organisations across the academic, library and publisher community.

Lastly, we are also working on alternative language versions so if you are interested in doing a translation please leave a comment or email mark [dot] macgillivrary [at] okfn [dot] org.

Avatar of admin

by admin

Beginnings of an Object Description Mapper

August 21, 2010 in Uncategorized

The analogue to an Object-Relational Mapper for RDF. Helping to make OWL Description Logic accessible from Python in a way that will seem familiar to people who are accustomed to things like SQLAlchemy and Django.

Avatar of admin

by admin

About Inferencing

August 2, 2010 in Uncategorized

Inferencing, or machine reasoning has a slightly unsavoury reputation perhaps stemming from the failure of Strong AI and its association with science fiction. This is a bit unfortunate and it could be argued that it has led Semantic Web technologies to be underdeveloped.

With the Semantic Web and RDF we are concerned with simple statements, or assertions. When humans make statements they generally rely on a large amount of background knowledge and contextual information to make their meaning clear without it being explicit. For example, if I say, “Mary had a little lamb,” it is unnecessary to explain that Mary is a person, Mary is female, a lamb is a young sheep, a sheep is a kind of quadrupedal animal or to digress in a discussion of what it means to “have” something, to what extent notions of ownership can extend to animals, or even the idea of time, past, present and future.

If we transcribe this first line of the nursery rhyme into the Notation 3 language so that it can be ingested by a computer, we might get,

Mary had [ a Lamb; size little ].
And because this transcription has to be done manually by a human unless we can invent some very good Natural Language Processing software, this is probably the most we can expect someone to do before it begins to get very tedious. We certainly don’t want to have to teach the computer about background facts more than once, but we can imagine that we have some set of background information at our disposal, indeed OpenCyc and SUMO can help here. Indeed OpenCyc can teach us that a lamb is a young sheep, a domesticated animal, a ruminant, a quadruped, a terrestrial organism, etc..

It is natural to want to be able to ask simple questions such as, “what type of animal did Mary have” which are actually quite easy to express in a query language like SPARQL,

SELECT ?animal ?type         WHERE {                 Mary had ?animal .                 ?animal a Animal .                 ?animal a ?type         }
however if this query were to be evaluated only against the facts transcribed from the nursery rhyme it would return no results. Mary had a lamb not an animal. To get the right answer we need to introduce rules in addition to our background facts. In this case the rule we need is very simple,
{ ?x a ?class . ?class subClassOf ?superclass } => { ?x a ?superclass }
This simply says that, for all things (x), if they have a class, and that class is a subclass of some superclass, then the thing is also whatever the superclass is. So if x is a lamb and lamb is a subclass of sheep then x is a sheep. Likewise, since sheep is a subclass of animal then x is an animal.

It is precisely this kind of situation where machine reasoning is helpful, to evaluate this type of simple rule. It is nothing spectacular, just following chains of statements made by humans to answer questions that would be obvious to a two-year old. That said, this is just a toy example, the same principle can be used with facts and questions that are not quite so obvious. However, if the rules get much more complicated or numerous, it becomes quite a lot more computationally expensive to evaluate them.

We have a very good reasoning engine, called FuXi, that is supported in ORDF for implementing these sorts of rules. Behind the scenes it is used in searching for specific types of things in the Bibliographica, one can search for publications, or articles, or books or, in some cases, chapters, but a search at a higher level of granularity will return all types of results and a search at a lower level will return only the types sought.

Avatar of admin

by admin

ORDF – the OKFN RDF Library

July 2, 2010 in Bibliographica, Technical

Some months ago we started looking at how we might possibly use an RDF store instead of a SQL database behind data-driven websites — of which OKF has several. The reasons have to do with making the data reuseable in a better way than ad-hoc JSON APIs.

As we tend to program in Python and use the Pylons framework_static/write-ops.png this led us to consider some alternatives like RDFAlchemy and SuRF. Both of those build on top of RDFLib and try to present a programming interface reminiscent of SQL-ORM middleware like SQLObject and SQLAlchemy. They assume a single database-like storage for the RDF data and in some cases make some assumptions about the form of the data itself.

One important thing that they do not directly handle is customised indexing — and triplestores vary widely in terms of how well certain types of queries will perform, if they are supported at all. Overall, using RDFAlchemy or SuRF didn’t seem like much of a gain over using RDFLib directly. So we started writing our own middleware which we’ve named ORDF (OKFN RDF Library).

Code and documentation is at http://ordf.org/

ORDF Features and Structure

Key features of ORDF:

  • Open-source and python-based (builds on RDFLib)
  • Clean separation of functionality such as storage, indexing, web frontend
  • Easy pluggability of different storage and indexing engines (all those supported by RDFLib, 4store, simple-disk using pairtree etc)
  • Extensibility via messaging (we use rabbitmq)
  • Built-in rdf “revisioning”: every set of changes to the RDF store is kept in a “changeset”. This enables provenance, roll-back, change reporting “out-of-the-box”

To illustrate how this works, here’s a diagram showing a write operation in ORDF using most of the features described above. Below we go into detail describing how it all works.

Write operations in ORDF diagram

Forward Compatibility with RDFLib

The ORDF middleware solves several problems. The first, and most mundane, is to paper over the significant API changes between versions 2.4.2 and 3.0.0 of RDFLib. The RDFLib moved things around a bunch and this tends to break things because statements like from rdflib import Graph need to be changed to from rdflib.graph import Graph. So the first thing ORDF does is let you do from ordf.graph import Graph which will work no matter which version of RDFLib you have installed. This is important because the changes in 3.0.0 are deeper than just some renaming of modules, there is software, the FuXi reasoner and anything that uses the SPARQL query language, that will not work well with the new version. This means that we basically have a forward compatibility layer that means that software developed with ORDF should continue to work once newer RDFLib stabilises.

Pylons Support

Only slightly less mundane than the previous, ORDF includes some code that should be common amongst web applications using the Pylons framework for accessing the ORDF facilities. This means controllers for obtaining copies of graphs in various serialisations and for implementing a SPARQL endpoint.

Indices and Message Queues

Then we have indexes and queueing. Named graphs, the moral equivalent of the objects from the SQL-ORM world are stored in more than one place to facilitate different kinds of queries,

  • The pairtree filesystem index, which is good for retrieving a graph if you know its name and simply stores it as a file in a specialised directory hierarchy on the disk. This is not good for querying but is pretty much future-proof — at least as long as it is possible to read a file from the disk.
  • An rdflib supported storage, suitable for small to medium sized datasets, does not depend on any external software and allows SPARQL queries over the data for graph traversal operations
  • The 4store quad-store which fulfills a similar role for larger datasets, allowing SPARQL queries but requires an additional piece of software running (possibly on a cluster for very large datasets) and is somewhat harder to set up.
  • A xapian full-text search index, allows free-form queries over text strings, something that no triplestore does very well.

There are plans for further storage back-ends, specifically using Solr as well as other triplestores such as Jena and Virtuoso.

A key element of this indexing architecture is that it is distributed. Whilst you can configure all of these index types into a single running program — and it is common to do so for development — in reality some indexing operations are expensive and you don’t necessarily want the client program sitting and waiting while they are done synchronously. So there is also a pseudo-index that sends graphs to a rabbitmq messaging server and for each index a daemon is run that listens to a queue on a fan-out exchange.

Introducing a layer of message queueing also makes it possible to support inferencing or the process of deriving new information or statements from the given data. This is an operation that is considerably more computationally expensive than mere indexing. It is accomplished by using two queues. When a graph is save, it first gets put on a queue conventionally called reason. The FuXi reasoner listens to that queue, computes some new statements (known in the literature as a production rule or forward-chaining system), and then puts the resulting, augmented, graph onto a queue called index and thence to the indexers.

Ontology Logic

Until most recently there was only one ontology-specific behaviour coded into ORDF and that was the ChangeSet. It is still important. It provides low level, per-statement, provenance and change history information. This is built into the system. A save operation on a graph is accomplished by obtaining a change context and adding one or more graphs to it, then committing the changes. Before sending the graphs out for indexing or reasoning or queueing or whatnot, a copy of the previous version of the graphs is obtained (usually from pairtree storage) and the differences are calculated. These differences along with some metadata make up the ChangeSet which is saved and indexed along with the graphs themselves. This accomplishes what we call Syntactic Provenance because it operates at the level of individual statements.

Lately several more modules have been added to support other vocabularies. The work on the Bibliographica project led to the introduction of the OPMV vocabulary for Semantic Provenance. This is used to describe the way a data record from an external source (in this case MARC data) is transformed by a Process into a collection of statements or graph, and the way other graphs are derived from this first one. This is a distinct problem from Syntactic Provenance since it deals with the relationships between entities or objects and not simply add/remove operations on their attributes.

Another addition has been the ORE Aggregation vocabulary which is also used in Bibliographica. In our system since distinct entities or objects are stored as named graphs, we want to avoid having data duplicated in places where it should not be. For example, a book might have an author and users are ultimately interested in seeing the author’s name when they are looking at data about the book. But we do not want to store the author’s details in the book’s graph because that means that if someone notices and corrects an error the error must be corrected both in the author’s graph and their book’s. Better to keep such changes in one place. So what we actually do is create an aggregation. The aggregation contains (points at, aggregates) the book and author graph and also includes a pointer to some information on how to display it.

More to come on concrete implementation of ontology-specific behaviour, MARC processing and Aggregations in a following-up post on Bibliographica.

Next Steps

There is much more ontology-specific work to be done. First on the list is an implementation in Python of the Fresnel vocabulary that is used to describe how to display RDF data in HTML. It is more a set of instructions than a templating language and we have already written an implementation in JavaScript. It is crucial, however, that websites built with ORDF do not rely on JavaScript for presentation and we should rely on custom templates as little as possible.

ORDF is now stable enough to start using in other projects, at least within the OKF family. A first and fairly easy case will be updating the RDF interface to CKAN to use it — fitting as ORDF actually started out as a refactor of that very codebase.

Avatar of admin

by admin

Latest Developments on Open Shakespeare (v0.8)

October 21, 2009 in News, Open Shakespeare, Releases

The last six months have seen significant developments on our Open Shakespeare project, many of which have are reflected on the website: http://www.openshakespeare.org/

The most major advance is the availability of new HTML and PDF editions of the texts, see, for example, these versions of Twelfth Night:

We’ve also made improvements to multiview, cleaned up the web interface, revamped the domain model (proper Work/Edition/Resource distinction), and much more!

Going forward our main efforts are, on the “tech” side, to integrate a new (javascript) annotation system, and on the content side it’s developing our open “critical edition” (an effort now being led by some students at Oxford and Cambridge).

We’re also holding a regular Open Shakespeare (virtual) meetup every other Saturday @ 4pm (London time) with the next one this coming Saturday (the 24th). All are welcome, so if you’re interested in Shakespeare why not drop in — details for how to participate are on the project wiki page.

Avatar of admin

by admin

Conservatives Pledge to Open 20 Most Socially Useful Datasets

October 19, 2009 in Open Data, Open Government Data, Policy

Thanks to a pointer from the ever-aware Julian Todd we’re heartened to see these pledges being made at the Conservative Party Conference in the UK:

  • Use open source software as much as possible
  • Publish on a website details of all government spending over £25,000. [ed: great news for Where Does My Money Go]
  • Allow the public to comment on all legislation before it is debated in depth by MPs and peers.
  • Publish online 20 of the most socially useful government datasets within 12 months of a General Election.
  • All government contacts over £10,000 being tendered by the government would also be online.
  • Fewer mega-projects; a rigid insistence on open standards and inter-operability; a level playing field for open source software and for smaller suppliers.

Interested in having a say what those “20 most socially useful government datasets” should be? You can actually do so right now, courtesy of the Office of Public Sector Information’s data unlocking service.

OPSI has been quietly offering the public the chance to make and vote on unlocking requests over the last six months. So if you have a moment head over and make a request or vote on an existing one.

Avatar of admin

by admin

Data.gov.uk Launched – and it’s Using CKAN

October 8, 2009 in CKAN, News, Open Data, Open Government Data

The UK Government’s public sector data site launched last week in a private beta — and it’s using CKAN as its backend for storing all its dataset info!

data.gov.uk

They’ve got more than a 1000 existing data sets, from 7 departments, all brought together for the first time in a re-useable form. They’re eager to get feedback from developers and users on all aspects of the project — just head over and sign up to their google group.

All kudos here to the Digital Engagement team at the Cabinet Office who have done a great job in putting this together in a really short time. We’re also delighted that they’ve chosen to use CKAN (and Talis’ Connected Commons) as a key part of their infrastructure — not only for the obvious reasons but also because it shows such a clear commitment to working collaboratively with the wider open data community into the future.

Lastly for those wondering where are all those uk gov packages on CKAN: being a private beta no uk gov data packages will be showing up on the CKAN site at the moment — but we expect this to change just as soon as data.gov.uk goes officially public …

Avatar of admin

by admin

Abusing “Open”: Macmillan’s Open Dictionary

September 9, 2009 in Events, Open Knowledge Definition, Open/Closed

Jonathan recently wrote about the availability of open dictionaries. In a recent comment to that post someone pointed us to Macmillan’s “Open” Dictionary (the reasons for the quotes will soon be apparent).

With a sense of excitement I followed the link: “Could it be”, I thought, “That a mainstream dictionary producer has decided that open is the way to go?”

Sadly, the answer is no: Macmillan’s “Open Dictionary” isn’t open — at least not in any way we mean by that term.

Their “open” means letting you give them information for free (by submitting word suggestions) but getting nothing back — as the terms and conditions make quite clear you’re not allowed to reproduce the material in any way and even linking could be problematic (emphasis added):

Unless otherwise indicated, this Web Site and its contents are the property of Macmillan Publishers Limited, … The copyright in the material contained on this Web Site belongs to Macmillan or its licensors. … Reproduction of material on this Web Site is prohibited unless express permission is given by Macmillan.

No licence is granted in respect of any intellectual property rights vested in Macmillan or other third parties.

You may not redistribute any of the Content of this Web Site without the prior authorisation of Macmillan or create a database in electronic form or manually by downloading and storing any content.

You may link to the home page and any HTML page of the Web Site provided you do not create a frame or any other bordered environment around the content … You may not link to any other page of the Web Site, other than the home page or any HTML page, without the prior written consent of Macmillan. Macmillan reserves the right to require you to remove any link to this Web Site. You may not replicate the Content on this Web Site.

To my mind this is clear abuse of the term “open” and more than a little exploitative — you do work for them for free and they don’t even promise to give you credit, let alone permission to use the material you helped create.

Such potential for abuse of the “open” label is a major reason we created the open definition — where open content and data are clearly defined as material that you, and others, are free to use, reuse and redistribute without restriction.

Avatar of admin

by admin

Speaking at OpenTech 2009

July 3, 2009 in Events, External, Open Government Data, Talks

Tomorrow I’ll be talking at OpenTech 2009 in a session on “Open Government Data” with Richard Stirling of the Cabinet Office and John Sheridan of OPSI.

With the recent, and very welcome, news on opening up government data both here and abroad I’ll be giving some suggested dos and don’ts for this process under the title: “Opening Up Government Data: Give it to us Raw, Give it to us Now”. See people there!