Many of you will be familiar with the now ubiquitous Linked Open Data cloud diagram, maintained by Richard Cyganiak. The diagram illustrates efforts to link together many different data sources, from the CIA World Factbook to DBpedia, a structured database of information extracted from Wikipedia. It looks like this:

We’re very pleased that the diagram’s maintainers, Anja Jentzsch, Richard Cyganiak, and Chris Bizer, have decided to use CKAN to maintain a registry of information about the datasets, from which the diagram will be automatically updated. They have put out a call for up to date information about datasets included in the diagram until next Wednesday 8th September.

From their announcement:

We are in the process of drawing the next version of the LOD cloud diagram. This time it is likely to contain around 180 datasets altogether having a size of around 20 billion RDF triples.

For drawing the next version of the LOD cloud, we have started to collect meta-information about the datasets to be included on CKAN, a registry of open data and content packages provided by the Open Knowledge Foundation.

The list of datasets about which we have already collected information is be found here:

In addition to basic meta-information about a dataset such as its size and the number of links pointing at other datasets, we also collect additional meta-information about the license of the dataset, alternative access options like SPARQL endpoints or dataset dumps, and whether there exist a voiD description of the dataset or a Semantic Web Sitemap.

So if your dataset is not listed yet and you want to have it included into the next version of the LOD cloud, please add it to CKAN until next Wednesday (September 8th, 2010).

Also, if we have collected wrong information about your dataset or if your dataset is only partially described up till now, it would be great if you could add the missing information.

Guidelines about how to add datasets to CKAN as well as about the tags that we are using to annotate the datasets are found here:

We thank all contributors in advance for their input and help, which hopefully will allow us to draw the next version of the LOD cloud as accurate as possible.

The following guest post is from Holger Drewes, who is a member of Open Knowledge Foundation Germany and the Open Data Network in Berlin.

As interfaces for open datasets from political and societal institutions become more and more available, the possibilities for easy and uncomplicated data visualization are expanding in very promising ways. With a little programming knowledge, or a bit of support, journalists and bloggers are able to back up the conclusions in their articles with facts in a illustrative way, using diagrams or maps. Even further, they can create, demonstrate, or underline interrelations through the integration of different datasets, using programming interfaces.

Very active in offering such programming interfaces (APIs) is the Worldbank, which provides an API for querying indicators relevant for describing the world’s development status, like birth rates, CO2 emission levels, and education expenditure. The aim of this article is to show how such data can be used, particularly, as an example, how it could be visualized on a map with the help of Google Maps. The following map shows the income level in different countries through colored pins. Clicking on a pin brings up additional information about the capital of a country and the meaning of the corresponding colour. The explanation is (hopefully) not too technical, so that it should be comprehensible to non-programmers as well, at least in its essentials. Some programming skills will be necessary for a realization, but it shouldn’t take more than 2-3 hours.


Show bigger map

The following three steps are necessary on the way to your own Worldbank open data mashup:

1. Worldbank API - Select indicators and formulate query

The API from the Worldbank can be used directly via a URL in the browser. You can choose which indicator or which country you wish to find results for by specifying the parameters in the URL. For example, the following query:

returns a list with all countries with a low income level (LIC). (You can get a more structured view of the result by selecting the source view in your browser.) A more detailed explanation about the usage of the API can be found on the website of the Worldbank. The fact that you can use the API directly through the browser also gives you the chance to play around a bit with the different parameters to get a better feeling for what the API can do. Once you have created a useful API URL (in our example: http://open.worldbank.org/countries?format=json&per_page=500), the URL can be queried through the corresponding programming function (e.g. cURL for the programming language PHP used in our example).

2. Convert the result to a format readable by Google Maps

Now the queried data has to be converted in a format readable by Google Maps. A good way to go here is KML, which is a descriptive language for geodata, used for example to locate places on a map or annotate them with additional information. There is also the alternative possibility of using the Google Maps-API directly to visualize the data. The advantage of the KML option is that at the end of the process it comes out with a code snippet which can be copied straight into weblogs and content management systems. The datasets from the Worldbank API are returned in XML or JSON, both structured data formats used to represent several datasets of different kinds and corresponding properties. Generally there are standard programming functions to process these formats in the different languages, for example in PHP the function “json_decode()” is used read datasets given in the JSON format. Now you can loop through the single datasets and write the properties, which should be presented on a map, in a KML string. A list of the possible KML properties which can be used can be found in the KML documentation hosted by Google. In our example the main properties are the name of the country and the income level, which should be shown when selecting a pin on the Google map, and the longitude and latitude of the capital of the corresponding country (see illustration below). In the process of this transformation it is also possible to carry out some graphical formatting, for example representing every country with a low income level through a red pin. The created file now has to be saved somewhere on a web server as filename.kml, so that it is accessible through the web.

3. Integrate into blog/article

Phew! Maybe that last section really was a bit technical! But the good part is, now you are more or less ready! Google Maps can process KML files directly, so that you can copy the corresponding URL straight into the search field of Google Maps. If you have done everything correctly, the datasets taken from the Worldbank API should be shown on the resulting map. Anyone who wants to try this can take the KML file URL used for this example:

Copy the link into the search field and look what happens. Via “Link” -> “Customize and preview embedded map” the desired clipping and zoom level of the map can be selected, and: ready! The HTML code which you have thus created can now be copied into your own website, and the map with the data overlay will automatically be loaded via Google!

Conclusion

Hopefully this article shows how easy it is - even by today’s standards - to integrate data from openly available data sources into your own website. With a little imagination and some programming skills, much more can be realized than shown in this example. Comparisons can be made by overlaying different datasets, or through the use of timelines. Maps can be complemented by your own datasets, or by data from other open programming interfaces. So, grab your keyboard! :-) And anyone who has experimented a bit and has created interesting visualizations: it would be great if you added a comment below!

The following guest post is from Stefano Costa at the University of Siena. He is Founder of the IOSA initiative and Coordinator of the Open Knowledge Foundation’s Working Group on Open Data in Archaeology. Stefano wishes to thank Thomas Kluyver and David Jones for their help in reviewing the post.

Since the 19th century, the study of archaeobotanical remains has been very important for combining “strict” archaeological knowledge with environmental data. Pollen data enable assessing the introduction of certain domesticated species of plants, or the presence of other species that grow typically where humans dwell. Not all pollen data come from archaeological fieldwork, and pollen analysis is often done by ecologists without a particular focus on human-associated plants. However, from an archaeologist’s perspective the relationship among the two sets is strong enough to take an interested look at pollen data worldwide, their availability and most importantly their openness, for which we follow the Open Knowledge Definition.

We found that there is a serious misunderstanding by universities and research centers of their role in society as places of research, innovation that is available for everyone. As for dendrochronological data, academia is a closed system producing data (at very high costs for society) that are only available inside its walls, but it’s all done with public money.

Finding pollen data

The starting point for finding pollen data is the NOAA website.

The Global Pollen Database hosted by the NOAA is a good starting point, but apparently its coverage is quite limited outside the US. Furthermore, data from 2005 onwards aren’t available via FTP in simple documented formats, but are instead downloadable as Access databases from another external website. Defining Access databases as a Bad Choice™ for data exchange is perhaps an euphemism.

Unfortunately, a large number of databases covering single continents or smaller regions is growing, and the approaches to data dissemination show marked differences.

Americas

For both North and South America, you can get data from more than one thousand sites directly via FTP. There are no explicit terms of use. Usually, data retrieved from federal agencies are public domain data.

The README document only states NOTE: PLEASE CITE ORIGINAL REFERENCES WHEN USING THIS DATA!!!!!. Fair enough, the requirement for attribution is certainly compatible with the Open Knowledge Definition.

Europe

From the GPD website we can easily reach the European Pollen Database, that is found at another website tough (and things can be even more confusing, provided that the NOAA website has some dead links).

You can download EPD data in PostgreSQL dump format (one file for each table, with a separate SQL script create_epd_db.sql). Data in the EPD can be restricted or unrestricted. That’s fine, let’s see how many unrestricted datasets there are. Following the database documentation, the P_ENTITY table contains the use status of each dataset:

steko@gibreel:~/epd-postgres-distribution-20100531$ cat p_entity.dump |
awk -F "\t" {' print $5 '} | sort | uniq -c
    154 R
   1092 U

which is pretty good because almost 88% of them are unrestricted (NB I write most of my programs in Python but I love one liners that involve awksort and uniq). We could easily create an “unrestricted” subset and make it available for easy download to all those who don’t want to mess up with restricted data.

But what do “unrestricted” mean for EPD data? Let’s take a more careful look (emphasis mine):

  1. Data will be classified as restricted or unrestricted. All data will be available in the EPD, although restricted data can be used only as provided below.
  2. Unrestricted data are available for all uses, and are included in the EPD on various electronic sites.
  3. Restricted data may be used only by permission of the data originator. Appropriate and ethical use of restricted data is the responsibility of the data user.
  4. Restrictions on data will expire three years after they are submitted to the EPD. Just prior to the time of expiration, the data originator will be contacted by the EPD database manager with a reminder of the pending change. The originator may extend restricted status for further periods of three years by so informing the EPD each time a three-year period expires.

Sounds quite good, doesn’t it? “for all uses” is reassuring and the short time limit is a good trade off. The horror comes a few paragraphs below with the following scary details:

  1. The data are available only to non-profit-making organizations and for research.
Profit-making organizations may use the data, even for legitimate uses, only with the written consent of the EPD Board, who will determine or negotiate the payment of any fee required.

Here the false assumption that only academia is entitled to perform research is taken for granted. And there are even more rules about the “normal ethics”: basically if you use EPD data in a publication the original data author should be listed among the authors of the work. I always thought citation and attribution were invented just for that exact purpose, but it looks like they have distinctly different approach to attribution. The EPD is even deciding what are “legitimate” uses of pollen data (I can hardly think of any possible unlegitimate use).

Africa

For “Africa” read “Europe” again, because most research projects are from French and English universities. For this reason, the situation is almost the same. What is even worst is that in developing countries there are far less people or organizations that can afford buying those data, notwithstanding the fact that in regions under rapid development the study and preservation of environmental resources are of major importance.

Data are downloadable for individual sites using a search engine, in Tilia format (not ASCII unfortunately). The problems come out with the license:

The wording is almost exactly the same as for the EPD seen above:

Normal ethics pertaining to co-authorship of publications applies. The contributor should be invited to be a co-author if a user makes significant use of a single contributor’s site, or if a single contributor’s data comprise a substantial portion of a larger data set analysed, or if a contributor makes a significant contribution to the analysis of the data or to the interpretation of the results. The data will be available only to non-profit-making organisations and for research. Profit-making organisations may use the data for legitimate purposes, only with the written consent of the majority of the members of the Advisory board, who will determine or negotiate the payment of any fee required. Such payment will be credited to the APD.

Conclusions

The only positive bit of the story, if any, is that these datasets are nevertheless available on the web, and their terms of use are clearly stated, no matter how restrictive. It would be just impossible to write a similar article about archaeological pottery, or zooarchaeological finds.

Appendix: Using pollen data

Pollen data are usually presented in forms of synthetic charts where both stratigraphic data and quantitative pollen data are easily readable. Each “column” of the chart stands for a species or genus. You can create this kind of visualization with free software tools.

The stratigraph package for R can be used for

plotting and analyzing paleontological and geological data distributed through through time in stratigraphic cores or sections. Includes some miscellaneous functions for handling other kinds of palaeontological and paleoecological data.

See the chart for an example of how they look like.

Some months ago we started looking at how we might possibly use an RDF store instead of a SQL database behind data-driven websites — of which OKF has several. The reasons have to do with making the data reuseable in a better way than ad-hoc JSON APIs.

As we tend to program in Python and use the Pylons framework_static/write-ops.png this led us to consider some alternatives like RDFAlchemy and SuRF. Both of those build on top of RDFLib and try to present a programming interface reminiscent of SQL-ORM middleware like SQLObject and SQLAlchemy. They assume a single database-like storage for the RDF data and in some cases make some assumptions about the form of the data itself.

One important thing that they do not directly handle is customised indexing — and triplestores vary widely in terms of how well certain types of queries will perform, if they are supported at all. Overall, using RDFAlchemy or SuRF didn’t seem like much of a gain over using RDFLib directly. So we started writing our own middleware which we’ve named ORDF (OKFN RDF Library).

Code and documentation is at http://ordf.org/

ORDF Features and Structure

Key features of ORDF:

  • Open-source and python-based (builds on RDFLib)
  • Clean separation of functionality such as storage, indexing, web frontend
  • Easy pluggability of different storage and indexing engines (all those supported by RDFLib, 4store, simple-disk using pairtree etc)
  • Extensibility via messaging (we use rabbitmq)
  • Built-in rdf “revisioning”: every set of changes to the RDF store is kept in a “changeset”. This enables provenance, roll-back, change reporting “out-of-the-box”

To illustrate how this works, here’s a diagram showing a write operation in ORDF using most of the features described above. Below we go into detail describing how it all works.

Write operations in ORDF diagram

Forward Compatibility with RDFLib

The ORDF middleware solves several problems. The first, and most mundane, is to paper over the significant API changes between versions 2.4.2 and 3.0.0 of RDFLib. The RDFLib moved things around a bunch and this tends to break things because statements like from rdflib import Graph need to be changed to from rdflib.graph import Graph. So the first thing ORDF does is let you do from ordf.graph import Graph which will work no matter which version of RDFLib you have installed. This is important because the changes in 3.0.0 are deeper than just some renaming of modules, there is software, the FuXi reasoner and anything that uses the SPARQL query language, that will not work well with the new version. This means that we basically have a forward compatibility layer that means that software developed with ORDF should continue to work once newer RDFLib stabilises.

Pylons Support

Only slightly less mundane than the previous, ORDF includes some code that should be common amongst web applications using the Pylons framework for accessing the ORDF facilities. This means controllers for obtaining copies of graphs in various serialisations and for implementing a SPARQL endpoint.

Indices and Message Queues

Then we have indexes and queueing. Named graphs, the moral equivalent of the objects from the SQL-ORM world are stored in more than one place to facilitate different kinds of queries,

  • The pairtree filesystem index, which is good for retrieving a graph if you know its name and simply stores it as a file in a specialised directory hierarchy on the disk. This is not good for querying but is pretty much future-proof — at least as long as it is possible to read a file from the disk.
  • An rdflib supported storage, suitable for small to medium sized datasets, does not depend on any external software and allows SPARQL queries over the data for graph traversal operations
  • The 4store quad-store which fulfills a similar role for larger datasets, allowing SPARQL queries but requires an additional piece of software running (possibly on a cluster for very large datasets) and is somewhat harder to set up.
  • A xapian full-text search index, allows free-form queries over text strings, something that no triplestore does very well.

There are plans for further storage back-ends, specifically using Solr as well as other triplestores such as Jena and Virtuoso.

A key element of this indexing architecture is that it is distributed. Whilst you can configure all of these index types into a single running program — and it is common to do so for development — in reality some indexing operations are expensive and you don’t necessarily want the client program sitting and waiting while they are done synchronously. So there is also a pseudo-index that sends graphs to a rabbitmq messaging server and for each index a daemon is run that listens to a queue on a fan-out exchange.

Introducing a layer of message queueing also makes it possible to support inferencing or the process of deriving new information or statements from the given data. This is an operation that is considerably more computationally expensive than mere indexing. It is accomplished by using two queues. When a graph is save, it first gets put on a queue conventionally called reason. The FuXi reasoner listens to that queue, computes some new statements (known in the literature as a production rule or forward-chaining system), and then puts the resulting, augmented, graph onto a queue called index and thence to the indexers.

Ontology Logic

Until most recently there was only one ontology-specific behaviour coded into ORDF and that was the ChangeSet. It is still important. It provides low level, per-statement, provenance and change history information. This is built into the system. A save operation on a graph is accomplished by obtaining a change context and adding one or more graphs to it, then committing the changes. Before sending the graphs out for indexing or reasoning or queueing or whatnot, a copy of the previous version of the graphs is obtained (usually from pairtree storage) and the differences are calculated. These differences along with some metadata make up the ChangeSet which is saved and indexed along with the graphs themselves. This accomplishes what we call Syntactic Provenance because it operates at the level of individual statements.

Lately several more modules have been added to support other vocabularies. The work on the Bibliographica project led to the introduction of the OPMV vocabulary for Semantic Provenance. This is used to describe the way a data record from an external source (in this case MARC data) is transformed by a Process into a collection of statements or graph, and the way other graphs are derived from this first one. This is a distinct problem from Syntactic Provenance since it deals with the relationships between entities or objects and not simply add/remove operations on their attributes.

Another addition has been the ORE Aggregation vocabulary which is also used in Bibliographica. In our system since distinct entities or objects are stored as named graphs, we want to avoid having data duplicated in places where it should not be. For example, a book might have an author and users are ultimately interested in seeing the author’s name when they are looking at data about the book. But we do not want to store the author’s details in the book’s graph because that means that if someone notices and corrects an error the error must be corrected both in the author’s graph and their book’s. Better to keep such changes in one place. So what we actually do is create an aggregation. The aggregation contains (points at, aggregates) the book and author graph and also includes a pointer to some information on how to display it.

More to come on concrete implementation of ontology-specific behaviour, MARC processing and Aggregations in a following-up post on Bibliographica.

Next Steps

There is much more ontology-specific work to be done. First on the list is an implementation in Python of the Fresnel vocabulary that is used to describe how to display RDF data in HTML. It is more a set of instructions than a templating language and we have already written an implementation in JavaScript. It is crucial, however, that websites built with ORDF do not rely on JavaScript for presentation and we should rely on custom templates as little as possible.

ORDF is now stable enough to start using in other projects, at least within the OKF family. A first and fairly easy case will be updating the RDF interface to CKAN to use it — fitting as ORDF actually started out as a refactor of that very codebase.

CKAN v1.0 Released

May 18th, 2010

We are pleased to announce the availability of version 1.0 of the CKAN software, our open source registry system for datasets (and other resources). After 3 years of development, twelve point releases and a several successful production deployments around the world CKAN has come of age!

CKAN around the world

As well as being used to power http://ckan.net and http://data.gov.uk CKAN is now helping run 7 data catalogues around the world including ones in Canada (http://datadotgc.ca / http://ca.ckan.net), Germany (http://de.ckan.net/) and Norway (http://no.ckan.net).

CKAN.net has also continued to grow steadily and now has over 940 registered packages:

Changelog

This is our largest release so far (56 tickets) with lots of new features and improvements. Main highlights (for a full listing of tickets please see the trac milestone):

  • Package edit form: new pluggable architecture for custom forms (#281, #286)
  • Package revisions: diffs now include tag, license and resource changes (#303)
  • Web interface: visual overhaul (#182, #206, #214-#227, #260) including a tag cloud (#89)
  • i18n: completion in Web UI - now covers package edit form (#248)
  • API extended: revisions (#251, #265), feeds per package (#266)
  • Developer documentation expanded (#289, #290)
  • Performance improved and CKAN stress-tested (#201)
  • Package relationships (Read-Write in API, Read-Only in Web UI) (#253-257)
  • Statistics page (#184)
  • Group edit: add multiple packages at once (#295)
  • Package view: RDF and JSON formatted metadata linked to from package page (#247)

Bugfixes:

  • Resources revision history (#292)
  • Extra fields now work with spaces in the name (#278, #280) and international characters (#288)
  • Updating resources in the REST API (#293)

Infrastructural:

  • Licenses: now uses external License Service (’licenses’ Python module)
  • Changesets introduced to support distributed revisioning of CKAN data - see doc/distributed.rst for more information.

Thanks

Lastly a big thank-you to everyone who has contributed to this release and especially to the folks at data.gov.uk!

The following guest post is from Christina Angelopoulos at the Institute for Information Law (IViR) and Maarten Zeinstra at Nederland Kennisland who are working on building a series of Public Domain Calculators as part of the Europeana project. Both are also members of the Open Knowledge Foundation’s Working Group on the Public Domain.

Europeana Logo

Over the past few months the Institute for Information Law (IViR) of the University of Amsterdam and Nederland Kennisland have been collaborating on the preparation of a set of six Public Domain Helper Tools as part of the EuropeanConnect project. The Tools are intended to assist Europeana data providers in the determination of whether or not a certain work or other subject matter vested with copyright or neighbouring rights (related rights) has fallen into the public domain and can therefore be freely copied or re-used, through functioning as a simple interface between the user and the often complex set of national rules governing the term of protection. The issue is of significance for Europeana, as contributing organisations will be expected to clearly mark the material in their collection as being in the public domain, through the attachment of a Europeana Public Domain Licence, whenever possible.

The Tools are based on six National Flowcharts (Decisions Trees) built by IViR on the basis of research into the duration of the protection of subject matter in which copyright or neighbouring rights subsist in six European jurisdictions (the Czech Republic, France, Italy, the Netherlands, Spain and the United Kingdom). By means of a series of simple yes-or-no questions, the Flowcharts are intended to guide the user through all important issues relevant to the determination of the public domain status of a given item.

Researching Copyright Law

The first step in the construction of the flowcharts was the careful study of EU Term Directive. The Directive attempts the harmonisation of rules on the term of protection of copyright and neighbouring rights across the board of EU Member States. The rules of the Directive were integrated by IViR into a set of Generic Skeleton European Flowcharts. Given the essential role that the Term Directive has played in shaping national laws on the duration of protection, these generic charts functioned as the prototype for the six National Flowcharts. An initial version of the Generic European Flowchart, as well as the National Flowcharts for the Netherlands and the United Kingdom, was put together with the help of the Open Knowledge Foundation at a Communia workshop in November 2009.

Further information necessary for the refinement of these charts as well as the assembly of the remaining four National Flowcharts was collected either through the collaboration of National Legal Experts contacted by IViR (Czech Republic, Italy and Spain) or independently through IViR’s in-house expertise (EU, France, the Netherlands and the UK).

Both the Generic European Flowcharts and the National Flowcharts have been split into two categories: one dedicated to the rules governing the duration of copyright and the sui generis database right and one dedicated to the rules governing neighbouring rights. Although this division was made for the sake of usability and in accordance with the different subject matter of these categories of rights (works of copyright and unoriginal databases on the one hand and performances, phonograms, films and broadcasts on the other), the two types of flowcharts are intended to be viewed as connected and should be applied jointly if a comprehensive conclusion as to the public domain status of an examined item is to be reached (in fact the final conclusion in each directs the user to the application of the other). This is due to the fact that, although the protected subject matter of these two categories of rights differs, they may not be entirely unrelated. For example, it does not suffice to examine whether the rights of the author of a musical work have expired; it may also be necessary to investigate whether the rights of the performer of the work or of the producer of the phonogram onto which the work has been fixated have also expired, in order to reach an accurate conclusion as to whether or not a certain item in a collection may be copied or re-used.

Legal Complexities

A variety of legal complexities surfaced during the research into the topic. Condensing the complex rules that govern the term of protection in the examined jurisdictions into a user-friendly tool presented a substantial challenge. One of the most perplexing issues was that of the first question to be asked. Rather than engage in complicated descriptions of the scope of the subject matter protected by copyright and related rights, IViR decided to avoid this can of worms. Instead, the flowchart’s starting point is provided by the question “is the work an unoriginal database?” However, this solution seems unsatisfactory and further thought is being put into an alternative approach.

Other difficult legal issues stumbled upon include the following:

  • Term of protection vis-à-vis third countries
  • Term of protection of works of joint authorship and collective works
  • The term of protection (or lack thereof) for moral rights
  • Application of new terms and transitional provisions
  • Copyright protection of critical and scientific publications and of non-original photographs
  • Copyright protection of official acts of public authorities and other works of public origins (e.g. legislative texts, political speeches, works of traditional folklore)
  • Copyright protection of translations, adaptations and typographical arrangements
  • Copyright protection of computer-generated works

On the national level, areas of uncertainty related to such matters as the British provisions on the protection of films (no distinction is made under British law between the audiovisual or cinematographic work and its first fixation, contrary to the system applied on the EU level) or exceptional extensions to the term of protection, such as that granted in France due to World Wars I and II or in the UK to J.M. Barrie’s “Peter Pan”.

Web based Public Domain Calculators

Once the Flowcharts had been prepared they were translated into code by IViR’s colleagues at Kennisland, thus resulting in the creation of the current set of six web-based Public Domain Helper Tools.

Technically the flowcharts needed to be translated into formats that computers can read. In this project Kennisland choose for an Extensible Markup Language (XML) approach for describing the questions in the flowcharts and the relations between them. The resulting XML documents are both human and computer readable. Using XML documents also allowed Kennisland to keep the decision structure separate from the actual programming language, which makes maintenance of both content and code easier.

Kennisland then needed to build an XML reader that could translate the structures and questions of these XML files into a questionnaire or apply some set of data to the available questions, so as to make the automatic calculation of large datasets possible. For the EuropeanaConnect project Kennisland developed two of these XML readers. The first translates these XML schemes into a graphical user interface tool (this can be found at EuropeanaLabs) and the second can potentially automatically determine the status of a work which resides at the Public Domain Works project mercurial depository on KnowledgeForge. Both of these applications are open source and we encourage people to download, modify and work on these tools.

It should be noted that, as part of Kennisland’s collaboration with the Open Knowledge Foundation, Kennisland is currently assisting in the development of an XML base scheme for automatic determination of the rights status of a work using bibliographic information. Unfortunately however this information alone is usually not enough for the automatic identification on a European level. This is due to the many international treaties that have accumulated over the years; rules for example change depending on whether an author is born in a country party to the Berne convention, an EU Member State or a third country.

It should of course also be noted that there is a limit to the extent to which an electronic tool can replace a case-by-case assessment of the public domain status of a copyrighted work or other protected subject matter in complicated legal situations. The Tools are accordingly accompanied by a disclaimer indicating that they cannot offer an absolute guarantee of legal certainty.

Further fine-tuning is necessary before the Helper Tools are ready to be deployed. For the moment test versions of the electronic Tools can be found here. We invite readers to try these beta tools and give us feedback on the pd-discuss list!

Note from the authors: If the whole construction process for the Flowcharts has highlighted one thing that would be the bewildering complexity of the current rules governing the term of protection for copyright and related rights. Despite the Term Directive’s attempts at creating a level playing field, national legislative idiosyncrasies are still going strong in the post-harmonisation era – a single European term of protection remains very much a chimera. The relevant rules are hardly simple on the level of the individual Member States either. In particular in countries such as the UK and France, the term of protection currently operates under confusing entanglements of rules and exceptions that make the confident calculation of the term of protection almost impossible for a copyright layperson and difficult even for experts.

PD Calculators

Generic copyright flowchart by Christina Angelopoulos. PDF version available from Public Domain Calculators wiki page

CKAN 0.11 Released

February 12th, 2010

We are pleased to announce the release of version 0.11 of the CKAN software, our open source registry of open data used in ckan.net and data.gov.uk.

CKAN tag cloud

This is our biggest release so far (55 tickets) with lots of new features and improvements. This release also saw a major new production deployment with the CKAN software powering http://data.gov.uk/ which had its public launch on Jan 21st!

Main highlights (for a full listing of tickets please see the trac milestone):

  • Package Resource object (multiple download urls per package): each package can have multiple ‘resources’ (urls) with each resource having additional metadata such as format, description and hash (#88, #89, #229)
  • “Full-text” searching of packages (#187)
  • Semantic web integration: RDFization of all data plus integration with an online RDF store (e.g. for http://www.ckan.net/ at http://semantic.ckan.net/ or Talis store) (#90 #163)
  • Package ratings (#77 #194)
  • i18n: we now have translations into German and French with deployments at http://de.ckan.net/ and http://fr.ckan.net/ (#202)
  • Package diffs available in package history (#173)
  • Minor:
    • Package undelete (#21, #126)
    • Automated CKAN deployment via Fabric (#213)
    • Listings are sorted alphabetically (#195)
    • Add extras to rest api and to ckanclient (#158 #166)
  • Infrastructural:
    • Change to UUIDs for revisions and all domain objects
    • Improved search performance and better pagination
    • Significantly improved performance in API and WUI via judicious caching

We’re pleased to announce the publication of a new report, Unlocking the potential of aid information. The report, by the Open Knowledge Foundation and Aidinfo, looks at how to make information related to international development (i) legally open, (ii) technically open and (iii) easy to find.

The report and relevant background information can be found at:

It aims to inform the development of a new platform for publishing and sharing aid information:

The International Aid Transparency Initiative (IATI) aims to improve the availability and accessibility of aid information by designing common standards for publication of info about aid. It’s is not about creating another database on aid activities, but creating a platform that will enable existing databases – and potential new services – to access this aid information and create compelling application providing more detailed, timely, and accessible information about aid.

The idea of openness is crucial to creating this platform and achieving transparency. Information must be openly available with as few restrictions in how the information is accessed and used as possible. To this end, we need to design a technical architecture that enables information to be published and accessed in an open way.

There are three main recommendations in the report, which are as follows:

  • Recommendation 1 - Aid information should be legally open. The standard should require a core set of standard licenses for pubishing aid information under. It should require that either:
    • (i) information is published under one of a small number of recommended options:
      • Licenses for content: Creative Commons Attribution or Attribution Sharealike license
      • Legal tools for data: Open Data Commons Public Domain Dedication and License (PDDL), Open Data Commons Open Database License (ODbL) or Creative Commons CC0
    • or that (ii) information is published using a license/legal tool that is compliant with a standard such as the Open Knowledge Definition.
  • Recommendation 2 - Aid information should be technically open. The standard should require that raw data is made available in bulk (not just via an API or web interface) with any relevant schema information and either:
    • (i) in one of a small number of recommended formats:
      • Text: HTML, ODF, TXT, XML
      • Data: CSV, XML, RDF/XML
    • or (ii) in a format:
      • (a) which is machine readable and
      • (b) for which the specification is publicly and freely available and usable
  • Recommendation 3 - Aid information should be easily findable. The standard should require that aid organisations add their knowledge assets to a registry with some basic metadata describing the information.

We are now welcoming comments on the report until Sunday 1st November 2009. To submit comments you can:

  1. Directly annotate the documents with your comments:
  2. Submit your comments for discussion on the open development mailing list.
  3. Email your comments to info at okfn dot org.

CKAN 0.9 Released

August 13th, 2009

We are pleased to announce the release of CKAN version 0.9! CKAN is the Comprehensive Knowledge Archive Network, a registry of open knowledge packages and projects.

Changes include:

  • Add version attribute for package
  • Fix purge to use new version of Versioned Domain Model (vdm) (0.4)
  • Link to changed packages when listing revision
  • Show most recently registered or updated packages on front page
  • Bookmarklet to enable easy package registration on CKAN
  • Usability improvements (package search and creation on front page
  • Use external list of licenses from license repository
  • Convert from py.test to nosetests

There are now over 560 packages in the registry - which means that on average we’ve been adding a package a day since version 0.8 was released in May!

Linking Open Data cloud

We’re currently organising a workshop on ‘open data and the semantic web’, which will take place in London this autumn. Details are as follows:

  • When: Friday 13th November 2009, 1000-1800
  • Where: London Knowledge Lab, 23-29 Emerald Street, London, WC1N 3QS. (See map)
  • Wiki: http://wiki.okfn.org/SemanticWeb
  • Participation: Attendance is free. If you are planning to come along please add your name to the wiki.
  • Microbloggers: See notices on identi.ca and Twitter

Further details:

Semantic web technologists and advocates are increasingly beginning to see the value of ‘open data’ for the data web. Tim Berners-Lee has spoken about the importance of open data, and being able to access raw data in easy to use formats, and the Linking Open Data project demonstrates what can be done by linking together a rich variety of publicly re-usable datasets.

This informal, hands-on workshop will bring together researchers, technologists, and people interested in open data and the semantic web from both public and private sector organisations for a day of talks and discussions.

Themes will include:

  • Linking Open Data
  • Legal tools for open data
  • Finding open data