Support Us

You are browsing the archive for Open Knowledge Definition.

Opening up linguistic data at the American National Corpus

Guest - January 15, 2011 in External, Featured Project, Open Data, Open Knowledge Definition, Open Knowledge Foundation, Open/Closed, Releases, WG Linguistics, Working Groups

The following guest post is from Nancy Ide, Professor of Computer Science at Vassar College, Technical Director of the American National Corpus project and member of the Open Knowledge Foundation’s Working Group on Open Linguistic Data.

The American National Corpus (ANC) project is creating a collection of texts produced by native speakers of American English since 1990. Its goal is to provide at least 100 million words of contemporary language data covering a broad and representative range of genres, including but not limited to fiction, non-fiction, technical writing, newspaper, spoken transcripts of various verbal communications, as well as new genres (blogs, tweets, etc.). The project, which began in 1998, was originally motivated by three major groups: linguists, who use corpus data to study language use and change; dictionary publishers, who use large corpora to identify new vocabulary and provide examples; and computational linguists, who need very large corpora to develop robust language models—that is, to extract statistics concerning patterns of lexical, syntactic, and semantic usage—that drive natural language understanding applications such as machine translation and information search and retrieval (à la Google).

Corpora for computational linguistics and corpus linguistics research are typically annotated for linguistic features, so that, for example, every word is tagged with its part of speech, every sentence is annotated for syntactic structure, etc. To be of use to the research and development community, it should be possible to re-distribute the corpus with its annotations so that others can reuse and/or enhance it, if only to replicate results as is the norm for most scientific research. The redistribution requirement has proved to be a major roadblock to creating large linguistically annotated corpora, since most language data, even on the web, is not freely redistributable. As a result, the large corpora most often used for computational linguistics research on English are the Wall Street Journal corpus, consisting of material from that publication produced in the early ‘90s, and the British National Corpus (BNC), which contains varied genre British English produced prior to 1994, when it was first released. Neither corpus is ideal, the first because of the limited genre, and the second because it includes strictly British English and is annotated for part of speech only. In addition, neither reflects current usage (for example, words like “browser” and “google” do not appear).

The ANC was established to remedy the lack of large, contemporary, richly annotated American English corpora representing a wide range of genres. In the original plan, the project would follow the BNC development model: a consortium of dictionary publishers would provide both the initial funding and the data to include in the corpus, which would be distributed by the Linguistic Data Consortium (LDC) under a set of licenses reflecting the restrictions (or lack thereof) imposed by these publisher-donors. These publishers would get the corpus and its linguistic annotations for free and could use it as they wished to develop their products; commercial users who had not contributed either money or data would have to pay a whopping $40,000 to the LDC for the privilege of using the ANC for commercial purposes. The corpus would be available for research use only for a nominal fee.

The first and second releases (a total of 22 million words) of the ANC were distributed through LDC from 2003 onward under the conditions described above. However, shortly after the second ANC release in 2005, we determined that the license for 15 of the 22 million words in the ANC did not restrict its use in any way—it could be redistributed and used for any purpose, including commercial. We had already begun to distribute additional annotations (which are separate from and indexed into the corpus itself) on our web site, and it occurred to us that we could freely distribute this unrestricted 15 million words as well. This gave birth to the Open ANC (OANC), which was immediately embraced by the computational linguistics community. As a result, we decided that from this point on, additions to the ANC would include only data that is free of restrictions concerning redistribution and commercial use. Our overall distribution model is to enable anyone to download our data and annotations for research or commercial development, asking (but not requiring) that they give back any additional annotations or derived data they produce that might be useful for others, which we will in turn make openly available.

Unfortunately, the ANC has not been funded since 2005, and only a few of the consortium publishers provided us with texts for the ANC. However, we have continued to gather millions of words of data from the web that we hope to be able to add to the OANC in the near future. We search for current American English language data that is either clearly identified as public domain or licensed with a Creative Commons “attribution” license. We stay away from “share-alike” licenses because of the potential restriction for commercial use: a commercial enterprise would not be able to release a product incorporating share-alike data or resources derived from it under the same conditions. It is here that our definition of “open” differs from the Open Knowledge Definition—until we can be sure that we are wrong, we regard the viral nature of the share-alike restriction as prohibitive for some uses, and therefore data with this restriction are not completely “open” for our purposes.

Unfortunately, because we don’t use “share-alike” data, the web texts we can put in the OANC are severely limited. A post on this blog by Jordan Hatcher a little while ago mentioned that the popularity of Creative Commons licenses has muddied the waters, and we at the ANC project agree, although for different reasons. We notice that many people—particularly producers of the kinds of data we most want to get our hands on, such as fiction and other creative writing—tend to automatically slap at least a “share-alike” and often also a “non-commercial” CC license on their web-distributed texts. At the same time, we have some evidence that when asked, many of these authors have no objection to our including their texts in the OANC, despite the lack of similar restrictions. It is not entirely clear how the SA and NC categories became an effective default standard license, but my guess is that many people feel that SA and NC are the “right” and “responsible” things to do for the public good. This, in turn, may result from the fact that the first widely-used licenses, such as the GNU Public License, were intended for use with software. In this context, share-alike and non-commercial make some sense: sharing seems clearly to be the civic-minded thing to do, and no one wants to provide software for free that others could subsequently exploit for a profit. But for web texts, these criteria may make less sense. The market value of a text that one puts on the web for free use (e.g., blogs, vs. works published via traditional means and/or available through electronic libraries such as Amazon) is potentially very small, compared to that of a software product that provides some functionality that a large number of people would be willing to pay for. Because of this fact, use of web texts in a corpus like the ANC might qualify as Fair Use—but so far, we have not had the courage to test that theory.

We would really like to see something like Open Data Commons Attribution License (ODC-BY) become the license that authors automatically reach for when they publish language data on the web, in the way the CC-BY-SA-NC license is now. ODC-BY was developed primarily for databases, but it would not take much to apply it to language data, if it has not been done already (see, e.g., the Definition of Free Cultural Works). Either that, or we determine if in fact, because of the lack of monetary value, Fair Use could apply to whole texts (see for example, Bill Graham Archives v. Dorling Kindersley Ltd., 448 F. 3d 605 – Court of Appeals, 2nd Circuit 2006 concerning Fair Use applied to entire works).

In the meantime, we continue to collect texts from the web that are clearly usable for our purposes. We also have a web page set up where one can contribute their writing of any kind (fiction, blog, poetry, essay, letters, email) – with a sign off on rights – to the OANC. So far, we have managed to collect mostly college essays, which college seniors seem quite willing to contribute for the benefit of science upon graduation. We welcome contributions of texts (check the page to see if you are a native speaker of American English), as well as input on using web materials in our corpus.

Richard Poynder interviews Jordan Hatcher

Guest - October 19, 2010 in Interviews, Legal, Open Data, Open Data Commons, Open Definition, Open Government Data, Open Knowledge Definition, Open Knowledge Foundation, Public Domain, WG Open Licensing

Open Acccess journalist extraordinaire Richard Poynder recently interviewed the Open Knowledge Foundation’s Jordan Hatcher about data licensing, the public domain, and lots more. An excerpt is reproduced below. The full version is available on Richard’s website.

Over the past twenty years or so we have seen a rising tide of alternative copyright licences emerge — for software, music and most types of content. These include the Berkeley Software Distribution (BSD) licence, the General Public Licence (GPL), and the range of licences devised by Creative Commons (CC). More recently a number of open licences and “dedications” have also been developed to assist people make data more freely available.

The various new licences have given rise to terms like “copyleft” and “libre” licensing, and to a growing social and political movement whose ultimate end-point remains to be established.

Why have these licences been developed? How do they differ from traditional copyright licences? And can we expect them to help or hinder reform of the traditional copyright system — which many now believe has got out of control? I discussed these and other questions in a recent email interview with Jordan Hatcher.

A UK-based Texas lawyer specialising in IT and intellectual property law, Jordan Hatcher is co-founder of, a board member of the Open Knowledge Foundation (OKF), and blogs under the name opencontentlawyer.


Jordan Hatcher

Big question

RP: Can you begin by saying something about yourself and your experience in the IP/copyright field?

JH: I’m a Texas lawyer living in the UK and focusing on IP and IT law. I concentrate on practical solutions and legal issues centred on the intersection of law and technology. While I like the entire field of IP, international IP and copyright are my most favourite areas.

As to more formal qualifications, I have a BA in Radio/TV/Film, a JD in Law, and an LLM in Innovation, Technology and the Law. I’ve been on the team that helped bring Creative Commons licences to Scotland and have led, or been a team member on, a number of studies looking at open content licences and their use within universities and the cultural heritage sector.

I was formerly a researcher at the University of Edinburgh in IP/IT, and for the past 2.5 years have been providing IP strategy and IP due diligence services with a leading IP strategy consultancy in London.

I’m also the co-founder and principal legal drafter behind Open Data Commons, a project to provide legal tools for open data, and the Chair of the Advisory Council for the Open Definition. I sit on the board for the Open Knowledge Foundation.

More detail than you can ask for is available on my web site here, and on my LinkedIn page here.

RP: It might also help if you reminded us what role copyright is supposed to play in society, how that role has changed over time (assuming that you feel it has) and whether you think it plays the role that society assigned to it successfully today.

JH: Wow that’s a big question and one that has changed quite a bit since the origin of copyright. As with most law, I take a utilitarian / legal realist view that the law is there to encourage a set of behaviours.

Copyright law is often described as being created to encourage more production and dissemination of works, and like any law, its imperfect in its execution.

I think what’s most interesting about copyright history is the technology side (without trying to sound like a technological determinist!). As new and potentially disruptive technologies have come along and changed the balance — from the printing press all the way to digital technology — the way we have reacted has been fairly consistent: some try to hang on to the old model as others eagerly adopt the new model.

For those interested in learning more about copyright’s history, I highly recommend the work of Ronan Deazley, and suggest people look at the first sections in Patry on Copyright. They could also usefully read Patry’s Moral Panics and the Copyright Wars. Additionally, there are many historical materials on copyright available at the homepage for a specific research project on the topic here.

Three tranches

RP: In the past twenty years or so we have seen a number of alternative approaches to licensing content develop — most notably through the General Public Licence and the set of licences developed by the Creative Commons. Why do you think these licences have emerged, and what are the implications of their emergence in your view?

JH: I see free and open licence development as happening within three tranches, all related to a specific area of use.

1. FOSS for software. Alongside the GPL, there have been a number of licences developed since the birth of the movement (and continuing to today), all aimed at software. These licences work best for software and tend to fall over when applied to other areas.

2. Open licences and Public licences for content. These are aimed at content, such as video, images, music, and so on. Creative Commons is certainly the most popular, but definitely not the first. The birth of CC does however represent a watershed moment in thinking about open licensing for content.

I distinguish open licences from public licences here, mostly because Creative Commons is so popular. Open has so many meanings to people (as do “free”) that it is critical to define from a legal perspective what is meant when one says “open”. The Open Knowledge Definition does this, and states that “open” means users have the right to use, reuse, and redistribute the content with very few restrictions — only attribution and share-alike are allowed restrictions, and commercial use must specifically be allowed.

The Open Definition means that only two out of the main six CC licences are open content licences — CC-BY and CC-BY-SA. The other four involve the No Derivatives (ND) restriction (thus prohibiting reuse) or have Non Commercial (NC) restrictions. The other four are what I refer to as “public licences”; in other words they are licences provided for use by the general public.

Of course CC’s public domain tools, such as CC0, all meet the Open Definition as well because they have no restrictions on use, reuse, and redistribution.

I wrote about this in a bit more detail recently on my blog.

3. Open Data Licences. Databases are different from content and software — they are a little like both in what users want to do with them and how licensors want to protect them, but are different from software and content in both the legal rights that apply and how database creators want to use open data licences.

As a result, there’s a need for specific open data licences, which is why we founded Open Data Commons. Today we have three tools available. It’s a new area of open licensing and we’re all still trying to work out all the questions and implications.

Open data

RP: As you say, data needs to be treated differently from other types of content, and for this reason a number of specific licences have been developed — including the Public Domain Dedication Licence (PDDL), the Public Doman Dedication Certificate (PDDC) and Creative Commons Zero. Can you explain how these licences approach the issue of licensing data in an open way?

JH: The three you’ve mentioned are all aimed at placing work into the public domain. The public domain has a very specific meaning in a legal context: It means that there are no copyright or other IP rights over the work. This is the most open/free approach as the aim is to eliminate any restrictions from an IP perspective.

There are some rights that can be hard to eliminate, and so of course patents may still be an issue depending on the context, (but perhaps that’s conversation for another time).

In addition to these tools, we’ve created two additional specific tools for openly licensing databases — the ODbL and the ODC-Attribution licences.

RP: Can you say something about these tools, and what they bring to the party?

JH: All three are tools to help increase the public domain and make it more known and accessible.

There’s some really exciting stuff going on with the public domain right now, including with PD calculators — tools to automatically determine whether a work is in the public domain. The great thing about work in the public domain is that it is completely legally interoperable, as it eliminates copyright restrictions.

See the rest of the interview on Open and Shut

Open Licenses vs Public Licenses

Guest - October 15, 2010 in Legal, OKI Projects, Open Data, Open Data Commons, Open Definition, Open Knowledge Definition, Open Knowledge Foundation, Open Standards, Open/Closed

The following post is from Jordan Hatcher, a Director at the Open Knowledge Foundation and founder of the Open Data Commons project. It was originally posted on his blog.

Let’s face it, we often have a definition problem.

It’s critical to distinguish “open licenses” from “public licenses” when discussing IP licensing, especially online — mostly because Creative Commons is so popular and as a result has muddied the waters a bit.

Open has so many meanings to people (same of course as with “free software” or free cultural works) that it is critical to define from a legal perspective what is meant when one says “open”. The Open Knowledge Definition does this, and states that “open” means users have the right to use, reuse, and redistribute the content with very few restrictions — only attribution and share-alike restrictions are ok, and commercial use must specifically be allowed.

Which CC licenses are Open?

The Open Definition means that only two out of the main six CC licenses are open content licenses — CC-BY and CC-BY-SA. The other four involve the two non-open license elements the No Derivatives (ND) restriction (thus prohibiting reuse) or have Non Commercial (NC) restrictions. The other four are “public licenses”, in other words they are licenses provided for use by the general public.

Of course CC’s public domain tools, such as CC0, all meet the Open Definition as well because they have no restrictions on use, reuse, and redistribution.

The Open Data Commons legal tools, including the PDDL, the ODbL and the ODC Attribution License, all comply with the Open Definition, and so are all open public licenses.

I haven’t done a full survey, but the majority of open licenses (in terms of popularity) probably also fit the definition of public licenses, as open license authors tend to draft licenses for public consumption (and these tend to be the most used ones, naturally) . Many open licenses aren’t public licenses though — mainly those drafted for specific use by a specific licensor, such as a government or business. So the UK government’s new Open Government License isn’t a public license because it’s not meant to be used without alteration by other governments, but provided it meets the definition of the Open Definition, would be an Open License.

A simple Venn Diagram might be:

Update on Open Source Initiative’s adoption of the Open Knowledge Definition

Jonathan Gray - August 4, 2010 in External, Open Data, Open Definition, Open Knowledge Definition, Open Knowledge Foundation

A few weeks back we blogged about Russ Nelson’s proposals for the Open Source Initiative (OSI) to adopt the Open Knowledge Definition, our standard for openness in relation to content and data.

Russ has written back to us with some notes and questions from a session on this at OSCON:

Okay, so, as promised, here is my report on the “Open Data Definition” BOF held on Wednesday, July 21, at 7PM. There were about ten people present, which is a reasonable attendance, particularly when set against the Google Android Hands-on session at which they gave out free Nexus One phones.

Didn’t seem wise to me to start from scratch, especially given the good work done by the Open Knowledge Foundation on their Open Knowledge Definition: So we read through it section by section, by way of review. Here are the questions we arrived at (thanks to Skud aka Kirrily Robert for taking notes):

  1. What happens with data that’s not copyrightable? 1a. What about data that consists of facts about the world and thus even a collection of it cannot be copyrighted, but the exact file format can be copyrighted? Many sub-federal-level governments in the US have to publish facts on demand but claim a copyright on the formatting.
  2. What about data that’s not accessible as a whole, but only through an API?
  3. We’re thinking that OKD #9 should read “execution of an additional agreement” rather than “additional license”.
  4. Does OKD #4 apply to works distributed in a particular file format? Is a movie not open data if it’s encoded in a patent-encumbered codec? Does it become open data if it’s re-encoded?
  5. What constitutes onerous attribution in OKD #5? If you get open data from somebody, and they have an attribution page, is it sufficient for you to comply with the attribution requirement if you point to the attribution page?

This serves as an invitation to discuss these issues on the new list . Send subscription requests to . Unsubscribe by sending a request to .

If these issues are successfully resolved, then this committee will recommend to the OSI board that the OKD should be adopted as OSI approved. If they can’t be resolved by, say, the end of 2010, then we will give up on trying. Either way, the intent is to lay down the list by the end of this year unless the participants desire otherwise.

So if you’d like to join the conversation, please join the list! We’ve also created an Etherpad to gather responses to some of these issues:

Belarusian translation of the Open Knowledge Definition (OKD)

Daniel Dietrich - July 28, 2010 in Open Data, Open Definition, Open Knowledge Definition

We’ve just added a Belarusian translation of the Open Knowledge Definition thanks to Patricia Clausnitzer!

If you’d like to translate the Definition into another language, or if you’ve already done so, please get in touch on our discuss list, or on info at the OKF’s domain name (okfn dot org).

Should the Open Source Initiative adopt the Open Knowledge Definition?

Jonathan Gray - July 19, 2010 in Open Data, Open Definition, Open Knowledge Definition

Russ Nelson, License Approval Chair at the Open Source Initiative (OSI), recently proposed a session at OSCON about OSI adopting a definition for open data:

I’m running a BOF at OSCON on Wednesday night July 21st at 7PM, with the declared purpose of adopting an Open Source Definition for Open Data. Safe enough to say that the OSD has been quite successful in laying out a set of criteria for what is, and what is not, Open Source. We should adopt a definition Open Data, even if it means merely endorsing an existing one. Will you join me there?

Subsequently a bunch of people wrote to Russell letting him know about the Open Knowledge Definition that we created a few years ago:

The Open Knowledge Definition (OKD) sets out principles to define ‘openness’ in knowledge – that’s any kind of content or data ‘from sonnets to statistics, genes to geodata’. The definition can be summed up in the statement that “A piece of knowledge is open if you are free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”

Russell suggested there was scope for the OSI to adopt the OKD, and emailed us a further blurb for the event:

Should the Open Source Initiative write its own definition of Open Data? Or is the Open Knowledge Foundation’s definition up to snuff? Come help us decide at OSCON next week. We have a BOF scheduled at 19:00 on 21 July 2010. We’ll present the results of our decision to the OSI for adoption at its next board meeting.

We’re excited at the prospect that the OKD might get adopted as an official open data definition by OSI, and would love to hear from folks who plan to attend the session!

Why Share-Alike Licenses are Open but Non-Commercial Ones Aren’t

Rufus Pollock - June 24, 2010 in Ideas and musings, Open Data, Open Definition, Open Knowledge Definition

It is sometimes suggested that there isn’t a real difference in terms of “openness” between share-alike (SA) and non-commercial (NC) clauses — both being some restriction on what the user of that material can do, and, as such, a step away from openness.

This is not true. A meaningful distinction can be drawn between share-alike and non-commercial clauses (or any other clause that discriminates against a particular type of person or field of endeavour), with the former being “open” and the latter being not “open”.

This distinction is important. It has relevance, for example, as to why Open Data Commons should not provide NC licenses but will provide a share-alike one. As well as to Creative Commons whose set of licenses includes both share-alike and non-commercial options. As such, not all CC licenses are open and CC licenses are are not all mutually compatible. This is something of an irony as it means that Creative Commons provide a set of licenses that don’t, in fact, result in a commons.

What’s the Problem? Why Does This Matter?

> What’s the problem with NC licenses, aren’t “SA” licenses a step away from open too? And if we debate this, don’t we just end up having a pointless license holy war?

The distinction between NC and SA licenses isn’t about “holy war” but something very practical: license compatibility and the integrity of the “open” commons. The core of a “commons” of data (or code) is that one piece of “open” material contained therein can be freely intermixed with other “open” material.

This interoperability is absolutely key to realizing the main practical benefits of “openness” which is the ease of use and reuse — which, in turn, mean more and better stuff getting created and used.

The Open Knowledge/Data Definition functions as a “standard” to ensure interoperability just in the same way as normal tech standards operate (but in this case for licenses rather than for a piece of hardware or software). The aim is to ensure that any license which complies with the definition will be interoperable with any other such license meaning that data or content under the one license can be combined with data or content under the other license.

Share-alike or attribution requirements are allowed within the definition precisely because they do not break this interoperability (and may even help promote the commons by ensuring material is “shared back”). Non-commercial provisions are not permitted because they fundamentally break the commons, not only through being incompatible with other licenses but because they overtly discriminate against particular types of users. (I should emphasize here that the definition is directly following the line set out in the original open source definition …)

Thus, there is a meaningful distinction between attribution and share-alike requirements and other such as non-commercial (NC), and it is a distinction that merits the description of share-alike licenses as being open but non-commercial licenses as not being open.

Isn’t It Just About Degree?

> Yes, NC and especially ND are more restrictive, but stating that NC > licenses aren’t open is wrong – they’re just not as open.

This is incorrect.

To reiterate: it is a mistake to view the set of licenses as some continuous spectrum of ‘openness’ with PD at one end and full rights reserved at the other — with the implication that all licenses in between are more or less open.

There are significant discontinuities and in particular we can meaningfully partition the set of licenses into open and not-open based on a) their interoperability b) the freedom they provide to all persons (and companies) to use, reuse and redistribute.

But You Can’t Trademark Openness …

> it’s annoying that someone claims to be releasing data openly, but it turns out to be > NC and no-compete and a bunch of other stuff. It would be nice to say to them – “you can’t claim to be open because you don’t meet this > definition”. But unfortunately it would probably be difficult to get > the trademark on the word “open”

It’s quite right that you can’t trademark openness — and no-one should want to! However, we can make an effort as a community to have a clear shared meaning for “open” in relation to data and content along the lines of — just as the open source definition has done for code. By insisting on this meaning we are doing something valuable: creating a standard and maintaining interoperability.

Russian translation of the Open Knowledge Definition (OKD)

lisa - April 27, 2010 in Open Definition, Open Knowledge Definition, Open Knowledge Foundation

We’ve just added a Russian translation of the Open Knowledge Definition thanks to Maxim Dubinin.

If you’d like to translate the Definition into another language, or if you’ve already done so, please get in touch on our discuss list, or on info at the OKF’s domain name (okfn dot org).

Norwegian translation of the Open Knowledge Definition (OKD)

lisa - April 22, 2010 in OKI Projects, Open Definition, Open Knowledge Definition, Open Knowledge Foundation

We are pleased to now have a Norwegian translation of the Open Knowledge Definition thanks to Svein-Magnus Sørensen, Harald Groven and Olav Anders Øvrebø.

If you’d like to translate the Definition into another language, or if you’ve already done so, please get in touch on our discuss list, or on info at the OKF’s domain name (okfn dot org).

A free software model for open knowledge

jwalsh - March 17, 2010 in CKAN, datapkg, Events, OKI Projects, Open Data Commons, Open Knowledge Definition, Open Knowledge Foundation, Talks

Notes describing the talk on the work of the Open Knowledge Foundation given last week at Jornadas SIG Libre.

OKF activity graph

I was happily surprised to be asked to give this open knowledge talk at an open source software conference. But it makes sense – the free software movement has created the conditions in which an open data movement is possible. There is lots to learn from open source process, in both a technical and organisational sense.

In English we have one word “free” where Spanish like most languages has two, gratis and libre, signifying separately “free of cost” and “freedom to”. The Open Source Institute coined Open Source as a branding or marketing exercise to avoid the primary meaning “free of cost”. So whenever I say “open” I want you to hear the word “libre” [Later i was told that libre can have meaning in at least 15 different ways]

The best way to talk about the work of the Open Knowledge Foundation is to look at its projects, which form an open knowledge stack similar to the OSGeo software stack.

Open Definition

The Open Knowledge Definition is based on the OSI Open Source Software Definition (which OSGeo uses as a reference for acceptable software licenses). No restrictions on field of endeavour – non-commercial-use licenses are not open as in the OKD. An open data license will pass the cake test.

Open Data Commons

Open Data Commons is run by Jordan Hatcher, who started work on the Open Database License with support from Talis, later extensive negotiation with the OpenStreetmap community. ODbL is a ShareAlike license for data, that obviates the problems of inapplicability of copyright to facts, and greediness of the ShareAlike clause when it comes to use of maps in PDFs, etc.

PDDL is a license that implements the Science Commons protocol for open access data, explicitly placing it in the public domain.

The Panton Principles are four precepts for publishers of scientific research data who wish that data to be freely reusable. Being openly able to inspect, critique and re-analyse data is critical to the effectiveness of scientific research.

Open Data Grid

The Open Data Grid is a project in early incubation; based on the Tahoe distributed filesystem. It’s in need of development effort on Tahoe to really get going. Provide secure storage for open datasets around the edges of infrastructure that people are already running. 4340727578_da9a6671a5_b

People are handwaving about the Cloud, but storage and backup are not problems that it is really meant to solve. People make different claims about the Cloud – cheaper, greener, more efficient, more flexible. Can we get these things in other ways?

There is a saying, “never underestimate the bandwidth of a truck full of DAT tapes”

Comprehensive Knowledge Archive Network (CKAN)

CKAN is inspired by free software package repositories, perl’s CPAN, R’s CRAN, python’s PyPi. It provides a wiki-like interface to create minimal metadata for packages with a versioned domain model and HTTP API.

CKAN supports groups, which can curate a package namespace – e.g. climate data – and assess priorities for turning into fully installable packages.

CKAN’s open source code is being used in the data package catalogue for the project, part of the Making Public Data Public effort in the UK.


The Debian of Data – datapkg takes Debian’s apt tool as inspiration for fully automatable install of data packages, with dependencies between them. This is currently in usable alpha stage with a python implementation.

Where Does My Money Go?

The next challenge really is to bring the concerns and the solutions to a mainstream public. Agustín Lobo spoke of “a personal consciousness but not an institutional consciousness” when it comes to open source and open data. Media coverage, exemplary government implementations, help to create this kind of consciousness.

Pressure for increased open access is coming from academia – for the research data underlying papers, for the right to data mine and correlate different sources, for library data open for re-use. Pressure is also coming from within museums, libraries and archives – memory institutions who want to increase exposure to their collections with new technology, and recognise that open data, linked to a network of resources, will work for sustainability and not against it.

The next generation of researchers, who are kids in school now, will grow up with an expectation that code and data are naturally open. It will be interesting to see what they make!

Meanwhile OpenStreetmap is feeding several startups, and more commercial presence in open data space will be of benefit. Illustrative that one does not have to be proprietary to be commercial.

Now higher-profile government projects opening data are helping to mainstream. To what extent is open a fashionable position, to what extent is open reflected throughout the way of working?

Open process; early release, public sharing of bugs, public discussion of plans – everything in Nat Torkington’s post on Truly Open Data. The opportunity to fail in public, to learn from others’ problems, and self-interestedly collaborate.

I had a great time at SIG Libre 10. Oscar Fonts’ talk on OpenSearch Geospatial interfaces to popular services has me itching to add an OpenSearch +Geo interface to CKAN, as well as to work on getting the apparent version skew in the Geo extensions resolved amicably.

Genís Roca spoke thought-provokingly on Retorno y rentabilidad (there isn’t really an equivalent English word – “rentability” – less exploitative or focused than profitability). Rentability, especially for online services, can come in ways that sustain an organisation predictably, and don’t involve fishing in the pockets of ultimate end-users.

Ivan Sanchez showed areas of OpenStreetmap Spain with stunning level of detail, trees and fences, MasterMap-quality coverage. I’m inspired to pick up JOSM and Markaartor to add building-level detail from out of copyright 1:500 Edinburgh town plans at the National Library of Scotland’s map services.

Agustin Lobo talked about the distributed work and cross-institutional support and benefit of the R project, and the impact of open source on open access to data in science. He mentioned a Nature open peer review experiment that was discarded – am thinking it wasn’t curated enough. The talk helped me to connect the OKF’s work to the rest of the Jornadas.

The shiny slides which many people asked for details of – this should show embedded in the page I hope. I stupidly forgot to put URLs on the slides which is partly why i have written this blog.

Get Updates