<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.0.1" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>Open Knowledge Foundation Weblog</title>
	<link>http://blog.okfn.org</link>
	<description></description>
	<pubDate>Wed, 14 May 2008 22:21:35 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.0.1</generator>
	<language>en</language>
			<item>
		<title>CKAN 0.5 Released</title>
		<link>http://blog.okfn.org/2008/02/01/ckan-05-released/</link>
		<comments>http://blog.okfn.org/2008/02/01/ckan-05-released/#comments</comments>
		<pubDate>Fri, 01 Feb 2008 11:39:45 +0000</pubDate>
		<dc:creator>Jonathan Gray</dc:creator>
		
	<category>News</category>
	<category>Open Knowledge</category>
	<category>OKF Projects</category>
	<category>Metadata</category>
	<category>Technical</category>
		<guid isPermaLink="false">http://blog.okfn.org/2008/02/01/ckan-05-released/</guid>
		<description><![CDATA[The Comprehensive Knowledge Archive Network (CKAN) version 0.5 has just been released.

Changes include:


feature to list and search tags
feature to make data available in machine-usable form via sql dump
feature to purge a revision and associated changes
support for reserved html characters in urls
upgrade to Pylons 0.9.6
new spam management utilities including (partial) blacklist support


The CKAN code is available [...]]]></description>
			<content:encoded><![CDATA[<p>The Comprehensive Knowledge Archive Network (CKAN) version 0.5 has just been released.</p>

<p>Changes include:</p>

<ul>
<li>feature to list and search tags</li>
<li>feature to make data available in machine-usable form via sql dump</li>
<li>feature to purge a revision and associated changes</li>
<li>support for reserved html characters in urls</li>
<li>upgrade to Pylons 0.9.6</li>
<li>new spam management utilities including (partial) blacklist support</li>
</ul>

<p>The CKAN code is available from:</p>

<ul>
<li><a href="http://pypi.python.org/pypi/ckan/0.5">http://pypi.python.org/pypi/ckan/0.5</a></li>
</ul>

<p>The data is available from:</p>

<ul>
<li><a href="http://ckan.net/license/">http://ckan.net/license/</a></li>
</ul>

<p>We&#8217;ve currently got 135 packages. If you come across a large dataset or substantial collection, please consider registering it on CKAN!</p>
<p class="akst_link"><a href="http://blog.okfn.org/?p=148&amp;akst_action=share-this"  title="E-mail this, post to del.icio.us, etc." id="akst_link_148" class="akst_share_link" rel="nofollow">Share This</a>
</p>]]></content:encoded>
			<wfw:commentRSS>http://blog.okfn.org/2008/02/01/ckan-05-released/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>Collaborative Development of Data</title>
		<link>http://blog.okfn.org/2007/02/20/collaborative-development-of-data/</link>
		<comments>http://blog.okfn.org/2007/02/20/collaborative-development-of-data/#comments</comments>
		<pubDate>Tue, 20 Feb 2007 12:29:09 +0000</pubDate>
		<dc:creator>Rufus Pollock</dc:creator>
		
	<category>Open Knowledge</category>
	<category>Musings</category>
	<category>Technical</category>
	<category>Open Data</category>
		<guid isPermaLink="false">http://blog.okfn.org/2007/02/20/collaborative-development-of-data/</guid>
		<description><![CDATA[$ This version: 2007-02-15 (First version 2006-05-24) $

We already have some fairly good working processes for collaborative development of unstructured text: the two most prominent examples being source code of computer programs and wikis for general purpose content (encyclopedias etc). However these tools perform poorly (or not at all) when we come to structured data.

The [...]]]></description>
			<content:encoded><![CDATA[<p>$ This version: 2007-02-15 (First version 2006-05-24) $</p>

<p>We already have some fairly good working processes for collaborative development of unstructured text: the two most prominent examples being source code of computer programs and wikis for general purpose content (encyclopedias etc). However these tools perform poorly (or not at all) when we come to structured data.</p>

<p>The purpose of this short essay is to pose the question: how do we collaboratively develop knowledge when it is in the form of structured data (as opposed to unstructured text)?</p>

<p>There are two aspects of structured data that distinguish it from plain text:</p>

<ol>
<li>Referential integrity (objects point to other objects)</li>
<li>Labelling to enable machine processing (and the addition of &#8217;semantics&#8217;)</li>
</ol>

<p>To illustrate what I mean consider the following use case which comes from our own <a href="http://www.publicdomainworks.net/">public domain works project</a>. Here we are storing data about cultural works. In the simplest possible setup we have two types of object: a work and a creator. A given work may have many creators (authors) and a given creator may have created many works. Furthermore each work and each creator have various attributes. For the purposes of this discussion let us focus on only two:</p>

<ol>
<li>name (creator) and title (work)</li>
<li>date of death (creator) and date of creation (work)</li>
</ol>

<p>If we were to adopt a wiki setup (<a href="http://en.wikipedia.org/wiki/Ludwig_van_Beethoven">a la wikipedia</a>) we would create a web page for each creator and each work. There would be a url pointing to any associated objects with some kind of human-processable (but likely not machine-processable) indicator of the nature of the link. Attributes would also be included as plain text perhaps with some simple markup to indicate their nature perhaps not. The unique identifier for a given object would come in the form of a url.</p>

<p>This is a not unattractive approach as it is very easy to implement &#8212; at least initially &#8212; because wikis for plain text are so well developed (and in fact it is the approach taken by the current v0.1 of public domain works). The problem arise when once one goes beyond simple data entry. For example</p>

<ol>
<li><p>Searching, particularly structured searching (e.g. find more all creators who died more than seventy years ago and whose works are more than 100 years old), is slow and cumbersome compared to working with a database. Referential integrity isn&#8217;t enforced and the unique identifiers (url names) aren&#8217;t</p></li>
<li><p>Programmatic insertion and querying of the data is very limited. For example suppose we obtain a library catalogue and wish to merge it into the existing data. To do this we need to query the existing db repeatedly to try and identify matches between existing objects and objects in the catalogue.</p></li>
<li><p>No support for ACID, in particular no way:</p>

<ol><li>To have (and enforce) referential integrity in your data structures [^wiktionaryz]</li>
<li>To do atomic commits which preserve referential integrity (even in   a simple wiki this is a problem in that renaming a page and changing references to it have to be separate operations rather than one atomic commit)</li></ol></li>
<li><p>&#8216;Data loss&#8217;/No data structure: when data structure isn&#8217;t &#8220;enforced&#8221; it may be extremely (or impossible) to extra relevant information (e.g. date of death in above example). In such circumstances, at least from a programmer&#8217;s point of view, the data is now &#8216;lost&#8217;. It also makes it much harder to enforce data constraints when data is entered or to check data validity once entered.</p></li>
</ol>

<p>Thus we really want an approach that supports:</p>

<ol>
<li>Versioning at the <em>model</em> level (i.e. not just of individual attributes)</li>
<li>Other data types than plain text</li>
<li>Associated tools:
<ul><li>No off-the-shelf tools that will version</li>
<li>No off-the-shelf tools to do visualization (e.g. showing diffs)[^1]</li>
<li>Web interface to provide for direct editing (and integration of associated tools such as diffs, changelogs etc)</li>
<li>Programmatic API to access data</li></ul></li>
</ol>

<p>The obvious way to proceed with this is to develop &#8216;versioned domain models&#8217;. That is to develop traditional software-based or database-backed &#8216;domain model&#8217; which can then be versioned. This would be very similar to the way that subversion first models a filesystem and then adds versioning of that filesystem[^3][^4].</p>

<h2>Footnotes</h2>

<p>[^wiktionaryz]: The wikitionaryz project (now renamed OmegaWiki) have been working on integrating referential intergrity into a wiki-like interface.</p>

<p>[^1]: there are a bunch of (pre 1.0 AFAICT) tools for doing diffs on xml data. See e.g. <a href="http://www.logilab.org/projects/xmldiff">http://www.logilab.org/projects/xmldiff</a></p>

<p>[^2]: see <a href="http://www.martinfowler.com/ap2/timeNarrative.html">http://www.martinfowler.com/ap2/timeNarrative.html</a> for software patterns for objects that change with time. There is also a <em>very</em> extensive book on the topic on time-oriented db applications in sql by the father of the temporal parts of sql3: <a href="http://www.cs.arizona.edu/people/rts/tdbbook.pdf">http://www.cs.arizona.edu/people/rts/tdbbook.pdf</a></p>

<p>[^3]: the subversion model can best be gleaned from its API. A pythonic version of that API can be seen in: http://www.rufuspollock.org/code/svnrepo/svnrepo.py</p>

<p>[^4]: http://www.musicbrainz.org/ already go some way towards having a versioned domain model in relation to music and its creators.</p>
<p class="akst_link"><a href="http://blog.okfn.org/?p=74&amp;akst_action=share-this"  title="E-mail this, post to del.icio.us, etc." id="akst_link_74" class="akst_share_link" rel="nofollow">Share This</a>
</p>]]></content:encoded>
			<wfw:commentRSS>http://blog.okfn.org/2007/02/20/collaborative-development-of-data/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>Thinking about Annotation</title>
		<link>http://blog.okfn.org/2007/01/24/thinking-about-annotation/</link>
		<comments>http://blog.okfn.org/2007/01/24/thinking-about-annotation/#comments</comments>
		<pubDate>Wed, 24 Jan 2007 18:06:17 +0000</pubDate>
		<dc:creator>Rufus Pollock</dc:creator>
		
	<category>Open Knowledge</category>
	<category>Musings</category>
	<category>Technical</category>
		<guid isPermaLink="false">http://blog.okfn.org/2007/01/24/thinking-about-annotation/</guid>
		<description><![CDATA[Annotation means the adding of comments/notes/etc to an underlying resource. For the present I&#8217;ll focus on the situation where the underlying resource is textual (as opposed to being an image, or a piece of film or some data). Various things to consider when implementing an annotation/comment system:


Addressing and atomisation: Are annotations specific to particular parts [...]]]></description>
			<content:encoded><![CDATA[<p>Annotation means the adding of comments/notes/etc to an underlying resource. For the present I&#8217;ll focus on the situation where the underlying resource is textual (as opposed to being an image, or a piece of film or some data). Various things to consider when implementing an annotation/comment system:</p>

<ol>
<li><p>Addressing and atomisation: Are annotations specific to particular parts of the resource. If so how do we store this address (relatedly: how is the resource &#8216;atomised&#8217; and how to we address these atoms, or range of atoms). For example, do we address by word, by character, by paragraph or by section? Do we wish to store ranges rather than a single address? Do we wish to allow a given annotation to be associated with multiple ranges/atoms?</p></li>
<li><p>Permissions: Are there restrictions on the creation (deletion/updating etc) of annotations.</p></li>
<li><p>Will the underlying resource change and if so are annotations intended to be robust to those changes.</p></li>
</ol>

<p>Let&#8217;s concentrate on the first issue for the time being as it is the most immediately important. Furthermore, defining the &#8216;atoms&#8217; of the resource sharply narrows the implementation options.</p>

<h3>The Simple Case: Mod a Blog</h3>

<p>If one is happy to have fairly large atoms (pages, or even sections of some piece of text) then implementing an annotation system can be reduced to grabbing your favourite CMS or blogging software and feeding the text in in appropriate chunks. This is often satisfactory and is a simple, low tech solution that will pretty much work out of the box. A classic example of this approach is <a href="http://www.pepysdiary.com/">http://www.pepysdiary.com/</a> which works so well because the subject matter (Samuel Pepy&#8217;s diary) has a very obvious atomisation (namely the daily diary entries) suited perfectly suited to blog software (in this case movable type).</p>

<p>You can even start doing a bit of modding, for example to present recent annotations (<a href="http://www.pepysdiary.com/recent/">http://www.pepysdiary.com/recent/</a>) or to present the text plus annotations all in one piece. (Given that <a href="http://www.commentonpower.org/">commentonpower</a> seems to fall neatly into this category with most commentable atoms of the right size for &#8216;blog&#8217; entries I wonder why they didn&#8217;t just implement it as a plugin for wordpress &#8212; perhaps it was such a simple app that it easier to &#8216;roll their own&#8217;. <strong>Update</strong>: since this was written we&#8217;ve had a chat with the developers and apparently commentonpower <em>does</em> use wordpress though it isn&#8217;t a plugin).</p>

<h3>Getting More Atomic</h3>

<p>Once you want to have atoms below a size comfortable for individual html pages/blog entries, wish to allow people to comment on chunks too large for an individual page, or to comment on ranges one starts to have problems with this approach. The main challenge at this point is to find some way to extract the addressing information from the client doing the annotation. Confining ourselves to the web the challenge becomes way to structure the interface and the text so that one can determine range start and end points. This is a non-trivial matter. Possible options include:</p>

<ul>
<li>Javascript: in theory the selection/range objects should help us out here unfortunately cross-browser support is patch (firefox as usual is excellent and IE pretty bad). If one does not want to be as precise as to get ranges javascript could also be used to extract e.g. element ids.</li>
<li>Copy and paste of the quote to annotate with some backend algorithm to determine the actual range. Nice and simple but not clear that one can &#8216;invert&#8217; (i.e. find a unique range from a given selection) unless the selection is large.</li>
<li>If addressing fairly large atoms (e.g. a paragraph or large) one could just insert a unique piece of user interface equipment (e.g. a button or link) with each atom. Note however that this prevents support for ranges.</li>
</ul>

<h3>Separating Data and Presentation</h3>

<p>Whatever one chooses to do it does seem sensible to clearly separate data and presentation. This is particularly important when there is so much uncertainty over the user interface. In particular, it would be good to clearly specify the annotation format and implement a programmatic interface to it independent of the standard (human) user interface. That way is easy to switch interfaces (or have multiple ones). Given that annotations are essentially just a comment it would seem sensible to try and reuse an existing format such as Atom (or RSS) for the machine interface to the comment store. [marginalia] already had such a format based on atom. I&#8217;ve recently reimplemented a stripped down version of this format for the annotation store backend in python in preparation for adding annotation support to openshakespeare web interface, see:</p>

<p><a href="http://project.knowledgeforge.net/shakespeare/svn/annotater/trunk/">http://project.knowledgeforge.net/shakespeare/svn/annotater/trunk/</a></p>

<p>Of course as discussed above this isn&#8217;t quite as simple as it looks as your user interface can constrain what you can and can&#8217;t store (using a blog approach you can&#8217;t store ranges and from what I have read getting reliable character offsets is problematic). Nevertheless it seems the best place to start.</p>
<p class="akst_link"><a href="http://blog.okfn.org/?p=65&amp;akst_action=share-this"  title="E-mail this, post to del.icio.us, etc." id="akst_link_65" class="akst_share_link" rel="nofollow">Share This</a>
</p>]]></content:encoded>
			<wfw:commentRSS>http://blog.okfn.org/2007/01/24/thinking-about-annotation/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>Storing and Visualizing Open Data</title>
		<link>http://blog.okfn.org/2006/06/07/storing-and-visualizing-open-data/</link>
		<comments>http://blog.okfn.org/2006/06/07/storing-and-visualizing-open-data/#comments</comments>
		<pubDate>Wed, 07 Jun 2006 16:21:03 +0000</pubDate>
		<dc:creator>Rufus Pollock</dc:creator>
		
	<category>News</category>
	<category>Open Knowledge</category>
	<category>Technical</category>
	<category>Open Data</category>
		<guid isPermaLink="false">http://blog.okfn.org/2007/06/07/storing-and-visualizing-open-data/</guid>
		<description><![CDATA[The basic purpose of the Open Knowledge Foundation is to &#8216;promote open knowledge&#8217;. In particular we want to:


Get data out there &#8212; that&#8217;s why we&#8217;re developing KnowledgeForge
Make sure that data is open data (i.e. is properly licensed) &#8212; that&#8217;s why we&#8217;re developing open knowledge definition
Make sure that data can be found &#8212; that&#8217;s why we&#8217;re [...]]]></description>
			<content:encoded><![CDATA[<p>The basic purpose of the Open Knowledge Foundation is to &#8216;promote open knowledge&#8217;. In particular we want to:</p>

<ul>
<li>Get data out there &#8212; that&#8217;s why we&#8217;re developing <a href="http://www.knowledgeforge.net">KnowledgeForge</a></li>
<li>Make sure that data is open data (i.e. is properly licensed) &#8212; that&#8217;s why we&#8217;re developing <a href="http://okd.okfn.org/">open knowledge definition</a></li>
<li>Make sure that data can be found &#8212; that&#8217;s why we&#8217;re developing <a href="http://www.ckan.net/">CKAN</a></li>
</ul>

<p>But as <a href="http://www.publicwhip.org/">Francis Irving</a> recently pointed out to me  you need to give people a reason to put data out there in a form that is discoverable. In particular, he suggested that giving people tools to do something with their data, such as visualize it, would be one of the best ways to encourage data sharing. To this end we&#8217;ve been putting together a very simple demonstration of a data store with some visualization capabilities at:</p>

<p><strike><a href="http://econ.dev.okfn.org/store/">http://econ.dev.okfn.org/store/</a></strike> <a href="http://www.openeconomics.net/store/">http://www.openeconomics.net/store/</a></p>

<p>The data itself (along with any associated with metadata) is stored in a subversion repository here:</p>

<p><a href="http://project.knowledgeforge.net/econ/svn/trunk/data/">http://project.knowledgeforge.net/econ/svn/trunk/data/</a></p>

<p>At present to add data to the store you just need to upload it to subversion but in future we hope to implement a web interface to do this.</p>
<p class="akst_link"><a href="http://blog.okfn.org/?p=70&amp;akst_action=share-this"  title="E-mail this, post to del.icio.us, etc." id="akst_link_70" class="akst_share_link" rel="nofollow">Share This</a>
</p>]]></content:encoded>
			<wfw:commentRSS>http://blog.okfn.org/2006/06/07/storing-and-visualizing-open-data/feed/</wfw:commentRSS>
		</item>
	</channel>
</rss>
