Collaborative Development of Data

$ This version: 2007-02-15 (First version 2006-05-24) $

We already have some fairly good working processes for collaborative development of unstructured text: the two most prominent examples being source code of computer programs and wikis for general purpose content (encyclopedias etc). However these tools perform poorly (or not at all) when we come to structured data.

The purpose of this short essay is to pose the question: how do we collaboratively develop knowledge when it is in the form of structured data (as opposed to unstructured text)?

There are two aspects of structured data that distinguish it from plain text:

Referential integrity (objects point to other objects)
Labelling to enable machine processing (and the addition of ‘semantics’)

To illustrate what I mean consider the following use case which comes from our own public domain works project. Here we are storing data about cultural works. In the simplest possible setup we have two types of object: a work and a creator. A given work may have many creators (authors) and a given creator may have created many works. Furthermore each work and each creator have various attributes. For the purposes of this discussion let us focus on only two:

name (creator) and title (work)
date of death (creator) and date of creation (work)

If we were to adopt a wiki setup (a la wikipedia) we would create a web page for each creator and each work. There would be a url pointing to any associated objects with some kind of human-processable (but likely not machine-processable) indicator of the nature of the link. Attributes would also be included as plain text perhaps with some simple markup to indicate their nature perhaps not. The unique identifier for a given object would come in the form of a url.

This is a not unattractive approach as it is very easy to implement — at least initially — because wikis for plain text are so well developed (and in fact it is the approach taken by the current v0.1 of public domain works). The problem arise when once one goes beyond simple data entry. For example

Searching, particularly structured searching (e.g. find more all creators who died more than seventy years ago and whose works are more than 100 years old), is slow and cumbersome compared to working with a database. Referential integrity isn’t enforced and the unique identifiers (url names) aren’t
Programmatic insertion and querying of the data is very limited. For example suppose we obtain a library catalogue and wish to merge it into the existing data. To do this we need to query the existing db repeatedly to try and identify matches between existing objects and objects in the catalogue.
No support for ACID, in particular no way:
1. To have (and enforce) referential integrity in your data structures ¹
2. To do atomic commits which preserve referential integrity (even in a simple wiki this is a problem in that renaming a page and changing references to it have to be separate operations rather than one atomic commit)
‘Data loss’/No data structure: when data structure isn’t “enforced” it may be extremely (or impossible) to extra relevant information (e.g. date of death in above example). In such circumstances, at least from a programmer’s point of view, the data is now ‘lost’. It also makes it much harder to enforce data constraints when data is entered or to check data validity once entered.

Thus we really want an approach that supports:

Versioning at the model level (i.e. not just of individual attributes)
Other data types than plain text
Associated tools:
- No off-the-shelf tools that will version
- No off-the-shelf tools to do visualization (e.g. showing diffs)²
- Web interface to provide for direct editing (and integration of associated tools such as diffs, changelogs etc)
- Programmatic API to access data

The obvious way to proceed with this is to develop ‘versioned domain models’. That is to develop traditional software-based or database-backed ‘domain model’ which can then be versioned. This would be very similar to the way that subversion first models a filesystem and then adds versioning of that filesystem³⁴.

Footnotes

The wikitionaryz project (now renamed OmegaWiki) have been working on integrating referential intergrity into a wiki-like interface. ↩
there are a bunch of (pre 1.0 AFAICT) tools for doing diffs on xml data. See e.g. ↩
the subversion model can best be gleaned from its API. A pythonic version of that API can be seen in: http://www.rufuspollock.org/code/svnrepo/svnrepo.py ↩
http://www.musicbrainz.org/ already go some way towards having a versioned domain model in relation to music and its creators. ↩