We Need Distributed Revision/Version Control for Data

In the open data community, we need tools for doing distributed revision/version control for data like the one’s that already exist for code.

(Don’t know what I mean by revision control or distributed revision control? Read this)

Distributed revision control systems for code, like mercurial and git, have had a massive impact on software development, and especially so in the F/OSS community — the distributed methodology works particularly well with open material.

The same would be true for data. Revision control, and specifically distributed revision control, would support (cf this and this earlier post):

Incremental development: “patches”, changelogs etc
Provenance tracking: showing who did what, when is built in to a revisioning system
Broader participation: you don’t have to worry (as much) about who you let in because changes can be reverted. It’s also easier to get involved because you can have your own independent copy to play around with (Distributed).
Easier collaboration: updates don’t mean making a full copy (and applying updates is automatic), you can see who is making changes, when etc etc
Peer-2-peer model: different contributors can work simultaneously and independently (Distributed). Extra “features” can added independently of mainline development with re-integration later (Distributed).

Because this is all a bit abstract it is worth giving a concrete example of why “distributed” revision control could be so useful.

Example

Imagine wikis on two related topics, say water sanitation technology and building construction technology for the developing world (alternatively just think of the first wiki and wikipedia). It is likely there are some significant overlaps in the wiki pages but also many pages that don’t overlap. At the moment, for these projects to reuse information their only option is:

Copy the article from one wiki into the other
(OR) Standardize on one wiki as the authoratative wiki for common content

These both have serious problems. For (1), the page goes out of date rapidly and you’ve forked the resource reducing the value of effort on each. For (2), for the wiki that does not have the content, people have to go off from one wiki to edit in another (disruptive experience), the material is not embedded within its relevant context and it is harder to adapt the material for each specific site. Furtheremore, and in part because of these issues, (2) is socially hard as it likely involves one wiki/community coming to dominate the other (whoever owns the “common” content).

However, in a world where things are distributed there a completely different option: each wiki could have its own copy but be able to push and pull changes from the other wiki with changes being merged. This allows for collaborative activity to continue but in a relatively independent way and solves the big social issue of who’s in charge (no-one is!).

The key take away from this is that a piece of technology (distributed version control) alters the social processes of collaboration thereby radically reducing the barriers to effective collaboration. And remember, social stuff is both a) hard and b) important.

Implementation (or why this is not trivial)

Two key features are involved, neither of which are much in evidence in the (open) data at present:

Data versioning/revisioning — the creation of “changesets”
Transmission and management of associated changesets between multiple peer nodes.

It is the P2P nature of this model (as opposed to classic server-client approach) that leads to it being termed: “Distributed Revision Control”. Given the existence of distributed revision control for code one might hope that we could just reuse those technologies for data. Unfortunately it is not that simple:

The key aspect to developing a revision control (distributed or not) is to work out the diff and changeset format. This has not been done for data.
- Diffing and revision control for code works because code can be considered as (structured) text where a line-based-approach (or, occasionally character-based-approach) to code makes sense. For data it usually doesn’t make sense:
- Consider a hacky way to version a relational database using traditional text revisioning tools:
  1. dump the database to sql
    2 . revision the dump that using standard code tools.
    Tthe impact of renaming a column or table in this scenario is that hundreds or maybe thousands of line in the dump would change (depending on how inserts were set up). Furthermore the diff format for the sql dump provides no easy way to apply changes to the live database — in essence, the diff has given you nothing over just taking snapshots. What is required here is some way to describe changes to a relational database in its terms (there are plenty btw this is just illustrating that simple text diffs don’t work well …)
Unlike for code we probably have to talk about “what kind of data”. This is because the diff format we use to build “changesets” will depend on the structure of the data.

However, once you have diff (and merge) figured out for a given type of data we can directly reuse most of the ideas (and maybe even code) from frameworks used for software code. To put it briefly: it’s the diffing and merging that’s (relatively) hard — the rest we can copy!

Colophon

We have already made an attempt to implement distributed revision control ourselves for the specific case of the data stored in CKAN instances like .

Our approach was based heavily on the mercurial/git conceptual model and used as data structure the natural one implied by the domain model (~ database rows but not quite) — in essence we dump to json for each field and then do diffs on the json.

If you’re interested in finding out more here’s the code. Big kudos here to CKAN developer John Bywater who actually did almost all the work of getting this from concept to running code.

7 Comments

Pingback: L’opendata dans tous ses états – Juillet III «
Rufus Pollock says:

July 17, 2010 at 14:26

@Jo: the question of data aggregates is an important one but I’m not entirely sure what you mean by it. Do you mean that people create changesets against an aggregate dataset (with all the issues about how that is persisted down to the underlying dataset or how the changeset is applied when the aggregate is refreshed) or are you thinking of the issue of derived datasets?

In either case, the problems aren’t easy though I do not know how much harder they are than in code. In code areas you will have software that ‘aggregates’ underlying libraries. There the approach so far seems mainly based on the use of versioning and the specification of the version in dependencies.

Rufus Pollock says:

July 17, 2010 at 14:23

@John: not sure why you need to trust everyone in a distributed setup. The whole point there is that anyone can make changes to their copy of the data but who I choose to pull changes from is up to me.

I’m also concerned that the example the wikis I gave (which was trying to make it more non-technical) may have misled people. Even in that model the point was that it would be up to each group if and when they pulled changes from another wiki.

John Griessen says:

July 13, 2010 at 23:27

http://south.aeracode.org/docs/about.html

is a description of an open source tool used to migrate python Django databases independently of 5 different database back ends.

That FOSS project could be mined for some
of your goal’s code.

John

John Griessen says:

July 13, 2010 at 16:21

Data language translation and developing compression for every generalizable type of data are first steps for
your goal.

John Griessen says:

July 13, 2010 at 16:18

As Jo Walsh says above, without trusting who you
are doing the version control with, it’s no use.

So the distributed part of your wish is going to be limited to trustworthy members of groups, not a wikipedia type of collaboration.

JSON may be handy, but may not be enough to deal
well with the volume of what you are wishing for…
the data in database driven sites…

from Wikipedia: “JSON parsing must ironically be accomplished on a character-by-character basis. Additionally, the standard has no provision for data compression, interning of strings, or object references.”

For every data type you want to version control, the compression you want to always have in a distributed system will probably have to come from a translator tool for the specific data language into a “compressed lowest common denominator form”, a form that is lossless and translates back or to another language completely.

Data language translation and developing compression for every generalizable type of data are first steps for you goal.

John

Jo Walsh says:

July 13, 2010 at 07:12

A problem that we can’t avoid having:

http://mappinghacks.com/2007/04/19/the-openstreetmap-new-data-model-army/

The problem unforeseen here, and in this post, is generalised data (aggregates, collections, sliced up from the original sources).

As Ben Goldacre catchily, recently put it – “open data is sometimes no use unless we also have open methods.”