In the open data community, we need tools for doing distributed revision/version control for data like the one’s that already exist for code.
(Don’t know what I mean by revision control or distributed revision control? Read this)
Distributed revision control systems for code, like mercurial and git, have had a massive impact on software development, and especially so in the F/OSS community — the distributed methodology works particularly well with open material.
- Incremental development: “patches”, changelogs etc
- Provenance tracking: showing who did what, when is built in to a revisioning system
- Broader participation: you don’t have to worry (as much) about who you let in because changes can be reverted. It’s also easier to get involved because you can have your own independent copy to play around with (Distributed).
- Easier collaboration: updates don’t mean making a full copy (and applying updates is automatic), you can see who is making changes, when etc etc
- Peer-2-peer model: different contributors can work simultaneously and independently (Distributed). Extra “features” can added independently of mainline development with re-integration later (Distributed).
Because this is all a bit abstract it is worth giving a concrete example of why “distributed” revision control could be so useful.
Imagine wikis on two related topics, say water sanitation technology and building construction technology for the developing world (alternatively just think of the first wiki and wikipedia). It is likely there are some significant overlaps in the wiki pages but also many pages that don’t overlap. At the moment, for these projects to reuse information their only option is:
- Copy the article from one wiki into the other
- (OR) Standardize on one wiki as the authoratative wiki for common content
These both have serious problems. For (1), the page goes out of date rapidly and you’ve forked the resource reducing the value of effort on each. For (2), for the wiki that does not have the content, people have to go off from one wiki to edit in another (disruptive experience), the material is not embedded within its relevant context and it is harder to adapt the material for each specific site. Furtheremore, and in part because of these issues, (2) is socially hard as it likely involves one wiki/community coming to dominate the other (whoever owns the “common” content).
However, in a world where things are distributed there a completely different option: each wiki could have its own copy but be able to push and pull changes from the other wiki with changes being merged. This allows for collaborative activity to continue but in a relatively independent way and solves the big social issue of who’s in charge (no-one is!).
The key take away from this is that a piece of technology (distributed version control) alters the social processes of collaboration thereby radically reducing the barriers to effective collaboration. And remember, social stuff is both a) hard and b) important.
Implementation (or why this is not trivial)
Two key features are involved, neither of which are much in evidence in the (open) data at present:
- Data versioning/revisioning — the creation of “changesets”
- Transmission and management of associated changesets between multiple peer nodes.
It is the P2P nature of this model (as opposed to classic server-client approach) that leads to it being termed: “Distributed Revision Control”. Given the existence of distributed revision control for code one might hope that we could just reuse those technologies for data. Unfortunately it is not that simple:
- The key aspect to developing a revision control (distributed or not) is to work out the diff and changeset format. This has not been done for data.
- Diffing and revision control for code works because code can be considered as (structured) text where a line-based-approach (or, occasionally character-based-approach) to code makes sense. For data it usually doesn’t make sense:
- Consider a hacky way to version a relational database using traditional text revisioning tools:
- dump the database to sql
2 . revision the dump that using standard code tools.
Tthe impact of renaming a column or table in this scenario is that hundreds or maybe thousands of line in the dump would change (depending on how inserts were set up). Furthermore the diff format for the sql dump provides no easy way to apply changes to the live database — in essence, the diff has given you nothing over just taking snapshots. What is required here is some way to describe changes to a relational database in its terms (there are plenty btw this is just illustrating that simple text diffs don’t work well …)
- dump the database to sql
- Unlike for code we probably have to talk about “what kind of data”. This is because the diff format we use to build “changesets” will depend on the structure of the data.
However, once you have diff (and merge) figured out for a given type of data we can directly reuse most of the ideas (and maybe even code) from frameworks used for software code. To put it briefly: it’s the diffing and merging that’s (relatively) hard — the rest we can copy!
We have already made an attempt to implement distributed revision control ourselves for the specific case of the data stored in CKAN instances like .
Our approach was based heavily on the mercurial/git conceptual model and used as data structure the natural one implied by the domain model (~ database rows but not quite) — in essence we dump to json for each field and then do diffs on the json.
If you’re interested in finding out more here’s the code. Big kudos here to CKAN developer John Bywater who actually did almost all the work of getting this from concept to running code.