The ability to do “version control” for data is a big deal. There are various options but one of the most attractive is to reuse existing tools for doing this with code, like **git** and **mercurial**. This post describes a simple “data pattern” for storing and versioning data using those tools which we’ve been using for some time and found to be very effective.
The ability to do revisioning and versioning data – store changes made and share them with others – especially in a distributed way would be a huge benefit to the (open) data community. I’ve discussed why at [some length before][versioning] (see also [this earlier post]) but to summarize:
* It allows effective distributed collaboration – you can take my dataset, make changes, and share those back with me (and different people can do this at once!)
* It allows one to track provenance better (i.e. what changes came from where)
* It allows for sharing updates and synchronizing datasets in a simple, effective, way – e.g. an automated way to get the last months GDP or employment data without pulling the whole file again
There are several ways to address the “revision control for data” problem. The approach here is to get data in a form that means we can take existing powerful distributed version control systems designed for code like **git** and **mercurial** and apply them to the data. As such, the best [github for data][clay] may, in fact, be [github](https://github.com/) (of course, you may want to layer data-specific interfaces on on top of git(hub) – this is what we do with
There are limitations to this approach and I discuss some of these and alternative models below. In particular, it’s best for “[small (or even micro) data][small data def]” – say, under 10Mb or 100k rows. (One alternative model can be found in the very interesting [Dat project][Dat] recently started by Max Ogden — with whom I’ve [talked many times][rp] on this topic).
However, given the maturity and power of the tooling – and its likely evolution – and the fact that [so much data *is* small][small data] we think this approach is very attractive.
[small data]: http://blog.okfn.org/2013/04/22/forget-big-data-small-data-is-the-real-revolution/
[small data def]: http://blog.okfn.org/2013/04/26/what-do-we-mean-by-small-data/
## The Pattern
The essence of the pattern is:
> 1. **Storing data as *line-oriented text* and specifically as CSV[^csv] (comma-separated variable) files**. “Line oriented text” just indicates that individual units of the data such as a row of a table (or an individual cell) corresponds to one line[^line].
> 2. **Use best of breed (code) versioning like git mercurial to store and manage the data**.
Line-oriented text is important because it enables the powerful distributed version control tools like **git** and **mercurial** to work effectively (this, in turn, is because those tools are built for code which is (usually) line-oriented text). It’s not just version control though: there is a **large and mature set of tools** for managing and manipulating these types of files (from grep to Excel!).
In addition to the basic pattern, there are several a few optional extras you can add:
> * Store the data in GitHub (or Gitorious or Bitbucket or …) – all the examples below follow this approach
> * Turn the collection of data into a [Simple Data Format][sdf] [data package][dp] by adding a datapackage.json file which provides a small set of essential information like the license, sources, and schema (this column is a number, this one is a string)
> * Add the scripts you used to process and manage data — that way everything is nicely together in one repository
[^csv]: Note that, by CSV, we really mean “DSV”, as the delimiter in the file does *not* have to be a comma. However, the row terminator *should* be a line break (or a line break plus carriage return).
[^line]: CSVs do not *always* have one row to one line (it is possible to have line-breaks in a field with quoting). However, most CSVs are one-row-to-one-line. CSVs are pretty much the simplest possible structured data format you can have.
## What’s good about this approach?
The set of tools that exists for managing and manipulating line-oriented files is **huge and mature**. In particular, powerful distributed version control systems like **git** and **mercurial** are already extremely robust ways to do distributed, peer-to-peer collaboration around code, and this pattern takes that model and makes it applicable to data. Here are some concrete examples of why its good.
#### Provenance tracking
Git and mercurial provide a complete history of individual contributions with “simple” provenance via commit messages and diffs.
![Example of commit messages](http://i.imgur.com/9RZvCLl.png)
#### Peer-to-peer collaboration
Forking and pulling data allows independent contributors to work on it simultaneously.
![Timeline of pull requests](http://i.imgur.com/ZDmoZQL.png)
#### Data review
By using git or mercurial, tools for code review can be repurposed for data review.
#### Simple packaging
The repo model provides a simple way to store data, code, and metadata in a single place.
![A repo for data](http://i.imgur.com/3UQRf4V.png)
This method of storing and versioning data is very low-tech. The format and tools are both very mature and are ubiquitous. For example, every spreadsheet and every relational database can handle CSV. Every unix platform has a [suite of tools][cl-tools] like grep, sed, cut that can be used on these kind of files.
We’ve been using with this approach for a long-time: in 2005 we first stored CSV’s in subversion, then in mercurial, and then when we switched to git (and github) 3 years ago we started storing them there. In 2011 we started the datasets organization on github which contains a [whole list of of datasets managed according to the pattern above][gh-datasets]. Here are a couple of specific examples:
* [Country, regional, and world GDP][gdp], collected from the World Bank and turned into a normalized CSV. The Python script used to clean the data is included in [scripts][scripts].
* List of companies in [Standard & Poor’s 500][sp500], an index of the top 500 publicly listed stocks in the US. Includes Python scripts to process data and instructions on how to replicate cleaned data from source. This dataset provides interesting examples of [what diffs][whats_a_diff1] [look like][whats_a_diff2], including [bad diffs (e.g.. a column rearrange)][bad_diff].
*Note* Most of these examples not only show CSVs being managed in github but are also [simple data format][sdf] [data packages][dp] – see the datapackage.json they contain.
### Limitations and Alternatives
Line-oriented text and its tools are, of course, far from *perfect* solutions to data storage and versioning. They will not work for datasets of every shape and size, and in some respects they are awkward tools for tracking and merging changes to tabular data. For example:
* Simple actions on data stored as line-oriented text can lead to a very large changeset. For example, swapping the order of two fields (= columns) leads to a change in *every single line*. Given that diffs, merges, etc. are line-oriented, this is unfortunate.[^line_diffs]
* It works best for smallish data (e.g. < 100k rows, < 50mb files, optimally < 5mb files). git and mercurial don't handle big files that well, and features like diffs get more cumbersome with larger files.[^s3] * It works best for data made up of lots of similar records, ideally tabular data. In order for line-oriented storage and tools to be appropriate, you need the record structure of the data to fit with the CSV line-oriented structure. The pattern is less good if your CSV is not very line-oriented (e.g. you have a lot of fields with line breaks in them), causing problems for diff and merge. * CSV lacks a lot of information, e.g. information on the types of fields (everything is a string). There is no way to add metadata to a CSV without compromising its simplicity or making it no longer usable as pure data. You can, however, add this kind of information in a separate file, and this exactly what the [Data Package standard][dp] provides with its datapackage.json file. [^s3]: For larger data, we suggest swapping out git (and e.g. GitHub) for simple file storage like s3. Note that s3 can support basic copy-on-write versioning. However, being copy-on-write, it is comparatively very inefficient. [^line_diffs]: As a concrete example, the merge function will probably work quite well in reconciling two sets of changes that affect different sets of records, hence lines. Two sets of changes which each move a column will not merge well, however. The most fundamental limitations above all arise from applying line-oriented diffs and merges to structured data whose atomic unit is *not* a line (its a cell, or a transform of some kind like swapping two columns) The first issue discussed below, where a simple change to a table is treated as a change to every line of the file, is a clear example. In a perfect world, we’d have both a convenient structure *and* a whole set of robust tools to support it, e.g. tools that recognize swapping two columns of a CSV as a single, simple change or that work at the level of individual cells. Fundamentally a revision system is built around a diff format and a merge protocol. Get these right and much of the rest follows. The basic 3 options you have are: * Serialize to line-oriented text and use the great tools like git (what’s we’ve described above) * Identify atomic structure (e.g. document) and apply diff at that level (think CouchDB or standard copy-on-write for RDBMS at row level) * Recording transforms (e.g. Refine) At the Open Knowledge Foundation we built a system along the lines of (2) and been involved in exploring and researching both (2) and (3) - see [changes and syncing for data on on dataprotocols.org][syncing]. These options are definitely worth exploring -- and, for example, Max Ogden, with whom I’ve had many great discussions on this topic, is currently working on an exciting project called [Dat], a collaborative data tool which will use the [“sleep” protocol][sleep]. However, our experience so far is that the line-oriented approach beats any currently available options along those other lines (at least for smaller sized files!). [Dat]: http://github.com/maxogden/dat [sleep]: http://www.dataprotocols.org/en/latest/sleep.html [syncing]: http://www.dataprotocols.org/en/latest/syncing.html ### data.okfn.org Having already been storing data in github like this for several years, we recently launched
– Data is CSV stored in git repos on GitHub at
– All datasets are data packages with datapackage.json metadata
– Frontend site is ultra-simple – it just provides catalog and API and pulls data directly from github
### Why line-oriented
Line-oriented text is the natural form of code and so is supported by a huge number of excellent tools. But line-oriented text is also the simplest and most parsimonious form for storing general record-oriented data—and most data can be turned into records.
At its most basic, structured data requires a delimiter for fields and a delimiter for records. Comma- or tab-separated values (CSV, TSV) files are a very simple and natural implementation of this encoding. They delimit records with the most natural separation character besides the space, the line break. For a field delimiter, since spaces are too common in values to be appropriate, they naturally resort to commas or tabs.
Version control systems require an atomic unit to operate on. A versioning system for data can quite usefully treat *records* as the atomic units. Using line-oriented text as the encoding for record-oriented data automatically gives us a record-oriented versioning system in the form of existing tools built for versioning code.