Support Us

You are browsing the archive for Ideas and musings.

Open Data Privacy

Laura James - August 27, 2013 in Featured, Ideas and musings, Open Data, Open Data and My Data, Open Government Data, Privacy

“yes, the government should open other people’s data”

Traditionally, the Open Knowledge Foundation has worked to open non-personal data – things like publicly-funded research papers, government spending data, and so on. Where individual data was a part of some shared dataset, such as a census, great amounts of thought and effort had gone in to ensuring that individual privacy was protected and that the aggregate data released was a shared, communal asset.

But times change. Increasing amounts of data are collected by governments and corporations, vast quantities of it about individuals (whether or not they realise that it is happening). The risks to privacy through data collection and sharing are probably greater than they have ever been. Data analytics – whether of “big “ or “small” data – has the potential to provide unprecedented insight; however some of that insight may be at the cost of personal privacy, as separate datasets are connected/correlated.

Medical data loss dress

Both open data and big data are hot topics right now, and at such times it is tempting for organisations to get involved in such topics without necessarily thinking through all the issues. The intersection of big data and open data is somewhat worrying, as the temptation to combine the economic benefits of open data with the current growth potential of big data may lead to privacy concerns being disregarded. Privacy International are right to draw attention to this in their recent article on data for development, but of course other domains are affected too.

Today, we’d like to suggest some terms to help the growing discussion about open data and privacy.

Our Data is data with no personal element, and a clear sense of shared ownership. Some examples would be where the buses run in my city, what the government decides to spend my tax money on, how the national census is structured and the aggregate data resulting from it. At the Open Knowledge Foundation, our default position is that our data should be open data – it is a shared asset we can and should all benefit from.

My Data is information about me personally, where I am identified in some way, regardless of who collects it. It should not be made open or public by others without my direct permission – but it should be “open” to me (I should have access to data about me in a useable form, and the right to share it myself, however I wish if I choose to do so).

Transformed Data is information about individuals, where some effort has been made to anonymise or aggregate the data to remove individually identified elements.

big-data_conew1

We propose that there should be some clear steps which need to be followed to confirm whether transformed data can be published openly as our data. A set of privacy principles for open data, setting out considerations that need to be made, would be a good start. These might include things like consulting key stakeholders including representatives of whatever group(s) the data is about and data privacy experts around how the data is transformed. For some datasets, it may not prove possible to transform them sufficiently such that a reasonable level of privacy can be maintained for citizens; these datasets simply should not be opened up. For others, it may be that further work on transformation is needed to achieve an acceptable standard of privacy before the data is fit to be released openly. Ensuring the risks are considered and managed before data release is essential. If the transformations provide sufficient privacy for the individuals concerned, and the principles have been adhered to, the data can be released as open data.

We note that some of “our data” will have personal elements. For instance, members of parliament have made a positive choice to enter the public sphere, and some information about them is therefore necessarily available to citizens. Data of this type should still be considered against the principles of open data privacy we propose before publication, although the standards compared against may be different given the public interest.

This is part of a series of posts exploring the areas of open data and privacy, which we feel is a very important issue. If you are interested in these matters, or would like to help develop privacy principles for open data, join the working group mailing list. We’d welcome suggestions and thoughts on the mailing list or in the comments below, or talk to us and the Open Rights Group, who we are working with, at the Open Knowledge Conference and other events this autumn.

9 models to scale open data – past, present and future

Francis Irving - July 18, 2013 in Business, Featured, Ideas and musings, Open Data

Golden spiral, by Kakapo31 CC-BY-NC-SA

The possibilities of open data have been enthralling us for 10 years.

I came to it through wanting to make Government really usable, to build sites
like TheyWorkForYou.

But that excitement isn’t what matters in the end.

What matters is scale – which organisational structures will make this movement
explode?

Whether by creating self-growing volunteer communities, or by generating flows
of money.

This post quickly and provocatively goes through some that haven’t worked
(yet!) and some that have.

Ones that are working now

1) Form a community to enter in new data. href="http://www.openstreetmap.org/">Open Street Map and href="http://musicbrainz.org/">MusicBrainz are two big examples. It works
as the community is the originator of the data. That said, neither has
dominated its industry as much as I thought they would have by now.

2) Sell tools to an upstream generator of open data. This is what
CKAN does for central Governments (and the new ScraperWiki CKAN tool helps with). It’s what mySociety does, when selling
FixMyStreet
installs to local councils, thereby publishing their potholes as RSS feeds.

3) Use open data (quietly). Every organisation does this and never talks
about it. It’s key to quite old data resellers like Bloomberg. It is what most of
ScraperWiki’s professional services
customers ask us to do. The value to society is enormous and invisible. The
big flaw is that it doesn’t help scale supply of open data.

4) Sell tools to downstream users. This isn’t necessarily open data
specific – existing software like spreadsheets and Business Intelligence can be
used with open or closed data. Lots of open data is on the web, so tools like
the new ScraperWiki which work well with
web data are particularly suited to it.

Ones that haven’t worked

5) Collaborative curation ScraperWiki started as an audacious attempt to create an open data curation
community, based on editing scraping code in a wiki. In its original form
(now called ScraperWiki Classic) this didn’t scale.
Here are some reasons, in terms of open data models, why it didn’t.

a. It wasn’t upstream. Whatever provenance you give, people trust data most
that they get it straight from its source. This can also be a partial upstream -
for example supplementing scraped data with new data manually gathered by
telephone.

b. It isn’t in private. Although in theory there’s lots to gain by wrangling
commodity data together in public, it goes against the instincts of most
organisations.

c. There’s not enough existing culture. The free software movement built a rich
culture of collaboration, ready to be exploited some 15 years in by the open
source movement, and 25 years later by tools like Github. With a few
exceptions, notably OpenCorporates, there
aren’t yet open data curation projects.

6) General purpose data marketplaces, particularly ones that are mainly
reusing open data, haven’t taken off. They might do one day, however I think
they need well-adopted higher level standards for data formatting and syncing
first (perhaps something like dat,
perhaps something based
on CSV files
).

Ones I expect more of in the future

These are quite exciting models which I expect to see a lot more of.

7) Give labour/money to upstream to help them create better data. This is
quite new. The only, and most excellent, example of it is the UK’s National
Archive href="http://blog.okfn.org/2012/10/04/worlds-first-real-commercial-open-data-curation-project/">curating
the Statute Law Database. They do the work with the help of staff seconded
from commercial legal publishers and other parts of Government.

It’s clever because it generates money for upstream, which people trust the most,
and which has the most ability to improve data quality.

8) Viral open data licensing. MySQL made lots of money this way, offering
proprietary dual licenses of GPLd software to embedded systems makers. In data
this could use OKFN’s Open Database License,
and organisations would pay when they wanted to mix the open data with their
own closed data. I don’t know anyone actively using it, although Chris Taggart
from OpenCorporates mentioned this model to me years ago.

9) Corporations release data for strategic advantage. Companies are starting to href="http://blog.okfn.org/2011/07/27/and-so-corporations-begin-to-open-data/">release
their own data for strategic gain. This is very new. Expect more of it.

What have I missed? What models do you see that will scale Open Data, and bring
its benefits to billions?

Git (and Github) for Data

Rufus Pollock - July 2, 2013 in Featured, Ideas and musings, Open Data, Small Data, Technical

The ability to do “version control” for data is a big deal. There are various options but one of the most attractive is to reuse existing tools for doing this with code, like git and mercurial. This post describes a simple “data pattern” for storing and versioning data using those tools which we’ve been using for some time and found to be very effective.

Introduction

The ability to do revisioning and versioning data – store changes made and share them with others – especially in a distributed way would be a huge benefit to the (open) data community. I’ve discussed why at some length before (see also this earlier post) but to summarize:

  • It allows effective distributed collaboration – you can take my dataset, make changes, and share those back with me (and different people can do this at once!)
  • It allows one to track provenance better (i.e. what changes came from where)
  • It allows for sharing updates and synchronizing datasets in a simple, effective, way – e.g. an automated way to get the last months GDP or employment data without pulling the whole file again

There are several ways to address the “revision control for data” problem. The approach here is to get data in a form that means we can take existing powerful distributed version control systems designed for code like git and mercurial and apply them to the data. As such, the best github for data may, in fact, be github (of course, you may want to layer data-specific interfaces on on top of git(hub) – this is what we do with http://data.okfn.org/).

There are limitations to this approach and I discuss some of these and alternative models below. In particular, it’s best for “small (or even micro) data” – say, under 10Mb or 100k rows. (One alternative model can be found in the very interesting Dat project recently started by Max Ogden — with whom I’ve talked many times on this topic).

However, given the maturity and power of the tooling – and its likely evolution – and the fact that so much data is small we think this approach is very attractive.

The Pattern

The essence of the pattern is:

  1. Storing data as line-oriented text and specifically as CSV1 (comma-separated variable) files. “Line oriented text” just indicates that individual units of the data such as a row of a table (or an individual cell) corresponds to one line2.

  2. Use best of breed (code) versioning like git mercurial to store and manage the data.

Line-oriented text is important because it enables the powerful distributed version control tools like git and mercurial to work effectively (this, in turn, is because those tools are built for code which is (usually) line-oriented text). It’s not just version control though: there is a large and mature set of tools for managing and manipulating these types of files (from grep to Excel!).

In addition to the basic pattern, there are several a few optional extras you can add:

  • Store the data in GitHub (or Gitorious or Bitbucket or …) – all the examples below follow this approach
  • Turn the collection of data into a Simple Data Format data package by adding a datapackage.json file which provides a small set of essential information like the license, sources, and schema (this column is a number, this one is a string)
  • Add the scripts you used to process and manage data — that way everything is nicely together in one repository

What’s good about this approach?

The set of tools that exists for managing and manipulating line-oriented files is huge and mature. In particular, powerful distributed version control systems like git and mercurial are already extremely robust ways to do distributed, peer-to-peer collaboration around code, and this pattern takes that model and makes it applicable to data. Here are some concrete examples of why its good.

Provenance tracking

Git and mercurial provide a complete history of individual contributions with “simple” provenance via commit messages and diffs.

Example of commit messages

Peer-to-peer collaboration

Forking and pulling data allows independent contributors to work on it simultaneously.

Timeline of pull requests

Data review

By using git or mercurial, tools for code review can be repurposed for data review.

Pull screen

Simple packaging

The repo model provides a simple way to store data, code, and metadata in a single place.

A repo for data

Accessibility

This method of storing and versioning data is very low-tech. The format and tools are both very mature and are ubiquitous. For example, every spreadsheet and every relational database can handle CSV. Every unix platform has a suite of tools like grep, sed, cut that can be used on these kind of files.

Examples

We’ve been using with this approach for a long-time: in 2005 we first stored CSV’s in subversion, then in mercurial, and then when we switched to git (and github) 3 years ago we started storing them there. In 2011 we started the datasets organization on github which contains a whole list of of datasets managed according to the pattern above. Here are a couple of specific examples:

Note Most of these examples not only show CSVs being managed in github but are also simple data format data packages – see the datapackage.json they contain.


Appendix

Limitations and Alternatives

Line-oriented text and its tools are, of course, far from perfect solutions to data storage and versioning. They will not work for datasets of every shape and size, and in some respects they are awkward tools for tracking and merging changes to tabular data. For example:

  • Simple actions on data stored as line-oriented text can lead to a very large changeset. For example, swapping the order of two fields (= columns) leads to a change in every single line. Given that diffs, merges, etc. are line-oriented, this is unfortunate.3
  • It works best for smallish data (e.g. < 100k rows, < 50mb files, optimally < 5mb files). git and mercurial don’t handle big files that well, and features like diffs get more cumbersome with larger files.4
  • It works best for data made up of lots of similar records, ideally tabular data. In order for line-oriented storage and tools to be appropriate, you need the record structure of the data to fit with the CSV line-oriented structure. The pattern is less good if your CSV is not very line-oriented (e.g. you have a lot of fields with line breaks in them), causing problems for diff and merge.
  • CSV lacks a lot of information, e.g. information on the types of fields (everything is a string). There is no way to add metadata to a CSV without compromising its simplicity or making it no longer usable as pure data. You can, however, add this kind of information in a separate file, and this exactly what the Data Package standard provides with its datapackage.json file.

The most fundamental limitations above all arise from applying line-oriented diffs and merges to structured data whose atomic unit is not a line (its a cell, or a transform of some kind like swapping two columns)

The first issue discussed below, where a simple change to a table is treated as a change to every line of the file, is a clear example. In a perfect world, we’d have both a convenient structure and a whole set of robust tools to support it, e.g. tools that recognize swapping two columns of a CSV as a single, simple change or that work at the level of individual cells.

Fundamentally a revision system is built around a diff format and a merge protocol. Get these right and much of the rest follows. The basic 3 options you have are:

  • Serialize to line-oriented text and use the great tools like git (what’s we’ve described above)
  • Identify atomic structure (e.g. document) and apply diff at that level (think CouchDB or standard copy-on-write for RDBMS at row level)
  • Recording transforms (e.g. Refine)

At the Open Knowledge Foundation we built a system along the lines of (2) and been involved in exploring and researching both (2) and (3) – see changes and syncing for data on on dataprotocols.org. These options are definitely worth exploring — and, for example, Max Ogden, with whom I’ve had many great discussions on this topic, is currently working on an exciting project called Dat, a collaborative data tool which will use the “sleep” protocol.

However, our experience so far is that the line-oriented approach beats any currently available options along those other lines (at least for smaller sized files!).

data.okfn.org

Having already been storing data in github like this for several years, we recently launched http://data.okfn.org/ which is explicitly based on this approach:

  • Data is CSV stored in git repos on GitHub at https://github.com/datasets
  • All datasets are data packages with datapackage.json metadata
  • Frontend site is ultra-simple – it just provides catalog and API and pulls data directly from github

Why line-oriented

Line-oriented text is the natural form of code and so is supported by a huge number of excellent tools. But line-oriented text is also the simplest and most parsimonious form for storing general record-oriented data—and most data can be turned into records.

At its most basic, structured data requires a delimiter for fields and a delimiter for records. Comma- or tab-separated values (CSV, TSV) files are a very simple and natural implementation of this encoding. They delimit records with the most natural separation character besides the space, the line break. For a field delimiter, since spaces are too common in values to be appropriate, they naturally resort to commas or tabs.

Version control systems require an atomic unit to operate on. A versioning system for data can quite usefully treat records as the atomic units. Using line-oriented text as the encoding for record-oriented data automatically gives us a record-oriented versioning system in the form of existing tools built for versioning code.


  1. Note that, by CSV, we really mean “DSV”, as the delimiter in the file does not have to be a comma. However, the row terminator should be a line break (or a line break plus carriage return). 

  2. CSVs do not always have one row to one line (it is possible to have line-breaks in a field with quoting). However, most CSVs are one-row-to-one-line. CSVs are pretty much the simplest possible structured data format you can have. 

  3. As a concrete example, the merge function will probably work quite well in reconciling two sets of changes that affect different sets of records, hence lines. Two sets of changes which each move a column will not merge well, however. 

  4. For larger data, we suggest swapping out git (and e.g. GitHub) for simple file storage like s3. Note that s3 can support basic copy-on-write versioning. However, being copy-on-write, it is comparatively very inefficient. 

Follow the Money, Follow the Data

Martin Tisne - May 3, 2013 in Ideas and musings, Open Data, Open Government Data, Open Spending


The following guest post from Martin Tisné was first published on his personal blog.

Money tunnel by RambergMediaImages, CC-BY-SA on Flickr

Some thoughts which I hope may be helpful in advance of the ‘follow the data‘ hack day this week-end:

The open data sector has quite successfully focused on socially-relevant information: fixing potholes a la http://www.fixmystreet.com/, adopting fire hydrants a la http://adoptahydrant.org/. My sense is that the next frontier will be to free the data that can enable citizens, NGOs and journalists to hold their governments to account. What this will likely mean is engaging in issues such as data on extractives’ transparency, government contracting, political finance, budgeting etc. So far, these are not the bread and butter of the open data movement (which isn’t to say there aren’t great initiatives like http://openspending.org/). But they should be:

At its heart, this agenda revolves around ‘following the money’. Without knowing the ‘total resource flow’:

  • Parents’ associations cannot question the lack of textbooks in their schools by interrogating the school’s budget
  • Healthcare groups cannot access data related to local spending on doctors, nurses
  • Great orgs such as Open Knowledge Foundation or BudgIT cannot get the data they need for their interpretative tools (e.g. budget tracking tool)
  • Investigative journalists cannot access the data they need to pursue a story

Our field has sought to ‘follow the money’ for over two decades, but in practice we still lack the fundamental ability to trace funding flows from A to Z, across the revenue chain. We should be able to get to what aid transparency experts call ‘traceability’ (the ability to trace aid funds from the donor down the project level) for all, or at least most fiscal flows.

Open data enables this to happen. This is exciting: it’s about enabling follow the money to happen at scale. Up until now, instances of ‘following the money’ have been the fruit of the hard work of investigative journalists, in isolated instances.

If we can ensure that data on revenues (extractives, aid, tax etc), expenditures (from planning to allocation to spending to auditing), and results (service delivery data) is timely, accessible, comparable and comprehensive, we will have gone a long way to helping ‘follow the money’ efforts reach the scale they deserve.

Follow the Money is a pretty tangible concept (if you disagree, please let me know!) – it helps demonstrate how government funds buy specific outcomes, and how/whether resources are siphoned away. We need to now make it a reality.

Open Knowledge: much more than open data

Laura James - May 1, 2013 in Featured, Ideas and musings, Join us, Open Data, Open Knowledge Foundation, Our Work


Book, Ball and Chain

We’ve often used “open knowledge” simply as a broad term to cover any kind of open data or content from statistics to sonnets, and more. However, there is another deeper, and far more important, reason why we are the “Open Knowledge” Foundation and not, for example, the “Open Data” Foundation. It’s because knowledge is something much more than data.

Open knowledge is what open data becomes when it’s useful, usable and used. At the Open Knowledge Foundation we believe in open knowledge: not just that data is open and can be freely used, but that it is made useful – accessible, understandable, meaningful, and able to help someone solve a real problem. —Open knowledge should be empowering – it should be enabling citizens and organizations understand the world, create insight and effect positive change.

It’s because open knowledge is much more than just raw data that we work both to have raw data and information opened up (by advocating and campaigning) and also by making, creating the tools to turn that raw material into knowledge that people can act upon. For example, we build technical tools, open source software to help people work with data, and we create handbooks which help people acquire the skills they need to do so. This combination, that we are both evangelists and makers, is extremely powerful in helping us change the world.

Achieving our vision of a world transformed through open knowledge, a world where a vibrant open knowledge commons empowers citizens and enables fair and sustainable societies, is a big challenge. We firmly believe it can done, with a global network of amazing people and organisations fighting for openness and making tools and more to support the open knowledge ecosystem, although it’s going to take a while!

We at the Open Knowledge Foundation are committed to this vision of a global movement building an open knowledge ecosystem, and we are here for the long term. We’d love you to join us in improving the world through open knowledge; there will be many different ways you can help coming up during the months ahead, so get started now by keeping in touch – by signing up to receive our Newsletter, or finding a local group or meetup near you.


What Do We Mean By Small Data

Rufus Pollock - April 26, 2013 in Featured, Ideas and musings, Labs, Open Data, Small Data

Earlier this week we published the first in a series of posts on small data: “Forget Big Data, Small Data is the Real Revolution”. In this second in the series, we discuss small data in more detail providing a rough definition and drawing parallels with the history of computers and software.

What do we mean by “small data”? Let’s define it crudely as:

“Small data is the amount of data you can conveniently store and process on a single machine, and in particular, a high-end laptop or server”

Why a laptop? What’s interesting (and new) right now is the democratisation of data and the associated possibility of large-scale distributed community of data wranglers working collaboratively. What matters here then is, crudely, the amount of data that an average data geek can handle on their own machine, their own laptop.

A key point is that the dramatic advances in computing, storage and bandwidth have far bigger implications for “small data” than for “big data”. The recent advances have increased the realm of small data, the kind of data that an individual can handle on their own hardware, far more relatively than they have increased the realm of “big data”. Suddenly working with significant datasets – datasets containing tens of thousands, hundreds of thousands or millions of rows can be a mass-participation activity.

(As should be clear from the above definition – and any recent history of computing – small (and big) are relative terms that change as technology advances – for example, in 1994 a terabyte of storage cost several hundred thousand dollars, today its under a hundred. This also means today’s big is tomorrow’s small).

Our situation today is similar to microcomputers in the late 70s and early 80s or the Internet in the 90s. When microcomputers first arrived, they seemed puny in comparison to the “big” computing and “big” software then around and there was nothing strictly they could do that existing computing could not. However, they were revolutionary in one fundamental way: they made computing a mass-participation activity. Similarly, the Internet was not new in the 1990s – it had been around in various forms for several decades – but it was at that point it became available at a mass-scale to the average developer (and ultimately citizen). In both cases “big” kept on advancing too – be it supercomputers or the high-end connectivity – but the revolution came from “small”.

This (small) data revolution is just beginning. The tools and infrastructure to enable effective collaboration and rapid scaling for small data are in their infancy, and the communities with the capacities and skills to use small data are in their early stages. Want to get involved in the small data forward revolution — sign up now


This is the second in a series of posts about the power of Small Data – follow the Open Knowledge Foundation blog, Twitter or Facebook to learn more and join the debate at #SmallData on Twitter.

Frictionless Data: making it radically easier to get stuff done with data

Rufus Pollock - April 24, 2013 in Featured, Ideas and musings, Labs, Open Data, Open Standards, Small Data, Technical

Frictionless Data is now in alpha at http://data.okfn.org/ – and we’d like you to get involved.

Our mission is to make it radically easier to make data used and useful – our immediate goal is make it as simple as possible to get the data you want into the tool of your choice.

This isn’t about building a big datastore or a data management system – it’s simply saving people from repeating all the same tasks of discovering a dataset, getting it into a format they can use, cleaning it up – all before they can do anything useful with it! If you’ve ever spent the first half of a hackday just tidying up tabular data and getting it ready to use, Frictionless Data is for you.

Our work is based on a few key principles:

  • Narrow focus — improve one small part of the data chain, standards and tools are limited in scope and size
  • Build for the web – use formats that are web “native” (JSON) and work naturally with HTTP (plain-text, CSV is streamable etc)
  • Distributed not centralised — designed for a distributed ecosystem (no centralized, single point of failure or dependence)
  • Work with existing tools — don’t expect people to come to you, make this work with their tools and their workflows (almost everyone in the world can open a CSV file, every language can handle CSV and JSON)
  • Simplicity (but sufficiency) — use the simplest formats possible and do the minimum in terms of metadata but be sufficient in terms of schemas and structure for tools to be effective

We believe that making it easy to get and use data and especially open data is central to creating a more connected digital data ecosystem and accelerating the creation of social and commercial value. This project is about reducing friction in getting, using and connecting data, making it radically easier to get data you need into the tool of your choice. Frictionless Data distills much of our learning over the last 7 years into some specific standards and infrastructure.

What’s the Problem?

Today, when you decide to cook, the ingredients are readily available at local supermarkets or even already in your kitchen. You don’t need to travel to a farm, collect eggs, mill the corn, cure the bacon etc – as you once would have done! Instead, thanks to standard systems of measurement, packaging, shipping (e.g. containerization) and payment, ingredients can get from the farm direct to my local shop or even my door.

But with data we’re still largely stuck at this early stage: every time you want to do an analysis or build an app you have to set off around the internet to dig up data, extract it, clean it and prepare it before you can even get it into your tool and begin your work proper.

What do we need to do for the working with data to be like cooking today – where you get to spend your time making the cake (creating insights) not preparing and collecting the ingredients (digging up and cleaning data)?

The answer: radical improvements in the “logistics” of data associated with specialisation and standardisation. In analogy with food we need standard systems of “measurement”, packaging, and transport so that its easy to get data from its original source into the application where you can start working with it.

Frictionless DAta idea

What’s Frictionless Data going to do?

We start with an advantage: unlike for physical goods transporting digital information from one computer to another is very cheap! This means the focus can be on standardizing and simplifying the process of getting data from one application to another (or one form to another). We propose work in 3 related areas:

  • Key simple standards. For example, a standardized “packaging” of data that makes it easy to transport and use (think of the “containerization” revolution in shipping)
  • Simple tooling and integration – you should be able to get data in these standard formats into or out of Excel, R, Hadoop or whatever tool you use
  • Bootstrapping the system with essential data – we need to get the ball rolling

frictionless data components diagram

What’s Frictionless Data today?

1. Data

We have some exemplar datasets which are useful for a lot of people – these are:

  • High Quality & Reliable

    • We have sourced, normalized and quality checked a set of key reference datasets such as country codes, currencies, GDP and population.
  • Standard Form & Bulk Access

    • All the datasets are provided in a standardized form and can be accessed in bulk as CSV together with a simple JSON schema.
  • Versioned & Packaged

    • All data is in data packages and is versioned using git so all changes are visible and data can becollaboratively maintained.

2. Standards

We have two simple data package formats, described as ultra-lightweight, RFC-style specifications. They build heavily on prior work. Simplicity and practicality were guiding design criteria.

Frictionless Data: package standard diagram

Data package: minimal wrapping, agnostic about the data its “packaging”, designed for extension. This flexibility is good as it can be used as a transport for pretty much any kind of data but it also limits integration and tooling. Read the full Data Package specification.

Simple data format (SDF): focuses on tabular data only and extends data package (data in simple data format is a data package) by requiring data to be “good” CSVs and the provision of a simple JSON-based schema to describe them (“JSON Table Schema”). Read the full Simple Data Format specification.

3. Tools

It’s early days for Frictionless Data, so we’re still working on this bit! But there’s a need for validators, schema generators, and all kinds of integration. You can help out – see below for details or check out the issues on github.

Doesn’t this already exist?

People have been working on data for a while – doesn’t something like this already exist? The crude answer is yes and no. People, including folks here at the Open Knowledge Foundation, have been working on this for quite some time, and there are already some parts of the solution out there. Furthermore, much of these ideas are directly borrowed from similar work in software. For example, the Data Packages spec (first version in 2007!) builds heavily on packaging projects and specifications like Debian and CommonJS.

Key distinguishing features of Frictionless Data:

  • Ultra-simplicity – we want to keep things as simple as they possibly can be. This includes formats (JSON and CSV) and a focus on end-user tool integration, so people can just get the data they want into the tool they want and move on to the real task
  • Web orientation – we want an approach that fits naturally with the web
  • Focus on integration with existing tools
  • Distributed and not tied to a given tool or project – this is not about creating a central data marketplace or similar setup. It’s about creating a basic framework that would enable anyone to publish and use datasets more easily and without going through a central broker.

Many of these are shared with (and derive from) other approaches but as a whole we believe this provides an especially powerful setup.

Get Involved

This is a community-run project coordinated by the Open Knowledge Foundation as part of Open Knowledge Foundation Labs. Please get involved:


  • Spread the word! Frictionless Data is a key part of the real data revolution – follow the debate on #SmallData and share our posts so more people can get involved

Forget Big Data, Small Data is the Real Revolution

Rufus Pollock - April 22, 2013 in Featured, Ideas and musings, Labs, Open Data, Small Data

There is a lot of talk about “big data” at the moment. For example, this is Big Data Week, which will see events about big data in dozens of cities around the world. But the discussions around big data miss a much bigger and more important picture: the real opportunity is not big data, but small data. Not centralized “big iron”, but decentralized data wrangling. Not “one ring to rule them all” but “small pieces loosely joined”.

Big data smacks of the centralization fads we’ve seen in each computing era. The thought that ‘hey there’s more data than we can process!’ (something which is no doubt always true year-on-year since computing began) is dressed up as the latest trend with associated technology must-haves.

Meanwhile we risk overlooking the much more important story here, the real revolution, which is the mass democratisation of the means of access, storage and processing of data. This story isn’t about large organisations running parallel software on tens of thousand of servers, but about more people than ever being able to collaborate effectively around a distributed ecosystem of information, an ecosystem of small data.

Just as we now find it ludicrous to talk of “big software” – as if size in itself were a measure of value – we should, and will one day, find it equally odd to talk of “big data”. Size in itself doesn’t matter – what matters is having the data, of whatever size, that helps us solve a problem or address the question we have.

For many problems and questions, small data in itself is enough. The data on my household energy use, the times of local buses, government spending – these are all small data. Everything processed in Excel is small data. When Hans Rosling shows us how to understand our world through population change or literacy he’s doing it with small data.

And when we want to scale up the way to do that is through componentized small data: by creating and integrating small data “packages” not building big data monoliths, by partitioning problems in a way that works across people and organizations, not through creating massive centralized silos.

This next decade belongs to distributed models not centralized ones, to collaboration not control, and to small data not big data.

Want to create the real data revolution? Come join our community creating the tools and materials to make it happen — sign up here:


This is the first in a series of posts about the power of Small Data – follow the Open Knowledge Foundation blog, Twitter or Facebook to learn more and join the debate at #SmallData on Twitter.

Further Reading

  • Nobody ever got fired for buying a cluster
    • Even at enterprises like Microsoft and Yahoo most jobs could run on a single machine. E.g. median job size is 14GB at Microsoft and 80% of jobs are less than 1TB. At Yahoo estimate median job size is 12GB.
    • Ananthanarayanan et al. show that Facebook jobs follow a power-law distribution with small jobs dominating; from their graphs it appears that at least 90% of the jobs have input sizes under 100 GB,” the paper states. “Chen et al. present a detailed study of Hadoop workloads for Facebook as well as 5 Cloudera customers. Their graphs also show that a very small minority of jobs achieves terabyte scale or larger and the paper claims explicitly that ‘most jobs have input, shuffle, and output sizes in the MB to GB range.’”
  • PACMan: Coordinated Memory Caching for Parallel Jobs – Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, Ion Stoica

Open Data & My Data

Laura James - February 22, 2013 in Featured, Ideas and musings, Open Data, Working Groups

The Open Knowledge Foundation believes in open knowledge: not just that some data is open and freely usable, but that it is useful – accessible, understandable, meaningful, and able to help someone solve a real problem.

A lot of the data which could help me improve my life is data about me – “MyData” if you like. Many of the most interesting questions and problems we have involve personal data of some kind. This data might be gathered directly by me (using my own equipment or commercial services), or it could be harvested by corporations from what I do online, or assembled by public sector services I use, or voluntarily contributed to scientific and other research studies.

Tape library, CERN, Geneva 2

Image: “Tape library, CERN, Geneva 2″ by Cory Doctorow, CC-BY-SA.

This data isn’t just interesting in the context of our daily lives: it bears on many global challenges in the 21st century, such as supporting an aging population, food consumption and energy use.

Today, we rarely have access to these types of data, let alone the ability to reuse and share it, even when it’s my data, about just me. Who owns data about me, who controls it, who has access to it? Can I see data about me, can I get a copy of it in a form I could reuse or share, can I get value out of it? Would I even be allowed to publish openly some of the data about me, if I wanted to?

But how does this relate to open data? After all, a key tenet of our work at the Open Knowledge Foundation is that personal data should not be made open (for obvious privacy reasons)!

However there are, in fact, obvious points where “Open Data” and “My Data” connect:

  • MyData becomes Open Data (via transformation): Important datasets that are (or could be) open come from “my data” via aggregation, anonymisation and so on. Much statistical information ultimately comes from surveys of individuals, but the end results are heavily aggregated (for example, census data). This means “my data” is an important source but also that it is essential that the open data community have a good appreciation of the pitfalls and dangers here – e.g. when anonymisation or aggregation may fail to provide appropriate privacy.

  • MyData becomes Open Data (by individual choice): There may be people who want to share their individual, personal, data openly to benefit others. A cancer patient could be happy to share their medical information if that could assist with research into treatments and help others like them. Alternatively, perhaps I’m happy to open my household energy data and share it with my local community to enable us collectively to make sustainable energy choices. (Today, I can probably only see this data on the energy company’s website, remote, unhelpful, out of my control. I may not even be able to find out what I’m permitted to do with my data!)

  • The Right to Choose: if it’s my data, just about me, I should be able to choose to access it, reuse it, share it and open it if I wish. There is an obvious translation here of key Open Data principles to MyData. Where the Open Definition states that material should be freely available for use, reuse and redistribution by anyone, we could think that my data should freely available for use, reuse and redistribution by me.

We think it is important to explore and develop these connections and issues. The Open Knowledge Foundation is therefore today launching an Open Data & MyData Working Group. Sign up here to participate:

This will be a place to discuss and explore how open data and personal data intersect. How can principles around openness inform approaches to personal data? What issues of privacy and anonymisation do we need to consider for datasets which may become openly published? Do we need “MyData Principles” that include the right of the individual to use, reuse and redistribute data about themselves if they so wish?

Appendix

There are plenty of challenging issues and questions around this topic. Here are a few:

Anonymization

Are big datasets actually anonymous? Anonymisation is incredibly hard. This isn’t a new problem (Ars Technica had a great overview in 2009) although it gets more challenging as more data is available, openly or otherwise, as more data which can be cross-correlated means anonymisation is more easily breached.

Releasing Value

There’s a lot of value in personal data – Boston Consulting Group claim €1tn. But even BCG point out that this value can only be realised if the processes around personal data are more transparent. Perhaps we can aspire to more than transparency, and have some degree of personal control, too.

Governments

Governments are starting to offer some proposals here such as “MiData” in the UK. This is a good start but do they really serve the citizen?

There’s also some proposed legislation to drive companies to give consumers the right to see their data.

But is access enough?

The consumer doesn’t own their data (even when they have “MiData”-style access to it), so can they publish it under an open licence if they wish?

Whose data is it anyway?

Computers, phones, energy monitors in my home, and so on, aren’t all personal to me. They are used by friends and family. It’s hard to know whose data is involved in many cases. I might want privacy from others in my household, not just from anonymous corporations.

This gets even more complicated when we consider the public sphere – surveillance cameras and internet of things sensors are gathering data in public places, about groups of independent people. Can the people whose images or information are being captured access or control or share this data, and how can they collaborate on this? How can consent be secured in these situations? Do we have to accept that some information simply cannot be private in a networked world?

(Some of these issues were raised at the Open Internet of Things Assembly in 2012, which lead to a draft declaration. The declaration doesn’t indicate the breadth of complex issues around data creation and processing which were hotly debated at the assembly.)

MyData Principles

We will need clear principles. Perhaps, just as the Open Definition has help clarify and shape the open data space, we need analogous “MyData” Principles which set out how personal data should be handled. These could include, for example:

  • That my data should be made available to me in machine-readable bulk form
  • That I should have right to use that data as I wish (including using, reusing and redistribution if I so wish).
  • That none of my data (where it contains personal information) should be made open without my full consent.

4 Ideas for Defending the Open Data Commons

Samuel Goëta - January 10, 2013 in Featured, Ideas and musings, OKF France, Open Data, Open Standards

The following post was written by Simon Chignard, author of L’Open data: Comprendre l’ouverture des données publiques. The post was originally posted on Simon’s blog following the launch of the Open Knowlege Foundation French national group, and has been translated by Samuel Goëta from OKFN France.

Open data and the commons: an old story?

Open Data Commons
There is a direct link between the open data movement and the philosophy of common goods. Open data are an illustration of the notion of common informational goods proposed by Elinor Ostrom, winner of the 2009 Nobel Prize for economics. Open data belong to everyone and, unlike water and air (and other common goods), they are non-exclusive: their use by one does not prevent others. If I reuse an open data set, this does not prevent other reusers from doing so. This proximity between the commons and open data is also suggested by the presence of the initiator of Creative Commons licences, Lawrence Lessig, at the 2007 Sebastopol meeting in which the concept of open data itself was defined.


But despite the strong conceptual and historical linkages, it seems that we, as actors of open data, are often shy to reaffirm the relationship. In our efforts to encourage public and private bodies to embrace open data, we seem almost embarrassed of this cornerstone philosophy. The four proposals I make here aim at one thing: not letting it drop!

Idea #1: defend a real choice in terms of open data licences (“pro-choice” approach)

On paper, that sounds clear: there is a real choice in France in terms of open data licences. On one side, the open licence offered by Etalab (the French government institution in charge of government open data), on the other side, the Open Database License (ODbL). Government services must use the former, some local authorities have chosen the latter, generally based on some conception of the relationship between the commons and open data.

In practice, this choice is hindered by the difficulties, real or perceived, of the ODbL licence. The two licences are distinguished by the ODbL’s obligation to share alike, which is clearly a product of a belief in the common pot (if I use it, I must recontribute). But a strange music is playing in France, which warns against this “contaminating” licence. ODbL is accused of being against business, coming “from abroad”, or being the source of unpredictable dangers (such as counterfeiting).


We find ourselves in a situation where, at the same moment as big projects such as Open Street Map are adopting ODbL, new entrants in open data apply – sometimes in good faith – the principle of the least effort: “that share-alike thing seems complicated, we don’t really know the potential risks, I’d rather choose Licence Ouverte”.

As the initiator of the ODbL licence, the Open Knowledge Foundation should be its first promoter, explain its mechanisms and opportunities (not only to block Google). So that a real choice of open data licences stays possible (pro-choice approach)!

But the ODbL licence cannot by itself defend open data as part of the digital commons – below are three further tactics which need to be employed alongside it.

Ideal #2: the General Interest Data, G.I.D.


Let’s take an example that matters to everyone, which was addressed during a recent workshop run by Net:Lab – access to housing. In France, who has the best knowledge of the housing market? Who knows rent prices in great details and in real time, with an address and a complete description of the accommodation? Not the council, nor tax services, nor even the housing minister – but a private actor in real estate ads.

In France, we have a law for personal data (CNIL law), another for public data (CADA law). But what about data – personal, public or private – which serves the general interest? With a clearer and more dynamic vision of rents, one can imagine that everyone would be more informed on the real prices of the market (while making sure to limit the side effects of transparency).

Without demanding the requisition of the data (and of empty flats), one can imagine a digital tax system encouraging its release. There is already a tax break in France for research, why not for open data?
As mentioned previously, this would require the definition of a new class of data, the G.I.D. (General Interest Data), associated with specific rights of access and reuse.

(Obviously, G.I.D. raises as many questions as it tackles – for example who will define general interest?)

Idea #3: Contribution peering: I contribute/I receive

The first period of open data has seen public actors (local authorities or governments) release their data to users, mainly developers. The emerging open data movement is becoming infinitely richer and more complex. Although the division of roles between producers and re-users seems quite established, it is evolving: public and collaborative open data are starting to mutually enrich each other, companies are starting to deliver data on themselves back to clients. How can we design a contribution mechanism which takes into account these evolutions, so as to make “common pots”?


The first step I would suggest is “peering of contribution” – as already exists for boat positioning systems (AIS data). Collaborative website Marine Traffic, launched in 2007, is now the first website in the world for tracking global naval traffic. More than 1000 contributors (equipped with an AIS receiver connected to the Internet) allow the daily tracking of 65,000 ships. The website now displays more that 2 million page views – per day (source: interview S. Chignard with Dimitris Lekkas, Greek scholar who developed the project). Everyone can visualise the data on the map displayed on the website, but if you wish to access raw data, you need to contribute to the service by connecting a new AIS receiver. Hence contribution peering encourages everyone to enhance the service (Marine Traffic is not the only website doing this – see for example the AIS Hub)

Idea #4: Contributive pricing on use (GET>POST)


The last suggestion I would like to make for the development and defence of an open data commons, is be pricing on use – an idea already mentioned in my blog about transport data. This would involve a variable pricing scheme for the use of data. APIs allow particularly well for this pricing method.

Let’s imagine, for example, that access to our G.I.D. be free for all, but that a contribution may be asked to the biggest users of an API who behave nearly as free riders (in economic theory, those who make use of others’ contributions without ever contributing themselves). Hence it would be free to anyone to choose whether to contribute by enhancing the data (updating, correcting), or by paying out-of-pocket!

Get Updates