Support Us

You are browsing the archive for Ideas and musings.

#OKStory

Heather Leson - July 9, 2014 in Events, Ideas and musings, Interviews, Network, OKFest, OKFestival, Open Knowledge Foundation

Everyone is a storyteller! Just one week away from the big Open Brain Party of OKFestival. We need all the storytelling help you can muster. Trust us, from photos to videos to art to blogs to tweets – share away.

The Storytelling team is a community-driven project. We will work with all participants to decide which tasks are possible and which stories they want to cover. We remix together.

We’ve written up this summary of how to Storytell, some story ideas and suggested formats.

There are a few ways to join:

  • AT the Event: We will host an in person meetup on Tuesday, July 15th to plan at the Science Fair. Watch #okstory for details. Look for the folks with blue ribbons.
  • Digital Participants: Join in and add all your content with the #okfest14 @heatherleson #OKStory tags.
  • Share: Use the #okstory hashtag. Drop a line to heather.leson AT okfn dot org to get connected.

We highlighted some ways to storytell in this brief 20 minute chat:

The “right to be forgotten” – a threat to Transparency and Open Data?

Rufus Pollock - May 22, 2014 in Featured, Ideas and musings, Privacy

A recent European Court Justice (ECJ) ruling may affect how privacy, transparency, and open data interact and has a direct relation with growing discussion about the “right to be forgotten”. Roughly summarized the ruling finds that organisations which publish information may be obliged to “take down” and remove information when an individual requests that removal even when the information is true and is a matter of “public record”.

This is potentially a significant change, adding to the work and responsibilities not just of big corporations like Google, but also to the creators of open databases big and small. The so-called “right to be forgotten” undoubtedly encapsulates a justified fear that lots of us have about our loss of personal privacy. However, this decision also appears to have the potential for significant (unintended) negative consequences for the publication and availability of key public interest information – the kind of information that is central to government and corporate accountability.

More discussion on this and related topics in area of open data and privacy in the Personal Data, Privacy and Open Data working group

Forgotten

The Ruling and What it Means

The core of the case was the request by a citizen to have web pages about him dating from 1998 removed from online newspaper archives of La Vanguardia, and significantly, for the Google Search results linking to that article also to be removed.

Now the pages in question contained information that one would normally consider to be of reasonable “public record”, specifically as summarized by the ECJ they “contained an announcement for a real-estate auction organised following attachment proceedings for the recovery of social security debts owed [by the citizen]“.

The Spanish Data Protection Agency (AEPD) who handled this in the first instance made what seemed a somewhat surprising ruling in that:

  • They rejected the complaint against La Vanguardia, taking the view that the information in question had been lawfully published by it.
  • But they upheld the complaint against Google and “requested those two companies [Google Spain and Google Inc] to take the necessary measures to withdraw the data from their index and to render access to the data impossible in the future.”

The ECJ (which opines on law not facts) essentially upheld the legal logic of AEPD’s decision, stating:

Court holds that the operator [e.g. Google] is, in certain circumstances, obliged to remove links to web pages that are published by third parties and contain information relating to a person from the list of results displayed following a search made on the basis of that person’s name. The Court makes it clear that such an obligation may also exist in a case where that name or information is not erased beforehand or simultaneously from those web pages, and even, as the case may be, when its publication in itself on those pages is lawful.

At first glance, this decision has some rather substantial implications, for example:

  • It imposes potentially very substantial obligations on those who collect and curate “public” (open) data and information. For example, to respond to requests to remove information (and to continue to track this going forward to ensure continuing compliance).
  • It appears to entitle individuals to request the take-down of information with a strong “public-interest” component. For example, imagine an online database providing information on corporate entities which may list the (true) fact that someone was a director of a company convicted of fraud. Would this ruling allow the director to request their removal?

What is especially noteworthy is that the decision appears to imply that even if the data comes from an official source (and is correct) a downstream collector or aggregator of that information may be required to remove it (and even where the original source does not have to remove the information).

We should, of course, remember that any holder of information (whether an original source or an aggregator) has legal (and moral) obligations to remove content in a variety of circumstances. Most obviously, there is an obligation to remove if something is false or some private information has been mistakenly published. This already has implications for transparency and open data projects.

For example, in the OpenSpending project information is collected from official sources about government finances including (in the UK) details of individual spending transactions. It is possible that (by accident) the description of a published transaction could provide sensitive information about a person (for example, it could be a payment to social services regarding an abused child where the child’s name is listed). In such circumstances both the original source (the government data) and OpenSpending would have a responsibility to redact the personal information as quickly as possible.

However, the case discussed here concerned what one would normally consider “public-interest” information. Traditionally, society has accepted that transparency concerns trump privacy in a variety of public interest areas: for example, one should be able to find who are the directors of limited liability companies, or know the name of one’s elected representatives, or know who it is who was convicted of a crime (though we note that some countries have systems whereby an offender’s conviction is, after some period, expunged from the record).

This ruling appears seriously to undermine this either in theory or in fact.

In particular, whilst a company like Google may dislike this ruling they have the resources ultimately to comply (in fact it may be good for them as it will increase the barriers to entry!). But for open data projects this ruling creates substantial issues – for example, it now seems possible that open projects like Wikipedia, Poderopedia, OpenCorporates or even OpenSpending will now have to deal with requests to remove information on the basis of infringing on personal data protection even though the information collected only derives from material published elsewhere and has a clear public interest component.

The everlasting memory of the internet, and the control of our personal data by corporations like Facebook and Google, undoubtedly present huge challenges to our rights to privacy and our very conception of the public/private divide. But we mustn’t let our justified concerns about ancient Facebook photos prejudicing our job prospects lead to knee-jerk reactions that will harm transparency and undermine the potential of open data.

More discussion on this and related topics in area of open data and privacy in the Personal Data, Privacy and Open Data working group

Excerpted Summary from the ECJ Summary

Excerpted from the ECJ Summary:

In 2010 Mario Costeja González, a Spanish national, lodged with the Agencia Española de Protección de Datos (Spanish Data Protection Agency, the AEPD) a complaint against La Vanguardia Ediciones SL (the publisher of a daily newspaper with a large circulation in Spain, in particular in Catalonia) and against Google Spain and Google Inc. Mr Costeja González contended that, when an internet user entered his name in the search engine of the Google group (‘Google Search’), the list of results would display links to two pages of La Vanguardia’s newspaper, of January and March 1998. Those pages in particular contained an announcement for a real-estate auction organised following attachment proceedings for the recovery of social security debts owed by Mr Costeja González.

With that complaint, Mr Costeja González requested, first, that La Vanguardia be required either to remove or alter the pages in question (so that the personal data relating to him no longer appeared) or to use certain tools made available by search engines in order to protect the data. Second, he requested that Google Spain or Google Inc. be required to remove or conceal the personal data relating to him so that the data no longer appeared in the search results and in the links to La Vanguardia. In this context, Mr Costeja González stated that the attachment proceedings concerning him had been fully resolved for a number of years and that reference to them was now entirely irrelevant.

The AEPD rejected the complaint against La Vanguardia, taking the view that the information in question had been lawfully published by it. On the other hand, the complaint was upheld as regards Google Spain and Google Inc. The AEPD requested those two companies to take the necessary measures to withdraw the data from their index and to render access to the data impossible in the future. Google Spain and Google Inc. brought two actions before the Audiencia Nacional (National High Court, Spain), claiming that the AEPD’s decision should be annulled. It is in this context that the Spanish court referred a series of questions to the Court of Justice.

[The ECJ then summarizes its interpretation. Basically Google can be treated as a data controller and ...]

… the Court holds that the operator is, in certain circumstances, obliged to remove links to web pages that are published by third parties and contain information relating to a person from the list of results displayed following a search made on the basis of that person’s name. The Court makes it clear that such an obligation may also exist in a case where that name or information is not erased beforehand or simultaneously from those web pages, and even, as the case may be, when its publication in itself on those pages is lawful.

Finally, in response to the question whether the directive enables the data subject to request that links to web pages be removed from such a list of results on the grounds that he wishes the information appearing on those pages relating to him personally to be ‘forgotten’ after a certain time, the Court holds that, if it is found, following a request by the data subject, that the inclusion of those links in the list is, at this point in time, incompatible with the directive, the links and information in the list of results must be erased.

Image: Forgotten by Stephen Nicholas, CC-BY-NC-SA

Open Data Privacy

Laura James - August 27, 2013 in Featured, Ideas and musings, Open Data, Open Data and My Data, Open Government Data, Privacy

“yes, the government should open other people’s data”

Traditionally, the Open Knowledge Foundation has worked to open non-personal data – things like publicly-funded research papers, government spending data, and so on. Where individual data was a part of some shared dataset, such as a census, great amounts of thought and effort had gone in to ensuring that individual privacy was protected and that the aggregate data released was a shared, communal asset.

But times change. Increasing amounts of data are collected by governments and corporations, vast quantities of it about individuals (whether or not they realise that it is happening). The risks to privacy through data collection and sharing are probably greater than they have ever been. Data analytics – whether of “big “ or “small” data – has the potential to provide unprecedented insight; however some of that insight may be at the cost of personal privacy, as separate datasets are connected/correlated.

Medical data loss dress

Both open data and big data are hot topics right now, and at such times it is tempting for organisations to get involved in such topics without necessarily thinking through all the issues. The intersection of big data and open data is somewhat worrying, as the temptation to combine the economic benefits of open data with the current growth potential of big data may lead to privacy concerns being disregarded. Privacy International are right to draw attention to this in their recent article on data for development, but of course other domains are affected too.

Today, we’d like to suggest some terms to help the growing discussion about open data and privacy.

Our Data is data with no personal element, and a clear sense of shared ownership. Some examples would be where the buses run in my city, what the government decides to spend my tax money on, how the national census is structured and the aggregate data resulting from it. At the Open Knowledge Foundation, our default position is that our data should be open data – it is a shared asset we can and should all benefit from.

My Data is information about me personally, where I am identified in some way, regardless of who collects it. It should not be made open or public by others without my direct permission – but it should be “open” to me (I should have access to data about me in a useable form, and the right to share it myself, however I wish if I choose to do so).

Transformed Data is information about individuals, where some effort has been made to anonymise or aggregate the data to remove individually identified elements.

big-data_conew1

We propose that there should be some clear steps which need to be followed to confirm whether transformed data can be published openly as our data. A set of privacy principles for open data, setting out considerations that need to be made, would be a good start. These might include things like consulting key stakeholders including representatives of whatever group(s) the data is about and data privacy experts around how the data is transformed. For some datasets, it may not prove possible to transform them sufficiently such that a reasonable level of privacy can be maintained for citizens; these datasets simply should not be opened up. For others, it may be that further work on transformation is needed to achieve an acceptable standard of privacy before the data is fit to be released openly. Ensuring the risks are considered and managed before data release is essential. If the transformations provide sufficient privacy for the individuals concerned, and the principles have been adhered to, the data can be released as open data.

We note that some of “our data” will have personal elements. For instance, members of parliament have made a positive choice to enter the public sphere, and some information about them is therefore necessarily available to citizens. Data of this type should still be considered against the principles of open data privacy we propose before publication, although the standards compared against may be different given the public interest.

This is part of a series of posts exploring the areas of open data and privacy, which we feel is a very important issue. If you are interested in these matters, or would like to help develop privacy principles for open data, join the working group mailing list. We’d welcome suggestions and thoughts on the mailing list or in the comments below, or talk to us and the Open Rights Group, who we are working with, at the Open Knowledge Conference and other events this autumn.

9 models to scale open data – past, present and future

Francis Irving - July 18, 2013 in Business, Featured, Ideas and musings, Open Data

Golden spiral, by Kakapo31 CC-BY-NC-SA

The possibilities of open data have been enthralling us for 10 years.

I came to it through wanting to make Government really usable, to build sites
like TheyWorkForYou.

But that excitement isn’t what matters in the end.

What matters is scale – which organisational structures will make this movement
explode?

Whether by creating self-growing volunteer communities, or by generating flows
of money.

This post quickly and provocatively goes through some that haven’t worked
(yet!) and some that have.

Ones that are working now

1) Form a community to enter in new data. href="http://www.openstreetmap.org/">Open Street Map and href="http://musicbrainz.org/">MusicBrainz are two big examples. It works
as the community is the originator of the data. That said, neither has
dominated its industry as much as I thought they would have by now.

2) Sell tools to an upstream generator of open data. This is what
CKAN does for central Governments (and the new ScraperWiki CKAN tool helps with). It’s what mySociety does, when selling
FixMyStreet
installs to local councils, thereby publishing their potholes as RSS feeds.

3) Use open data (quietly). Every organisation does this and never talks
about it. It’s key to quite old data resellers like Bloomberg. It is what most of
ScraperWiki’s professional services
customers ask us to do. The value to society is enormous and invisible. The
big flaw is that it doesn’t help scale supply of open data.

4) Sell tools to downstream users. This isn’t necessarily open data
specific – existing software like spreadsheets and Business Intelligence can be
used with open or closed data. Lots of open data is on the web, so tools like
the new ScraperWiki which work well with
web data are particularly suited to it.

Ones that haven’t worked

5) Collaborative curation ScraperWiki started as an audacious attempt to create an open data curation
community, based on editing scraping code in a wiki. In its original form
(now called ScraperWiki Classic) this didn’t scale.
Here are some reasons, in terms of open data models, why it didn’t.

a. It wasn’t upstream. Whatever provenance you give, people trust data most
that they get it straight from its source. This can also be a partial upstream -
for example supplementing scraped data with new data manually gathered by
telephone.

b. It isn’t in private. Although in theory there’s lots to gain by wrangling
commodity data together in public, it goes against the instincts of most
organisations.

c. There’s not enough existing culture. The free software movement built a rich
culture of collaboration, ready to be exploited some 15 years in by the open
source movement, and 25 years later by tools like Github. With a few
exceptions, notably OpenCorporates, there
aren’t yet open data curation projects.

6) General purpose data marketplaces, particularly ones that are mainly
reusing open data, haven’t taken off. They might do one day, however I think
they need well-adopted higher level standards for data formatting and syncing
first (perhaps something like dat,
perhaps something based
on CSV files
).

Ones I expect more of in the future

These are quite exciting models which I expect to see a lot more of.

7) Give labour/money to upstream to help them create better data. This is
quite new. The only, and most excellent, example of it is the UK’s National
Archive href="http://blog.okfn.org/2012/10/04/worlds-first-real-commercial-open-data-curation-project/">curating
the Statute Law Database. They do the work with the help of staff seconded
from commercial legal publishers and other parts of Government.

It’s clever because it generates money for upstream, which people trust the most,
and which has the most ability to improve data quality.

8) Viral open data licensing. MySQL made lots of money this way, offering
proprietary dual licenses of GPLd software to embedded systems makers. In data
this could use OKFN’s Open Database License,
and organisations would pay when they wanted to mix the open data with their
own closed data. I don’t know anyone actively using it, although Chris Taggart
from OpenCorporates mentioned this model to me years ago.

9) Corporations release data for strategic advantage. Companies are starting to href="http://blog.okfn.org/2011/07/27/and-so-corporations-begin-to-open-data/">release
their own data for strategic gain. This is very new. Expect more of it.

What have I missed? What models do you see that will scale Open Data, and bring
its benefits to billions?

Git (and Github) for Data

Rufus Pollock - July 2, 2013 in Featured, Ideas and musings, Open Data, Small Data, Technical

The ability to do “version control” for data is a big deal. There are various options but one of the most attractive is to reuse existing tools for doing this with code, like git and mercurial. This post describes a simple “data pattern” for storing and versioning data using those tools which we’ve been using for some time and found to be very effective.

Introduction

The ability to do revisioning and versioning data – store changes made and share them with others – especially in a distributed way would be a huge benefit to the (open) data community. I’ve discussed why at some length before (see also this earlier post) but to summarize:

  • It allows effective distributed collaboration – you can take my dataset, make changes, and share those back with me (and different people can do this at once!)
  • It allows one to track provenance better (i.e. what changes came from where)
  • It allows for sharing updates and synchronizing datasets in a simple, effective, way – e.g. an automated way to get the last months GDP or employment data without pulling the whole file again

There are several ways to address the “revision control for data” problem. The approach here is to get data in a form that means we can take existing powerful distributed version control systems designed for code like git and mercurial and apply them to the data. As such, the best github for data may, in fact, be github (of course, you may want to layer data-specific interfaces on on top of git(hub) – this is what we do with http://data.okfn.org/).

There are limitations to this approach and I discuss some of these and alternative models below. In particular, it’s best for “small (or even micro) data” – say, under 10Mb or 100k rows. (One alternative model can be found in the very interesting Dat project recently started by Max Ogden — with whom I’ve talked many times on this topic).

However, given the maturity and power of the tooling – and its likely evolution – and the fact that so much data is small we think this approach is very attractive.

The Pattern

The essence of the pattern is:

  1. Storing data as line-oriented text and specifically as CSV1 (comma-separated variable) files. “Line oriented text” just indicates that individual units of the data such as a row of a table (or an individual cell) corresponds to one line2.

  2. Use best of breed (code) versioning like git mercurial to store and manage the data.

Line-oriented text is important because it enables the powerful distributed version control tools like git and mercurial to work effectively (this, in turn, is because those tools are built for code which is (usually) line-oriented text). It’s not just version control though: there is a large and mature set of tools for managing and manipulating these types of files (from grep to Excel!).

In addition to the basic pattern, there are several a few optional extras you can add:

  • Store the data in GitHub (or Gitorious or Bitbucket or …) – all the examples below follow this approach
  • Turn the collection of data into a Simple Data Format data package by adding a datapackage.json file which provides a small set of essential information like the license, sources, and schema (this column is a number, this one is a string)
  • Add the scripts you used to process and manage data — that way everything is nicely together in one repository

What’s good about this approach?

The set of tools that exists for managing and manipulating line-oriented files is huge and mature. In particular, powerful distributed version control systems like git and mercurial are already extremely robust ways to do distributed, peer-to-peer collaboration around code, and this pattern takes that model and makes it applicable to data. Here are some concrete examples of why its good.

Provenance tracking

Git and mercurial provide a complete history of individual contributions with “simple” provenance via commit messages and diffs.

Example of commit messages

Peer-to-peer collaboration

Forking and pulling data allows independent contributors to work on it simultaneously.

Timeline of pull requests

Data review

By using git or mercurial, tools for code review can be repurposed for data review.

Pull screen

Simple packaging

The repo model provides a simple way to store data, code, and metadata in a single place.

A repo for data

Accessibility

This method of storing and versioning data is very low-tech. The format and tools are both very mature and are ubiquitous. For example, every spreadsheet and every relational database can handle CSV. Every unix platform has a suite of tools like grep, sed, cut that can be used on these kind of files.

Examples

We’ve been using with this approach for a long-time: in 2005 we first stored CSV’s in subversion, then in mercurial, and then when we switched to git (and github) 3 years ago we started storing them there. In 2011 we started the datasets organization on github which contains a whole list of of datasets managed according to the pattern above. Here are a couple of specific examples:

Note Most of these examples not only show CSVs being managed in github but are also simple data format data packages – see the datapackage.json they contain.


Appendix

Limitations and Alternatives

Line-oriented text and its tools are, of course, far from perfect solutions to data storage and versioning. They will not work for datasets of every shape and size, and in some respects they are awkward tools for tracking and merging changes to tabular data. For example:

  • Simple actions on data stored as line-oriented text can lead to a very large changeset. For example, swapping the order of two fields (= columns) leads to a change in every single line. Given that diffs, merges, etc. are line-oriented, this is unfortunate.3
  • It works best for smallish data (e.g. < 100k rows, < 50mb files, optimally < 5mb files). git and mercurial don’t handle big files that well, and features like diffs get more cumbersome with larger files.4
  • It works best for data made up of lots of similar records, ideally tabular data. In order for line-oriented storage and tools to be appropriate, you need the record structure of the data to fit with the CSV line-oriented structure. The pattern is less good if your CSV is not very line-oriented (e.g. you have a lot of fields with line breaks in them), causing problems for diff and merge.
  • CSV lacks a lot of information, e.g. information on the types of fields (everything is a string). There is no way to add metadata to a CSV without compromising its simplicity or making it no longer usable as pure data. You can, however, add this kind of information in a separate file, and this exactly what the Data Package standard provides with its datapackage.json file.

The most fundamental limitations above all arise from applying line-oriented diffs and merges to structured data whose atomic unit is not a line (its a cell, or a transform of some kind like swapping two columns)

The first issue discussed below, where a simple change to a table is treated as a change to every line of the file, is a clear example. In a perfect world, we’d have both a convenient structure and a whole set of robust tools to support it, e.g. tools that recognize swapping two columns of a CSV as a single, simple change or that work at the level of individual cells.

Fundamentally a revision system is built around a diff format and a merge protocol. Get these right and much of the rest follows. The basic 3 options you have are:

  • Serialize to line-oriented text and use the great tools like git (what’s we’ve described above)
  • Identify atomic structure (e.g. document) and apply diff at that level (think CouchDB or standard copy-on-write for RDBMS at row level)
  • Recording transforms (e.g. Refine)

At the Open Knowledge Foundation we built a system along the lines of (2) and been involved in exploring and researching both (2) and (3) – see changes and syncing for data on on dataprotocols.org. These options are definitely worth exploring — and, for example, Max Ogden, with whom I’ve had many great discussions on this topic, is currently working on an exciting project called Dat, a collaborative data tool which will use the “sleep” protocol.

However, our experience so far is that the line-oriented approach beats any currently available options along those other lines (at least for smaller sized files!).

data.okfn.org

Having already been storing data in github like this for several years, we recently launched http://data.okfn.org/ which is explicitly based on this approach:

  • Data is CSV stored in git repos on GitHub at https://github.com/datasets
  • All datasets are data packages with datapackage.json metadata
  • Frontend site is ultra-simple – it just provides catalog and API and pulls data directly from github

Why line-oriented

Line-oriented text is the natural form of code and so is supported by a huge number of excellent tools. But line-oriented text is also the simplest and most parsimonious form for storing general record-oriented data—and most data can be turned into records.

At its most basic, structured data requires a delimiter for fields and a delimiter for records. Comma- or tab-separated values (CSV, TSV) files are a very simple and natural implementation of this encoding. They delimit records with the most natural separation character besides the space, the line break. For a field delimiter, since spaces are too common in values to be appropriate, they naturally resort to commas or tabs.

Version control systems require an atomic unit to operate on. A versioning system for data can quite usefully treat records as the atomic units. Using line-oriented text as the encoding for record-oriented data automatically gives us a record-oriented versioning system in the form of existing tools built for versioning code.


  1. Note that, by CSV, we really mean “DSV”, as the delimiter in the file does not have to be a comma. However, the row terminator should be a line break (or a line break plus carriage return). 

  2. CSVs do not always have one row to one line (it is possible to have line-breaks in a field with quoting). However, most CSVs are one-row-to-one-line. CSVs are pretty much the simplest possible structured data format you can have. 

  3. As a concrete example, the merge function will probably work quite well in reconciling two sets of changes that affect different sets of records, hence lines. Two sets of changes which each move a column will not merge well, however. 

  4. For larger data, we suggest swapping out git (and e.g. GitHub) for simple file storage like s3. Note that s3 can support basic copy-on-write versioning. However, being copy-on-write, it is comparatively very inefficient. 

Follow the Money, Follow the Data

Martin Tisne - May 3, 2013 in Ideas and musings, Open Data, Open Government Data, Open Spending


The following guest post from Martin Tisné was first published on his personal blog.

Money tunnel by RambergMediaImages, CC-BY-SA on Flickr

Some thoughts which I hope may be helpful in advance of the ‘follow the data‘ hack day this week-end:

The open data sector has quite successfully focused on socially-relevant information: fixing potholes a la http://www.fixmystreet.com/, adopting fire hydrants a la http://adoptahydrant.org/. My sense is that the next frontier will be to free the data that can enable citizens, NGOs and journalists to hold their governments to account. What this will likely mean is engaging in issues such as data on extractives’ transparency, government contracting, political finance, budgeting etc. So far, these are not the bread and butter of the open data movement (which isn’t to say there aren’t great initiatives like http://openspending.org/). But they should be:

At its heart, this agenda revolves around ‘following the money’. Without knowing the ‘total resource flow’:

  • Parents’ associations cannot question the lack of textbooks in their schools by interrogating the school’s budget
  • Healthcare groups cannot access data related to local spending on doctors, nurses
  • Great orgs such as Open Knowledge Foundation or BudgIT cannot get the data they need for their interpretative tools (e.g. budget tracking tool)
  • Investigative journalists cannot access the data they need to pursue a story

Our field has sought to ‘follow the money’ for over two decades, but in practice we still lack the fundamental ability to trace funding flows from A to Z, across the revenue chain. We should be able to get to what aid transparency experts call ‘traceability’ (the ability to trace aid funds from the donor down the project level) for all, or at least most fiscal flows.

Open data enables this to happen. This is exciting: it’s about enabling follow the money to happen at scale. Up until now, instances of ‘following the money’ have been the fruit of the hard work of investigative journalists, in isolated instances.

If we can ensure that data on revenues (extractives, aid, tax etc), expenditures (from planning to allocation to spending to auditing), and results (service delivery data) is timely, accessible, comparable and comprehensive, we will have gone a long way to helping ‘follow the money’ efforts reach the scale they deserve.

Follow the Money is a pretty tangible concept (if you disagree, please let me know!) – it helps demonstrate how government funds buy specific outcomes, and how/whether resources are siphoned away. We need to now make it a reality.

Open Knowledge: much more than open data

Laura James - May 1, 2013 in Featured, Ideas and musings, Join us, Open Data, Open Knowledge Foundation, Our Work


Book, Ball and Chain

We’ve often used “open knowledge” simply as a broad term to cover any kind of open data or content from statistics to sonnets, and more. However, there is another deeper, and far more important, reason why we are the “Open Knowledge” Foundation and not, for example, the “Open Data” Foundation. It’s because knowledge is something much more than data.

Open knowledge is what open data becomes when it’s useful, usable and used. At the Open Knowledge Foundation we believe in open knowledge: not just that data is open and can be freely used, but that it is made useful – accessible, understandable, meaningful, and able to help someone solve a real problem. —Open knowledge should be empowering – it should be enabling citizens and organizations understand the world, create insight and effect positive change.

It’s because open knowledge is much more than just raw data that we work both to have raw data and information opened up (by advocating and campaigning) and also by making, creating the tools to turn that raw material into knowledge that people can act upon. For example, we build technical tools, open source software to help people work with data, and we create handbooks which help people acquire the skills they need to do so. This combination, that we are both evangelists and makers, is extremely powerful in helping us change the world.

Achieving our vision of a world transformed through open knowledge, a world where a vibrant open knowledge commons empowers citizens and enables fair and sustainable societies, is a big challenge. We firmly believe it can done, with a global network of amazing people and organisations fighting for openness and making tools and more to support the open knowledge ecosystem, although it’s going to take a while!

We at the Open Knowledge Foundation are committed to this vision of a global movement building an open knowledge ecosystem, and we are here for the long term. We’d love you to join us in improving the world through open knowledge; there will be many different ways you can help coming up during the months ahead, so get started now by keeping in touch – by signing up to receive our Newsletter, or finding a local group or meetup near you.


What Do We Mean By Small Data

Rufus Pollock - April 26, 2013 in Featured, Ideas and musings, Labs, Open Data, Small Data

Earlier this week we published the first in a series of posts on small data: “Forget Big Data, Small Data is the Real Revolution”. In this second in the series, we discuss small data in more detail providing a rough definition and drawing parallels with the history of computers and software.

What do we mean by “small data”? Let’s define it crudely as:

“Small data is the amount of data you can conveniently store and process on a single machine, and in particular, a high-end laptop or server”

Why a laptop? What’s interesting (and new) right now is the democratisation of data and the associated possibility of large-scale distributed community of data wranglers working collaboratively. What matters here then is, crudely, the amount of data that an average data geek can handle on their own machine, their own laptop.

A key point is that the dramatic advances in computing, storage and bandwidth have far bigger implications for “small data” than for “big data”. The recent advances have increased the realm of small data, the kind of data that an individual can handle on their own hardware, far more relatively than they have increased the realm of “big data”. Suddenly working with significant datasets – datasets containing tens of thousands, hundreds of thousands or millions of rows can be a mass-participation activity.

(As should be clear from the above definition – and any recent history of computing – small (and big) are relative terms that change as technology advances – for example, in 1994 a terabyte of storage cost several hundred thousand dollars, today its under a hundred. This also means today’s big is tomorrow’s small).

Our situation today is similar to microcomputers in the late 70s and early 80s or the Internet in the 90s. When microcomputers first arrived, they seemed puny in comparison to the “big” computing and “big” software then around and there was nothing strictly they could do that existing computing could not. However, they were revolutionary in one fundamental way: they made computing a mass-participation activity. Similarly, the Internet was not new in the 1990s – it had been around in various forms for several decades – but it was at that point it became available at a mass-scale to the average developer (and ultimately citizen). In both cases “big” kept on advancing too – be it supercomputers or the high-end connectivity – but the revolution came from “small”.

This (small) data revolution is just beginning. The tools and infrastructure to enable effective collaboration and rapid scaling for small data are in their infancy, and the communities with the capacities and skills to use small data are in their early stages. Want to get involved in the small data forward revolution — sign up now


This is the second in a series of posts about the power of Small Data – follow the Open Knowledge Foundation blog, Twitter or Facebook to learn more and join the debate at #SmallData on Twitter.

Frictionless Data: making it radically easier to get stuff done with data

Rufus Pollock - April 24, 2013 in Featured, Ideas and musings, Labs, Open Data, Open Standards, Small Data, Technical

Frictionless Data is now in alpha at http://data.okfn.org/ – and we’d like you to get involved.

Our mission is to make it radically easier to make data used and useful – our immediate goal is make it as simple as possible to get the data you want into the tool of your choice.

This isn’t about building a big datastore or a data management system – it’s simply saving people from repeating all the same tasks of discovering a dataset, getting it into a format they can use, cleaning it up – all before they can do anything useful with it! If you’ve ever spent the first half of a hackday just tidying up tabular data and getting it ready to use, Frictionless Data is for you.

Our work is based on a few key principles:

  • Narrow focus — improve one small part of the data chain, standards and tools are limited in scope and size
  • Build for the web – use formats that are web “native” (JSON) and work naturally with HTTP (plain-text, CSV is streamable etc)
  • Distributed not centralised — designed for a distributed ecosystem (no centralized, single point of failure or dependence)
  • Work with existing tools — don’t expect people to come to you, make this work with their tools and their workflows (almost everyone in the world can open a CSV file, every language can handle CSV and JSON)
  • Simplicity (but sufficiency) — use the simplest formats possible and do the minimum in terms of metadata but be sufficient in terms of schemas and structure for tools to be effective

We believe that making it easy to get and use data and especially open data is central to creating a more connected digital data ecosystem and accelerating the creation of social and commercial value. This project is about reducing friction in getting, using and connecting data, making it radically easier to get data you need into the tool of your choice. Frictionless Data distills much of our learning over the last 7 years into some specific standards and infrastructure.

What’s the Problem?

Today, when you decide to cook, the ingredients are readily available at local supermarkets or even already in your kitchen. You don’t need to travel to a farm, collect eggs, mill the corn, cure the bacon etc – as you once would have done! Instead, thanks to standard systems of measurement, packaging, shipping (e.g. containerization) and payment, ingredients can get from the farm direct to my local shop or even my door.

But with data we’re still largely stuck at this early stage: every time you want to do an analysis or build an app you have to set off around the internet to dig up data, extract it, clean it and prepare it before you can even get it into your tool and begin your work proper.

What do we need to do for the working with data to be like cooking today – where you get to spend your time making the cake (creating insights) not preparing and collecting the ingredients (digging up and cleaning data)?

The answer: radical improvements in the “logistics” of data associated with specialisation and standardisation. In analogy with food we need standard systems of “measurement”, packaging, and transport so that its easy to get data from its original source into the application where you can start working with it.

Frictionless DAta idea

What’s Frictionless Data going to do?

We start with an advantage: unlike for physical goods transporting digital information from one computer to another is very cheap! This means the focus can be on standardizing and simplifying the process of getting data from one application to another (or one form to another). We propose work in 3 related areas:

  • Key simple standards. For example, a standardized “packaging” of data that makes it easy to transport and use (think of the “containerization” revolution in shipping)
  • Simple tooling and integration – you should be able to get data in these standard formats into or out of Excel, R, Hadoop or whatever tool you use
  • Bootstrapping the system with essential data – we need to get the ball rolling

frictionless data components diagram

What’s Frictionless Data today?

1. Data

We have some exemplar datasets which are useful for a lot of people – these are:

  • High Quality & Reliable

    • We have sourced, normalized and quality checked a set of key reference datasets such as country codes, currencies, GDP and population.
  • Standard Form & Bulk Access

    • All the datasets are provided in a standardized form and can be accessed in bulk as CSV together with a simple JSON schema.
  • Versioned & Packaged

    • All data is in data packages and is versioned using git so all changes are visible and data can becollaboratively maintained.

2. Standards

We have two simple data package formats, described as ultra-lightweight, RFC-style specifications. They build heavily on prior work. Simplicity and practicality were guiding design criteria.

Frictionless Data: package standard diagram

Data package: minimal wrapping, agnostic about the data its “packaging”, designed for extension. This flexibility is good as it can be used as a transport for pretty much any kind of data but it also limits integration and tooling. Read the full Data Package specification.

Simple data format (SDF): focuses on tabular data only and extends data package (data in simple data format is a data package) by requiring data to be “good” CSVs and the provision of a simple JSON-based schema to describe them (“JSON Table Schema”). Read the full Simple Data Format specification.

3. Tools

It’s early days for Frictionless Data, so we’re still working on this bit! But there’s a need for validators, schema generators, and all kinds of integration. You can help out – see below for details or check out the issues on github.

Doesn’t this already exist?

People have been working on data for a while – doesn’t something like this already exist? The crude answer is yes and no. People, including folks here at the Open Knowledge Foundation, have been working on this for quite some time, and there are already some parts of the solution out there. Furthermore, much of these ideas are directly borrowed from similar work in software. For example, the Data Packages spec (first version in 2007!) builds heavily on packaging projects and specifications like Debian and CommonJS.

Key distinguishing features of Frictionless Data:

  • Ultra-simplicity – we want to keep things as simple as they possibly can be. This includes formats (JSON and CSV) and a focus on end-user tool integration, so people can just get the data they want into the tool they want and move on to the real task
  • Web orientation – we want an approach that fits naturally with the web
  • Focus on integration with existing tools
  • Distributed and not tied to a given tool or project – this is not about creating a central data marketplace or similar setup. It’s about creating a basic framework that would enable anyone to publish and use datasets more easily and without going through a central broker.

Many of these are shared with (and derive from) other approaches but as a whole we believe this provides an especially powerful setup.

Get Involved

This is a community-run project coordinated by the Open Knowledge Foundation as part of Open Knowledge Foundation Labs. Please get involved:


  • Spread the word! Frictionless Data is a key part of the real data revolution – follow the debate on #SmallData and share our posts so more people can get involved

Forget Big Data, Small Data is the Real Revolution

Rufus Pollock - April 22, 2013 in Featured, Ideas and musings, Labs, Open Data, Small Data

There is a lot of talk about “big data” at the moment. For example, this is Big Data Week, which will see events about big data in dozens of cities around the world. But the discussions around big data miss a much bigger and more important picture: the real opportunity is not big data, but small data. Not centralized “big iron”, but decentralized data wrangling. Not “one ring to rule them all” but “small pieces loosely joined”.

Big data smacks of the centralization fads we’ve seen in each computing era. The thought that ‘hey there’s more data than we can process!’ (something which is no doubt always true year-on-year since computing began) is dressed up as the latest trend with associated technology must-haves.

Meanwhile we risk overlooking the much more important story here, the real revolution, which is the mass democratisation of the means of access, storage and processing of data. This story isn’t about large organisations running parallel software on tens of thousand of servers, but about more people than ever being able to collaborate effectively around a distributed ecosystem of information, an ecosystem of small data.

Just as we now find it ludicrous to talk of “big software” – as if size in itself were a measure of value – we should, and will one day, find it equally odd to talk of “big data”. Size in itself doesn’t matter – what matters is having the data, of whatever size, that helps us solve a problem or address the question we have.

For many problems and questions, small data in itself is enough. The data on my household energy use, the times of local buses, government spending – these are all small data. Everything processed in Excel is small data. When Hans Rosling shows us how to understand our world through population change or literacy he’s doing it with small data.

And when we want to scale up the way to do that is through componentized small data: by creating and integrating small data “packages” not building big data monoliths, by partitioning problems in a way that works across people and organizations, not through creating massive centralized silos.

This next decade belongs to distributed models not centralized ones, to collaboration not control, and to small data not big data.

Want to create the real data revolution? Come join our community creating the tools and materials to make it happen — sign up here:


This is the first in a series of posts about the power of Small Data – follow the Open Knowledge Foundation blog, Twitter or Facebook to learn more and join the debate at #SmallData on Twitter.

Further Reading

  • Nobody ever got fired for buying a cluster
    • Even at enterprises like Microsoft and Yahoo most jobs could run on a single machine. E.g. median job size is 14GB at Microsoft and 80% of jobs are less than 1TB. At Yahoo estimate median job size is 12GB.
    • Ananthanarayanan et al. show that Facebook jobs follow a power-law distribution with small jobs dominating; from their graphs it appears that at least 90% of the jobs have input sizes under 100 GB,” the paper states. “Chen et al. present a detailed study of Hadoop workloads for Facebook as well as 5 Cloudera customers. Their graphs also show that a very small minority of jobs achieves terabyte scale or larger and the paper claims explicitly that ‘most jobs have input, shuffle, and output sizes in the MB to GB range.’”
  • PACMan: Coordinated Memory Caching for Parallel Jobs – Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, Ion Stoica

Get Updates