Why the Open Definition Matters for Open Data: Quality, Compatibility and Simplicity

The Open Definition performs an essential function as a “standard”, ensuring that when you say “open data” and I say “open data” we both mean the same thing. This standardization, in turn, ensures the quality, compatibility and simplicity essential to realizing one of the main practical benefits of “openness”: the greatly increased ability to combine different datasets together to drive innovation, insight and change.

Recent years have seen an explosion in the release of open data by dozens of governments including the G8. Recent estimates by McKinsey put the potential benefits of open data at over $100bn and others estimate benefits at more than 1% of global GDP.

However, these benefits are at significant risk both from quality-dilution and “open-washing”” (non-open data being passed off as open) as well as from fragmentation of the ecosystem as the proliferation of open licenses each with their own slightly different terms and conditions leads to incompatibility.

The Open Definition helps eliminates these risks and ensure we realize the full benefits of open. It acts as the “gold standard” for open content and data guaranteeing quality and preventing incompatibility.

This post explores in more detail why it’s important to have the Open Definition and the clear standard it provides for what “open” means in open data and open content.

Three Reasons

There are three main reasons why the Open Definition matters for open data:

Quality: open data should mean the freedom for anyone to access, modify and share that data. However, without a well-defined standard detailing what that means we could quickly see “open” being diluted as lots of people claim their data is “open” without actually providing the essential freedoms (for example, claiming data is open but actually requiring payment for commercial use). In this sense the Open Definition is about “quality control”.

Compatibility: without an agreed definition it becomes impossible to know if your “open” is the same as my “open”. This means we cannot know whether it’s OK to connect your open data and my open data together since the terms of use may, in fact, be incompatible (at the very least I’ll have to start consulting lawyers just to find out!). The Open Definition helps guarantee compatibility and thus the free ability to mix and combine different open datasets which is one of the key benefits that open data offers.

Simplicity: a big promise of open data is simplicity and ease of use. This is not just in the sense of not having to pay for the data itself, its about not having to hire a lawyer to read the license or contract, not having to think about what you can and can’t do and what it means for, say, your business or for your research. A clear, agreed definition ensures that you do not have to worry about complex limitations on how you can use and share open data.

Let’s flesh these out in a bit more detail:

Quality Control (avoiding “open-washing” and “dilution” of open)

A key promise of open data is that it can freely accessed and used. Without a clear definition of what exactly that means (e.g. used by whom, for what purpose) there is a risk of dilution especially as open data is attractive for data users. For example, you could quickly find people putting out what they call “open data” but only non-commercial organizations can access the data freely.

Thus, without good quality control we risk devaluing open data as a term and concept, as well as excluding key participants and fracturing the community (as we end up with competing and incompatible sets of “open” data).

Compatibility

A single piece of data on its own is rarely useful. Instead data becomes useful when connected or intermixed with other data. If I want to know about the risk of my home getting flooded I need to have geographic data about where my house is located relative to the river and I need to know how often the river floods (and how much).

That’s why “open data”, as defined by the Open Definition, isn’t just about the freedom to access a piece of data, but also about the freedom to connect or intermix that dataset with others.

Unfortunately, we cannot take compatibility for granted. Without a standard like the Open Definition it becomes impossible to know if your “open” is the same as my “open”. This means, in turn, that we cannot know whether it’s OK to connect (or mix) your open data and my open data together (without consulting lawyers!) – and it may turn out that we can’t because your open data license is incompatible with my open data license.

Think of power sockets around the world. Imagine if every electrical device had a different plug and needed a different power socket. When I came over to your house I’d need to bring an adapter! Thanks to standardization, at least in a given country, power-sockets are almost always the same – so I bring my laptop over to your house without a problem. However, when you travel abroad you may have to take adapter with you. What drives this is standardization (or its lack): within your own country everyone has standardized on the same socket type but different countries may not share a standard and hence you need to get an adapter (or run out of power!).

Whilst for power sockets, incompatibility may be a minor inconvenience, easily solved by buying a $10 adapter, for data it is huge issue. If the license for two datasets is incompatible it may by close to impossible to combine the. For example, if each dataset has many contributors, which is common for “open” projects where hundreds or thousands of volunteers have helped build the dataset, then one would need to get the agreement of all of those individual contributors – a huge task. Even where it is easier, where there is just one or a few “owners” for a dataset, resolving incompatibility may be very expensive both in terms of lawyers and in terms of payments.

For open data, the risk of incompatibility is growing as more open data is released and more and more open data publishers such as governments write their own “open data licenses” (with the potential for these different licenses to be mutually incompatible).

The Open Definition helps prevent incompatibility by: