Support Us

You are browsing the archive for Ideas and musings.

Forget Big Data, Small Data is the Real Revolution

Rufus Pollock - April 22, 2013 in Featured, Ideas and musings, Labs, Open Data, Small Data

This is the first in a series of posts. The next posts in the series is What Do We Mean by Small Data

There is a lot of talk about “big data” at the moment. For example, this is Big Data Week, which will see events about big data in dozens of cities around the world. But the discussions around big data miss a much bigger and more important picture: the real opportunity is not big data, but small data. Not centralized “big iron”, but decentralized data wrangling. Not “one ring to rule them all” but “small pieces loosely joined”.

Big data smacks of the centralization fads we’ve seen in each computing era. The thought that ‘hey there’s more data than we can process!’ (something which is no doubt always true year-on-year since computing began) is dressed up as the latest trend with associated technology must-haves.

Meanwhile we risk overlooking the much more important story here, the real revolution, which is the mass democratisation of the means of access, storage and processing of data. This story isn’t about large organisations running parallel software on tens of thousand of servers, but about more people than ever being able to collaborate effectively around a distributed ecosystem of information, an ecosystem of small data.

Just as we now find it ludicrous to talk of “big software” – as if size in itself were a measure of value – we should, and will one day, find it equally odd to talk of “big data”. Size in itself doesn’t matter – what matters is having the data, of whatever size, that helps us solve a problem or address the question we have.

For many problems and questions, small data in itself is enough. The data on my household energy use, the times of local buses, government spending – these are all small data. Everything processed in Excel is small data. When Hans Rosling shows us how to understand our world through population change or literacy he’s doing it with small data.

And when we want to scale up the way to do that is through componentized small data: by creating and integrating small data “packages” not building big data monoliths, by partitioning problems in a way that works across people and organizations, not through creating massive centralized silos.

This next decade belongs to distributed models not centralized ones, to collaboration not control, and to small data not big data.

Want to create the real data revolution? Come join our community creating the tools and materials to make it happen — sign up here:

This is the first in a series of posts about the power of Small Data – follow the Open Knowledge Foundation blog, Twitter or Facebook to learn more and join the debate at #SmallData on Twitter.

Further Reading

  • Nobody ever got fired for buying a cluster
    • Even at enterprises like Microsoft and Yahoo most jobs could run on a single machine. E.g. median job size is 14GB at Microsoft and 80% of jobs are less than 1TB. At Yahoo estimate median job size is 12GB.
    • Ananthanarayanan et al. show that Facebook jobs follow a power-law distribution with small jobs dominating; from their graphs it appears that at least 90% of the jobs have input sizes under 100 GB,” the paper states. “Chen et al. present a detailed study of Hadoop workloads for Facebook as well as 5 Cloudera customers. Their graphs also show that a very small minority of jobs achieves terabyte scale or larger and the paper claims explicitly that ‘most jobs have input, shuffle, and output sizes in the MB to GB range.'”
  • PACMan: Coordinated Memory Caching for Parallel Jobs – Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, Ion Stoica

Open Data & My Data

Laura James - February 22, 2013 in Featured, Ideas and musings, Open Data, Working Groups

The Open Knowledge Foundation believes in open knowledge: not just that some data is open and freely usable, but that it is useful – accessible, understandable, meaningful, and able to help someone solve a real problem.

A lot of the data which could help me improve my life is data about me – “MyData” if you like. Many of the most interesting questions and problems we have involve personal data of some kind. This data might be gathered directly by me (using my own equipment or commercial services), or it could be harvested by corporations from what I do online, or assembled by public sector services I use, or voluntarily contributed to scientific and other research studies.

Tape library, CERN, Geneva 2

Image: “Tape library, CERN, Geneva 2″ by Cory Doctorow, CC-BY-SA.

This data isn’t just interesting in the context of our daily lives: it bears on many global challenges in the 21st century, such as supporting an aging population, food consumption and energy use.

Today, we rarely have access to these types of data, let alone the ability to reuse and share it, even when it’s my data, about just me. Who owns data about me, who controls it, who has access to it? Can I see data about me, can I get a copy of it in a form I could reuse or share, can I get value out of it? Would I even be allowed to publish openly some of the data about me, if I wanted to?

But how does this relate to open data? After all, a key tenet of our work at the Open Knowledge Foundation is that personal data should not be made open (for obvious privacy reasons)!

However there are, in fact, obvious points where “Open Data” and “My Data” connect:

  • MyData becomes Open Data (via transformation): Important datasets that are (or could be) open come from “my data” via aggregation, anonymisation and so on. Much statistical information ultimately comes from surveys of individuals, but the end results are heavily aggregated (for example, census data). This means “my data” is an important source but also that it is essential that the open data community have a good appreciation of the pitfalls and dangers here – e.g. when anonymisation or aggregation may fail to provide appropriate privacy.

  • MyData becomes Open Data (by individual choice): There may be people who want to share their individual, personal, data openly to benefit others. A cancer patient could be happy to share their medical information if that could assist with research into treatments and help others like them. Alternatively, perhaps I’m happy to open my household energy data and share it with my local community to enable us collectively to make sustainable energy choices. (Today, I can probably only see this data on the energy company’s website, remote, unhelpful, out of my control. I may not even be able to find out what I’m permitted to do with my data!)

  • The Right to Choose: if it’s my data, just about me, I should be able to choose to access it, reuse it, share it and open it if I wish. There is an obvious translation here of key Open Data principles to MyData. Where the Open Definition states that material should be freely available for use, reuse and redistribution by anyone, we could think that my data should freely available for use, reuse and redistribution by me.

We think it is important to explore and develop these connections and issues. The Open Knowledge Foundation is therefore today launching an Open Data & MyData Working Group. Sign up here to participate:

This will be a place to discuss and explore how open data and personal data intersect. How can principles around openness inform approaches to personal data? What issues of privacy and anonymisation do we need to consider for datasets which may become openly published? Do we need “MyData Principles” that include the right of the individual to use, reuse and redistribute data about themselves if they so wish?


There are plenty of challenging issues and questions around this topic. Here are a few:


Are big datasets actually anonymous? Anonymisation is incredibly hard. This isn’t a new problem (Ars Technica had a great overview in 2009) although it gets more challenging as more data is available, openly or otherwise, as more data which can be cross-correlated means anonymisation is more easily breached.

Releasing Value

There’s a lot of value in personal data – Boston Consulting Group claim €1tn. But even BCG point out that this value can only be realised if the processes around personal data are more transparent. Perhaps we can aspire to more than transparency, and have some degree of personal control, too.


Governments are starting to offer some proposals here such as “MiData” in the UK. This is a good start but do they really serve the citizen?

There’s also some proposed legislation to drive companies to give consumers the right to see their data.

But is access enough?

The consumer doesn’t own their data (even when they have “MiData”-style access to it), so can they publish it under an open licence if they wish?

Whose data is it anyway?

Computers, phones, energy monitors in my home, and so on, aren’t all personal to me. They are used by friends and family. It’s hard to know whose data is involved in many cases. I might want privacy from others in my household, not just from anonymous corporations.

This gets even more complicated when we consider the public sphere – surveillance cameras and internet of things sensors are gathering data in public places, about groups of independent people. Can the people whose images or information are being captured access or control or share this data, and how can they collaborate on this? How can consent be secured in these situations? Do we have to accept that some information simply cannot be private in a networked world?

(Some of these issues were raised at the Open Internet of Things Assembly in 2012, which lead to a draft declaration. The declaration doesn’t indicate the breadth of complex issues around data creation and processing which were hotly debated at the assembly.)

MyData Principles

We will need clear principles. Perhaps, just as the Open Definition has help clarify and shape the open data space, we need analogous “MyData” Principles which set out how personal data should be handled. These could include, for example:

  • That my data should be made available to me in machine-readable bulk form
  • That I should have right to use that data as I wish (including using, reusing and redistribution if I so wish).
  • That none of my data (where it contains personal information) should be made open without my full consent.

4 Ideas for Defending the Open Data Commons

Open Knowledge France - January 10, 2013 in Featured, Ideas and musings, OKF France, Open Data, Open Standards

The following post was written by Simon Chignard, author of L’Open data: Comprendre l’ouverture des données publiques. The post was originally posted on Simon’s blog following the launch of the Open Knowlege Foundation French national group, and has been translated by Samuel Goëta from OKFN France.

Open data and the commons: an old story?

Open Data Commons There is a direct link between the open data movement and the philosophy of common goods. Open data are an illustration of the notion of common informational goods proposed by Elinor Ostrom, winner of the 2009 Nobel Prize for economics. Open data belong to everyone and, unlike water and air (and other common goods), they are non-exclusive: their use by one does not prevent others. If I reuse an open data set, this does not prevent other reusers from doing so. This proximity between the commons and open data is also suggested by the presence of the initiator of Creative Commons licences, Lawrence Lessig, at the 2007 Sebastopol meeting in which the concept of open data itself was defined.

But despite the strong conceptual and historical linkages, it seems that we, as actors of open data, are often shy to reaffirm the relationship. In our efforts to encourage public and private bodies to embrace open data, we seem almost embarrassed of this cornerstone philosophy. The four proposals I make here aim at one thing: not letting it drop!

Idea #1: defend a real choice in terms of open data licences (“pro-choice” approach)

On paper, that sounds clear: there is a real choice in France in terms of open data licences. On one side, the open licence offered by Etalab (the French government institution in charge of government open data), on the other side, the Open Database License (ODbL). Government services must use the former, some local authorities have chosen the latter, generally based on some conception of the relationship between the commons and open data.

In practice, this choice is hindered by the difficulties, real or perceived, of the ODbL licence. The two licences are distinguished by the ODbL’s obligation to share alike, which is clearly a product of a belief in the common pot (if I use it, I must recontribute). But a strange music is playing in France, which warns against this “contaminating” licence. ODbL is accused of being against business, coming “from abroad”, or being the source of unpredictable dangers (such as counterfeiting).

We find ourselves in a situation where, at the same moment as big projects such as Open Street Map are adopting ODbL, new entrants in open data apply – sometimes in good faith – the principle of the least effort: “that share-alike thing seems complicated, we don’t really know the potential risks, I’d rather choose Licence Ouverte”.

As the initiator of the ODbL licence, the Open Knowledge Foundation should be its first promoter, explain its mechanisms and opportunities (not only to block Google). So that a real choice of open data licences stays possible (pro-choice approach)!

But the ODbL licence cannot by itself defend open data as part of the digital commons – below are three further tactics which need to be employed alongside it.

Ideal #2: the General Interest Data, G.I.D.

Let’s take an example that matters to everyone, which was addressed during a recent workshop run by Net:Lab – access to housing. In France, who has the best knowledge of the housing market? Who knows rent prices in great details and in real time, with an address and a complete description of the accommodation? Not the council, nor tax services, nor even the housing minister – but a private actor in real estate ads.

In France, we have a law for personal data (CNIL law), another for public data (CADA law). But what about data – personal, public or private – which serves the general interest? With a clearer and more dynamic vision of rents, one can imagine that everyone would be more informed on the real prices of the market (while making sure to limit the side effects of transparency).

Without demanding the requisition of the data (and of empty flats), one can imagine a digital tax system encouraging its release. There is already a tax break in France for research, why not for open data? As mentioned previously, this would require the definition of a new class of data, the G.I.D. (General Interest Data), associated with specific rights of access and reuse.

(Obviously, G.I.D. raises as many questions as it tackles – for example who will define general interest?)

Idea #3: Contribution peering: I contribute/I receive

The first period of open data has seen public actors (local authorities or governments) release their data to users, mainly developers. The emerging open data movement is becoming infinitely richer and more complex. Although the division of roles between producers and re-users seems quite established, it is evolving: public and collaborative open data are starting to mutually enrich each other, companies are starting to deliver data on themselves back to clients. How can we design a contribution mechanism which takes into account these evolutions, so as to make “common pots”?

The first step I would suggest is “peering of contribution” – as already exists for boat positioning systems (AIS data). Collaborative website Marine Traffic, launched in 2007, is now the first website in the world for tracking global naval traffic. More than 1000 contributors (equipped with an AIS receiver connected to the Internet) allow the daily tracking of 65,000 ships. The website now displays more that 2 million page views – per day (source: interview S. Chignard with Dimitris Lekkas, Greek scholar who developed the project). Everyone can visualise the data on the map displayed on the website, but if you wish to access raw data, you need to contribute to the service by connecting a new AIS receiver. Hence contribution peering encourages everyone to enhance the service (Marine Traffic is not the only website doing this – see for example the AIS Hub)

Idea #4: Contributive pricing on use (GET>POST)

The last suggestion I would like to make for the development and defence of an open data commons, is be pricing on use – an idea already mentioned in my blog about transport data. This would involve a variable pricing scheme for the use of data. APIs allow particularly well for this pricing method.

Let’s imagine, for example, that access to our G.I.D. be free for all, but that a contribution may be asked to the biggest users of an API who behave nearly as free riders (in economic theory, those who make use of others’ contributions without ever contributing themselves). Hence it would be free to anyone to choose whether to contribute by enhancing the data (updating, correcting), or by paying out-of-pocket!

Open Data and Privacy Concerns in Biomedical Research

Sabina Leonelli - November 26, 2012 in Ideas and musings, Open Data, Open Science, WG Open Data in Science

Privacy has long been the focus of debates about how to use and disseminate data taken from human subjects during clinical research. The increasing push to share data freely and openly within biomedicine poses a challenge to the idea of private individual information, whose dissemination patients and researchers can control and monitor.

In order to address this challenge, however, it is not enough to think about (or simply re-think) the meaning of ‘informed consent’ procedures. Rather, addressing privacy concerns in biomedical research today, and the ways in which the Open Data movement might transform how we think about the privacy of patients, involves understanding the ways in which data are disseminated and used to generate new results. In other words, one needs to study how biomedical researchers confront the challenges of making data intelligible and useful for future research.

Efficient data re-use comes from what the Royal Society calls ‘intelligent openness’ – the development of standards for data dissemination which make data both intelligible and assessable. Data are intelligible when they can be used as evidence for one or more claims, thus helping scientists to advance existing knowledge. Data are assessable when scientists can evaluate their quality and reliability as evidence, usually on the basis of their format, visualisation and extra information (metadata) also available in databases.

Yet the resources and regulatory apparatus for securing proper curation of data, and so their adequate dissemination and re-use, are far from being in place. Making data intelligible and assessable requires labour, infrastructures and funding, as well as substantial changes to the institutional structures surrounding scientific research. While the funding to build reliable and stable biomedical databases and Open Data Repositories is increasing, there is no appropriate business model to support the long-term sustainability of these structures, with national funders, industry, universities and publishing houses struggling to agree on their respective responsibilities in supporting data sharing.

Several other factors are important. For instance, the free dissemination of data is not yet welcomed by the majority of researchers, who do not have the time or resources for sharing their data, are not rewarded for doing so and who often fear that premature data-sharing will damage their competitive advantage over other research groups. There are intellectual property concerns too, especially when funding for research comes from industry or specific parts of government such as defence. Further, there are few clear standards for what counts as evidence in different research contexts and across different geographical locations. And more work needs to be done on how to relate datasets collected at different times and with different technologies.

The social sciences and humanities have an important role to help scientific institutions and funders develop policies and infrastructures for the evaluation of data-sharing practices, particularly the collaborative activities that fuel data-intensive research methods. An improved understanding of how data can be made available so as to maximise their usefulness for future research can also help tackle privacy concerns relating to sensitive data about individuals.

When it comes to sharing medical records, it is now generally agreed that providing ‘informed consent’ to individual patients is simply not possible, as neither patients not researchers themselves can predict how the data could be used in the future. Even the promise of anonymity is failing, as new statistical and computational methods make it possible to retrieve the identity of individuals from large, aggregated datasets, as shown by genome-wide association studies.

A more effective approach is the development of ‘safe havens’: data repositories which would give access to data only to researchers with appropriate credentials. This could potentially safeguard data from misuse, without hampering researchers’ ability to extract new knowledge from them. Whether this solution succeeds ultimately depends on the ability of researchers to work with data providers, including patients, to establish how data travel online, how they are best re-used and how data sharing is likely to affect, and hopefully improve, future medicine. This work is very important, and should be supported and rewarded by universities, research councils and other science funders as an integral part of the research process.

To learn more, read the report ‘Making Data Accessible to All’

Towards a public digital infrastructure: why do governments have a responsibility to go open?

Guillermo Moncecchi - November 1, 2012 in Featured, Ideas and musings, Open Government Data, WG Open Government Data

The most common argument in favor of open data is that it enhances transparency, and while the link may not always be causal, it is certainly true that both tend to go hand-in-hand. But there is another, more expansive perspective on open government data: that it is part of an effort to build public infrastructure.

Does making a shapefile available with all Montevideo’s traffic lights make Montevideo’s government more transparent? We don’t think so. But one of our duties as civil servants is building the city infrastructure. And we should understand that data is mainly infrastructure. People do things on it, as they do things on roads, bridges or parks. For money, for amusement, for philanthropy, there are myriads of uses for infrastructure: we should not try to determine or even guess which those uses are. We must just provide the infrastructure and ensure it will be available. Open data should be seen as a component of an effort to build a public digital infrastructure, where people could, within the law, do whatever they want. Exactly as they do with roads.

When you see open data in this light, several decisions become easier. Should we ask people for identification to give them our data? Answer: do you ask them for an identification to use the street? No, you don’t – then no, you shouldn’t. Why should we use open, non proprietary standards for publishing data? For the same reason you do not build a street where only certain car brands can pass. What happens if there are problems with my data, which causes problems for the users? Well, you will be liable, if the law decides that … but, would you avoid demands for accidents caused by pavement problems by not building streets? Of course you are responsible for your data: you are paid to create it, as you are paid for building bridges. Every question about open data we can imagine has already been answered for traditional infrastructure.

But of course the infrastructure required to enable people to create an information society goes beyond data. We will give you four examples.

The most direct infrastructure component is hardware and communications. The Uruguayan government recognises this, and is planning to have each home connected with fibre by then end of 2015, with 1 Gb traffic for free for everybody with a phone line. Meanwhile since 2007, every public school child gets an OLPC laptop and internet connection. This programme should be understood as being primarily about infrastructure: education encompasses much more than laptops, but infrastructure enables the development of new education paths.

Secondly, services. Sometimes it’s better to provide services than to provide data. Besides publishing cartography data, in Montevideo we provide WMS and WFS services to retrieve a map just using a URL. Services, as data, should be open: no registration, no access limit. Open services allow developers to use not only government data, but also government computation power, and, of course, government knowledge: the knowledge needed to, say, estimate the arrival time of a bus.

Thirdly, sometimes services are not enough, and we have to develop complete software components to enable public servants to do their work. Sometimes these software components should also be part of the public digital infrastructure. The people of Brazil are very clear on this: in 2007 they developed the Portal do Software Publico Brasileiro, where applications developed by or for the government are publicly available. Of course, this is not a new concept: its general version is called open source software. We believe that within this framework of public infrastructure, the discussion between open source and privative software makes no sense. Nobody would let a company be the owner of a street. If is public, it should be open.

Finally, there is knowledge. We, as the government, must tell the people what we are doing, and how we are doing it. Our knowledge should be open. We have the duty to publish our knowledge and to let others use it, so that we can participate actively in communities, propose changes, and act as an innovation factor in every task we face. Because we are paid for that: for building knowledge infrastructure.

We do not think government should let others do its work: on the contrary, we want a strong government, building the blocks of infrastructure to achieve its tasks, and making this infrastructure available to people to do whatever they want, within the law.

Exactly the same thing they do with streets.

Is Open Access Open?

Peter Murray-Rust - October 26, 2012 in Featured, Ideas and musings, Open Access

This post is cross-posted from Peter’s blog

I’m going to ask questions. They are questions I don’t know the answers to – maybe I am ignorant in which case please comment with information, or maybe the “Open Access Community” doesn’t know the answers. Warning: I shall probably be criticized by some of the mainstream “OA Community”. Please try to read beyond any rhetoric.

As background, I am well versed in Openness. I have taking a leading role in creating and launching many Open efforts – SAX, Chemical MIME, Chemical Markup Language, The Blue Obelisk, Panton Principles, Open Bibliography, Open Content Mining and helped to write a significant number of large software frameworks (OSCAR, JUMBO, OPSIN, AMI2). I’m on the advisory board of the Open Knowledge Foundation and I have contributed to or worked with Wikipedia, Open Streetmap, Stackoverflow, Open Science Summit, Mat Todd (Open Source Drug Discovery) and been to many hackathons. So I am very familiar with the modern ideology and practice of “Open”. Is “Open Access” the same sort of beast?

The features of “Open” that I value are:

  • Meritocracy. That doesn’t mean that decisions are made by hand counting, but it means that people’s views are listened to, and they enter the process when it seems right to the community. That’s happened with SAX, very much with the Blue Obelisk, and the Open Knowledge Foundation.
  • Universality of participation, particularly from citizens without formal membership or qualifications. A feeling of community.
  • A willingness to listen to other views and find means of changing strategy where necessary
  • Openness of process. It is clear what is happening, even if you are not in command.
  • Openness of results. This is universally fundamental. Although there have been major differences of opinion in Free/Open Source Software (F/OSS) everyone is agreed that the final result is free to use, modify, redistribute without permission and for any purpose. Free software is a matter of liberty, not price.
  • A mechanism to change current practice. The key thing about Wikipedia is that it dramatically enhances the way we use knowledge. Many activities in the OKF (and other Open Organisations) are helping to change practice in government, development agencies, companies. It’s not about price restrictions, it’s about giving back control to the citizens of the world. Open Streetmap produces BETTER and more innovative maps that people can use to change the lives of people living right now – e.g. the Haitian earthquake.

How does Open Access measure up against these? Not very well. That doesn’t mean it isn’t valuable, but it means that it doesn’t have obvious values I can align with. I have followed OA for most of the last 10 years and tried to contribute, but without success. I have practiced it by publishing all my own single-author papers over the last 5 years in Gold CC-BY journals. But I have never had much feeling of involvement – certainly not the involvement that I get from SAX or BlueObelisk.

That’s a harsh statement and I will elaborate:

Open Access is not universal – it looks inward to Universities (and Research Institutions). In OA week the categories for membership are:


[1] There is no space for “citizen” in OA. Indeed, some in the OA movement emphasize this. Stevan Harnad has said that the purpose of OA is for “researchers to publish to researchers” and that ordinary people won’t understand scholarly papers. I take a strong and public stance against this – the success of Galaxy Zoo has shown how citizens can become as expert as many practitioners. In my new area of phylogenetic trees I would feel confident that anyone with a University education (and many without) would have little difficulty understanding much of the literature and many could become involved in the calculations. For me, Open Access has little point unless it reaches out to the citizenry and I see very little evidence of this (please correct me).

There is, in fact, very little role for the individual. Most of the infrastructure has been built by university libraries without involving anyone outside (regrettably, since university repositories are poor compared to other tools in the Open movements). There is little sense of community. The main events are organised round library practice and funders – which doesn’t map onto other Opens. Researchers have little involvement in the process – the mainstream vision is that their university will mandate them to do certain things and they will comply or be sacked. This might be effective (although no signs yet), but it is not an “Open” attitude.

Decisions are made in the following ways:

* An oligarchy, represented in the BOAI processes and Enabling Open Scholarship (EOS). EOS is a closed society that releases briefing papers and has a members ship of 50 EUR per year and have to be formally approved by the committee (I have represented to several members of EOS that I don’t find this inclusive and I can’t see any value in my joining – it’s primarily for university administrators and librarians). * Library organizations (e.g. SPARC) * Organizations of OA publishers (e.g. OASPA)

Now there are many successful and valuable organizations that operate on these principles, but they don’t use the word “Open”.

So is discussion “Open”? Unfortunately not very. There is no mailing list with both large volume of contributions and effective freedom to present a range of views. Probably the highest volume list for citizens (as opposed to librarians) is GOAL and here differences of opinion are unwelcome. Again that’s a hard statement, but the reality is that if you post anything that does not support Green Open Access then Stevan Harnad and the Harnadites will publicly shout you down. I have been denigrated on more than one occasion by members of the OA oligarchy (Look at the archive if you need proof). It’s probably fair to say that this attitude has effective killed Open discussion in OA. Jan Velterop and I are probably the only people prepared to challenge opinions: most others walk away.

Because of this lack of discussion it isn’t clear to me what the goals and philosophy of OA are. I suspect that different practitioners have many different views, including:

  • A means to reach out to citizenry beyond academia, especially for publicly funded research. This should be the top reason IMO but there is little effective practice.
  • A means to reduce journal prices. This is (one of) Harnad’s arguments. We concentrate on making everything Green and when we have achieved this the publishers will have to reduce their prices. This seems most unlikely to me – any publisher losing revenue will fight this.
  • A way of reusing scholarly output. This is ONLY possible if the output is labelled as CC-BY. There’s about 5-10 percent of this. Again this is high on my list and the only reason Ross Mounce and I can do research into phylogenetic trees.
  • A way of changing scholarship. I see no evidence at all for this in the OA community. In fact OA is holding back innovation in new methods of scholarship as it emphasizes the conventional role of the “final manuscript” and the “publisher”. Green OA relies (in practice) in having publishers and so legitimizes them

And finally is the product “Open”? The BOAI declaration is, in Cameron Neylon’s words, “clear, direct, and precise:” To remind you:

“By ‘open access’ to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.”

This is in the tradition of Stallman’s software freedoms, The Open Knowledge Definition and all the other examples I have quoted. Free to use, re-use and redistribute for any lawful purpose. For manuscripts it is cleanly achieved by adding a visible CC-BY licence. But unfortunately many people, including the mainstream OA community and many publishers use “(fully) Open Access” to mean just about anything. Very few of us challenge this. So the result is that much current “OA” is so badly defined that it adds little value. There have been attempts to formalize this, but they have all ended in messy (and to me unacceptable) compromise. In all other Open communities “libre” has a clear meaning – freedom as in speech. In OA it means almost nothing. Unfortunately anyone trying to get tighter approaches is shouted down. So, and this is probably the greatest tragedy, Open Access does not by default produce Open products.

For that reason we have set up our own Open-access list in the OKF.

If we can have a truly Open discussion we might make progress on some of these issues.

[1] Phylogenetic tree diagram by David Hillis, Derreck Zwickil and Robin Gutell.

The future of Open Access

Theodora Middleton - October 24, 2012 in Featured, Ideas and musings, Open Access

At the start of this week, which is Open Access week, we heard from Martin Weller about some of his fears for the future of Open Access. We’ve been collecting a few opinions from around the OKFN on the future of OA. Here’s a selection. What do you think?

Ross Mounce: The future of publicly-funded research is inevitably Open Access.

With increasing realisation that research is best distributed electronically – for speed, economic efficiency, and fairness – Open Access to publicly-funded academic research is inevitable.

It costs money to implement, maintain and enforce artificial paywalls to restrict access to research online. These create frustrating and time-consuming barriers to accessing research. Open Access is thus an obviously beneficial system that simply allows ALL to read, re-use and remix academic research, thereby truly maximising the potential return on investment from these works.

Peter Murray-Rust: Is Open Access Open?

Is Open Access really “Open”? The features of Open I value are:

  • Meritocracy: That doesn’t mean that decisions are made by hand counting, but it means that people’s views are listened to, and they enter the process when it seems right to the community.
  • Universality of participation, particularly from citizens without formal membership or qualifications. A feeling of community.
  • A willingness to listen to other views and find means of changing strategy where necessary
  • Openness of process. It is clear what is happening, even if you are not in command.
  • Openness of results. This is universally fundamental. Although there have been major differences of opinion in Free/Open Source Software (F/OSS) everyone is agreed that the final result is free to use, modify, redistribute without permission and for any purpose.
  • A mechanism to change current practice. The key thing about Wikipedia is that it dramatically enhances the way we use knowledge. Many activities in the OKF (and other Open Organisations) are helping to change practice in government, development agencies, companies. It’s not about price restrictions, it’s about giving back control to the citizens of the world.

How does OA match up? Not very well:

  • It’s not universal: it looks inwards to universities. There is no space for the ‘citizen’, or even the individual.
  • It has oligarchic and closed decision procedures: the Enabling Open Scholarship committee costs 50 euros per year to join, and requires recommendation by an existing member.
  • Discussion is closed: differing opinions aren’t listened to or wanted.
  • The product isn’t, necessarily, open either: whilst a CC-BY license would easily ensure manuscript openness, in fact the term “open access” is applied to almost anything, and means very little.

Only if we can have a truly Open discussion about these issues, will we make any progress.

A longer version of Peter’s thoughts will be published later this week.

Christian Heise: Open Access is the fundament for Open Science.

In Feburary 2002 the Budapest Open Access Initiative (BOAI) launched a worldwide campaign for open access (OA). Even if it did not invent the idea, the initiative articulated the first major international statement and public definition of open access. Now, ten years later, it has made new recommendations for the next ten years (summarized by me in five points):

1. Every institution of higher education should have access to an open access repository (through a consortium or outsourcing), and every publishing scholar in every field and country, including those not affiliated with institutions of higher education, should have deposit rights.

2. Every institution of higher education should have a policy that all future scholarly articles by faculty members and all future theses and dissertations are made open access as soon as practicable, and deposited in the institution’s designated open access repository, preferably licensed CC-BY.

3. Research institutions, including funders, should support the development and maintenance of the tools, directories, and resources essential to the progress and sustainability of open access, including: tools and APIs to convert deposits made in PDF format into machine-readable formats such as XML; the means to harvest from and re-deposit to other repositories; and tools working with alternative impact metrics.

4. The use of classic journal impact factors is discouraged. The Initiative encourage the development of alternative metrics for impact and quality which are less simplistic, more reliable, and entirely open for use and reuse.

5. The open access community should act in concert more often and we should do more to make universities, publishers, editors, referees and researchers aware of standards of professional conduct for open access publishing. We also need to articulate more clearly, with more evidence, and to more stakeholder groups the advantages and potentials of open access

These recommendations are pretty detailed on what has to be done to get a sustainable open access process in the near future. However, the far future has to be the evolution from Open Access to the holistic concept of Open Science (open access + open science data).

Tom Olijhoek: Open Interconnected Specialist Communities

In my view the future of science will ultimately depend on the formation of many interconnected scientific communities covering all possible areas. Making optimal use of the internet and social media, scientists and citizens within and between these communities will collaborate to produce more useful knowledge than ever before and to store, maintain and provide information for those who seek it. Especially for medical scientists in the developing world, these communities would provide vehicles for innovation, health improvement and development in their respective countries. Following this line of thought, the only hope on winning the battle against malaria, aids, neglected diseases and other tropical infections will lie in free access to and sharing of information, and in joining forces by way of social media and open science communities. MalariaWorld is our first experiment in this mode of specialist open access scientific community.

Laurent Romary: L’open access est un état d’esprit

L’open access est un état d’esprit pour le chercheur. Tous les moyens sont bons pour favoriser la dissémination des savoirs, publications, données, expertises. On peut douter que le système de publication commercial, tel que nous le connaissons actuellement réponde véritablement aux attentes des chercheurs et aux enjeux de l’interconnection des connaissances. Les infrastructures de recherche de demain, gérées par les chercheurs eux-mêmes, devront comprendre des environnements virtuels de recherche, où chaque scientifique (en sciences dures tout comme en sciences humaines) gérera ses observables, ses commentaires, ses résultats et choisira librement et sans barrière financière de les diffuser ou de les faire évaluer.

The great Open Access swindle

Martin Weller - October 22, 2012 in Featured, Ideas and musings, Open Access

This week is Open Access week, and we’ll be running a few pieces mulling over where Open Access has got to, and where it’s going. Here Martin Weller discusses some reservations…

The Cunning Thief, by Chocarne-Moreau. PD

Just to be clear from the outset, I am an advocate for open access, and long ago took a stance to only publish OA and to only review for OA. I’m not suggesting here that open access is itself a swindle, but rather that the current implementation, in particular commercial publishers adopting Gold OA, is problematic.

In my digital scholarship book, I made two pleas, the first was for open access publishing, and the second was for scholars to own the process of change. On this second point, the book ends thus:

“This is a period of transition for scholarship, as significant as any other in its history, from the founding of universities to the establishment of peer review and the scientific method. It is also a period that holds tension and even some paradoxes: it is both business as usual and yet a time of considerable change; individual scholars are being highly innovative and yet the overall picture is one of reluctance; technology is creating new opportunities while simultaneously generating new concerns and problems…. …For scholars it should not be a case of you see what goes, you see what stays, you see what comes, but rather you *determine* what goes, what stays and what comes.”

The open access element has proceeded faster than even I imagined when writing this back in 2010/2011. The Finch Report can be seen as the crowning achievement of the open access movement, in setting out a structure for all UK scholarly articles to be published as open access. But in rather typical “you academics are never happy” mode I’ve become increasingly unhappy about the route Open Access is taking. And the reason is that it fails to meet the second of my exhortations, in that it is a method being determined by the publishing industry and not by academics themselves.

The favoured route is that of Gold OA, under which authors pay publishers to have open access articles published, usually through research funds. This is good in that it means these research papers will be openly available to all, but bad from a digital scholarship perspective. And here’s why:

1) Ironically, openness may lead to elitism. If you need to pay to publish, then, particularly in cash-strapped times, it becomes something of a luxury. New researchers, or smaller universities won’t have these funds available. Many publishers have put in waivers for new researchers (PLoS are excellent at this), but there’s no guarantee of these, and after all, the commercial publishers are concerned with maximising profits. If there are enough paying customers around then it’s not in their interest to give out too many freebies. And it also means richer universities can flood journals with articles. Similarly those with research grants can publish, as this is where the funding will come from, and those without can’t. This will increase competition in an already ludicrously competitive research funding regime. You’re either in the boat or out of it will be the outcome. The Scholarly Kitchen blog has a good piece on OA increasing the so-called Matthew Effect. It would indeed be a strange irony if those of us who have been calling for open access because of a belief in wider access and a more democratic knowledge society end up creating a self-perpetuating elite.

2) It will create additional cost. Once the cost is shifted to research funders, then the author doesn’t really care about the price. There is no strong incentive to keep costs down or find alternative funding mechanisms. This is great news for publishers who must be rubbing their hands with glee. It is not only a licence to carry on as they were, but they have successfully fended off the threat of free publication and dissemination that the internet offers. Music industry moguls must be looking on with envy. The cost for publication is shifted to taxpayers (who ultimately fund research) or students (if it comes out of university money). The profits and benefits stay with the publishers. It takes some strained squinting to view this as a victory.

Steven Harnad argues again for Green OA, claiming that

“Publishers– whose primary concern is not with maximizing research usage and progress but with protecting their current revenue streams and modus operandi –are waiting for funders or institutions to pledge the money to pay Gold OA publishing fees. But research funds are scarce and institutional funds are heavily committed to journal subscriptions today. There is no extra money to pay for Gold OA fees”

3) It doesn’t promote change – in my book I also talked about how a digital, networked and open approach could change what we perceive as research, and that much of our interpretation of research was dictated by the output forms we have. So, for instance we could see smaller granularity of outputs, post review, different media formats, all beginning to change our concept of what research means. But Gold OA that reinforces the power of commercial publishers, simply maintains a status quo, and keeps the peer-reviewed 5000 word article as the primary focus of research that must be attained.

I’ve heard Stephen Downes say that as soon as any form of commercial enterprise touches education it ruins it (or words to that effect). I wouldn’t go that far, I think for instance that commercial companies often make a better job of software and technology than universities, but academic publishing is such an odd business that maybe it doesn’t make sense as a commercial enterprise. As David Wiley so nicely parodies in his trucker’s parable, there isn’t really another industry like it. Academics (paid by the taxpayer or students) provide free content, and then the same academics provide free services (editorship and peer-review) and then hand over rights and ownership to a commercial company, who provide a separate set of services, and then sell back the content to the same group of academics.

I know a few people who work in commercial publishing, and they are smart, good people who genuinely care about promoting knowledge and publishing as a practice. This is not a cry for such people to be out on the streets, but rather for their skills and experience to be employed by and for universities, the research communities and the taxpayer rather than for shareholders. In this business Downes’ contamination theory seems to hold, there is simply no space in the ecosystem for profit to exist, and when it does it corrupts the whole purpose of the enterprise, which is to share and disseminate knowledge.

Gold OA is not inherently detrimental. There are plenty of non-profit publishers who operate this model and they keep costs down to a minimum and have a generous fee waiver policy. They are, after all, not concerned with making a profit, and are concerned with knowledge dissemination. Other models exist also, including subsidised university presses, centralised publishing platforms, etc. The swindle is that there is no real incentive to explore these possibilities because the standard model has been reinforced through the manner in which OA has been implemented. As Tim O’Reilly comments “If we’re going to get science policy right, it’s really important for us to study the economic benefit of open access and not just accept the arguments of incumbents”.

[An earlier version of this post was originally posted on Martin Weller’s blog]

Video: Julia Kloiber on Open Data

Rufus Pollock - October 3, 2012 in Ideas and musings, Interviews, OKF Germany, OKFest, Our Work

Here’s Julia Kloiber from OKFN-DE’s Stadt-Land-Code project, talking at the OKFest about the need for more citizen apps in Germany, the need for greater openness, and how to persuade companies to open up.

Open Data, Technology and Government 2.0 – What Should We, And Should We Not Expect

Rufus Pollock - September 13, 2012 in Featured, Ideas and musings, Open Data, Policy

This is second of two pieces about “managing expectations” (the first is here). Open data has come a long way in the last few years and so have expectations. There’s a growing risk that open data will be seen as a panacea that will magically solve climate change or eliminate corruption or “fix” democracy. This is dangerous because it will inevitably fail to do so and hope and enthusiasm will be replaced by disappointment and dis-engagement.

This would be a tragedy as open data is valuable to us socially, commercially and culturally. However, we do need to think hard about how to make effective use of open data. Open data is usually only one part of a solution and we need to identify and work on the other key factors, such as institutions and tools, needed to bring about real change.

Some steps in a theory of change – see below for discussion. Note that the 3rd step is both by far the most important and most complex.

Government 2.0, Open Data and IT

More than two years ago a UK Government civil servant came to visit me. A new government had taken power in the UK for the first time in a decade and she wanted to ask to me about “Government 2.0”, open data and transparency.

One thing was immediately apparent from our discussion: while she was already excited by these ideas it wasn’t entirely clear to her what they involved or exactly what problems they would help with — something that has remained a common feature of conversations I’ve had since.

In my view (a view expressed in that conversation two years ago), there are (at least) two distinct — albeit related — ideas for what “Government 2.0” means:

  • Improving (Government) services by utilizing current information technology and open data — open data being especially interesting (and novel) as it could turn Government from the direct supplier of services to the supplier of the data (and infrastructure) needed to run those services (‘Government as a Platform’)
  • More interactive, participatory governance (and therefore more “democratic) via the use, again, of open information and technology (though the connection was somewhat vaguer).

Put like this it’s clear why “Government 2.0” can appear so exciting — after all it appears to promise a radical improvement, even transformation, of government.

But it also should make us concerned. Unrealistic expectations can be dangerous — something that is generally beneficial can get confused with a miracle cure and then blamed when it fails to deliver.

Moreover, there’s the risk that we start fixating on this wondrous new possibility (open data and technology) and ignore other key (but less exciting) elements in solving our problems — with the consequence that we much reduce the actual benefit we got from these new innovations in policy and technology.

This second point seemed especially important as it could lead to the dangerous assumption developing that open information + IT would magically turn into better (and more participatory) governance without much examination of how this exactly would come about and any changes to the form and structure of governance that would be needed.

The danger here is of confusing necessary with sufficient conditions: open data may be necessary part of better and more participatory governance but they are likely not sufficient without, say, substantial other changes in the structure and machinery of government (e.g. who gets to vote, when and where).1 These latter changes are normally costly and much more difficult than adopting new IT or opening up data. Thus, whilst new IT and open data are important, they are likely only one (possibly small) part of a solution.

When is Open Data (Part of) the Solution?

To help think about and clarify this question — of the role of open data and IT vis-a-vis other factors in a solution — I drew the first version of this diagram (a diagram I have drawn again and again over the last few years).

The purpose of the diagram is to provide a rough-and-ready way to think about the role of open data (and IT) in solving a specific problem compared to other factors such as institutions.

Some specific problems are listed for illustrative purposes. For example, Climate Change is situated up at the top-left implying that Open Data + IT likely play a relatively limited role compared to other factors such as institutions and governance change (roughly: the real problem here is reaching international agreement on a solution not more (open) information).

Conversely finding a better way to get to work is likely a problem where Open Data + IT can have a very large impact irrespective of any other factors. Meanwhile, for an issue like Corruption there would be debate as to where to situate it: on the one hand Transparency and Open Data can have a big impact with relatively little institutional or governance change, however, on the other hand one could argue that without reasonably significant governance and institutional change, open data and transparency would have little effect.

Note that diagram and examples given are for illustration purposes and don’t necessarily reflect my views (you could argue, for example, that Climate Change should be situated somewhere quite different!)

A Theory of Change

What this line of thought suggests is that we need to delve deeper into the exact “theory of change” for a given area. Around open data I think there is a general chain of logic, which runs, roughly, as follows:

  1. Open (digital) data + IT dramatically lowers the cost of access to information
  2. This includes information about what the government is doing (be that in terms of laws or filling in pot-holes)
  3. Armed with this better information citizens (or other groups) will

    1. Be able to hold government accountable and/or drive change
    2. Have a better sense of how their polity operates (improving trust etc)

In essence it presumes some theory of change like this (a diagram I also drew that day for the civil servant):

The key question is around step 3: “Action (& Change)”. It highlights the often missing (but implicit) assumption in much of this discussion that once information is available action and change will follow. But action, even in highly developed democracies can be hard for several reasons:

  1. Understanding and action requires attention: analyzing and acting on information requires time and attention and these seem to be (increasingly) scarce. Crudely put: do I go out to the cinema with my friends or do I read up on the latest draft law?

    The key cost of becoming politically active is not the direct cost of acquiring information (be that what is happening or the email address of the representative to contact) but the attention and time cost in analyzing, understanding and acting on that information. If so, open data and IT may only have a limited impact on reducing the cost of taking “political action”

  2. Digital technology by reducing simple transmission costs has substantially increased the amount of “information” competing for attention – information should also be interpreted here in the broad sense and include entertainment and anything else that could be shipped as digital “bits”.

  3. The problems of coordination and collective action: crudely, why should I bother to act if it requires a million of us to act for something to happen. Coordination problems are as old as humanity itself. While one can argue that modern communication technologies can assist us in coordination2 they would seem at best to offer mild improvement on a fundamentally difficult problem (see the appendix below for more on this).


I should emphasize that I am far from arguing that open data (and IT) are not important. However, we need to temper our enthusiasm with an appreciation that they are only one part of the solution. As next steps I think we need to:

  • Think hard about what problems to tackle — if technology and open information are the tools to hand we want to focus on problems where they are especially effective. Using the first of the diagrams above can be a useful exercise in clarifying where a particular problem is located.
  • Be clear that other changes or improvements in, say, institutions, will be needed. We should work out what these are and then endeavour to make them. We should be aware that often these changes will be both more important and much harder than those that we can achieve with technology and open information alone.
  • Appreciate that open data and technology are attractive tools because they are (relatively) very cheap and straightforward to use. This is worth bearing in mind: even if open data and information technology are 10% of solving a problem they are an incredibly cheap 10% to do.
  • Acknowledge that open information and technology will often be complements to institutional change not substitutes. If so we cannot just do more open information and less governance reform — that would be like giving you a second hammer to compensate for having no nails.

Appendix: Principal Agent Theory and Government

Imagine I hire a real-estate agent to sell my house for me. In legal/economic parlance I am the principal (I own the house) and the real-state agent is the “agent” — the person acting for the principal.

The normal “problem” of such relationships is that the interests and goals of the two parties are not aligned.

Take the real-estate case: suppose the two parties negotiate a 6% commission plus an up-front fee — then for every additional $1 in the sale price the agent manages to get from a buyer for the house the agent will receive only 6 cents in commission. This implies a strong divergence, at least in pure monetary terms, between the incentives of the principal (the owner of the house) and the agent (the real-estate agent).

To make this even more concrete, consider a situation where by working a weekend, and doing extra showings of the house it will be sold for an extra $10,000. Suppose that the agent crudely values this weekend time at $1000 a day (imagine their daughter has a birthday party!). For them the “payoff” is: $600 (6% of of $10k) – $2000 (their time cost) = -$1400.3 So they have a strong incentive not to bother. Meanwhile the principal would clearly make the effort: even assuming a higher cost of their time of $5k their “payoff” is $10k – $5k = $5k.

So how does this relate to government? Well, Lincoln may have said that Government is by, for and of the people but in truth only the middle of these is true: Government is a classic principal agent setup in which the principal (the people) appoint agents (their elected representatives) to govern the polity.

Thus, Government faces exactly the kind of principal agent problems above: the incentives of elected officials (or bureaucrats) may differ markedly from those of the citizens as a whole. For example, elected officials may primarily care about remaining in office (just as the real-estate agent care about commission) rather than ensuring the best outcome for citizens — this trade-off will be especially acute when narrow but powerful groups who are able to provide monetary or other support for, say, re-election, have interests that conflict with the general welfare of society.

The focus of most principal-agent analysis is on how to better align the interests of the agent with that of the principal. Normally this involves some form of monitoring (so the principal has a better sense of what the agent is doing) combined with some form of reward and sanctions based on outcomes and whatever information a principal has managed to glean about an agent’s actions.

In a perfect world, the principal would know exactly what the agent was doing and with the right set of rewards and sanctions could then ensure they did exactly what was wanted. However, in this situation the principal would essentially be the agent (how else does one know exactly what they are doing) and so the real question is how well can one with imperfect information and imperfect rewards and sanctions. We need not go into detail here but the key (and obvious) point is that the more imperfect the principal’s information and the more imperfect their rewards and sanctions and poorer the alignment will be.

Unfortunately on this basis there are several reasons to think the governance principal-agent problem is especially bad (with the “people” as principals and “government” as agents):

  • Government is complex – this makes it hard for the principal (the “people”) to know what the agent(s) are doing. Remember it’s more about knowing the agents actions than outcomes since outcomes, due to uncertainty, only partially reflect an agent’s effort. Very strong incentives based on uncertain outcomes can be counter-productive — if I could work very hard and it will all be for nothing because things go badly for random reasons then maybe I should not bother and just see what random chance brings.
  • The incentives that can be offered to agents (the governors) are relatively crude — being voted out at the next election (or being overthrown in an uprising!)
  • Governance in fact has multiple levels of principal-agent relationships: the “people” may elect representatives who in turn appoint or utilize a managerial bureaucracy to run government — in this case elected officials are principals and the bureaucracy is agent.
  • The very large number of individual citizens makes the coordination problem of acting to sanction or reward an agent especially difficult — the simplest form of this sanctioning (rewarding) is in the form of elections yet the incentives for any given citizen to make the effort to participate is very low: why should I bother to vote when I am only one among millions? (we note that turnout in most countries has been consistently dropping).

  1. Of course, it is true that technology and the open flow of information can enable certain forms of governance that are otherwise very difficult — for example, modern IT makes it possible literally to hold daily votes of all citizens, something that would otherwise be impossible except in the very smallest of polities. 

  2. cf the debate on role of social media in the Arab Spring and generally around “social media revolutions” 

  3. nb. i have made no allowance for risk aversion here given that gain is expected $10k. However, the basic point would still stand. 

Get Updates