Support Us

You are browsing the archive for Ideas and musings.

Frictionless Data: making it radically easier to get stuff done with data

Rufus Pollock - April 24, 2013 in Featured, Ideas and musings, Labs, Open Data, Open Standards, Small Data, Technical

Frictionless Data is now in alpha at http://data.okfn.org/ – and we’d like you to get involved.

Our mission is to make it radically easier to make data used and useful – our immediate goal is make it as simple as possible to get the data you want into the tool of your choice.

This isn’t about building a big datastore or a data management system – it’s simply saving people from repeating all the same tasks of discovering a dataset, getting it into a format they can use, cleaning it up – all before they can do anything useful with it! If you’ve ever spent the first half of a hackday just tidying up tabular data and getting it ready to use, Frictionless Data is for you.

Our work is based on a few key principles:

  • Narrow focus — improve one small part of the data chain, standards and tools are limited in scope and size
  • Build for the web – use formats that are web “native” (JSON) and work naturally with HTTP (plain-text, CSV is streamable etc)
  • Distributed not centralised — designed for a distributed ecosystem (no centralized, single point of failure or dependence)
  • Work with existing tools — don’t expect people to come to you, make this work with their tools and their workflows (almost everyone in the world can open a CSV file, every language can handle CSV and JSON)
  • Simplicity (but sufficiency) — use the simplest formats possible and do the minimum in terms of metadata but be sufficient in terms of schemas and structure for tools to be effective

We believe that making it easy to get and use data and especially open data is central to creating a more connected digital data ecosystem and accelerating the creation of social and commercial value. This project is about reducing friction in getting, using and connecting data, making it radically easier to get data you need into the tool of your choice. Frictionless Data distills much of our learning over the last 7 years into some specific standards and infrastructure.

What’s the Problem?

Today, when you decide to cook, the ingredients are readily available at local supermarkets or even already in your kitchen. You don’t need to travel to a farm, collect eggs, mill the corn, cure the bacon etc – as you once would have done! Instead, thanks to standard systems of measurement, packaging, shipping (e.g. containerization) and payment, ingredients can get from the farm direct to my local shop or even my door.

But with data we’re still largely stuck at this early stage: every time you want to do an analysis or build an app you have to set off around the internet to dig up data, extract it, clean it and prepare it before you can even get it into your tool and begin your work proper.

What do we need to do for the working with data to be like cooking today – where you get to spend your time making the cake (creating insights) not preparing and collecting the ingredients (digging up and cleaning data)?

The answer: radical improvements in the “logistics” of data associated with specialisation and standardisation. In analogy with food we need standard systems of “measurement”, packaging, and transport so that its easy to get data from its original source into the application where you can start working with it.

Frictionless DAta idea

What’s Frictionless Data going to do?

We start with an advantage: unlike for physical goods transporting digital information from one computer to another is very cheap! This means the focus can be on standardizing and simplifying the process of getting data from one application to another (or one form to another). We propose work in 3 related areas:

  • Key simple standards. For example, a standardized “packaging” of data that makes it easy to transport and use (think of the “containerization” revolution in shipping)
  • Simple tooling and integration – you should be able to get data in these standard formats into or out of Excel, R, Hadoop or whatever tool you use
  • Bootstrapping the system with essential data – we need to get the ball rolling

frictionless data components diagram

What’s Frictionless Data today?

1. Data

We have some exemplar datasets which are useful for a lot of people – these are:

  • High Quality & Reliable

    • We have sourced, normalized and quality checked a set of key reference datasets such as country codes, currencies, GDP and population.
  • Standard Form & Bulk Access

    • All the datasets are provided in a standardized form and can be accessed in bulk as CSV together with a simple JSON schema.
  • Versioned & Packaged

    • All data is in data packages and is versioned using git so all changes are visible and data can becollaboratively maintained.

2. Standards

We have two simple data package formats, described as ultra-lightweight, RFC-style specifications. They build heavily on prior work. Simplicity and practicality were guiding design criteria.

Frictionless Data: package standard diagram

Data package: minimal wrapping, agnostic about the data its “packaging”, designed for extension. This flexibility is good as it can be used as a transport for pretty much any kind of data but it also limits integration and tooling. Read the full Data Package specification.

Simple data format (SDF): focuses on tabular data only and extends data package (data in simple data format is a data package) by requiring data to be “good” CSVs and the provision of a simple JSON-based schema to describe them (“JSON Table Schema”). Read the full Simple Data Format specification.

3. Tools

It’s early days for Frictionless Data, so we’re still working on this bit! But there’s a need for validators, schema generators, and all kinds of integration. You can help out – see below for details or check out the issues on github.

Doesn’t this already exist?

People have been working on data for a while – doesn’t something like this already exist? The crude answer is yes and no. People, including folks here at the Open Knowledge Foundation, have been working on this for quite some time, and there are already some parts of the solution out there. Furthermore, much of these ideas are directly borrowed from similar work in software. For example, the Data Packages spec (first version in 2007!) builds heavily on packaging projects and specifications like Debian and CommonJS.

Key distinguishing features of Frictionless Data:

  • Ultra-simplicity – we want to keep things as simple as they possibly can be. This includes formats (JSON and CSV) and a focus on end-user tool integration, so people can just get the data they want into the tool they want and move on to the real task
  • Web orientation – we want an approach that fits naturally with the web
  • Focus on integration with existing tools
  • Distributed and not tied to a given tool or project – this is not about creating a central data marketplace or similar setup. It’s about creating a basic framework that would enable anyone to publish and use datasets more easily and without going through a central broker.

Many of these are shared with (and derive from) other approaches but as a whole we believe this provides an especially powerful setup.

Get Involved

This is a community-run project coordinated by the Open Knowledge Foundation as part of Open Knowledge Foundation Labs. Please get involved:

  • Spread the word! Frictionless Data is a key part of the real data revolution – follow the debate on #SmallData and share our posts so more people can get involved

Forget Big Data, Small Data is the Real Revolution

Rufus Pollock - April 22, 2013 in Featured, Ideas and musings, Labs, Open Data, Small Data

This is the first in a series of posts. The next posts in the series is What Do We Mean by Small Data

There is a lot of talk about “big data” at the moment. For example, this is Big Data Week, which will see events about big data in dozens of cities around the world. But the discussions around big data miss a much bigger and more important picture: the real opportunity is not big data, but small data. Not centralized “big iron”, but decentralized data wrangling. Not “one ring to rule them all” but “small pieces loosely joined”.

Big data smacks of the centralization fads we’ve seen in each computing era. The thought that ‘hey there’s more data than we can process!’ (something which is no doubt always true year-on-year since computing began) is dressed up as the latest trend with associated technology must-haves.

Meanwhile we risk overlooking the much more important story here, the real revolution, which is the mass democratisation of the means of access, storage and processing of data. This story isn’t about large organisations running parallel software on tens of thousand of servers, but about more people than ever being able to collaborate effectively around a distributed ecosystem of information, an ecosystem of small data.

Just as we now find it ludicrous to talk of “big software” – as if size in itself were a measure of value – we should, and will one day, find it equally odd to talk of “big data”. Size in itself doesn’t matter – what matters is having the data, of whatever size, that helps us solve a problem or address the question we have.

For many problems and questions, small data in itself is enough. The data on my household energy use, the times of local buses, government spending – these are all small data. Everything processed in Excel is small data. When Hans Rosling shows us how to understand our world through population change or literacy he’s doing it with small data.

And when we want to scale up the way to do that is through componentized small data: by creating and integrating small data “packages” not building big data monoliths, by partitioning problems in a way that works across people and organizations, not through creating massive centralized silos.

This next decade belongs to distributed models not centralized ones, to collaboration not control, and to small data not big data.

Want to create the real data revolution? Come join our community creating the tools and materials to make it happen — sign up here:

This is the first in a series of posts about the power of Small Data – follow the Open Knowledge Foundation blog, Twitter or Facebook to learn more and join the debate at #SmallData on Twitter.

Further Reading

  • Nobody ever got fired for buying a cluster
    • Even at enterprises like Microsoft and Yahoo most jobs could run on a single machine. E.g. median job size is 14GB at Microsoft and 80% of jobs are less than 1TB. At Yahoo estimate median job size is 12GB.
    • Ananthanarayanan et al. show that Facebook jobs follow a power-law distribution with small jobs dominating; from their graphs it appears that at least 90% of the jobs have input sizes under 100 GB,” the paper states. “Chen et al. present a detailed study of Hadoop workloads for Facebook as well as 5 Cloudera customers. Their graphs also show that a very small minority of jobs achieves terabyte scale or larger and the paper claims explicitly that ‘most jobs have input, shuffle, and output sizes in the MB to GB range.'”
  • PACMan: Coordinated Memory Caching for Parallel Jobs – Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, Ion Stoica

Open Data & My Data

Laura James - February 22, 2013 in Featured, Ideas and musings, Open Data, Working Groups

The Open Knowledge Foundation believes in open knowledge: not just that some data is open and freely usable, but that it is useful – accessible, understandable, meaningful, and able to help someone solve a real problem.

A lot of the data which could help me improve my life is data about me – “MyData” if you like. Many of the most interesting questions and problems we have involve personal data of some kind. This data might be gathered directly by me (using my own equipment or commercial services), or it could be harvested by corporations from what I do online, or assembled by public sector services I use, or voluntarily contributed to scientific and other research studies.

Tape library, CERN, Geneva 2

Image: “Tape library, CERN, Geneva 2″ by Cory Doctorow, CC-BY-SA.

This data isn’t just interesting in the context of our daily lives: it bears on many global challenges in the 21st century, such as supporting an aging population, food consumption and energy use.

Today, we rarely have access to these types of data, let alone the ability to reuse and share it, even when it’s my data, about just me. Who owns data about me, who controls it, who has access to it? Can I see data about me, can I get a copy of it in a form I could reuse or share, can I get value out of it? Would I even be allowed to publish openly some of the data about me, if I wanted to?

But how does this relate to open data? After all, a key tenet of our work at the Open Knowledge Foundation is that personal data should not be made open (for obvious privacy reasons)!

However there are, in fact, obvious points where “Open Data” and “My Data” connect:

  • MyData becomes Open Data (via transformation): Important datasets that are (or could be) open come from “my data” via aggregation, anonymisation and so on. Much statistical information ultimately comes from surveys of individuals, but the end results are heavily aggregated (for example, census data). This means “my data” is an important source but also that it is essential that the open data community have a good appreciation of the pitfalls and dangers here – e.g. when anonymisation or aggregation may fail to provide appropriate privacy.

  • MyData becomes Open Data (by individual choice): There may be people who want to share their individual, personal, data openly to benefit others. A cancer patient could be happy to share their medical information if that could assist with research into treatments and help others like them. Alternatively, perhaps I’m happy to open my household energy data and share it with my local community to enable us collectively to make sustainable energy choices. (Today, I can probably only see this data on the energy company’s website, remote, unhelpful, out of my control. I may not even be able to find out what I’m permitted to do with my data!)

  • The Right to Choose: if it’s my data, just about me, I should be able to choose to access it, reuse it, share it and open it if I wish. There is an obvious translation here of key Open Data principles to MyData. Where the Open Definition states that material should be freely available for use, reuse and redistribution by anyone, we could think that my data should freely available for use, reuse and redistribution by me.

We think it is important to explore and develop these connections and issues. The Open Knowledge Foundation is therefore today launching an Open Data & MyData Working Group. Sign up here to participate:

This will be a place to discuss and explore how open data and personal data intersect. How can principles around openness inform approaches to personal data? What issues of privacy and anonymisation do we need to consider for datasets which may become openly published? Do we need “MyData Principles” that include the right of the individual to use, reuse and redistribute data about themselves if they so wish?

Appendix

There are plenty of challenging issues and questions around this topic. Here are a few:

Anonymization

Are big datasets actually anonymous? Anonymisation is incredibly hard. This isn’t a new problem (Ars Technica had a great overview in 2009) although it gets more challenging as more data is available, openly or otherwise, as more data which can be cross-correlated means anonymisation is more easily breached.

Releasing Value

There’s a lot of value in personal data – Boston Consulting Group claim €1tn. But even BCG point out that this value can only be realised if the processes around personal data are more transparent. Perhaps we can aspire to more than transparency, and have some degree of personal control, too.

Governments

Governments are starting to offer some proposals here such as “MiData” in the UK. This is a good start but do they really serve the citizen?

There’s also some proposed legislation to drive companies to give consumers the right to see their data.

But is access enough?

The consumer doesn’t own their data (even when they have “MiData”-style access to it), so can they publish it under an open licence if they wish?

Whose data is it anyway?

Computers, phones, energy monitors in my home, and so on, aren’t all personal to me. They are used by friends and family. It’s hard to know whose data is involved in many cases. I might want privacy from others in my household, not just from anonymous corporations.

This gets even more complicated when we consider the public sphere – surveillance cameras and internet of things sensors are gathering data in public places, about groups of independent people. Can the people whose images or information are being captured access or control or share this data, and how can they collaborate on this? How can consent be secured in these situations? Do we have to accept that some information simply cannot be private in a networked world?

(Some of these issues were raised at the Open Internet of Things Assembly in 2012, which lead to a draft declaration. The declaration doesn’t indicate the breadth of complex issues around data creation and processing which were hotly debated at the assembly.)

MyData Principles

We will need clear principles. Perhaps, just as the Open Definition has help clarify and shape the open data space, we need analogous “MyData” Principles which set out how personal data should be handled. These could include, for example:

  • That my data should be made available to me in machine-readable bulk form
  • That I should have right to use that data as I wish (including using, reusing and redistribution if I so wish).
  • That none of my data (where it contains personal information) should be made open without my full consent.

4 Ideas for Defending the Open Data Commons

Open Knowledge France - January 10, 2013 in Featured, Ideas and musings, OK France, Open Data, Open Standards

The following post was written by Simon Chignard, author of L’Open data: Comprendre l’ouverture des données publiques. The post was originally posted on Simon’s blog following the launch of the Open Knowlege Foundation French national group, and has been translated by Samuel Goëta from OKFN France.

Open data and the commons: an old story?

Open Data Commons There is a direct link between the open data movement and the philosophy of common goods. Open data are an illustration of the notion of common informational goods proposed by Elinor Ostrom, winner of the 2009 Nobel Prize for economics. Open data belong to everyone and, unlike water and air (and other common goods), they are non-exclusive: their use by one does not prevent others. If I reuse an open data set, this does not prevent other reusers from doing so. This proximity between the commons and open data is also suggested by the presence of the initiator of Creative Commons licences, Lawrence Lessig, at the 2007 Sebastopol meeting in which the concept of open data itself was defined.

But despite the strong conceptual and historical linkages, it seems that we, as actors of open data, are often shy to reaffirm the relationship. In our efforts to encourage public and private bodies to embrace open data, we seem almost embarrassed of this cornerstone philosophy. The four proposals I make here aim at one thing: not letting it drop!

Idea #1: defend a real choice in terms of open data licences (“pro-choice” approach)

On paper, that sounds clear: there is a real choice in France in terms of open data licences. On one side, the open licence offered by Etalab (the French government institution in charge of government open data), on the other side, the Open Database License (ODbL). Government services must use the former, some local authorities have chosen the latter, generally based on some conception of the relationship between the commons and open data.

In practice, this choice is hindered by the difficulties, real or perceived, of the ODbL licence. The two licences are distinguished by the ODbL’s obligation to share alike, which is clearly a product of a belief in the common pot (if I use it, I must recontribute). But a strange music is playing in France, which warns against this “contaminating” licence. ODbL is accused of being against business, coming “from abroad”, or being the source of unpredictable dangers (such as counterfeiting).

We find ourselves in a situation where, at the same moment as big projects such as Open Street Map are adopting ODbL, new entrants in open data apply – sometimes in good faith – the principle of the least effort: “that share-alike thing seems complicated, we don’t really know the potential risks, I’d rather choose Licence Ouverte”.

As the initiator of the ODbL licence, the Open Knowledge Foundation should be its first promoter, explain its mechanisms and opportunities (not only to block Google). So that a real choice of open data licences stays possible (pro-choice approach)!

But the ODbL licence cannot by itself defend open data as part of the digital commons – below are three further tactics which need to be employed alongside it.

Ideal #2: the General Interest Data, G.I.D.

Let’s take an example that matters to everyone, which was addressed during a recent workshop run by Net:Lab – access to housing. In France, who has the best knowledge of the housing market? Who knows rent prices in great details and in real time, with an address and a complete description of the accommodation? Not the council, nor tax services, nor even the housing minister – but a private actor in real estate ads.

In France, we have a law for personal data (CNIL law), another for public data (CADA law). But what about data – personal, public or private – which serves the general interest? With a clearer and more dynamic vision of rents, one can imagine that everyone would be more informed on the real prices of the market (while making sure to limit the side effects of transparency).

Without demanding the requisition of the data (and of empty flats), one can imagine a digital tax system encouraging its release. There is already a tax break in France for research, why not for open data? As mentioned previously, this would require the definition of a new class of data, the G.I.D. (General Interest Data), associated with specific rights of access and reuse.

(Obviously, G.I.D. raises as many questions as it tackles – for example who will define general interest?)

Idea #3: Contribution peering: I contribute/I receive

The first period of open data has seen public actors (local authorities or governments) release their data to users, mainly developers. The emerging open data movement is becoming infinitely richer and more complex. Although the division of roles between producers and re-users seems quite established, it is evolving: public and collaborative open data are starting to mutually enrich each other, companies are starting to deliver data on themselves back to clients. How can we design a contribution mechanism which takes into account these evolutions, so as to make “common pots”?

The first step I would suggest is “peering of contribution” – as already exists for boat positioning systems (AIS data). Collaborative website Marine Traffic, launched in 2007, is now the first website in the world for tracking global naval traffic. More than 1000 contributors (equipped with an AIS receiver connected to the Internet) allow the daily tracking of 65,000 ships. The website now displays more that 2 million page views – per day (source: interview S. Chignard with Dimitris Lekkas, Greek scholar who developed the project). Everyone can visualise the data on the map displayed on the website, but if you wish to access raw data, you need to contribute to the service by connecting a new AIS receiver. Hence contribution peering encourages everyone to enhance the service (Marine Traffic is not the only website doing this – see for example the AIS Hub)

Idea #4: Contributive pricing on use (GET>POST)

The last suggestion I would like to make for the development and defence of an open data commons, is be pricing on use – an idea already mentioned in my blog about transport data. This would involve a variable pricing scheme for the use of data. APIs allow particularly well for this pricing method.

Let’s imagine, for example, that access to our G.I.D. be free for all, but that a contribution may be asked to the biggest users of an API who behave nearly as free riders (in economic theory, those who make use of others’ contributions without ever contributing themselves). Hence it would be free to anyone to choose whether to contribute by enhancing the data (updating, correcting), or by paying out-of-pocket!

Open Data and Privacy Concerns in Biomedical Research

Sabina Leonelli - November 26, 2012 in Ideas and musings, Open Data, Open Science, WG Open Data in Science

Privacy has long been the focus of debates about how to use and disseminate data taken from human subjects during clinical research. The increasing push to share data freely and openly within biomedicine poses a challenge to the idea of private individual information, whose dissemination patients and researchers can control and monitor.

In order to address this challenge, however, it is not enough to think about (or simply re-think) the meaning of ‘informed consent’ procedures. Rather, addressing privacy concerns in biomedical research today, and the ways in which the Open Data movement might transform how we think about the privacy of patients, involves understanding the ways in which data are disseminated and used to generate new results. In other words, one needs to study how biomedical researchers confront the challenges of making data intelligible and useful for future research.

Efficient data re-use comes from what the Royal Society calls ‘intelligent openness’ – the development of standards for data dissemination which make data both intelligible and assessable. Data are intelligible when they can be used as evidence for one or more claims, thus helping scientists to advance existing knowledge. Data are assessable when scientists can evaluate their quality and reliability as evidence, usually on the basis of their format, visualisation and extra information (metadata) also available in databases.

Yet the resources and regulatory apparatus for securing proper curation of data, and so their adequate dissemination and re-use, are far from being in place. Making data intelligible and assessable requires labour, infrastructures and funding, as well as substantial changes to the institutional structures surrounding scientific research. While the funding to build reliable and stable biomedical databases and Open Data Repositories is increasing, there is no appropriate business model to support the long-term sustainability of these structures, with national funders, industry, universities and publishing houses struggling to agree on their respective responsibilities in supporting data sharing.

Several other factors are important. For instance, the free dissemination of data is not yet welcomed by the majority of researchers, who do not have the time or resources for sharing their data, are not rewarded for doing so and who often fear that premature data-sharing will damage their competitive advantage over other research groups. There are intellectual property concerns too, especially when funding for research comes from industry or specific parts of government such as defence. Further, there are few clear standards for what counts as evidence in different research contexts and across different geographical locations. And more work needs to be done on how to relate datasets collected at different times and with different technologies.

The social sciences and humanities have an important role to help scientific institutions and funders develop policies and infrastructures for the evaluation of data-sharing practices, particularly the collaborative activities that fuel data-intensive research methods. An improved understanding of how data can be made available so as to maximise their usefulness for future research can also help tackle privacy concerns relating to sensitive data about individuals.

When it comes to sharing medical records, it is now generally agreed that providing ‘informed consent’ to individual patients is simply not possible, as neither patients not researchers themselves can predict how the data could be used in the future. Even the promise of anonymity is failing, as new statistical and computational methods make it possible to retrieve the identity of individuals from large, aggregated datasets, as shown by genome-wide association studies.

A more effective approach is the development of ‘safe havens’: data repositories which would give access to data only to researchers with appropriate credentials. This could potentially safeguard data from misuse, without hampering researchers’ ability to extract new knowledge from them. Whether this solution succeeds ultimately depends on the ability of researchers to work with data providers, including patients, to establish how data travel online, how they are best re-used and how data sharing is likely to affect, and hopefully improve, future medicine. This work is very important, and should be supported and rewarded by universities, research councils and other science funders as an integral part of the research process.

To learn more, read the report ‘Making Data Accessible to All’

Towards a public digital infrastructure: why do governments have a responsibility to go open?

Guillermo Moncecchi - November 1, 2012 in Featured, Ideas and musings, Open Government Data, WG Open Government Data

The most common argument in favor of open data is that it enhances transparency, and while the link may not always be causal, it is certainly true that both tend to go hand-in-hand. But there is another, more expansive perspective on open government data: that it is part of an effort to build public infrastructure.

Does making a shapefile available with all Montevideo’s traffic lights make Montevideo’s government more transparent? We don’t think so. But one of our duties as civil servants is building the city infrastructure. And we should understand that data is mainly infrastructure. People do things on it, as they do things on roads, bridges or parks. For money, for amusement, for philanthropy, there are myriads of uses for infrastructure: we should not try to determine or even guess which those uses are. We must just provide the infrastructure and ensure it will be available. Open data should be seen as a component of an effort to build a public digital infrastructure, where people could, within the law, do whatever they want. Exactly as they do with roads.

When you see open data in this light, several decisions become easier. Should we ask people for identification to give them our data? Answer: do you ask them for an identification to use the street? No, you don’t – then no, you shouldn’t. Why should we use open, non proprietary standards for publishing data? For the same reason you do not build a street where only certain car brands can pass. What happens if there are problems with my data, which causes problems for the users? Well, you will be liable, if the law decides that … but, would you avoid demands for accidents caused by pavement problems by not building streets? Of course you are responsible for your data: you are paid to create it, as you are paid for building bridges. Every question about open data we can imagine has already been answered for traditional infrastructure.

But of course the infrastructure required to enable people to create an information society goes beyond data. We will give you four examples.

The most direct infrastructure component is hardware and communications. The Uruguayan government recognises this, and is planning to have each home connected with fibre by then end of 2015, with 1 Gb traffic for free for everybody with a phone line. Meanwhile since 2007, every public school child gets an OLPC laptop and internet connection. This programme should be understood as being primarily about infrastructure: education encompasses much more than laptops, but infrastructure enables the development of new education paths.

Secondly, services. Sometimes it’s better to provide services than to provide data. Besides publishing cartography data, in Montevideo we provide WMS and WFS services to retrieve a map just using a URL. Services, as data, should be open: no registration, no access limit. Open services allow developers to use not only government data, but also government computation power, and, of course, government knowledge: the knowledge needed to, say, estimate the arrival time of a bus.

Thirdly, sometimes services are not enough, and we have to develop complete software components to enable public servants to do their work. Sometimes these software components should also be part of the public digital infrastructure. The people of Brazil are very clear on this: in 2007 they developed the Portal do Software Publico Brasileiro, where applications developed by or for the government are publicly available. Of course, this is not a new concept: its general version is called open source software. We believe that within this framework of public infrastructure, the discussion between open source and privative software makes no sense. Nobody would let a company be the owner of a street. If is public, it should be open.

Finally, there is knowledge. We, as the government, must tell the people what we are doing, and how we are doing it. Our knowledge should be open. We have the duty to publish our knowledge and to let others use it, so that we can participate actively in communities, propose changes, and act as an innovation factor in every task we face. Because we are paid for that: for building knowledge infrastructure.

We do not think government should let others do its work: on the contrary, we want a strong government, building the blocks of infrastructure to achieve its tasks, and making this infrastructure available to people to do whatever they want, within the law.

Exactly the same thing they do with streets.

Is Open Access Open?

Peter Murray-Rust - October 26, 2012 in Featured, Ideas and musings, Open Access

This post is cross-posted from Peter’s blog

I’m going to ask questions. They are questions I don’t know the answers to – maybe I am ignorant in which case please comment with information, or maybe the “Open Access Community” doesn’t know the answers. Warning: I shall probably be criticized by some of the mainstream “OA Community”. Please try to read beyond any rhetoric.

As background, I am well versed in Openness. I have taking a leading role in creating and launching many Open efforts – SAX, Chemical MIME, Chemical Markup Language, The Blue Obelisk, Panton Principles, Open Bibliography, Open Content Mining and helped to write a significant number of large software frameworks (OSCAR, JUMBO, OPSIN, AMI2). I’m on the advisory board of the Open Knowledge Foundation and I have contributed to or worked with Wikipedia, Open Streetmap, Stackoverflow, Open Science Summit, Mat Todd (Open Source Drug Discovery) and been to many hackathons. So I am very familiar with the modern ideology and practice of “Open”. Is “Open Access” the same sort of beast?

The features of “Open” that I value are:

  • Meritocracy. That doesn’t mean that decisions are made by hand counting, but it means that people’s views are listened to, and they enter the process when it seems right to the community. That’s happened with SAX, very much with the Blue Obelisk, and the Open Knowledge Foundation.
  • Universality of participation, particularly from citizens without formal membership or qualifications. A feeling of community.
  • A willingness to listen to other views and find means of changing strategy where necessary
  • Openness of process. It is clear what is happening, even if you are not in command.
  • Openness of results. This is universally fundamental. Although there have been major differences of opinion in Free/Open Source Software (F/OSS) everyone is agreed that the final result is free to use, modify, redistribute without permission and for any purpose. Free software is a matter of liberty, not price.
  • A mechanism to change current practice. The key thing about Wikipedia is that it dramatically enhances the way we use knowledge. Many activities in the OKF (and other Open Organisations) are helping to change practice in government, development agencies, companies. It’s not about price restrictions, it’s about giving back control to the citizens of the world. Open Streetmap produces BETTER and more innovative maps that people can use to change the lives of people living right now – e.g. the Haitian earthquake.

How does Open Access measure up against these? Not very well. That doesn’t mean it isn’t valuable, but it means that it doesn’t have obvious values I can align with. I have followed OA for most of the last 10 years and tried to contribute, but without success. I have practiced it by publishing all my own single-author papers over the last 5 years in Gold CC-BY journals. But I have never had much feeling of involvement – certainly not the involvement that I get from SAX or BlueObelisk.

That’s a harsh statement and I will elaborate:

Open Access is not universal – it looks inward to Universities (and Research Institutions). In OA week the categories for membership are:

“click here if you’re a: RESEARCH FUNDER | RESEARCHER/FACULTY MEMBER | ADMINISTRATOR | PUBLISHER | STUDENT | LIBRARIAN”

[1] There is no space for “citizen” in OA. Indeed, some in the OA movement emphasize this. Stevan Harnad has said that the purpose of OA is for “researchers to publish to researchers” and that ordinary people won’t understand scholarly papers. I take a strong and public stance against this – the success of Galaxy Zoo has shown how citizens can become as expert as many practitioners. In my new area of phylogenetic trees I would feel confident that anyone with a University education (and many without) would have little difficulty understanding much of the literature and many could become involved in the calculations. For me, Open Access has little point unless it reaches out to the citizenry and I see very little evidence of this (please correct me).

There is, in fact, very little role for the individual. Most of the infrastructure has been built by university libraries without involving anyone outside (regrettably, since university repositories are poor compared to other tools in the Open movements). There is little sense of community. The main events are organised round library practice and funders – which doesn’t map onto other Opens. Researchers have little involvement in the process – the mainstream vision is that their university will mandate them to do certain things and they will comply or be sacked. This might be effective (although no signs yet), but it is not an “Open” attitude.

Decisions are made in the following ways:

* An oligarchy, represented in the BOAI processes and Enabling Open Scholarship (EOS). EOS is a closed society that releases briefing papers and has a members ship of 50 EUR per year and have to be formally approved by the committee (I have represented to several members of EOS that I don’t find this inclusive and I can’t see any value in my joining – it’s primarily for university administrators and librarians). * Library organizations (e.g. SPARC) * Organizations of OA publishers (e.g. OASPA)

Now there are many successful and valuable organizations that operate on these principles, but they don’t use the word “Open”.

So is discussion “Open”? Unfortunately not very. There is no mailing list with both large volume of contributions and effective freedom to present a range of views. Probably the highest volume list for citizens (as opposed to librarians) is GOAL and here differences of opinion are unwelcome. Again that’s a hard statement, but the reality is that if you post anything that does not support Green Open Access then Stevan Harnad and the Harnadites will publicly shout you down. I have been denigrated on more than one occasion by members of the OA oligarchy (Look at the archive if you need proof). It’s probably fair to say that this attitude has effective killed Open discussion in OA. Jan Velterop and I are probably the only people prepared to challenge opinions: most others walk away.

Because of this lack of discussion it isn’t clear to me what the goals and philosophy of OA are. I suspect that different practitioners have many different views, including:

  • A means to reach out to citizenry beyond academia, especially for publicly funded research. This should be the top reason IMO but there is little effective practice.
  • A means to reduce journal prices. This is (one of) Harnad’s arguments. We concentrate on making everything Green and when we have achieved this the publishers will have to reduce their prices. This seems most unlikely to me – any publisher losing revenue will fight this.
  • A way of reusing scholarly output. This is ONLY possible if the output is labelled as CC-BY. There’s about 5-10 percent of this. Again this is high on my list and the only reason Ross Mounce and I can do research into phylogenetic trees.
  • A way of changing scholarship. I see no evidence at all for this in the OA community. In fact OA is holding back innovation in new methods of scholarship as it emphasizes the conventional role of the “final manuscript” and the “publisher”. Green OA relies (in practice) in having publishers and so legitimizes them

And finally is the product “Open”? The BOAI declaration is, in Cameron Neylon’s words, “clear, direct, and precise:” To remind you:

“By ‘open access’ to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.”

This is in the tradition of Stallman’s software freedoms, The Open Knowledge Definition and all the other examples I have quoted. Free to use, re-use and redistribute for any lawful purpose. For manuscripts it is cleanly achieved by adding a visible CC-BY licence. But unfortunately many people, including the mainstream OA community and many publishers use “(fully) Open Access” to mean just about anything. Very few of us challenge this. So the result is that much current “OA” is so badly defined that it adds little value. There have been attempts to formalize this, but they have all ended in messy (and to me unacceptable) compromise. In all other Open communities “libre” has a clear meaning – freedom as in speech. In OA it means almost nothing. Unfortunately anyone trying to get tighter approaches is shouted down. So, and this is probably the greatest tragedy, Open Access does not by default produce Open products.

For that reason we have set up our own Open-access list in the OKF.

If we can have a truly Open discussion we might make progress on some of these issues.

[1] Phylogenetic tree diagram by David Hillis, Derreck Zwickil and Robin Gutell.

The future of Open Access

Theodora Middleton - October 24, 2012 in Featured, Ideas and musings, Open Access

At the start of this week, which is Open Access week, we heard from Martin Weller about some of his fears for the future of Open Access. We’ve been collecting a few opinions from around the OKFN on the future of OA. Here’s a selection. What do you think?

Ross Mounce: The future of publicly-funded research is inevitably Open Access.

With increasing realisation that research is best distributed electronically – for speed, economic efficiency, and fairness – Open Access to publicly-funded academic research is inevitable.

It costs money to implement, maintain and enforce artificial paywalls to restrict access to research online. These create frustrating and time-consuming barriers to accessing research. Open Access is thus an obviously beneficial system that simply allows ALL to read, re-use and remix academic research, thereby truly maximising the potential return on investment from these works.

Peter Murray-Rust: Is Open Access Open?

Is Open Access really “Open”? The features of Open I value are:

  • Meritocracy: That doesn’t mean that decisions are made by hand counting, but it means that people’s views are listened to, and they enter the process when it seems right to the community.
  • Universality of participation, particularly from citizens without formal membership or qualifications. A feeling of community.
  • A willingness to listen to other views and find means of changing strategy where necessary
  • Openness of process. It is clear what is happening, even if you are not in command.
  • Openness of results. This is universally fundamental. Although there have been major differences of opinion in Free/Open Source Software (F/OSS) everyone is agreed that the final result is free to use, modify, redistribute without permission and for any purpose.
  • A mechanism to change current practice. The key thing about Wikipedia is that it dramatically enhances the way we use knowledge. Many activities in the OKF (and other Open Organisations) are helping to change practice in government, development agencies, companies. It’s not about price restrictions, it’s about giving back control to the citizens of the world.

How does OA match up? Not very well:

  • It’s not universal: it looks inwards to universities. There is no space for the ‘citizen’, or even the individual.
  • It has oligarchic and closed decision procedures: the Enabling Open Scholarship committee costs 50 euros per year to join, and requires recommendation by an existing member.
  • Discussion is closed: differing opinions aren’t listened to or wanted.
  • The product isn’t, necessarily, open either: whilst a CC-BY license would easily ensure manuscript openness, in fact the term “open access” is applied to almost anything, and means very little.

Only if we can have a truly Open discussion about these issues, will we make any progress.

A longer version of Peter’s thoughts will be published later this week.

Christian Heise: Open Access is the fundament for Open Science.

In Feburary 2002 the Budapest Open Access Initiative (BOAI) launched a worldwide campaign for open access (OA). Even if it did not invent the idea, the initiative articulated the first major international statement and public definition of open access. Now, ten years later, it has made new recommendations for the next ten years (summarized by me in five points):

1. Every institution of higher education should have access to an open access repository (through a consortium or outsourcing), and every publishing scholar in every field and country, including those not affiliated with institutions of higher education, should have deposit rights.

2. Every institution of higher education should have a policy that all future scholarly articles by faculty members and all future theses and dissertations are made open access as soon as practicable, and deposited in the institution’s designated open access repository, preferably licensed CC-BY.

3. Research institutions, including funders, should support the development and maintenance of the tools, directories, and resources essential to the progress and sustainability of open access, including: tools and APIs to convert deposits made in PDF format into machine-readable formats such as XML; the means to harvest from and re-deposit to other repositories; and tools working with alternative impact metrics.

4. The use of classic journal impact factors is discouraged. The Initiative encourage the development of alternative metrics for impact and quality which are less simplistic, more reliable, and entirely open for use and reuse.

5. The open access community should act in concert more often and we should do more to make universities, publishers, editors, referees and researchers aware of standards of professional conduct for open access publishing. We also need to articulate more clearly, with more evidence, and to more stakeholder groups the advantages and potentials of open access

These recommendations are pretty detailed on what has to be done to get a sustainable open access process in the near future. However, the far future has to be the evolution from Open Access to the holistic concept of Open Science (open access + open science data).

Tom Olijhoek: Open Interconnected Specialist Communities

In my view the future of science will ultimately depend on the formation of many interconnected scientific communities covering all possible areas. Making optimal use of the internet and social media, scientists and citizens within and between these communities will collaborate to produce more useful knowledge than ever before and to store, maintain and provide information for those who seek it. Especially for medical scientists in the developing world, these communities would provide vehicles for innovation, health improvement and development in their respective countries. Following this line of thought, the only hope on winning the battle against malaria, aids, neglected diseases and other tropical infections will lie in free access to and sharing of information, and in joining forces by way of social media and open science communities. MalariaWorld is our first experiment in this mode of specialist open access scientific community.

Laurent Romary: L’open access est un état d’esprit

L’open access est un état d’esprit pour le chercheur. Tous les moyens sont bons pour favoriser la dissémination des savoirs, publications, données, expertises. On peut douter que le système de publication commercial, tel que nous le connaissons actuellement réponde véritablement aux attentes des chercheurs et aux enjeux de l’interconnection des connaissances. Les infrastructures de recherche de demain, gérées par les chercheurs eux-mêmes, devront comprendre des environnements virtuels de recherche, où chaque scientifique (en sciences dures tout comme en sciences humaines) gérera ses observables, ses commentaires, ses résultats et choisira librement et sans barrière financière de les diffuser ou de les faire évaluer.

The great Open Access swindle

Martin Weller - October 22, 2012 in Featured, Ideas and musings, Open Access

This week is Open Access week, and we’ll be running a few pieces mulling over where Open Access has got to, and where it’s going. Here Martin Weller discusses some reservations…

The Cunning Thief, by Chocarne-Moreau. PD

Just to be clear from the outset, I am an advocate for open access, and long ago took a stance to only publish OA and to only review for OA. I’m not suggesting here that open access is itself a swindle, but rather that the current implementation, in particular commercial publishers adopting Gold OA, is problematic.

In my digital scholarship book, I made two pleas, the first was for open access publishing, and the second was for scholars to own the process of change. On this second point, the book ends thus:

“This is a period of transition for scholarship, as significant as any other in its history, from the founding of universities to the establishment of peer review and the scientific method. It is also a period that holds tension and even some paradoxes: it is both business as usual and yet a time of considerable change; individual scholars are being highly innovative and yet the overall picture is one of reluctance; technology is creating new opportunities while simultaneously generating new concerns and problems…. …For scholars it should not be a case of you see what goes, you see what stays, you see what comes, but rather you *determine* what goes, what stays and what comes.”

The open access element has proceeded faster than even I imagined when writing this back in 2010/2011. The Finch Report can be seen as the crowning achievement of the open access movement, in setting out a structure for all UK scholarly articles to be published as open access. But in rather typical “you academics are never happy” mode I’ve become increasingly unhappy about the route Open Access is taking. And the reason is that it fails to meet the second of my exhortations, in that it is a method being determined by the publishing industry and not by academics themselves.

The favoured route is that of Gold OA, under which authors pay publishers to have open access articles published, usually through research funds. This is good in that it means these research papers will be openly available to all, but bad from a digital scholarship perspective. And here’s why:

1) Ironically, openness may lead to elitism. If you need to pay to publish, then, particularly in cash-strapped times, it becomes something of a luxury. New researchers, or smaller universities won’t have these funds available. Many publishers have put in waivers for new researchers (PLoS are excellent at this), but there’s no guarantee of these, and after all, the commercial publishers are concerned with maximising profits. If there are enough paying customers around then it’s not in their interest to give out too many freebies. And it also means richer universities can flood journals with articles. Similarly those with research grants can publish, as this is where the funding will come from, and those without can’t. This will increase competition in an already ludicrously competitive research funding regime. You’re either in the boat or out of it will be the outcome. The Scholarly Kitchen blog has a good piece on OA increasing the so-called Matthew Effect. It would indeed be a strange irony if those of us who have been calling for open access because of a belief in wider access and a more democratic knowledge society end up creating a self-perpetuating elite.

2) It will create additional cost. Once the cost is shifted to research funders, then the author doesn’t really care about the price. There is no strong incentive to keep costs down or find alternative funding mechanisms. This is great news for publishers who must be rubbing their hands with glee. It is not only a licence to carry on as they were, but they have successfully fended off the threat of free publication and dissemination that the internet offers. Music industry moguls must be looking on with envy. The cost for publication is shifted to taxpayers (who ultimately fund research) or students (if it comes out of university money). The profits and benefits stay with the publishers. It takes some strained squinting to view this as a victory.

Steven Harnad argues again for Green OA, claiming that

“Publishers– whose primary concern is not with maximizing research usage and progress but with protecting their current revenue streams and modus operandi –are waiting for funders or institutions to pledge the money to pay Gold OA publishing fees. But research funds are scarce and institutional funds are heavily committed to journal subscriptions today. There is no extra money to pay for Gold OA fees”

3) It doesn’t promote change – in my book I also talked about how a digital, networked and open approach could change what we perceive as research, and that much of our interpretation of research was dictated by the output forms we have. So, for instance we could see smaller granularity of outputs, post review, different media formats, all beginning to change our concept of what research means. But Gold OA that reinforces the power of commercial publishers, simply maintains a status quo, and keeps the peer-reviewed 5000 word article as the primary focus of research that must be attained.

I’ve heard Stephen Downes say that as soon as any form of commercial enterprise touches education it ruins it (or words to that effect). I wouldn’t go that far, I think for instance that commercial companies often make a better job of software and technology than universities, but academic publishing is such an odd business that maybe it doesn’t make sense as a commercial enterprise. As David Wiley so nicely parodies in his trucker’s parable, there isn’t really another industry like it. Academics (paid by the taxpayer or students) provide free content, and then the same academics provide free services (editorship and peer-review) and then hand over rights and ownership to a commercial company, who provide a separate set of services, and then sell back the content to the same group of academics.

I know a few people who work in commercial publishing, and they are smart, good people who genuinely care about promoting knowledge and publishing as a practice. This is not a cry for such people to be out on the streets, but rather for their skills and experience to be employed by and for universities, the research communities and the taxpayer rather than for shareholders. In this business Downes’ contamination theory seems to hold, there is simply no space in the ecosystem for profit to exist, and when it does it corrupts the whole purpose of the enterprise, which is to share and disseminate knowledge.

Gold OA is not inherently detrimental. There are plenty of non-profit publishers who operate this model and they keep costs down to a minimum and have a generous fee waiver policy. They are, after all, not concerned with making a profit, and are concerned with knowledge dissemination. Other models exist also, including subsidised university presses, centralised publishing platforms, etc. The swindle is that there is no real incentive to explore these possibilities because the standard model has been reinforced through the manner in which OA has been implemented. As Tim O’Reilly comments “If we’re going to get science policy right, it’s really important for us to study the economic benefit of open access and not just accept the arguments of incumbents”.

[An earlier version of this post was originally posted on Martin Weller’s blog]

Video: Julia Kloiber on Open Data

Rufus Pollock - October 3, 2012 in Ideas and musings, Interviews, OK Germany, OKFest, Our Work

Here’s Julia Kloiber from OKFN-DE’s Stadt-Land-Code project, talking at the OKFest about the need for more citizen apps in Germany, the need for greater openness, and how to persuade companies to open up.

Get Updates