Open data and the voluntary sector
August 2nd, 2010
The following guest post is from David Kane, who is a research officer at NCVO. He blogs on NCVO’s website and can be found on twitter @kanedr. The author wishes to acknowledge Louise Brown from NCVO’s ICT and Collaboration Team for her valuable input.
Here at the National Council for Voluntary Organisations (NCVO) we’ve recently started taking an interest in open data, and its implication for charities and the voluntary sector.
We know that some voluntary organisations which specialise in open data have been leading the charge - the Open Knowledge Foundation is a not-for-profit company, mySociety is a registered charity - and often the most exciting and innovative uses of open data are made by volunteers in their spare time. But we know that many voluntary organisations find it difficult to find the time and skills to develop their ICT capabilities, and can find the challenge of implementing new technologies in their organisation daunting. This is daunting not just because of the time and resources required, but also because it requires a change in organisational culture.
Our interest in open data culminated recently when the Coalition government published its paper Building the Big Society, which included five themes. The NCVO research team have looked at the evidence behind each of these themes - you can see the results on our website.
The first four themes relate well to NCVO’s usual work - they talk about participation, transferring power, communities and supporting organisations - but the fifth (publish government data) was less familiar, and is a new concept to much of the voluntary and community sector.
Looking at this fifth theme, I came up with a number of opportunities and challenges that open data presents for charities and voluntary organisations:
1. Open data will give charities new ways to find and share information on the need of their beneficiaries - who needs their services most and where they are located.
The sharing of information will be key to this - it’s not just about using data that the government has opened up, but also opening your own data. Organisations like New Philanthropy Capital and NCVO’s own Strategy and Impact unit stress the need for charities to demonstrate the impact that their services have - opening their data can help to do this.
This can create a more joined up service for users, provide cost savings and mean that organisations can meet unmet needs. But organisations need to think about how to access and manipulate this information - will it need specialist staff or volunteers? Some organisations might need outside help to be able to do this.
2. Charities will be able to use the evidence found in open data to boost their campaigns and lobby government. Voluntary sector organisations have been at the forefront of opening up data.
Providing services directly to those in need isn’t the only way that charities help the most vulnerable - their campaigning and lobbying also has a very important role in this. Open data can help charities speak truth to power, whether it’s challenging government spending or using their own data to lobby for better targeting of services. Data that charities gather themselves through their beneficiaries and communities can make the case even more forcefully, but again charities need to have the skills to be able to do this.
3. Many of the skills needed to create, access and use open data are not yet widespread in the voluntary sector. There is a cost to effectively creating and using this data, while sharing commercially sensitive data could reduce competitiveness.
This is a really important point - the uneven spread of the skills, knowledge and resources needed means that some organisations risk being left behind while others use open data to its full potential. At the moment much of work being done is by interested and passionate individuals, but there may not be enough of these to go around.
4. As open data becomes embedded in government, voluntary organisations which contract with government may be compelled to produce and share data as part of those contracts.
This is a bit hypothetical at the moment - I’ve seen no evidence of this happening in government contracts yet. But it seems possible to me that as a culture of open data becomes embedded in government, this culture informs their contracting arrangements. If this does happen, charities will need to be ready.
Conclusion
So how does the voluntary sector keep up with the open data revolution? Well it needs to make sure that staff and volunteers have the skills and knowledge needed to create and use open data. Charities need to learn from each other too, particularly by talking to organisations that are ahead of the game. Perhaps most importantly, examples of the power of open data will show charities how important this is.
The following interview was published earlier this week by the European Journalism Centre in the Netherlands.
In recent years the practice and philosophy of making data freely available for use and re-use has been taken up by many different institutions, from national governments to international organizations such as the World Bank. Journalists too have started to tap into the potential of free accessible data.
Now, with major news providers such as The Guardian beginning to open up the data they work with, the opportunities for producing visualisations or web mash-ups based on this information readily exist.
Increasing online open data availability puts the processing power into journalists’ hands; rather than relying on outside specialists such as policy makers to provide the insights, raw data can now be analysed and interpreted in newsrooms.
This is the emerging field of data-driven journalism, in which journalists gather, analyse and visualise ‘big’ data and combine it with compelling, credible storytelling.
Jonathan Gray, Open Knowledge Foundation
European Journalism Centre (EJC) questioned Jonathan Gray (see right), community coordinator at the Open Knowledge Foundation, a UK non-profit organisation dedicated to promoting open knowledge internationally, about how journalists can exploit the potential of open data. Jonathan is also the founder of WhereDoesMyMoneyGo.org, a web application which allows users to explore and visually represent UK public spending.
EJC: Where and how can journalists access free and open data sources on the Internet?
Jonathan Gray: There is currently a huge wealth of freely accessible data on the Internet, scattered over web real estate belonging to government departments, academics, NGOs, news organisations, technologists and others. Yet despite the fact that there is so much information out there, it is not always easy to find the exact information you are looking for; get hold of raw data sources (as opposed to reports, articles and other material about these sources), and find out whether or not you can freely reuse the information (just because something is accessible online, this does not always mean you’re allowed to reuse or republish it).
To address these issues we started the CKAN project, which aims to make it easy to find collections of documents and datasets which anyone can freely re-use for any purpose.
CKAN is now being used by the UK government in its data.gov.uk project, and we are helping open government data advocates around the world set up new instances to track sources of open data in over a dozen countries. We’re also working hard to make sure that other data catalogues in other countries are interoperable with CKAN, and to promote uptake of the project in different communities creating and using public datasets - from geospatial data analysts to civic hackers, from climate modellers to semantic web technologists. We hope that CKAN will become an international, multilingual, distributed one-stop shop for open data.
If you are interested in using this technology to set up a data registry in your country (whether you’re an advocate or a civil servant), we’d love to hear from you.
EJC: Would journalists need special skills, like programming to explore and analyze these datasets?JG: Having some programming skills is no doubt useful for journalists whose work may depend on extracting, analysing, and understanding information in large databases. This may be particularly valuable for investigative journalists trawling public sources to build up a richer picture of complex chains of events or states of affairs. However new digital technologies are making it increasingly easy for journalists without programming skills to explore and analyse datasets.
Social web services such as Many Eyes or Google Chart Tools mean that anyone can visually represent data sets ‘on the fly’. Free and open source desktop applications also enable people to drill down into databases in increasingly sophisticated ways.
While these kinds of tools can go a long way, there are obviously still limits as to what non-technical journalists can do. A good example of this is the recent release of the COINS data in the UK – a huge release of information on public spending, unprecedented in its scope and detail. We had journalists from many different national newspapers and news outlets calling us to ask what was in the data, and in particular whether we had found any good stories. At the time several people expressed disappointment with the release, basically complaining that getting useful or interesting information from the database was like getting blood from a stone. The good news is that since then we’ve seen at least half a dozen new projects which let people sort, search, explore and comment on the data, and no doubt we will see many more in the next few months and years.
To quote Peter Murray-Rust, OKF Advisory Board member and tireless advocate of open data in chemistry, “Data is difficult”. Whether we’re talking about statistical data, environmental data or spending data, the chain from production to presentation can often be long and complicated. The more people involved in the process of cleaning up, checking, interpreting, and visually representing datasets the better. Hence at the OKF we are strong advocates of a community-driven approach, involving experts from across the board, as well as interested and motivated citizens.
We hope to move to a situation where rather than a single official point of contact for datasets (whether from national governments, international bodies or NGOs), we have an ecosystem of open data with lots of datasets connected together, accessible via many different interfaces and with plenty of tools to help people understand and interact with the data.
Rather than the traditional treasure hunt, for example looking for data buried deep on an official website or PDF document, working out how to use the shiny front end interface, etc., we hope there will be more of a two way relationship with information around us, i.e. delivery on demand according to interests, read/write access, commenting, telling stories with data, enabling people to embed dynamic visual representations which link back to source, and so on. By explicitly opening up datasets for others to freely reuse without restriction we allow a thousand flowers to bloom.
While the core task of journalists will presumably continue to be much the same, i.e. interpreting, communicating and framing the information in meaningful narratives, with comment and analysis, I think the precise division of labour between journalists and others remains to be seen. Hopefully we’ll see some boundaries begin to blur.
EJC: What are the common challenges associated with data usage, production and presentation? And what is there to learn for journalists who want to go in this direction from your perspective?
JG: As I alluded to above, piecing together and interpreting data can be hard. (Getting data in the first place can also be hard—but that’s another story!). There are lots of great examples of this from Ben Goldacre’s analysis of a recent Guardian article take on NHS death rates, to dodgy newspaper graphs (see this presentation from Simon Field, CTO of the UK’s Office for National Statistics).
We’ve experienced a number of difficulties when trying to make sense of UK spending data as part of the Where Does My Money Go? project, including (but by no means limited to) missing data, different figures from different government sources, retired schema, absent keys, changing categories, multiple different codes, data delivered on thousands of sheets of paper, and so on.
Before this project I never knew working out where public funds are spent could be so much like theology, with some intangible, ineffable base layer (the actual spending) and many different denominations of interpreting this for different purposes, with different rules, different explanations, different calendar dates and different terminology. Luckily it’s all falling into place, but not without talking to lots of helpful government folks and outside experts.
To understand a dataset is often not just to understand a bunch of facts about the world, but to understand all kinds of assumptions and processes involved in the production of that data - which can be critical, for example, in understanding whether the books have been cooked. Another good reason why it’s better to have many pairs of eyes and hands!
In terms of things for journalists to learn, I guess I’d like to see us move from a situation where we consult data on a case-by-case basis in order to provide facts for stories, to one where we allow stories to emerge from the data (i.e. data-driven journalism!). For example, more systematically following transcripts and drilling down into datasets from national parliaments and international bodies such as the EU or the UN. Also being able to give events a broader sense of context, for example by being able to link directly to documents and datasets and letting people discuss and comment on these.
I think visualisation tools may turn out to be very powerful with respect to giving people a ‘bigger picture’. We are currently taking baby steps, but I’m sure things will pick up pace as the technologies begin to get more robust, and we start to build a visual language for interfacing with datasets, similar to the kinds of graphical user interfaces we are all now used to with our operating systems, or on web browsers and web applications.
EJC: Do you think journalists are capable of producing offerings similar to your Where does my money go? project or to CKAN?
JG: Yes of course! There is no compelling reason why journalists shouldn’t continue to contribute to these kinds of projects. I guess it’s more a question of people’s time and priorities.
Starting to have more provision for learning to code on journalism courses might be a good way to systematically encourage journalists to start working on and engaging with these kinds of projects. But I think it’s often also to do with motivation. Many really good programmers I know had no formal training, but learned by having a project they wanted to do and teaching themselves with input from various free/open source software communities. Formal training is often neither necessary nor sufficient to code on projects like Where Does My Money Go, They Work For You, and so on - its far more important to set aside the time, and stick at it.
Having more widespread acceptance of this sort of thing as something which is ‘legitimate’ for journalists to undertake may well help people to justify spending their time in this way (to themselves, to employers, etc). There already seems to be growing interest here - e.g. from BBC, Guardian, Telegraph, New York Times, and so on – so hopefully it won’t be long before programming comes to be accepted as a valuable part of a journalist’s repertoire.
EJC: Your work provides a great example, though we think one aspect is missing: storytelling. Would you consider presenting short stories, interviews, videos on top of your data in the future?
JG:Yes definitely. I think one of the main things that sets Hans Rosling’s Gapminder apart from other similar projects is his wonderfully entertaining set of lectures where he talks about an issue (population growth, development, health, quality of life, etc) with reference to Trendalyzer, the Gapminder visualisation tool. We’re really proud to have Hans on the Open Knowledge Foundation’s advisory board, and I think he’s someone who really understands the relationship between stories and interactive visualisation tools. On the one hand his narrative makes the colourful bubbles meaningful, on the other hand the colourful bubbles are indispensable in his lectures! A good illustration of why a picture is worth a thousand words is to watch any of his videos where he narrates hundreds of bubbles moving around the screen—highlighting a significant move like a football commentator highlighting something in a game. While the words may aid understanding and capture key aspects, they are no substitute for the image. They are also clever because after you’ve watched a few, you have a pretty good idea how to use the tool to ask your own questions and build your own visualisations.
These kinds of tools are new, and can often not be particularly intuitive to use, partly because we have no ‘accepted’ way of doing things yet. Tutorials and help menus can be boring, and the video lectures are a great way of teaching people to use the tool without them noticing! We’d definitely like to do more of this kind of thing with our own projects. In general I think that storytelling will play a very important part in new visualisation tools—from creating narratives within the visual design and the user’s experience, through to being able to seamless embed and integrate visualisations within external articles, texts and other media.
Belarusian translation of the Open Knowledge Definition (OKD)
July 28th, 2010
We’ve just added a Belarusian translation of the Open Knowledge Definition thanks to Patricia Clausnitzer!
If you’d like to translate the Definition into another language, or if you’ve already done so, please get in touch on our discuss list, or on info at the OKF’s domain name (okfn dot org).
Data Driven Journalism, Amsterdam, 24th August 2010
July 27th, 2010
I’m very much looking forward to an event on Data Driven Journalism in Amsterdam in late August, which will bring together representatives from various media organisations (e.g. The New York Times, The Financial Times, The Times, …) and other stakeholders for a day of talks and discussions on the role of new digital technologies in the future of journalism:
The European Journalism Centre in collaboration with the University of Amsterdam organises the first round table on data-driven journalism on 24 August in Amsterdam. The one day event brings together specialists in fields which intersect with data-driven journalism: data mining, data visualisation and multimedia storytelling to discuss the possibilities of this emerging field, examine and understand the needed tools and workflows, and spread the know-how for data-driven journalism. What can we learn from the existing projects? How can we integrate the existing tools in the journalistic workflows? What skills are needed to enter this field? These are just a few of the issues which will be addressed in this event.
In particular I’m keen to talk about with the other participants about open data, and the role journalists can play in helping to open up official information and to help present it to the public in new ways. They asked me for a quote to use for the event:
Opening up content and data produced by public bodies will enable new forms of reportage as well as a new generation of services enabling the public to participate in the news making process. New tools to analyse, represent, deliver and give context to public data are beginning to revolutionise the way we understand large and complex issues, from Hans Rosling’s analysis of flu statistics, to the Guardian MP expenses crowdsourcing tool, and to the Afghanistan Election Data project. An ecosystem of open data that anyone can reuse or contribute to will be critical for a new generation of data driven journalism to flourish.
You can find out more at:
If you’re going to be in Amsterdam, participation is free and you can register here.
Introducing the Panton Papers
July 26th, 2010
Peter Murray-Rust — Cambridge University chemist, Open Knowledge Foundation Advisory Board member and tireless advocate for open data in chemistry — has recently started a series of blog posts about open data, focusing on issues related to the Panton Principles for open data in science.
The first is called Open Data: why I need the Open Knowledge Foundation, and in it he introduces some of the issues he wishes to discuss and gives his vision for the role he hopes the OKF community will play in relation to open data. He writes:
After a period of silence on this blog (but not on the Open Knowledge Foundation lists) I hope to publish a flurry of ideas on Open Data. There is no doubt that “Open Data” has arrived and there is enormous interest. (By contrast when I started to investigate it 5 years ago there was nothing). It’s desperately important, more complex than I ever imagined, and it’s critical to address it immediately, responsibly, dispassionately and inclusively. If we manage to set out the concerns now, we may manage to avoid the worst problems that were encountered by the Open Source and later Open Access movements. [They have made enormous progress and without their footsteps Open Data would fall into many of the same pitfalls. But Open Data is Difficult – a phrase I shall repeat frequently.]
I am putting my faith and energy into the Open Knowledge Foundation – its people and its infrastructure. This is because it’s an organisation which is wideranging (it deals with open content of all sorts, open metadata, services, etc.). It has great expertise in legal problems and solutions (where these are necessary) and also how to find alternative approaches. It’s neutral (apart from urging Openness and developing the infrastructure). It’s very professional, and realises that ideas without implementation have less weight. So there is an impressive range of software and information skills. I am reminded of my favourite motto (from the IETF) – “rough consensus and running code”, one the greatest productive mantras of our time.
The enthusiasm is palpable. [Today I had a breakfast Skype session with Jonathan Gray (coordinator of OKF) and it's all about how we can make things happen fast and responsibly.] The OKF works through Working Groups and discussion lists, and so when I had a concern about Open Data I brought it to the OKF and – after a great deal of work – we emerged with the Panton Principles which have now been translated into several languages by OKF members.
Simply, the OKF amplifies the visions of individuals from the almost-impossible to the attainable.
So I am putting some ideas into the OKF melting pot to see what emerges.
In the next post, titled Open Data: The concept of Panton Papers, he lays out his ideas for the Panton Papers:
The current theme is “Panton Papers”. The idea is that part of the value of the Panton Principles is that the whole document is short and the key points are simply made. But the “Principles” can therefore only address the motivation and the procedures for Open data in a general manner, and many of the problems are in the details. I believe that many of the problems in Open Access (which is simpler than Open Data) arose because not enough communal effort was given to the practice of Open Access and I want to avoid as many OD problems as possible before they occur.
Over the last 2 years (when Open Data has started to become important and discussed) I have seen several potentially difficult areas. I’ll simply list the ones I have thought of here and then outline the idea of the Panton Papers. This discussion is mirrored in part by the OKF open-science discussion list and you may wish to subscribe. There’s also a regular working group on open-science. (Almost everything in OKF is Open, but it may take a little while to find out where you want to be!). The issues that I currently have are:
- What is data? Images? Graphs? Tables? Equations? Accounts of experiments? This is a major problem and almost completely unexplored. Without solving this we are held back 10 year or more in our ability to re-use the primary scientific literature (e.g. by closed-access publishers who claim that factual graphs belong to them).
- Why should data be open? (and when should it not be?). I’ve put forward ideas here and here . They range from moral, to legal/quasi-legal to utilitarian.
- Who owns data? This is one of the trickiest areas – there is legal and contractual ownership and there is moral ownership. Generally there is far far too much “ownership” of data.
- When should data be released? This is a key question (see here for an example). Some communities have solved it – most haven’t addressed it and will have to go through the rigour of working out release protocols.
- How and where should data be exposed? I am strongly of the opinion that we need domain-specific repositories (which could be national or international) and the Institutional Repositories are almost never the best place to expose data (I expect and welcome alternative opinions). The “how” depends on understanding what the data and metadata are and is increasingly dependent on specialist software and information standards. “Archival” is often the wrong word to use.
- Datamining and textmining. Most authors, publishers, repository owners are unaware of the enormous power of automated analysis of the literature. Some closed access publishers expressly forbid these activities. We have to liberate the right of the scientific community to do this enthusiastically and efficiently.
- Reproducibility. Science is based on reproducibility – we expect to be able to replicate the “materials and methods” of an experiment and to try to falsify its claims. Physical materials are beyond the immediate discussion (though this may change) but much science is now based on computing. It should be possible to replicate simulations, data cleaning, data analysis, model fitting etc. This is a tricky area. It is difficult (though with virtualization and the cloud is becoming easier) to reproduce the computing environment. Large or complex data sets are a major problem but must be addressed. This is not without monetary cost.
I may add more.
The idea is that each of these is a “Panton Paper”. It may or may not be crafted in Pantonia (the hectare of the Chemistry Department, The OKF headquarters, and the Panton Arms in Cambridge UK). Everything I now write is mutable.
Each paper will have a top level document of similar form to the Panton Principles, i.e. 3-8 ideas, with short explanatory paragraph(s). This document will be crafted by the OKF in public view on a wiki or Ether/Piratepad. Anyone can take part. We shall welcome contributions from a wide range of disciplines (in fact this is essential). At some stage version 1.0 of the paper will be frozen and will be formally published. We have an offer from a major publisher to do this and I am hoping we can announce this at Open Science Summit.
The Paper should carry a wider range of links to other essays in Open Data and should carry examples from different disciplines. For example there is a well tried and accepted process in many areas of bioscience and astronomy as to what when and how data get published.
Peter has started drafting ideas for the first two of these at:
If you’d like to get stuck in, please head on over to the open-science list and say hi!
How to Visualise Worldbank Data with Google Maps
July 23rd, 2010
The following guest post is from Holger Drewes, who is a member of Open Knowledge Foundation Germany and the Open Data Network in Berlin.
As interfaces for open datasets from political and societal institutions become more and more available, the possibilities for easy and uncomplicated data visualization are expanding in very promising ways. With a little programming knowledge, or a bit of support, journalists and bloggers are able to back up the conclusions in their articles with facts in a illustrative way, using diagrams or maps. Even further, they can create, demonstrate, or underline interrelations through the integration of different datasets, using programming interfaces.
Very active in offering such programming interfaces (APIs) is the Worldbank, which provides an API for querying indicators relevant for describing the world’s development status, like birth rates, CO2 emission levels, and education expenditure. The aim of this article is to show how such data can be used, particularly, as an example, how it could be visualized on a map with the help of Google Maps. The following map shows the income level in different countries through colored pins. Clicking on a pin brings up additional information about the capital of a country and the meaning of the corresponding colour. The explanation is (hopefully) not too technical, so that it should be comprehensible to non-programmers as well, at least in its essentials. Some programming skills will be necessary for a realization, but it shouldn’t take more than 2-3 hours.
The following three steps are necessary on the way to your own Worldbank open data mashup:
1. Worldbank API - Select indicators and formulate query
The API from the Worldbank can be used directly via a URL in the browser. You can choose which indicator or which country you wish to find results for by specifying the parameters in the URL. For example, the following query:
returns a list with all countries with a low income level (LIC). (You can get a more structured view of the result by selecting the source view in your browser.) A more detailed explanation about the usage of the API can be found on the website of the Worldbank. The fact that you can use the API directly through the browser also gives you the chance to play around a bit with the different parameters to get a better feeling for what the API can do. Once you have created a useful API URL (in our example: http://open.worldbank.org/countries?format=json&per_page=500), the URL can be queried through the corresponding programming function (e.g. cURL for the programming language PHP used in our example).
2. Convert the result to a format readable by Google Maps
Now the queried data has to be converted in a format readable by Google Maps. A good way to go here is KML, which is a descriptive language for geodata, used for example to locate places on a map or annotate them with additional information. There is also the alternative possibility of using the Google Maps-API directly to visualize the data. The advantage of the KML option is that at the end of the process it comes out with a code snippet which can be copied straight into weblogs and content management systems. The datasets from the Worldbank API are returned in XML or JSON, both structured data formats used to represent several datasets of different kinds and corresponding properties. Generally there are standard programming functions to process these formats in the different languages, for example in PHP the function “json_decode()” is used read datasets given in the JSON format. Now you can loop through the single datasets and write the properties, which should be presented on a map, in a KML string. A list of the possible KML properties which can be used can be found in the KML documentation hosted by Google. In our example the main properties are the name of the country and the income level, which should be shown when selecting a pin on the Google map, and the longitude and latitude of the capital of the corresponding country (see illustration below). In the process of this transformation it is also possible to carry out some graphical formatting, for example representing every country with a low income level through a red pin. The created file now has to be saved somewhere on a web server as filename.kml, so that it is accessible through the web.

3. Integrate into blog/article
Phew! Maybe that last section really was a bit technical! But the good part is, now you are more or less ready! Google Maps can process KML files directly, so that you can copy the corresponding URL straight into the search field of Google Maps. If you have done everything correctly, the datasets taken from the Worldbank API should be shown on the resulting map. Anyone who wants to try this can take the KML file URL used for this example:
Copy the link into the search field and look what happens. Via “Link” -> “Customize and preview embedded map” the desired clipping and zoom level of the map can be selected, and: ready! The HTML code which you have thus created can now be copied into your own website, and the map with the data overlay will automatically be loaded via Google!
Conclusion
Hopefully this article shows how easy it is - even by today’s standards - to integrate data from openly available data sources into your own website. With a little imagination and some programming skills, much more can be realized than shown in this example. Comparisons can be made by overlaying different datasets, or through the use of timelines. Maps can be complemented by your own datasets, or by data from other open programming interfaces. So, grab your keyboard!
And anyone who has experimented a bit and has created interesting visualizations: it would be great if you added a comment below!
Open Data in Agriculture and Why It Matters
July 22nd, 2010
The following guest post is from Elizabeth McVay Greene, Founder of Food+Tech Connect. It was originally posted on Food+Tech Connect, Provenance and the Huffington Post.
The farmer usually knows best — for his or her land, crop, livestock, and profitability, among other things. As a girl on a Minnesota farm in the ’80s and ’90s, I was in awe of my grandfather’s ability to know just what his beef cattle needed — more water, richer pasture, better nutrients — and to have a crop rotation schedule seemingly in his head. It was as though my Granddad could feel his way through the unpredictability of weather, supply and demand, and price fluctuations to make the optimal decisions for his operation.
A couple of years ago I was with him on his farm when the cows were getting checked for pregnancy. He sat in the middle of the cattle yard while cow after cow ran through the chute. After each check, the cowboy gave a signal to indicate the cow’s status, and Granddad jotted down on a spreadsheet the results that would later that evening get saved to someone’s hard drive. What could I build, I thought to myself, that would make this process easier, and link the data that emerged from this days-long affair with other information from the farm itself, the region, and the markets to help my family make even smarter operating choices?
It’s a question that participants from across the food sector have pondered for years, and to which new information technology is beginning to provide answers. With capabilities like social media that offers instantaneous mini-reports, remote sensing that announces field-level conditions, and user-generated mapping that offers an on-the-ground view of production, merchandising, and consumption activity, we are beginning to get the tools at our fingertips to optimize decision-making with connected, real-time information, not just intuition. Farm management software, mobile applications, and web-based tools are increasingly available to farmers around the world and present an opportunity for us to understand and act on the global interconnections among food, agriculture, water, energy, soil, farm profitability, and human nutrition as never before.
I do not want farmers’ wisdom to evaporate in the face of technology. Quite the contrary, I want that specialized knowledge of acre, crop, and herd to be augmented and preserved. Agricultural and food system data is important because it lets us see what we couldn’t see before, and in a world in which the expertise to sustain our food supply lives in the minds and senses of aging farmers, I would like to see a 21st century agricultural revolution that builds on farmers’ talent and perception to capture and interpret newly available signals from the ecosystem.
Imagine, for example, that a farmer needs to decide how much to irrigate during a drought. It’s a decision that affects just his farm in the short run, but has systemic costs and benefits. If the farmer could connect historical commodity prices, weather charts, financial and environmental costs, and soil conditions to assess the trade-offs in the choice he makes, he could complement his highly refined intuition with the long-term effects that his decision has on his farm and beyond. The more widely information and tools like this are available, the more optimal decisions participants can make throughout the food system.
Opening up food and agricultural data requires an information architecture and infrastructure that does not currently exist. The United Nations’ Food and Agriculture Organization is a leader in providing easily accessible, highly usable, and surprisingly current data, but right now it is far ahead of the pack in terms of transparency in reporting. The USDA released its Open Government Plan in April and the possibilities the agency’s data presents for developers and entrepreneurs are many. However, there exists no single platform for coordinating the numerous strands of measurements, probabilities, risks, and fluctuations in real-time. We need to build toward a high level of integration and openness in data in order to truly be stewards of the land and sustainable producers and consumers of agricultural products.
At a time when food is becoming a political issue instead of being discussed as the fundamental need that it is, we must access competing data and analysis to inform the investment, innovation, and policy behind food production and consumption. To transform data into metrics that empower decision-making across the food system, we need to get a broad spectrum of actors in the sector to communicate and collaborate. Let this essay serve as a call for a networked food system that harnesses and applies robust information through data generation, database architecture, open research and collaboration, and agile, relevant metrics, in pursuit of more efficient, more sustainable, more productive food and farming.
One Information Policy for Freedom of Information and Re-use
July 21st, 2010
The following guest post is from Katleen Janssen, researcher at the Interdisciplinary Centre for Law and ICT at Katholieke Universiteit Leuven, and member of the Open Knowledge Foundation’s Working Groups on EU Open Data and Open Government Data.
In Belgium – and I can imagine this is the case in more countries – we look at data.gov.uk with a mix of admiration and envy. The goal of the PSI directive to stimulate any re-use of public sector information is taken to heart and translated into a portal opening up large numbers of data sets for any type of use.
While in the UK and in many other EU Member States (e.g. Netherlands, Denmark, Spain), the awareness is growing that the open availability of public sector data can stimulate innovation and increase accountability, some other countries are still turning a blind eye to the opportunities that open access to public sector data can bring. A big part of the problem seems to be culture. Public bodies do not realize the value of their data for others, or they are worried that their data will be interpreted wrongly or used for wrongful purposes, putting their reputation on the line. In addition, due to lack of resources or lack of vision, some governments were satisfied with just transposing the directive in a law – to never look at it again, let alone develop an actual implementation policy or guidelines for the public bodies. Left to their own devices, some public bodies have risen to the occasion and developed a well-working re-use policy, while others have not bothered, or may simply not even be aware that there is such a concept as ‘re-use’.
As bad as I make the Member States sound, I must admit that they did not have an easy job in transposing and implementing the PSI directive, as this directive has left many difficult and unclear issues for the Member States to sort out themselves. Even the concept of re-use itself raises a lot of issues, particularly in relationship with the citizens’ right to access information under national freedom of information legislation. The PSI directive has its roots in economic considerations and was developed to support the information industry, and European Commission representatives have often emphasized its economic character, the fact remains that the definition of re-use in the PSI directive is much broader: “the use by persons or legal entities of documents held by public sector bodies, for commercial or non-commercial purposes other than the initial purpose within the public task for which the documents were produced”(article 2.4). Hence, it does not only involve commercial use, but also any other use as long as it is outside of the public task.
Considering this broad definition, it is not surprising that some of the Member States linked re-use immediately to their existing legislation on freedom of information (FOI) and decided to transpose the directive by amending their laws on access to government information. Some Member States felt that this legislation already covered all they needed to transpose the PSI directive (e.g. Sweden, Finland, Poland). Of course, access and re-use are closely related, in the sense that public sector data has to be accessible before you can re-use it, but they have a different background: access is rooted in traditions of democracy and public participation, while re-use has an economic slant and is intended to stimulate the common and internal market. These two different mindsets have only rarely been recognized by government, public bodies or appeal bodies. One of the few attempts to explain the distinction was made in 2004 by the UK Advisory Panel on Crown Copyright (which has been replaced by APPSI since then):
Although the subject matter (public information) and the broad scope (public bodies) of these instruments are similar, the underpinning policies are quite different. The FOI Act seeks to promote greater transparency and openness in the conduct of public affairs, while the PSI Directive recognises the value, and aims to encourage the commercial exploitation, of public information. The focus of the FOI Act is enhancing the rights of individuals in a democratic society. At the heart of the PSI Directive is the smoother running of the internal market; the stimulation of the European information industry so it can compete more effectively in the global marketplace.
Due to these differences, incorporation of access and re-use into the same legislation is not a simple task, and some implementations have been criticized for trying (e.g. by Mireille van Eechoud and Marc De Vries in the Netherlands). However, at least these countries have realized that there is a link between both and they should be applauded for seeing the relationship between them. The main problem with this practice is not the incorporation into one text, but rather the incorporation into one text without the incorporation into one information policy. If no attention is paid to the coherence between different information policies, they end up being very difficult to apply, or in the worst case end up contradicting each other. An example: in France and Belgium, the freedom of information legislation contained a prohibition to use the documents that were obtained under this legislation for commercial purposes. In Belgium, this article was abolished during the implementation of the PSI directive, to ensure that commercial use would not be hindered. While this was a nice attempt to harmonize access and re-use, it actually had an opposite effect. For years, the article had been interpreted in a way that prohibited commercial use of the documents as they were, but any reworking of the data or value-adding was allowed, without any extra conditions. The introduction of the PSI legislation changed this and made such re-use also subject to the freedom of the public bodies to decide whether they allowed re-use or not. Hence, the PSI legislation actually decreased the possibilities for re-use. In addition, it limited the extent and interpretation of what you can do with information obtained under access legislation.
In my opinion, that is one of the biggest issues of the PSI legislation: where does access stop and where does re-use begin? How are public bodies supposed to know which rules they have to apply to a request for information? The example that I usually give, is journalists: during the history of freedom of information legislation (in Sweden it dates from 1766), they have been among the main users of FOI to obtain information from the government and the public sector. However, the news is also big business: newspapers and news channels have to be competitive and gain revenues. So while traditionally journalists have always obtained their information under FOI, if you want to apply the PSI rules to the letter, they would be re-using the information, possibly even for a commercial purpose. This could mean dealing with licences, fees and use conditions. However, journalists are not the only example of possible confusion between re-use and access. The development of Web 2.0 could potentially increase this confusion exponentially. Like Mayo and Steinberg said in their Power of Information Review, a lot of new and innovative services are created by citizens and organisations on their websites, blogs, fora, etc (e.g. mtraffic, Openstreetmap, Where does my money go?, Fixmystreet). These services are re-using public sector data, but before the PSI legislation, they might have already been possible under FOI legislation, due their role in increasing public participation and democratic accountability.
Initiatives like data.gov.uk, however fantastic they are, increase the grey zone between access and re-use, as their aim with releasing public sector data is not only economic growth and innovation (like the PSI directive), but also increasing accountability and transparency (what FOI legislation was originally intended for). However, the enthusiasm and acclaim with which it has been achieved shows that maybe this is the way to go: forget about dogmatic issues like the distinction between re-using PSI and accessing it under FOI, and just think in terms of making public sector data open to anyone who wants to use it. This entails having a streamlined information policy that takes into account all the possible uses that could be made of public sector data.
However, such an overarching policy may work if everything is available free of charge and with hardly any use conditions, for example under a CC-zero licence or data.gov.uk’s open licence conditions, but it may have unwelcome results in countries or public bodies that wish to maintain a more complicated licensing policy with charges for using the data. We may not like such charges and conditions, but the truth still is that some public bodies creating interesting data have to earn their own money, and will have to continue doing so unless the government sees the importance of their data and starts to fund it from the central budget. In such cases, a combined policy for access and re-use might rather lead to the public having to pay or having to sign a licence for getting the data in more occasions than under the traditional FOI legislation. Considering the fundamental character of the right to access government information, this should be avoided at all time. Any information policy that intends to do away with the distinction between FOI and re-use should start from the largest common denominator of what people can already do with the information they obtain from government, and ensure that these existing rights are maintained. This almost automatically leads to a very open data policy.
This exercise will be one for the Member States, without much assistance from the European Commission, as the Commission has repeatedly indicated that it has no competence to act on freedom of information issues. From a European Union perspective, this is a shame, as the harmonization intended by the PSI directive may be set back again. However, there are other guidelines and legislations to take inspiration from, such as the OECD Recommendation for Enhanced Access and More Effective Use of Public Sector Information and the Council of Europe Convention on Access to Official Documents. Based on these, the Member States should start thinking about developing information policies, going beyond occasional good practices, based on open data for any purpose.
Opening up university infrastructure data
July 20th, 2010
The following guest post is from Christopher Gutteridge, Web Projects Manager at the Electronics and Computer Science (ECS), University of Southampton and member of the OKF’s Working Group on Open Bibliographic Data.
Around five years ago we (The School of Electronics and Computer Science, University of Southampton) had a project to create open data of our infrastructure data. This included staff, teaching modules, research groups, seminars and projects. This year we have been overhauling the site based on what we’ve learned in the interim. We made plenty of mistakes, but that’s fine and what being a university is all about. We’ll continue to blog about what we’ve learned.
We have formally added a “CC0″ public domain license to all our infrastructure RDF data, such as staff contact details, research groups and publication lists. One reason few people took an interest in working with our data is that we didn’t explicitly say what was and wasn’t OK, and people are disinclined to build anything on top of data which they have no explicit permission to use. Most people want to instinctively preserve some rights over their data, but we can see no value in restricting what this data can be used for. Restricting commercial use is not helpful and restricting derivative works of data is non-sensical!
Here’s an Example; Someone is building a website to list academics by their research area and they use our data to add our staff to this. How does it benefit us to force them to attribute our data to us? They are already assisting us by making our staff and pages more discoverable, why would we want to provide a restriction?. If they want to build a service that compiles and republishes data they would need to track every license and that’s going to be a bother of a similar scale to the original BSD Clause 3.
Our attitude is that we’d like an attribution where convenient, but not if it’s a bother. must-attribute is a legal requirement, we say “please-attribute”. It’s our hope that this step will help other similar organisations take the same step with the confidence of not being the first to do so.
The CC0 license does not currently extend to our research publications documents (just the metadata) or to research data. It is my personal view that research funders should make it a requirement of funding that a project publishes all data produced, in open formats, along with any custom software used to produce it, or required to process it, along with the source and (ideally) the complete cvs/git/svn history. This is beyond the scope of what we’ve done recently in ECS, but the University is taking the management of research data very seriously and it is my hope that this will result in more openness.
Another mistake we have learned from is that we made a huge effort to correctly model and describe our data as semantically accurately as possible. Nobody cares enough about our data to explain to their tool what an “ECS Person” is. We’re in the process of adding in the more generic schemes like FOAF and SIOC etc. The awesome thing about the RDF format is that we can do this gently and incrementally. So now everybody is both (is rdf:type of) a ecs:Person and a foaf:Person. (example). The process of making this more generic will continue for a while, and we may eventually expire most of the extraneous ecs:xyz site-specific relationships except where no better ones exist.
The key turning point for us was when we started trying to us this data to solve our own problems. We frequently build websites for projects and research groups and these want views on staff, projects, publications etc. Currently this is done with an SQL connection to the database and we hope the postgrad running the site doesn’t make any cock-ups which result in data being made public which should not have been. We’ve never had any (major) problems with this approach, but we think that loading all our RDF data into a SPARQL server (like an SQL server, but for RDF data and connects with HTTP) is a better approach. The SPARQL server only contains information we are making public so the risks of leaks (eg. staff who’ve not given formal permission to appear on our website) is minimised. We’ve taken our first faltering steps and discovered immediately that our data sucked (well, wasn’t as useful as we’d imagined). We’d modelled it with an eye to accuracy, not usefulness, believing if you build it they will come. The process of “eating our own dogfood” rapidly revealed many typos, and poor design decisions which had not come to light in the previous 4 or 5 years!
Currently we’re also thinking about what the best “boilerplate” data is to put in each document. Again, we’re now thinking about how to make it useful to other people rather than how to accurately model things.
There’s no definitive guidance on this. I’m interested to hear from people who wish to consume data like this to tell us what they *need* to be told, rather than what we want to tell them. Currently we’ve probably got an overkilll!
One field I believe should be standard which we don’t have is where to send corrections to. Some of the data.gov.uk is out of date and an instruction on how to correct it would be nice and benefit everyone.
At the same time we have started making our research publication metadata available as RDF, also CC0, via our EPrints server. It helps that I’m also lead developer for
the EPrints project! By default any site upgrading to EPrints 3.2.1 or later will get linked data being made available automatically (albeit, with an unspecified license).
Now let me tell you how open linked data can save a university time and money!
Scenario: The university cartography department provides open data in RDF form describing every building, it’s GPS coordinates and it’s ID number. (I was able to create such a file for 61 university buildings in less than an hours work. It is already freely published on maps on our website so no big deal making it available.
The university teaching support team maintain a database of learning spaces, and the features they contain (projectors, seating layout, capacity etc.) and what building each one is in. They use the same identifier (URI) for buildings as the cartography dept. but don’t even need to talk to them, as the scheme is very simple. Let’s say:
http://data.exampleuniversity.ac.uk/location/building/23
Each team undertakes to keep their bit up to date, which is basically work they were doing anyway. They source any of their systems from this data so there’s only one place to maintain it. They maintain it in whatever form works for them (SQL, raw RDF, textfile, Excel file in a shared directory!) and data.exampleuniversity.ac.uk knows how to get at this and provide it in well formed RDF.
The timetabling team wants to build a service to allow lecturers and students to search for empty rooms with certain features, near where they are now. (This is a genuine request made of our Timetable team at Southampton that they would like a solution for)
The coder tasked with this gets the list of empty rooms from the timetabling team, possibly this won’t be open data, but it still uses the same room IDs (URIs). eg. http://data.exampleuniversity.ac.uk/location/building/23/room/101
She can then mash this up with the learning-space data and the building location data to build a search to show empty rooms, filtered by required feature(s). She could even take the building you’re currently in and sort the results by distance away from you. The key thing is that she doesn’t have to recreate any existing data, and as the data is open she doesn’t need to jump through any hoops to get it. She may wish to register her use so that she’s informed of any planed outages or changes to the data she’s using but that’s about it. She has to do no additional maintenance as the data is being sourced directly from the owners. You could do all this with SQL, but this approach allows
people to use the data with confidence without having to get a bunch of senior managers to agree a business case. An academic from another university, running a conference at exampleuniversity can use the same information without having to navigate any of the politics and bureaucracy and improve their conference sites value to delegates by joining each session to it’s accurate location. If they make the conference programme into linked data (see http://programme.ecs.soton.ac.uk/ for my work in this area!) then a 3rd party could develop an iPhone app to mash up the programme & university building location datasets and help delegates navigate.
But the key thing is that making your information machine readable, discoverable and openly licensed is of most value to your own members in an organisation. It stops duplication of work and reduces time wasted trying to get a copy of data other staff maintain.
“If HP knew what HP knows, we’d be three times more profitable.” – Hewlett-Packard Chairman and CEO Lew Platt
I’ve been working on a mindmap to brainstorm every potential entity a university may eventually want to identify with a URI. Many of these would benefit from open data. Please contact me if you’ve got ones to add! It would be potentially useful to start recommending styles for URIs for things like rooms, courses and seminars as most of our data will be of a similar shape, and it makes things easier if we can avoid needless inconsistency!
Russ Nelson, License Approval Chair at the Open Source Initiative (OSI), recently proposed a session at OSCON about OSI adopting a definition for open data:
I’m running a BOF at OSCON on Wednesday night July 21st at 7PM, with the declared purpose of adopting an Open Source Definition for Open Data. Safe enough to say that the OSD has been quite successful in laying out a set of criteria for what is, and what is not, Open Source. We should adopt a definition Open Data, even if it means merely endorsing an existing one. Will you join me there?
Subsequently a bunch of people wrote to Russell letting him know about the Open Knowledge Definition that we created a few years ago:
The Open Knowledge Definition (OKD) sets out principles to define ‘openness’ in knowledge – that’s any kind of content or data ‘from sonnets to statistics, genes to geodata’. The definition can be summed up in the statement that “A piece of knowledge is open if you are free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”
Russell suggested there was scope for the OSI to adopt the OKD, and emailed us a further blurb for the event:
Should the Open Source Initiative write its own definition of Open Data? Or is the Open Knowledge Foundation’s definition up to snuff? Come help us decide at OSCON next week. We have a BOF scheduled at 19:00 on 21 July 2010. We’ll present the results of our decision to the OSI for adoption at its next board meeting.
We’re excited at the prospect that the OKD might get adopted as an official open data definition by OSI, and would love to hear from folks who plan to attend the session!

