The following guest post is from Stefano Costa at the University of Siena. He is Founder of the IOSA initiative and Coordinator of the Open Knowledge Foundation’s Working Group on Open Data in Archaeology. Stefano wishes to thank Thomas Kluyver and David Jones for their help in reviewing the post.

Since the 19th century, the study of archaeobotanical remains has been very important for combining “strict” archaeological knowledge with environmental data. Pollen data enable assessing the introduction of certain domesticated species of plants, or the presence of other species that grow typically where humans dwell. Not all pollen data come from archaeological fieldwork, and pollen analysis is often done by ecologists without a particular focus on human-associated plants. However, from an archaeologist’s perspective the relationship among the two sets is strong enough to take an interested look at pollen data worldwide, their availability and most importantly their openness, for which we follow the Open Knowledge Definition.

We found that there is a serious misunderstanding by universities and research centers of their role in society as places of research, innovation that is available for everyone. As for dendrochronological data, academia is a closed system producing data (at very high costs for society) that are only available inside its walls, but it’s all done with public money.

Finding pollen data

The starting point for finding pollen data is the NOAA website.

The Global Pollen Database hosted by the NOAA is a good starting point, but apparently its coverage is quite limited outside the US. Furthermore, data from 2005 onwards aren’t available via FTP in simple documented formats, but are instead downloadable as Access databases from another external website. Defining Access databases as a Bad Choice™ for data exchange is perhaps an euphemism.

Unfortunately, a large number of databases covering single continents or smaller regions is growing, and the approaches to data dissemination show marked differences.

Americas

For both North and South America, you can get data from more than one thousand sites directly via FTP. There are no explicit terms of use. Usually, data retrieved from federal agencies are public domain data.

The README document only states NOTE: PLEASE CITE ORIGINAL REFERENCES WHEN USING THIS DATA!!!!!. Fair enough, the requirement for attribution is certainly compatible with the Open Knowledge Definition.

Europe

From the GPD website we can easily reach the European Pollen Database, that is found at another website tough (and things can be even more confusing, provided that the NOAA website has some dead links).

You can download EPD data in PostgreSQL dump format (one file for each table, with a separate SQL script create_epd_db.sql). Data in the EPD can be restricted or unrestricted. That’s fine, let’s see how many unrestricted datasets there are. Following the database documentation, the P_ENTITY table contains the use status of each dataset:

steko@gibreel:~/epd-postgres-distribution-20100531$ cat p_entity.dump |
awk -F "\t" {' print $5 '} | sort | uniq -c
    154 R
   1092 U

which is pretty good because almost 88% of them are unrestricted (NB I write most of my programs in Python but I love one liners that involve awksort and uniq). We could easily create an “unrestricted” subset and make it available for easy download to all those who don’t want to mess up with restricted data.

But what do “unrestricted” mean for EPD data? Let’s take a more careful look (emphasis mine):

  1. Data will be classified as restricted or unrestricted. All data will be available in the EPD, although restricted data can be used only as provided below.
  2. Unrestricted data are available for all uses, and are included in the EPD on various electronic sites.
  3. Restricted data may be used only by permission of the data originator. Appropriate and ethical use of restricted data is the responsibility of the data user.
  4. Restrictions on data will expire three years after they are submitted to the EPD. Just prior to the time of expiration, the data originator will be contacted by the EPD database manager with a reminder of the pending change. The originator may extend restricted status for further periods of three years by so informing the EPD each time a three-year period expires.

Sounds quite good, doesn’t it? “for all uses” is reassuring and the short time limit is a good trade off. The horror comes a few paragraphs below with the following scary details:

  1. The data are available only to non-profit-making organizations and for research.
Profit-making organizations may use the data, even for legitimate uses, only with the written consent of the EPD Board, who will determine or negotiate the payment of any fee required.

Here the false assumption that only academia is entitled to perform research is taken for granted. And there are even more rules about the “normal ethics”: basically if you use EPD data in a publication the original data author should be listed among the authors of the work. I always thought citation and attribution were invented just for that exact purpose, but it looks like they have distinctly different approach to attribution. The EPD is even deciding what are “legitimate” uses of pollen data (I can hardly think of any possible unlegitimate use).

Africa

For “Africa” read “Europe” again, because most research projects are from French and English universities. For this reason, the situation is almost the same. What is even worst is that in developing countries there are far less people or organizations that can afford buying those data, notwithstanding the fact that in regions under rapid development the study and preservation of environmental resources are of major importance.

Data are downloadable for individual sites using a search engine, in Tilia format (not ASCII unfortunately). The problems come out with the license:

The wording is almost exactly the same as for the EPD seen above:

Normal ethics pertaining to co-authorship of publications applies. The contributor should be invited to be a co-author if a user makes significant use of a single contributor’s site, or if a single contributor’s data comprise a substantial portion of a larger data set analysed, or if a contributor makes a significant contribution to the analysis of the data or to the interpretation of the results. The data will be available only to non-profit-making organisations and for research. Profit-making organisations may use the data for legitimate purposes, only with the written consent of the majority of the members of the Advisory board, who will determine or negotiate the payment of any fee required. Such payment will be credited to the APD.

Conclusions

The only positive bit of the story, if any, is that these datasets are nevertheless available on the web, and their terms of use are clearly stated, no matter how restrictive. It would be just impossible to write a similar article about archaeological pottery, or zooarchaeological finds.

Appendix: Using pollen data

Pollen data are usually presented in forms of synthetic charts where both stratigraphic data and quantitative pollen data are easily readable. Each “column” of the chart stands for a species or genus. You can create this kind of visualization with free software tools.

The stratigraph package for R can be used for

plotting and analyzing paleontological and geological data distributed through through time in stratigraphic cores or sections. Includes some miscellaneous functions for handling other kinds of palaeontological and paleoecological data.

See the chart for an example of how they look like.

The following guest post is from Chris Taggart of OpenlyLocal, who advises the Where Does My Money Go? project on local spending data, and is a member of the Open Knowledge Foundation’s Working Group on Open Government Data. This is a cross-post — Chris’ original post here.

When the coalition announced that councils would have to publish all spending over £500 by January next year, there’s been a palpable excitement in the open data and transparency community at the thought of what could be done with it (not least understanding and improving the balance of councils’ relationships with suppliers).

Secretary of State for Communities & Local Government Eric Pickles followed this up with a letter to councils saying, “I don’t expect everyone to do it right first time, but I do expect everyone to do it.” Great. Raw Data Now, in the words of Tim-Berners Lee.

Now, however, with barely the ink dry, the reality is looking not just a bit messy, a bit of a first attempt (which would be fine and understandable given the timescale), but Not Open At All.

As a member of the Local Public Data Panel, I’ve worked with other members and councils to draw up some clear and pragmatic draft guidelines for publishing the local spending data. We’ve had a great response in the comments and in conversations, and together with some lessons I did on importing the existing data, I think these will allow us to do a second draft soon.

One thing we weren’t explicit in that first draft – because we took it for granted – was that the data had to be open, and free for reuse by all. Equality of access by all is essential.

So I’ve been watching the activities of Spikes Cavell’s SpotlightOnSpend with some wariness and now those fears seem to have been borne out, as the company seems to set out not to consume the open data that councils are publishing, but to control this data.

The idea seems to be that councils should give Spikes Cavell privileged access to their detailed invoice information, which the company then adds to their proprietry and definitely non-open database, and then publishes an extract of this information on the SpotlightOnSpend website. Exactly what information they get, and under what terms isn’t disclosed anywhere.

The website’s got most of the buzzwords: transparency, accessible, efficiency. It’s even got a friendly .org.uk domain. If that’s not enough to convince councils, liberally sprinkled around the site is an apparent endorsement from the Secretary of State himself:

I’m really excited about the opportunities of transparency and it’s something this government is utterly committed to. spotlightonspend demonstrates that, when innovative businesses work with far-sighted public bodies, we can inform the public, reduce costs and improve democracy both locally and nationally.
Eric Pickles
Secretary of State
Communities and Local Government

However, when you go to the data and click on the download link this is what you get:

Note the “This data is for your personal use only”  (not to mention the fact that the use of a captcha’ to screen out machines downloading the data means, er, you can’t use machines to automatically download the data, which is sort of the point of publishing the data in a machine-readable way).

Never mind, surely you can just head over to the council’s website and download the data from there? No chance. This is what you get on the Guildford website:

You can search and view this financial data using a new Spotlight on Spend national website. Just follow the link found in the offsite links section of this page.

What about Mole Valley Council:

This data is now available on the spotlight on spend website. You can look at categories and individual suppliers to see how much has been spent in each area or you can download all the data to see individual transactions.

But what about Windsor & Maidenhead, who are closely affiliated with the project, and who are publishing data on their website? Well, download the data from SpotlightOnSpend and it’s rather different from the published data. Different in that it is missing core data that is in W&M published data (e.g. categories), and that includes data that isn’t in the published data (e.g. data from 2008).

So the upshot seems to be this, councils hand over all their valuable financial data to a company which aggregates for its own purposes, and, er, doesn’t open up the data, shooting down all those goals of mashing up the data, using the community to analyse and undermining much of the good work that’s been done.

It’s worth linking here to the Open Knowledge Foundation’s draft guidelines on reporting of Government Finances (disclosure: I helped draw them up), of which the first point is ‘Make data openly available using an explicit license’. And let me be absolutely clear here: this is not open data, not a desirable approach, will not achieve the results of transparency or of equality of access, and is not good for the public sector.

I’m hoping this is a matter of councils and the Secretary of State not understanding the process and implications of giving this data to Spike Cavell on a privileged basis. If not, perhaps it could be the first test case for the newly setup of Public Sector Transparency Board to rule on.

The cake test of freedom

March 15th, 2010

At last week’s Jornadas SIG Libre in Girona, Ivan Sanchez of the Spanish OpenStreetmap community told me about the cake test of data freedom.

What is the cake test? Easy: geographic data, or a map, is open only if someone can make you a gift of a cake with your map on it.

prueba_de_la_tarta2 The cake test is inspired by the dissident test and the desert island test used by the Debian community to gauge software freedom for packages to be included in a free and open distribution.

For data to pass the cake test, you must be able to freely share the data with someone (the baker) who can re-use it for a profitable activity (the baking of cakes) and is then freely able to redistribute the resulting derived work (the cake).

The cake test can apply to all kinds of information resources, not just geodata. A resource that passes the cake test will be open in the sense of the Open Knowlege Definition. You could print a research paper onto a cake, a chart based on a dataset, some code describing an algorithm. Obviously a map just looks prettier on a cake.

The objective of the Cake Test is quite simple:

If a layperson can’t decide if one can or cannot give away a cake, or cannot do this easily, then the data or the maps cannot be freely used.

And you could be sure that if two datasets each passed the cake test, then it should be fine to give someone a cake decorated with parts of both of them - that is the intention of the data makers.

Is it open data? Does the data pass the cake test?

The following guest post is from Regards Citoyens, a French association of citizens with a shared interest in opening up information about the functioning of democratic institutions in France.

France is lagging behind…

opendatcamp_okv5_r

There is no doubt about it: compared to other countries, France is definitely late in opening up its data. For a country so proud of its human rights and democratic revolution, it took a while before it finally joined the open data movement! The first “Open Data Camp” organized in Paris last December is a good example of this new momentum.

While the US and the UK have taken enormous steps in the past two years with the release of data.gov and data.gov.uk, France and many other southern European countries are still being very conservative about making public data public. To catch up in the world of open data will require more than just a few political measures. French institutions need a drastic change to their approach to the production and dissemination of official data. But nothing will be possible without support, demand and engagement from groups of citizens.

Interesting — and often relatively little known — projects already lead the way. For example, the HAL Archives opens up access to scientific journal articles and IREP offers access to data about pollutants. But this is just a very small fraction of material that is out there. The vast majority of official documents, datasets and publicly funded research remains inaccessible to citizens. Indeed, it can be very difficult for an individual to gain access to specific public documents. In 1978, a committee called CADA was created to provide advice on such demands, but such public services often won’t process the requests easily.

For historical reasons, it is especially difficult to change French officials’ approach to data release. For a very long time, most public data sharing has been done by public administrations classified as EPIC (Etablissement Public à caractère Industriel et Commercial or Public Administration for Industrial and Commercial purposes). These administrations have a prior commercial purpose even though their data are considered public. Examples include key providers of meteorological data and geospatial data. Having both public and commercial purposes, such administrations tend to be interested in making profit from the data by selling it to corporate businesses. Therefore, it can be a real challenge for citizens to get free access to these data and reuse them for civil society projects to strengthen democracy, to increase citizen engagement or to improve the delivery of public services.

The former DJO (Direction des Journaux Officiels or Directorate of Official Publications), now called DILA (Direction de l’Information Légale et Administrative, Directorate of Legal and Administrative Information), is another good example of this situation. This administration is in charge of all legal data including laws issued by the parliament and official government decisions. Before 2002, online access to the French legislation was restricted through a régime de concession à titre onéreux. This means only those able and willing to pay a license, mainly companies like Reuters or Lamy, were allowed to utilise the documents. The situation changed in 2002 and now any individual has access to these key legal documents thanks to LégiFrance. Extra features like an access to the rich XML feed of any legislation modification could be of great help to improve legislative monitoring projects like Regards Citoyens’ Simplify the law. Unfortunately these features are still restricted to users able to pay the fee.

Government initiatives: limited access but not openness

Despite all of this, the global movement for openness has recently taken a radical turn thanks to the data.gov projects, the 2007 EU INSPIRE Directive (planned to be transposed in France in June 2010) and Sweden’s initiative to promote eGovernment projects during its presidency of Europe. All of these seem to have triggered some change within French government’s view of public data and some things have started to change.

A new administration, the DILA, was recently created to replace the DJO and try to impulse an improved production and diffusion of public data. In this context, a new agency called APIE (Agence du Patrimoine Immatériel de l’Etat, the State’s Intangible Heritage Agency) was settled to lead the reflexion, coordinate, estimate and organize a common data effort between the different administrations. The objective is to propose by the middle of 2010 a platform that will promote all different sources of data and describe their respective licenses.

Unfortunately, the French government’s historical lack of openness left an open field to the private sector. Some companies largely benefit of this situation: they make profit out of the data by becoming an intermediate between the administration and data users. A good example of this is the GFII (the Groupement Français de l’Industrie de l’Information, or French Association of Electronic Information Industry). Disappointed in having such difficult contact with the government, this active lobbying group started to take care individually of civil servants’ training, and progressively became the official investor and organizer of training programmes instead of the government. This entry of the private sector into matters of public administration certainly contributed to the APIE’s information licensing decisions: there is an obvious inclination to sell the data to companies without considering the benefits of allowing reuse by citizen driven projects using open licences. This situation is neither good for innovation nor for the production of common knowledge.

Citizen driven open data initiatives in France

logo_redecoupage

Like in many countries, the first steps into open data came from the research and the Free and Open Source Software (F/OSS) communities. WikiMedia France and OpenStreetMap.fr are probably the most popular open knowledge projects in France. Early websites like Mon-Depute.fr — a vote monitoring project created by an archivist — or droit.org — a very active project from l’Ecole des Mines on legal publication — helped a lot to make democratic data available. Our work at Regards Citoyens on parliamentary activity with NosDéputés.fr and on electoral data is a new step for French open data for democracy and civil society.

OpenStreetMap.fr is a very good example of a citizen driven open data project. The Public Land Registry (Cadastre) has a website intended to publish their map, which provides interesting information but not openly. Therefore, some contributors of OpenStreetMap found out how to technically access the raw data. But this still was not enough to open up the data for anyone. So the OSM community studied the legal situation and contacted the French Ministry of Finance in charge of this service. They finally got an answer in January 2009: a global export of their whole database is not allowed, but a partial one is. So hundreds of volunteers began a crowdsourcing effort and OpenStreetMap.fr is now able to free more and more data from the Land Registry.

All of these are good examples that open data is not only about technology: it also often depends on the efforts of a community in order to legally secure the data and encourage others to allow it to be reused for any purpose. That is why we helped organise the first French Open Data Camp in Paris, where more than 120 people came to learn and share their skills. We learned a lot about information visualisation techniques from existing projects and from interesting theoretical ones! We also had a good conversation with activists, ‘hacktivists’, and others about the political, economic and administrative benefits of open data.

The success of this event seems like a pretty good demonstration that France is ready and already made its first steps into the global world of open data. Regards Citoyens will follow these changes and will try to modestly contribute to the global open data movement by working together with international organisations such as the Open Knowledge Foundation. With our fellow “campers”, we are convinced that making public data accessible and reusable will bring great benefits to commercial innovation, democratic organisations, and to civil society.

A radiant turret lit by the midsummer midnight sun by the State Library of New South Wales collection on Flickr

With the United Nations Climate Change Conference in Copenhagen starting on Monday, it is of vital important that there is consensus on the scientific evidence about climate change, in order to inform debates about the best course of action for the international community. Sharing the same basic picture about the climate, global warming and the impact of human sources of carbon dioxide (regardless of the details of this picture, regardless of differences in opinion about the most appropriate course of action in reponse to it) is surely a critical prerequisite to effective and fruitful negotiations.

The recent illegally obtained emails from the University of East Anglia’s Climatic Research Unit (so-called ‘Climategate’) and the subsequent accusations of secrecy and malpractice from climate change sceptics have provoked debate in the media about the openness and availability of datasets related to climate change.

Partly in response to accusations of secrecy and falsification of key datasets from sceptics, the UK Met Office announced today they will be publishing new climate datasets. Earlier the Telegraph reported:

Sceptics alleged that emails stolen from the Climatic Research Unit at the university show scientists were willing to manipulate data to show global warming.

They also complain that the raw data for the climate models was not made available to the public.

To try to restore public confidence the Met Office is talking to other meteorological organisations around the world about recreating the model using the same raw data but more modern computers.

The whole process will also use any new information and be more open to the public.

This evening, the BBC reported:

Meanwhile, the Met Office said it would publish all the data from weather stations worldwide, which it said proved climate change was caused by humans.

Its database is a main source of analysis for the IPCC.

It has written to 188 countries for permission to publish the material, dating back 160 years from more than 1,000 weather stations.

As UEA said in an announcement from the end of November, over 95% of the CRU climate data is already available and permission to publish the remaining data will have to be sought from each of the relevant National Meteorological Services (NMSs) around the world on a case by case basis. Professor Davies of UEA, suggests there are partly commercial reasons for this:

We are grateful for the necessary support of the Met Office in requesting the permissions for releasing the information but understand that responses may take several months and that some countries may refuse permission due to the economic value of the data.

An editorial piece in Nature from a couple of days ago suggests:

Researchers are barred from publicly releasing meteorological data from many countries owing to contractual restrictions. Moreover, in countries such as Germany, France and the United Kingdom, the national meteorological services will provide data sets only when researchers specifically request them, and only after a significant delay. The lack of standard formats can also make it hard to compare and integrate data from different sources. Every aspect of this situation needs to change: if the current episode does not spur meteorological services to improve researchers’ ease of access, governments should force them to do so.

Mike Hulme of UEA and Jerome Ravetz of Oxford Univeristy argue in a recent BBC article that climate scientists will have to become better at engaging the public in their research:

While there will always be a unique function for expert scientific reviewers to play in authenticating knowledge, this need not exclude other interested and motivated citizens from being active.

These demands for more openness in science are intensified by the embedding of the internet and Web 2.0 media as central features of many people’s social exchanges.

In particular they suggest that scientists should respond to demands that:

  • To be validated, knowledge must also be subject to the scrutiny of an extended community of citizens who have legitimate stakes in the significance of what is being claimed
  • And to be empowered for use in public deliberation and policy-making, knowledge must be fully exposed to the proliferating new communication media by which such extended peer scrutiny takes place.

Roger Pielke, Professor of Environmental Studies at the University of Colorado, argues in a recent interview in the Washington Post that:

More openness, more transparency, more diversity, and more attention to the social construction of expertise is needed.

While it is important to remember, as Cameron Neylon notes, that proper interpretation of climate change data requires significant background knowledge and a thorough grounding in relevant scientific literature and tools, nevertheless it is clear that there is an increasing demand from interested non-expert non-scientists to access and reuse climate data. The Times recently published two pieces analysing and refuting a climate change sceptic’s interpretation of the publicly available HADCRU data. Another blogger points out that public environmental datasets allow non-expert members of the public to explore the evidence and draw different conclusions about climate change - and argues that the peer review process will act as a quality filter for their research.

In response to the demand for data, Real Climate (who were also hacked, and who provide two excellent posts on the CRU hack and background context) have published a very useful list of public climate datasets as well as a blog post asking the climate science community for further suggestions.

All of this interest in public sources of climate data, reminded us of our Open Environmental Data project which we started two years ago this autumn. The project aimed to answer the question:

  • What environmental data is out there, and how open is it?

It also aimed to document relevant legislation and policy relevant to environmental data in different jurisdictions.

We have picked up this work again by starting a climate data group on CKAN, our open source registry of open data:

We have started to go through available public sources of climate data, looking at:

  • Whether datasets are open as in the Open Knowledge Definition - i.e. whether they explicitly say that they can be used by anyone, for any purpose, without restriction (except perhaps attribution, integrity or sharealike requirements).
  • Whether or not there are facilities to download raw data in bulk - i.e. whether they easily allow users to directly download all the data in open, machine readable formats.

Environmental data is an excellent case of where sharing is the key to scaling. Research institutions must share data with each other in order to build up as detailed a picture as possible, incorporating as much evidence as possible from around the world. As much of this research is publicly funded, and due to increasing public interest, there are now strong arguments for extending this sharing from sharing between research institutions to sharing to the public.

Furthermore, often access is not enough. Datasets need to be combined with other datasets, or reused in visual representations. Hence there are arguments for making data open as in the Open Knowledge Definition, which means that anyone can reuse and redistribute it for any purpose. This allow allows for innovation in the ways in which the data can be presented to the public by third parties, including not-for-profit organisations and companies - such as through the creation of new web services to allow the data to be explored.

There are currently 38 data sources listed, over half of which are fully open. However many datasets are still not explicitly legally open, and many of them have restrictions on how they can be reused. There are still plenty of datasets to add! We’ve been in touch with the folks at Real Climate, and they’ve been supportive of the project and encouraged us to reuse and build on their list of data sources.

In order to mark the occasion of the Copenhagen Conference, over the next few weeks we will be continuing to add publicly available climate data to CKAN. By better documenting existing open environmental data, we hope to make some small contribution to laying the groundwork for the shared picture about the state of our climate that we currently need.

If you are interested in contributing to the climate data group - please either drop us a line, or get stuck in and register a package!

Over the last few months there have been lots of exciting announcements about open data from cities around the world. We decided to take a look at what is currently out there - in particular taking note of:

  1. Whether datasets are open as in the Open Knowledge Definition - i.e. whether they explicitly say that they can be used by anyone, for any purpose, without restriction (except perhaps attribution, integrity or sharealike requirements).
  2. Whether or not there are facilities to download raw data in bulk - i.e. whether they easily allow users to directly download all the data in open, machine readable formats.

We’ve now got 16 packages with the city tag on CKAN, our open-source registry of open data:

Manhattan Skyline Crop

United States

Boston

Chicago

New Orleans

New York

Portland

San Francisco

Washington, D.C.

  • Background: One of the earliest and best examples of publishing local government data online - publicised by Vivek Kundra, who went on to work on data.gov.
  • Open?: Yes. Users must notify the OCTO and redistribute a disclaimer.
  • Bulk download?: Yes. All datasets are linked to from main page.
  • More information:

Canada

Calgary

  • Background: A draft motion to make the City of Calgary’s data open was reported in July 2009. At time of writing no open data appears to be published yet.
  • Open?: No. Not yet published.
  • Bulk download?: No. Not yet published.
  • More information:

Nanaimo

Toronto

Vancouver

UK

Birmingham

  • Background: Digital Birmingham announce their ‘Open City’ initiative to increase access to public datasets in April 2009. They host an event in August 2009, reported here. At time of writing no open data appears to be published yet.
  • Open?: No. Not yet published.
  • Bulk download?: No. Not yet published.
  • More information:

Lichfield

London

  • Background: Initiative to open up the City of London’s data was reported in the Guardian in October 2009. At time of writing some datasets are published.
  • Open?: No. Currently no permission is granted to reuse data.
  • Bulk download?: No. Datasets cannot be downloaded in bulk.
  • More information:

How to open up city data

There are some excellent examples of publishing open data on cities - in particular New York, Washington and Vancouver. However not all data is explicitly open, or made available in bulk. Below is our recipe for opening up city data:

  1. Use a license or legal tool to make datasets legally open! - If you are using your own custom copyright notice, license, disclaimer or terms and conditions, make sure they are compliant with the Open Knowledge Definition. You can also use existing licenses and legal tools, such as:
    • the PDDL, the ODbL, or CC0 for data
    • and CC-BY or CC-BY-SA for content
  2. Make the raw data available in bulk! - Publish data in open, machine readable formats in a way which makes it easy for users to automatically download it. This could mean directly linking to all files in a single HTML page, or putting files in a single publicly accessible directory. Don’t make it difficult for users to download material by only allowing access to data via a shiny interface. Keep it simple!

Get involved!

Does your city publish open data? Do any of the details above need to be amended or added to? If you would like to get involved we encourage you to:

Jonathan recently wrote about the availability of open dictionaries. In a recent comment to that post someone pointed us to Macmillan’s “Open” Dictionary (the reasons for the quotes will soon be apparent).

With a sense of excitement I followed the link: “Could it be”, I thought, “That a mainstream dictionary producer has decided that open is the way to go?”

Sadly, the answer is no: Macmillan’s “Open Dictionary” isn’t open — at least not in any way we mean by that term.

Their “open” means letting you give them information for free (by submitting word suggestions) but getting nothing back — as the terms and conditions make quite clear you’re not allowed to reproduce the material in any way and even linking could be problematic (emphasis added):

Unless otherwise indicated, this Web Site and its contents are the property of Macmillan Publishers Limited, … The copyright in the material contained on this Web Site belongs to Macmillan or its licensors. … Reproduction of material on this Web Site is prohibited unless express permission is given by Macmillan.

No licence is granted in respect of any intellectual property rights vested in Macmillan or other third parties.

You may not redistribute any of the Content of this Web Site without the prior authorisation of Macmillan or create a database in electronic form or manually by downloading and storing any content.

You may link to the home page and any HTML page of the Web Site provided you do not create a frame or any other bordered environment around the content … You may not link to any other page of the Web Site, other than the home page or any HTML page, without the prior written consent of Macmillan. Macmillan reserves the right to require you to remove any link to this Web Site. You may not replicate the Content on this Web Site.

To my mind this is clear abuse of the term “open” and more than a little exploitative — you do work for them for free and they don’t even promise to give you credit, let alone permission to use the material you helped create.

Such potential for abuse of the “open” label is a major reason we created the open definition — where open content and data are clearly defined as material that you, and others, are free to use, reuse and redistribute without restriction.

Jed Sundwall of Netsquared just published an interview with Rufus Pollock, co-founder of the Open Knowledge Foundation.

The interview includes discussion about the distinction between price and value, about the Open Knowledge Definition, about CKAN, about decentralised approaches to working with large quantities of data, about packaging for knowledge and about ‘Shiny Front End Syndrome’. It ends with 3 suggestions for people publishing collections of content or data.

Here’s an excerpt:

Well, one day soon we’re going to have a lots of material that is open and what’s really exciting about open stuff is that it can easily be shared and recombined. That means we can break very complicated problems down into small bits, which people can manage. But then, we can put it back together again. So, let’s say you were interested in U.S. unemployment, a hot topic, and you’re interested in understanding how it changes. Maybe there’s a data site out there just on unemployment itself. But maybe there’s another one on house repossessions or the housing market, and then, there’s another one on manufacturing. There are a whole bunch of different data sites.

Now, maybe one person could just maintain them all but that might become too big a job. You may need expertise in the housing market to maintain the housing data site, but you really want to bring these together often when you want to do analysis, or compute things, or make pretty pictures, or whatever it is you want to do. This is very similar to building a large building, let’s say, or developing an operating system plus all the applications to use. Maybe one person could build them all and make sure they all work together but that would be quite a big task. Even the world’s greatest monopolist struggles to do this effectively.

So, the typical way we go about doing this is by exploiting divide and conquer. But when you divide stuff up, there was this question about how you bring it back together. So then, we say we’re moving toward a world where you can start getting lots of these data sets and then start putting them out there in the world. They can just start taking this unemployment data or this housing data. But, how do you find that and how do you get a hold of it? So often in software, there’s been this tradition of building some kind of registry where you can find things, and then you start to impose some structure on that material, you start packaging. So rather than just saying: here’s my website, here’s my Wiki, look, there’s lots of data on it, you are going to start packaging that data in a slightly more structured form.

The point of CKAN is to start saying, look, there’s a better way than just having our stuff in wikis or in some random form on a website. We can start registering this material, and packaging it up a bit. That way other people, when they want them, can come and get hold of them easily and wheel of reuse can start to turn.

If starting a new, public interest, organisation, there are three obvious principles you might like to have.

  • Finance - have all bank transactions automatically public in real time. Plus accounts.
  • Software - all software made by the organisation to be open source.
  • Information - voluntarily subscribe to some sort of FOI law.

The software one is reasonably well covered.

There are problems with the finance one. For example, you probably need to anonymise individual donations, or at least those that are ’small’. It would be lovely if somebody could think through all this, and come up with an “Open Finance Definition”, for describing when an organisations finances are truely open.

There are also problems with the Freedom of Information one. In the UK at least, subscribing to public sector FOI law voluntarily would be dangerous. You wouldn’t get the protection from defamation that a public sector body gets, and you may have trouble applying the public interest test clearly. So again, would be lovely if somebody could come up with an “Open Information Organisation Definition” which encoded a good principle to have for this.

Amazingly really, the more you think about this openness, the more things you find that could be open, and the more definitions you need. There’s work for you forever, Rufus :)

Melanie Dulong de Rosnay recently published an excellent paper on open data in the life sciences in Nature Precedings entitled Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness. From the abstract:

Molecular biology data are subject to terms of use that vary widely between databases and curating institutions. This research presents a taxonomy of contractual and technical restrictions applicable to databases in life science. It builds upon research led by Science Commons demonstrating why open data and the freedom to integrate facilitate innovation and how this openness can be achieved. The taxonomy describes technical and legal restrictions applicable to life science databases, and its metadata have been used to assess terms of use of databases hosted by Life Science Resource Name (LSRN) Schema. While a few public domain policies are standardized, most terms of use are not harmonized, difficult to understand and impose controls that prevent others from effectively reusing data. Identifying a small number of restrictions allows one to quickly appreciate which databases are open. A checklist for data openness is proposed in order to assist database curators who wish to make their data more open to make sure they do so.

Shirley Fung has published a directory of open datasets examined in the paper, and details of their re-usability on Molecular Biology Databases.

For each dataset, they provided basic metadata, including:

  • The name and URL of the database,
  • URL of the download page and URL of the terms of use,
  • Extracts of the terms of use for further review and comments,
  • Values for technical accessibility and legal accessibility features [...]

They then looked at various technical and legal restrictions for accessing, acquiring and re-using the material - including bulk downloadability, registration, password protection, terms and conditions, and licensing - asking the following questions:

  • Is there a link to download the whole database?
  • Is it possible to access the data through a batch feature?
  • Is it possible to access the data through a query-based system?
  • Finally, is registration compulsory before downloading or accessing data in the ways described above?
  • Does the database have a policy?
  • Are there any restrictions on the right to reformatting and redistributing?
  • Which restrictions?

This is very similar to the work we have been doing with ckan.net, which aims to provide basic metadata for knowledge packages, including:

  • url
  • title
  • download url
  • tags
  • license/legal status
  • unstructured text field with a description of the resource and details about its openness

Furthermore, CKAN uses certain tags to indicate any technical or legal restrictions on the packages that are listed. For technical access, this includes bulk downloads, registrations, password protection, and access through an API:

For legal terms tags include noncommercial restrictions, and cases where terms of re-use are not clear:

There are also several ‘todo’ tags to indicate where it might be useful to write to the knowledge publisher or distributor to clarify something, to split up the entry into multiple entries, or to otherwise work on the registry:

There is significant work involved in documenting the legal and technological issues involved in accessing and re-using knowledge. It would be fantastic if this could be made easier by sharing the results of this kind of research. CKAN is intended to be a community-driven resource to aid the discovery of (open) knowledge in the first instance, its automatic installation in the longer term, and ultimately to support its re-use by providing multiple download links, multiple formats, big datasets broken down into smaller components and so on.

The MBDB is a fantastic project and we hope that in future we can put our heads together with Melanie, Shirley and others to improve the discoverability (and re-usability) of open data in the life sciences!