Open Data Going Mainstream?
April 10th, 2008
Bret Taylor’s recent post entitled “We Need a Wikipedia for Data” has been garnering a lot of attention around the blogosphere. While his suggestions are not particularly novel, the post and the attention it has garnered, is, I think, indicative of the growing interests in the issues of (open) data and its importance for the development of related services and products.
While generally in agreement with Bret’s arguments, there are a few differences that are worth raising. First Bret appears to favour some kind of centralized repository that everyone can read from and write to:
To this end, I think we should create a Wikipedia for data: a global database for all of these important data sources to which we all contribute and that anyone can use.
As readers of this blog will know, we’re sceptical of this ‘one ring to rule them all’ approach. In this regard, it is also important to distinguish finding material, parsing it, and plugging it together, issues that got rather run together in the surrounding discussion. As I wrote in a comment to Bret’s post:
There seem to be several distinct issues you (and your commenters) are concerned with:
1. Discoverability of datasets. For this you want a registry of some kind and this is exactly what the Comprehensive Knowledge Archive Network (CKAN) is designed to do. …
2. ‘Developing’ data particularly using many contributors and a versioning (wiki-like) model. This seems a general problem and one which I wrote about in this post on the collaborative development of data back in February last year. Since then various projects have launched or developed which attempt to address this issue, even if only partially (e.g. Freebase, Swivel, Numbrary, http://www.openeconomics.net …). This then leads into:
3. Componentizing data so that one can easily plug different datasets together rather than having to aggregate data together in one big place (crudely: ‘One Ring to Rule them All’ vs. ‘Small Pieces, Loosely Joined’). After all it seems unlikely that any one organization, however large, can hold ‘all the data’, and in ay case doing so would negate the benefits of having ‘many minds’ working on a problem. It is our hope that CKAN would start to facilitate the kind of packaging that one frequently observes in software but is, as yet, fairly rare for knowledge (data/content/…). More on this can be found in this blog post on componentization plus the slides from our presentation at XTech.
To conclude, I definitely agree about the importance of having more open data and making it easier to find and use though I’m hoping that it will take a more decentralized and componentized form than simply a ‘wikipedia’ for data. More important though than any details is the fact that this kind of interest from a wider audience indicates that issues of data openness and production are going mainstream — something we as a community should strongly welcome.
Public Domain Dedication & License (PDDL) v.1.0 released at OKCon!
March 18th, 2008
Jordan Hatcher, of opencontentlawyer.com and chair of the Advisory Council for the Open Knowledge Definition took the Public Domain Dedication & License out of beta on Saturday at OKCon.
The PDDL (which we blogged about in December) was initially sponsored by Talis and is specifically aimed at providing a suitable license for open data — taking account of rights in databases, such as those created by the EU Database Directive. As Jordan’s announcement states - the license is now ready for use. This is great news for producers and promoters of open data.
Open Bibliographic Data: The State of Play
March 6th, 2008
Given the public role of libraries and the fact that bibliographic metadata (i.e. the material in library catalogues) doesn’t seem that exciting from a commercial point of view you might think that, of all the types of data out there, it would be bibliographic data that would be the most open. You might even think, given the public-spiritedness of librarians, that this is the kind of area where not only could it be openly available but it would be openly available (in nice little bzip or gzipped dumps …).
In fact the situation is quite the opposite. Most libraries appear to implicitly or explicitly exert rights over their data with some libraries licensing access to their catalogue data for substantial sums of money. The following lists some of the examples (both closed and open) that we know of:
Library of congress: public domain in the US (or at least free) but copyrighted outside the US. See [1] and comments in in fred2.0 readme which state:
These data are works of the United States Government and as such are not subject to copyright within the United States. (17 U.S.C §105).
The Library of Congress has copyrighted these data for use outside the United States. Contact the LC for permission prior to use or distribution of this data outside the United States. [http://www.loc.gov/cds/mds.html]
- fred2.0 (fred2.0 CKAN package): an excellent example of the effort to make material available but unfortunately has same restrictions as Library of Congress (from which the material is sourced).
- British Library: closed (and apparently gets sold for substantial sums).
- OCLC/Worldcat: closed. See the OCLC CKAN page.
- Barton/Simile: semi-open. Sourced from OCLC. Originally taken down but now back under CC non-commercial. See [1] for further discussion.
- OpenLibrary: in theory open (though no formal license or dump as yet and some material may have been sourced from LoC making it suspect outside of the US)
- isbndb.com: not really fully bibliographic data and status uncertain (see isbndb.com CKAN page)
LibraryThing: closed. Does not seem to make data available and source would likely make this problematic (from the about page):
LibraryThing uses Amazon and libraries that provide open access to their collections with the Z39.50 protocol. The protocol is used by a variety of desktop programs, notably bibliographic software like EndNote. LibraryThing appears to be the first mainstream web use.
As we continue to search for open sources of bibliographic data we’d love to hear from anyone who knows of examples not already on this list.
[1] http://www.bookism.org/open/2007/04/02/open-data-what-would-kilgour-think/
Creative Commons adopts ‘Free Cultural Works’ seal of approval
February 22nd, 2008
Yesterday Creative Commons announced that their Attribution and Attribution Sharealike licenses will feature a seal of approval and link to Freedom Defined - the Definition of Free Cultural Works. We’ve been in touch with Freedom Defined since May 2006 (we blogged about the project last year) as their aims are so similar to that of opendefinition.org and the Open Knowledge Definition.
While there was discussion last year of merging the two projects, it now looks as though they will remain complementary - with Freedom Defined focusing on cultural works, and with the Open Knowledge Definition retaining a broader conception of ‘knowledge’ that includes data (see e.g. Good news for open data).
Mike Linksvayer of Creative Commons comments:
This added signaling is part of an ongoing effort to distinguish among the range of Creative Commons licenses — never say the Creative Commons license, as there is no such thing. Our license deeds have always communicated the distinct properties of each license with icons and brief descriptions.
This is great news and will hopefully contribute to the strengthening of a more robust sense of free culture/open knowledge within the plethora of liberal licensing options that are now available!
Open Definition Advisory Council launched
February 15th, 2008
We are pleased to announce the launch of an Advisory Council for opendefinition.org. The Council will be formally responsible for maintaining and developing the Definitions and associated material found on the Open Definition site - including the Open Knowledge Definition and the Open Service Definition. As many of you will know, these definitions aim to provide clear and succinct sets of conditions for ‘openness’ in knowledge and services.
Jordan Hatcher of opencontentlawyer.com has kindly agreed to be Chair of the Council, which includes:
- Paul Jacobson, iCommons
- Paul Miller, Talis
- Peter Murray-Rust, Cambridge University
- Rufus Pollock, Open Knowledge Foundation & Cambridge University
- Rob Styles, Talis
- Peter Suber, Scholarly Publishing and Academic Resources Coalition (SPARC) & Earlham College
- Luis Villa, Columbia Law School, GNOME Foundation & Open Source Initiative
- Jo Walsh, Open Knowledge Foundation & Open Source Geo-Spatial Foundation
- John Wilbanks, Science Commons
More detailed biographies are available on the Advisory Council page.
It is our intention that the overall development of the material on the site will continue in the same community based and collaborative manner. The Council’s role will be to provide oversight, guidance and input into this process, not to replace it.
This is fantastic news for the definitions projects!
On Getting Raw Data for Cancer Research
February 4th, 2008
Andrew Vickers, a biostatistician at the Memorial Sloan-Kettering Cancer Center, New York, recently published an article in the New York Times about his experiences trying to get hold of raw data for cancer research: Cancer Data? Sorry, Can’t Have It. In it he describes various difficulties he has encountered trying to get hold of the data that could “make an immediate and important impact on the lives of cancer patients”. Reasons for reluctance to share data included:
- potentially making researchers uncomfortable that their analyses could be undermined;
- refusal on the grounds that the original research team might “consider a similar analysis at some point in the future”;
- privacy concerns;
- red tape;
- unwillingness to co-operate;
- the “difficulty of putting together a dataset”;
- potential for misinterpretation or misrepresentation.
Vickers states:
Given the enormous physical, emotional and financial toll of cancer, one might expect researchers to promote the free and open exchange of information. The patients who volunteer for cancer trials often suffer through painful procedures and harsh experimental treatments in the hope of hastening a cure. The data they provide ought to belong to all of us. Yet cancer researchers typically treat it as their personal property.
He cites the research of Dr John Kirwan at the University of Bristol into researchers’ attitudes towards data sharing:
He found that three-quarters of researchers he surveyed, as well as a major industry group, opposed making original trial data available. It is worth restating this finding: most scientists doing research on how best to help those in pain, or at risk of death, want to keep their data a secret.
Vickers makes a strong case for the importance of sharing data and for “robust debate” in the domain of cancer research. He notes the ease with which raw data can now be shared.
This is an excellent particular case of a more general line we take at the OKF (e.g. see Give Us the Data Raw, and Give it to Us Now and Dead Knowledge: why being explicit about openness matters). Surely much is lost if data that could prove useful to cancer researchers sits collecting dust. Much could be gained if more trials data were open.
Meeting on UK Public Sector Information Re-use Request Service
January 15th, 2008
On Saturday I attended a ‘BarCamp’ on the Power of Information Review Recommendation 8 - which suggests there should be a re-use request service for UK Public Sector Information (we blogged about this in October).
The event was organised by John Sheridan of the Office of Public Sector Information and was attended by representatives from government, the private sector, the media, and nonprofits - including mySociety’s Tom Steinberg, who co-wrote the review in June 2007.
The meeting went well - and quite a bit of time was spent planning what the service will look like and what it will do. Below are some jottings for those that are interested. (These are rough and uncomprehensive, please don’t hesitate to get in touch if there’s anything missing or incorrect!)
Notes
John Sheridan: Government have limited expertise in developing certain kinds of web services. What are barriers to being able to re-use UK PSI?
Brainstorming session: all participants were invited to suggest the kinds of things they’d like to see and discuss throughout the BarCamp.
Rob McKinnon, mySociety NZ
- suggested a kind of e-democracy existed in the 19th c. with suffragists?
- parliament = paper?
- insert vote output legislation
- more paper inside parliament
- parliament is about data
- creating data rather than paper
- he was inspired by Public Whip and They Work For You
- screenscraping HTML from government website
- NZ theyworkforyou site
- (aside, Francis Irving: we used data, then the click use license came along)
- (aside, John Sheridan: several hundred thousand pounds of potential revenue lost through switch to click license)
- no copyright exists on NZ parliamentary debates
- (aside, Richard Quarrell: distinction between local government and central government?)
- screenscraping parliament.nz
- getting metadata from HTML SPAN tags
- trying on a small sample, testing on a larger sample
- doesn’t have to be structured/semantic data, rdf
- making ‘’’source”’ information available - people will make use of it
- politics about politicians or people?
- networked democracy
- making information transparent, facilitate social collabortation, participation
- make data discoverable
- 80% of TheyWorkForYou NZ’s visitors come via Google’s search
- use canonical, reliable and readable URLs
- make data linkable
- let people mark content
- dopplr, upcoming
- more participation through transparency of requests
- (aside, Michael Cross: legal basis for requesting information remained available? National Archives. famous case of documents from East India Company smelling of a certain oil - which turned out to be source of information about health)
- (aside, Richard Quarrel: its an interesting question whether metadata/html tags count as ‘information’ under, e.g. FOI requests…)
Discussion
- John Sheridan: NZ gov effectively developed own microformats. One thing he [John] does is convince government metadata working group to develop microformat. Microformats for licensing like Creative Commons. Adding licensing information to URI.
- John Sheridan: Heaps of work on microformats already exist.
- Tom Steinberg: Respond in a flexible manner to demand for formats rather than blanket mandate for all material to be in format X.
- Richard Quarrell: government sharing - egovernment standards?
- Tom Steinberg: different tags for different government departments
Francis Irving, mySociety
- Freedom of Information requests
- discussed mySociety’s Freedom of Information Filer and Archive service which is under development
- database dumps
- (aside, Glyn Wintle: sometimes government find it easier to give whole database rather than answer a particular query)
- encouraging real names on requests rather than pseudonyms
- (aside, John Sheridan: snag that not all data gov has it owns. can’t re-publish third party information without investigation… local authority can serve requests, but cannot give permission to republish.)
- (aside: rights in, e.g. address data. 3 different bodies own rights: Post Office, …)
- make copyright clear on data
- (aside, John Sheridan: suggest we move naturally on to licensing psi, rights, etc.)
Michael Cross, Free Our Data
- APSI looking at bigger economic picture
- raw data should be made available for free
- (aside, John Sheridan: government encrypting all in sight, if think no-one wants it. whilst policy framework encourages re-use…)
- (aside, Stephan Carlyle: Deal with 40,000 FOI requests. 900,000 environmental information regulation requests. Value added requests. Point to info already published. Publishing often costs less than production. Balanced approach to access. Most requests are members of public wanting to out certain things in their locality. Danger of having too much of a focus on boundaries and exceptions. 97% of requests easy to respond to within 20 days.)
- (Tom Steinberg demonstrated Department of Health Information Asset Register)
- (aside, Richard Quarrell: plan to publish comprehensive list of IARNs - information asset register numbers)
- (aside, Tom Steinberg: do we design a request service to put pressure on government FOI policy, or one that works within the bounds of existing legislation?)
- (aside, Rob McKinnon: refinement with proprietry content?)
- (aside, John Sheridan: Cross Cutting Review [of Knowledge] says material is available for re-use by default)
- Discussion of point 9.7 of click use license, about using PSI in misleading context.
(Lunch)
- Demoing maptasm
Brainstorming for request service
- What are the goals?
- Who are the users?
- What sort of things are people going to request?
- How will people find out that they can request info?
- How will the service help PSIH’s (Public Sector Information Holders) to be more responsive?
- How do we make sure the data is released in internet time?
- Relation to FOI + distinction?
- The process - how should it work?
Aims
- Government information provision is driven by what people want. Provision should become driven by demand.
- Not enough to make culture change documents. Need sticks. Quite strong incentives. Shame and money. Pressure through revealing failure to serve. Threat of budget cuts if information isn’t served.
- Increase knowledge of what is available and what has value (if not published).
- What ‘raw’ info is available and consistent way of gaining access. Expressing this.
- Complement other initiatives
- To be safety net for all other information provision.
Types of request
- Too expensive to get under other rights of access or re-use
- Clarification of licenses
- Change of licensing terms
- Can’t find
- Change of law relating to info publication
- data that doesn’t exist but should
- Change cost of information
- Different formats of information
- Change purchasing or obtaining
Who are the users
- Civic society
- Academics
- Private sector (data products, open models)
- Public servants/departments
- Individuals
How will people find out
- In every copyright statement
- Within click use license pages
- OPSI front page
- IFTS site
- Anywhere you can buy information from government
- PSIH how to get info pages
- FOI officer training
- Information Commissioner’s office (ico)
- Success stories?
- Business office
- Its own blog
- Distinction between procedural and policy complaints?
How does it work
- Category of change requested
- Details of user
- PSIH(s) concerned (and ‘don’t know’)
- Dataset requested
- Problem is wrong policy
- Problem is execution
- Nature of problem
- What I could do if I had it
- How it should work in future
- Story/history
The process - what happens?
- Published on a page
- Published on a new items page
- Problems email alterts + rss feeds
- Mail named contact?
- Provides tips and tricks + explanation for obtaining what they want
- Endorse function inc. status of endorses + use cases
- Discussion thread inc. authentication for PSIH responses
- Write to creator of report
- Canonical list of bodies which pulbish PSI? See civil service handbook for lists.
Endorsing page
- Name
- Allow me to be emailed by (opsi/poster/anyone)
Response to ‘The Future of Bibliographic Control’ draft from the Library of Congress
December 19th, 2007
A couple of weeks back we blogged about the ‘Future of Bibliographic Control’ draft report from a working group at the Library of Congress. Since then, we’ve submitted to the group a brief, collaboratively edited response to the draft and an appendix with some additional detailed comments.
The response was drafted by the Open Knowledge Foundation and Aaron Swartz of the Open Library and was co-signed by over 150 groups and individuals, including:
- Lawrence Lessig, Founder, Creative Commons
- Brewster Kahle, Founder, Internet Archive
- Tim O’Reilly, Founder and CEO O’Reilly Media
- Tim Spalding, Founder, LibraryThing.com
- Peter Suber, Senior Researcher, The Scholarly Publishing and Academic Resources Coalition
- John Wonderlich, Program Director and John Brothers, CTO, Sunlight Foundation
- Paul Miller, Rob Styles, Terry Willan, Talis
- Rick and Megan Prelinger, Prelinger Library & Archives
- … and librarians, system librarians, catalogers, assistant librarians, library support staff, library users, library school lecturers and students, consultants, academics and software developers from Australia, Belgium, Canada, Germany, India, Italy, the Netherlands, Norway, Portugal, Spain, the Ukraine, the UK and the US.
Many, many thanks to all of those who helped to publicise this, and to those who co-signed the response! We hope that the working group consider amending the draft in light of our comments in January.
Good news for open data: Protocol for Implementing Open Access Data, Open Data Commons PDDL and CCZero
December 17th, 2007
Last night Science Commons announced the release of the Protocol for Implementing Open Access Data:
The Protocol is a method for ensuring that scientific databases can be legally integrated with one another. The Protocol is built on the public domain status of data in many countries (including the United States) and provides legal certainty to both data deposit and data use. The protocol is not a license or legal tool in itself, but instead a methodology for a) creating such legal tools and b) marking data already in the public domain for machine-assisted discovery.
As well as working closely with the Open Knowledge Foundation, Talis and Jordan Hatcher, Science Commons have spent the last year consulting widely with international geospatial and biodiversity scientific communities. They’ve also made sure that the protocol is conformant with the Open Knowledge Definition:
We are also pleased to announce that the Open Knowledge Foundation has certified the Protocol as conforming to the Open Knowledge Definition. We think it’s important to avoid legal fragmentation at the early stages, and that one way to avoid that fragmentation is to work with the existing thought leaders like the OKF.
Also, Jordan Hatcher has just released a draft of the Public Domain Dedication & Licence (PDDL) and an accompanying document on open data community norms. This is also conformant with the Open Knowledge Definition:
The current draft PDDL is compliant with the newly released Science Commons draft protocol for the “Open Access Data Mark” and with the Open Knowledge Foundation’s Open Definition.
Furthermore Creative Commons have recently made public a new protocol called CCZero which will be released in January. CCZero will allow people:
(a) ASSERT that a workhas no legal restrictions attached to it, OR
(b) WAIVE any rights associated with a work so it has not legal restrictions attached to it,
and
(c) “SIGN” the assertion or waiver.
All of this is fantastic news for open data!
‘The Future of Bibliographic Control’ and Licensing Policies for Bibliographic Data
December 6th, 2007
Last week the Working Group on the Future of Bibliographic Control at the Library of Congress released their Draft Report. They are soliciting for public comment until the 15th December, in good time for final submission on the 9th January.
The aim of the working group is to:
Present findings on how bibliographic control and other descriptive practices can effectively support management of and access to library materials in the evolving information and technology environment.
They will make recommendations for the library world in general, and the Library of Congress in particular. The group includes representatives from:
- the American Association of Law Libraries
- the American Library Association
- the Association of Research Libraries
- the Special Libraries Association
- Microsoft
- the Coalition for Networked Information
- OCLC
Some notes on the draft
The draft continually emphasises that our information environment is changing and that libraries must seek to keep abreast of these changes through new policies and new kinds of partnerships. Alongside urges for libraries to take heed of nontraditional third party content (e.g. book reviews, cover images), and to work towards new kinds of shared standards, there is mention of greater sharing of bibliographic material.
Paragraphs such as the following suggest that the authors are proposing new ways of ‘opening up’ silos of bibliographic data:
“The future of bibliographic control will be collaborative, decentralized, international in scope, and Web-based. Its realization will occur in cooperation with the private sector, and with the active collaboration of library users. Data will be gathered from multiple sources; change will happen quickly; and bibliographic control will be dynamic, not static. The underlying technology that makes this future possible and necessary—the World Wide Web—is now almost two decades old. Libraries must continue the transition to this future without delay in order to retain their relevance as information providers.” (p.1)
“Library bibliographic data will move from the closed database model to the open Web-based model wherein records are addressable by programs and are in formats that can be easily integrated into Web services and computer applications. This will enable libraries to make better use of networked data resources and to take advantage of the relationships that exist (or could be made to exist) among various data sources on the Web.” (p.23)
“The Working Group envisions a bibliographic infrastructure wherein data about entities of interest (e.g., works, places, people, concepts, and chronological periods) are encoded in agreed-upon ways and made available through agreed-upon Web protocols for ready and efficient use by other applications and services. LC and the library community need to find ways of “releasing the value” of the rich historic investment in semantic data onto the Web.” (p.29)
Moreover, the first of the five key recommendations made by the group is to:
“Increase the efficiency of bibliographic production for all libraries through increased cooperation and increased sharing of bibliographic records, and by maximizing the use of data produced throughout the entire “supply chain” for information resources.” (p.1)
More particularly, recommendations include:
1.1.1.5 All: Work with resource providers to coordinate data sharing in a way that works well for all partners. (p.12)
1.1.4.1 LC: Convene a representative group consisting of libraries (large and small), vendors, and OCLC members to address costs, barriers to change, and the value of potential gains arising from greater sharing of data, and to develop recommendations for change. 1.1.4.2 LC: Promote widespread discussion of barriers to sharing data. 1.1.4.3 LC: Reevaluate the pricing of LC’s product line with a view to developing a business model that enables more substantial cost recovery. (p. 13)
However, there is no specific mention of licensing policies for bibliographic data per se. While there is talk of new ways of sharing bibliographic material, there is a resounding silence regarding blanket licensing of the data - particularly open licensing - which would allow anyone to re-use the data, from technology companies to individual enthusiasts. Without further clarification, the draft might imply that sharing and collaboration with regard to bibliographic material might extend only to other libraries and companies who are willing to take a more active role in adding value to the material by developing new products or services. As the authors state:
“Once considered a public good, information access is today a commodity in a rapidly-growing marketplace.” (p. 12)
While of course cost-benefit analysis will be involved in taking licensing policy decisions, and it unsurprising open licensing of bibliographic data is not outrightly recommended for all libraries - it does seem surprising that open licensing is not even mentioned in the draft. (As an aside: surely as a US government operation, material produced by the Library of Congress is exempt from copyright - and hence effectively open by default, at least in the US?)
Open bibliographic data?
While prominent bibliographic projects such as OCLC are closed (see the oclc package entry in CKAN), projects such as The Open Library (which we’ve blogged about here and here) exemplify the benefits of an open approach. (See this post on Jay Datema’s blog for an interesting view of open licensing for bibliographic data.)
Open bibliographic data could brings about significant benefits to the general public (by allowing anyone to redistribute, re-use, and build on it), as well as to other institutions and commercial developers.
The British Library has recently released a press release in which CEO Lynne Brindley declares the current balance between private rights and the public domain is “not working”. Though their bibliographic data is closed (as are the products of their digitisation efforts), Lynne’s recent statement is germane:
“I think we at the British Library, echoing the intent of the Adelphi Charter, believe that while market economics are very important, the public interest also needs to be actively protected – this can be done in many different ways but one important, if not the most important way, is through enlightened and well informed legislation balancing the conflicting public and private interests that seek to create and inform our IP regime. There is a need for real innovation in business models and for the legislation to become fit-for-purpose for the digital age.”
We hope that the working group add explicit mention of the potential benefits of open licensing to their report. It would be great if the Library of Congress got a wealth of responses to their draft from the open knowledge community!
