Open Data Going Mainstream?
April 10th, 2008
Bret Taylor’s recent post entitled “We Need a Wikipedia for Data” has been garnering a lot of attention around the blogosphere. While his suggestions are not particularly novel, the post and the attention it has garnered, is, I think, indicative of the growing interests in the issues of (open) data and its importance for the development of related services and products.
While generally in agreement with Bret’s arguments, there are a few differences that are worth raising. First Bret appears to favour some kind of centralized repository that everyone can read from and write to:
To this end, I think we should create a Wikipedia for data: a global database for all of these important data sources to which we all contribute and that anyone can use.
As readers of this blog will know, we’re sceptical of this ‘one ring to rule them all’ approach. In this regard, it is also important to distinguish finding material, parsing it, and plugging it together, issues that got rather run together in the surrounding discussion. As I wrote in a comment to Bret’s post:
There seem to be several distinct issues you (and your commenters) are concerned with:
1. Discoverability of datasets. For this you want a registry of some kind and this is exactly what the Comprehensive Knowledge Archive Network (CKAN) is designed to do. …
2. ‘Developing’ data particularly using many contributors and a versioning (wiki-like) model. This seems a general problem and one which I wrote about in this post on the collaborative development of data back in February last year. Since then various projects have launched or developed which attempt to address this issue, even if only partially (e.g. Freebase, Swivel, Numbrary, http://www.openeconomics.net …). This then leads into:
3. Componentizing data so that one can easily plug different datasets together rather than having to aggregate data together in one big place (crudely: ‘One Ring to Rule them All’ vs. ‘Small Pieces, Loosely Joined’). After all it seems unlikely that any one organization, however large, can hold ‘all the data’, and in ay case doing so would negate the benefits of having ‘many minds’ working on a problem. It is our hope that CKAN would start to facilitate the kind of packaging that one frequently observes in software but is, as yet, fairly rare for knowledge (data/content/…). More on this can be found in this blog post on componentization plus the slides from our presentation at XTech.
To conclude, I definitely agree about the importance of having more open data and making it easier to find and use though I’m hoping that it will take a more decentralized and componentized form than simply a ‘wikipedia’ for data. More important though than any details is the fact that this kind of interest from a wider audience indicates that issues of data openness and production are going mainstream — something we as a community should strongly welcome.
On data transport through payment networks
December 13th, 2007
I recently ran across the Cruickshank Report, a review written in 2000 of the state of payment information systems in the UK, and enjoyed what it had to say about “money transmission” (Think ATM networks, point-of-sale networks in shops, credit card networks, as well as intra-bank schemes for larger sums.)
A lot of value is apparently created by the transport mechanism itself, in the form of per-use access fees: “around three quarters of a billion pounds per year are paid in this way to UK debit and credit card issuers. The interests of bank run schemes do not coincide with the public interest.” These are interchange fees, paid between individual members to cover the cost of services supplied from one member to another.
Inflated interchange fees create a number of detriments. First, they raise the cost to retailers of card payments. … Second, allowing issuers to recover costs through interchange payments weakens the incentive to cut costs through greater efficiency … Third, competition between payment mechanisms is distorted in favour of products with artificially high interchange fees.
Money transmission infrastructures such as ATM networks conduct “network effects”. Initially there is a high entry cost to the builder of a network. People are more likely to join it, the more people they can reach over it, the more value there is to each participating node - in how widely a credit card is accepted, or how widely a videophone is used. Each new user can be given the same level of service for less new cost than the previous one. Once maintenance cost is covered, up until capacity is full, that extra cost is effectively zero.
“Network effects also have profound implications for competition, efficiency and innovation in markets where they arise… Once a network is well established, it can be extremely difficult to create a new network in direct competition.” A very high cost of initial capital investment, at a great deal of redundancy in services “raises entry barriers” to value transmission markets which “in turn leads to higher customer charges and lower levels of service in these markets. It also effects the geographic distribution…” where densely populated areas may become over-served, sparse ones neglected.
Except at times of high congestion, there isn’t a per-access cost impact over and above the maintenance costs of the underlying network. Cost to install, fix and improve services may be significant; but the benefit of being able to join such a network, collectively amongst participants, far outweighs this cost. A per-access fee levied against the ultimate end user may hold back the generation of network effects and act against the economic interests of the whole network.
The Cruickshank report’s analysis suggests a license for entry into “money transmission” networks which reflects the risk involved in trusting other participants to behave consistently, and the high levels of value being committed. The current schemes run by the payments industry have a “mutual governance” model, where the underlying network is operated by a not-for-profit company co-owned by the participating “competing” companies.
Yet the industry associations have formed non-for-profit mutuals on the grounds that some aspects of their business are better run collaboratively. There are many situations beyond payments networks that look like this and which tend to involve the underlying transmission medium for moving things from one place to another. A network - a road network carrying a bus network, or a communications network with public terminals - becomes so widespread and the necessity of interchange with it so complete that the cost of replicating it - where that is physically possible - must tend to be prohibitively more than the cost of joining it.
In a few very congested areas, private toll roads may be viable, but even then following the topology of a main network. Planning and licensing restrictions in the dependencies, additionally limit who can participate in building infrastructure.
What does this look like? Well, it looks a lot like another non-Internet network which the Internet increasing depends on to be of commercial interest, the cellular network. To become a full member of the GSM alliance and therefore entitled to read, and use, the specifications for phone call data exchange, one needs to have a “license” for a slice of spectrum and at least a minimal physical infrastructure of cell towers. A moratorium on new phone mast installation means that new market entrants must sublicense from competitors; even a really significant capital investment cannot do enough. This starts to look like what economists have called a “two-sided” market, where services depend on platforms, and an effective monopoly on the latter allows an entity
I want to claim a strong argument that there is a whole class of enterprises in which competition at the infrastructure layer cannot produce a better result than cooperation, and is likely to produce a worse one. If there really is a class of works which are “natural cooperatives”, I want to find out more about how they are constituted and how their runnings are best expressed in rules. For want of a better word, all these enterprises are some kind of “transport” and that’s what I’m trying to get at in proposing an “Open Transport” session for next year’s Open Knowledge Foundation conference.
An Economist Writes: What’s the best structure in welfare terms for society: standardization (cooperatively via an open or semi-open standard), standardization via monopoly, or multi-platform/network competition? Each of these structures will have different static (how good is it right now) and dynamic (investment in quality and innovation for the future) effects. For example open standards might be great statically but take ages to hammer out (while everyone negotiates) while a proprietary standard might be bad statically but fast to do (and be of good quality — since there are fewer compromises).
Mutual governance may come in for criticism, but perhaps it is rather the governance of mutual governance that is the problem. A revision of the rules describing these sorts of systems ought to be workable; the creation of a status to which these networks can apply in order to get tax breaks etc, and could achieve the same as any regulatory regime which aimed to increase transparency, lower costs and encourage innovation. Surely all participants in the building of transmission networks which move value around - in the form of water and waste, data and energy - must share these aims?
Big Art Mob, public art and open heritage resources
November 30th, 2007
I’ve just been poking around at the Big Art Mob website which was launched by Channel 4 earlier this year and picked up a Royal Television Society Innovation Award earlier this month. It aims to “create the UK’s first comprehensive survey of Public Art” using user-submitted camera phone pictures and a Google maps API.
Though part of the project seems to involve soliciting for feedback for what ‘public art’ is, and means, Big Art Mob also looks to endorse adopt a legal definition of ‘public art’:
The Copyright, Designs and Patents act 1988 defines Public Art as sculptures, buildings, models for buildings and “works of artistic craftsmanship” which are permanently situated in a public place or in premises open to the public. This means you cannot walk into a gallery, for example, even a public gallery, and photograph a painting to send to Big Art Mob. Likewise you cannot go to a privately owned place, say a stately home, and photograph and send pictures of art there without permission.
This is a convenient way to try to avoid possible copyright infringement, or other legal difficulties surrounding the images that users submit. In keeping with the ‘public spirit’ of the project, the terms and conditions state that images contributed will be made available to others under a Creative Commons Attribution-NonCommercial-ShareAlike license.
It looks like a great project, and as well as being one of the first broadcaster endorsements of mobile blogging (as many people have pointed out), its looks as if it could generate a significant collection of CC licensed images displayed on the Big Art Map. However it would be even better if their images were fully open, and if the project made raw dumps of site location data and associated tags available for others to re-use!
The potential of open heritage resources - and an anecdote
Some of us at the OKF have been brainstorming about local heritage projects like this for a while. One line of thought is that linking user-generated material (including material from Flickr, Wikipedia, and so on) to material from local museums, libraries and archives could encourage the growth of a ‘public information ecology’ for local heritage. Naturally we think open licensing would help such an ecology to flourish - and would let developers to experiment with different kinds of interfaces to enable users explore, modify, extract and reuse material they are interested in. ‘Public art’ such as architecture, sculpture, and other landmarks is ideal subject matter for this!
I started thinking about the potential of open local heritage resources after my father and I spotted a stained glass window we both liked in a country church. He sent a picture of it to me (from his mobile phone), and later I tried to find out who might have done it. I was amazed at how much enthusiast-generated information was out there. For example, Stained Glass Window Records contains over 20,000 records from one hobbyist! After finding the organisation who invoiced the church for the glass I was able to narrow down the possible artists by comparing my picture with other images available on personal websites - until I spotted a striking resemblance with another window from the same period. I was furnished with a rough biography by cross-referencing Maltese genealogy and newspaper records.
This kind of impromptu amateur research has only fairly recently become possible. Many freely available resources out there are still not open. Imagine what kinds of applications would be possible if more hobbyists and institutions allowed the fruits of their labour to be re-combined and built upon!
Keeping “Open” Libre
November 20th, 2007
Last week I attended the Jornadas gvSIG, the developer/user gathering for the open source GIS project supported by the regional government in Valencia. There seems to be a very supportive climate towards free software and open licensed data in Spain. I was impressed to hear people from commercial consultancies and local government information and infrastructure departments talking so strongly about software libre and the need to compartir el conocimiento, where tecnologia proprietaria has no place in a proyecto cooperativo. Government is increasingly moving toward an explicit Creative Commons based open licensing approach to public data and its Spatial Data Infrastructure - census data, political and administrative shapes, street networks and aerial imagery - all kinds of geographic information, open and libre.
Our household only knows about Indo-European languages, but can’t think of another language than English where a distinction between libre (free) and gratis (free) isn’t explicitly made. Talk of datos libres or freie daten has both rhetorical strength and public plausibility in a way in which free, in English, hasn’t. The term “open source software” originally came about as a softening of the term “free software”, in an attempt to introduce a non-radical plausibility. Free and Open Source software can be essentially the same thing, under a different name, open licensed in the same way.
In the last few weeks I’ve heard of Google’s launch of “OpenSocial” and its bootstrapping of the “Open Handset Alliance”. The latter, certainly, is based on patent/license-encumbered hardware and not offering an “Open Platform” that will run on more truly libre telephony hardware platforms such as OpenMoko. How libre is “open”, in these cases? How libre can a system be that relies on data formats and hardware recipes that require royalties and/or membership of a consortium in order to use it?
In such circumstances I am very glad an effort like opendefinition.org, attempting to describe a yardstick by which the libre qualities of open data, data service, data format, works can be assessed. I hope that, in helping to keep the definition of usefully “open” clear, this may help to keep open free.
The IPCC Data Distribution Centre - environmental data licensing
November 20th, 2007
We’ve recently started looking into how much environmental data made available on the web is open in accordance with the Open Knowledge Definition. The Intergovernmental Panel on Climate Change (IPCC) has a Data Distribution Centre (DDC) - which is a good start to see what data is available. The DDC “offers access to baseline and scenario data for representing the evolution of climatic, socio-economic, and other environmental conditions”. Many datasets from research centres around the world are available from the centre.
The “Why does the DDC exist?” page states:
Data are being provided by the DDC over the World Wide Web. All research groups supplying datasets have agreed to these being in the public domain. The data are provided free of charge, but all users are requested to register to ensure both that the data are used for public scientific research rather than for commercial applications and also that they can be informed of possible modifications, additions and other new developments at the DDC.
It is unfortunate that the Centre is restricting commerical re-use of the datasets they provide - especially given that many important environmental datasets are produced by US government research groups and are effectively open.
Some datasets have more specific licensing information or terms of use, such as the Special Report on Emissions Scenarios 4th Asessment Report (SRES-AR4) Global Climate Model data page, which states:
These data are licensed for use in Research Projects only. A ‘Research Project’ is any project organised by a university, a scientific institute, or similar organisation (private or public), for non-commercial research purposes only. A necessary condition of the recognition of non-commercial purposes is that all the results obtained are openly available at delivery costs only, without any delay linked to commercial objectives, and that the research itself is submitted for open publication.
It would be great if more data producers and distributors had clearer metadata about the licensing and terms of use of their datasets! This would allow a more fine-grained approach to re-use, as opposed to the blanket approach of the IPCC DDC, and several other environmental dataset distributors.
(As an aside: we’ve started an Open Environmental Data wiki page and we’d warmly welcome any contributions to this!)
Give Us the Data Raw, and Give it to Us Now
November 7th, 2007
One thing I find remarkable about many data projects is how much effort goes into developing a shiny front-end for the material. Now I’m not knocking shiny front-ends, they’re important for providing a way for many users to get at the material (and very useful for demonstrating to funders where all the money went). But shiny front ends (SFEs from now on) do have various drawbacks:
- They often take over completely and start acting as a restriction on the way you can get data out of the system. (A classic example of this is the Millenium Development Goals website which has lots of shiny ajax which actually make it really hard to grab all of the data out of the system — please, please just give me a plain old csv file and a plain old url).
- Even if the SFE doesn’t actually get in the way, they do take money away from the central job of getting the data out there in a simple form, and …
- They tend to date rapidly. Think what a website designed five years ago looks like today (hello css). Then think about what will happen to that nifty ajax+css work you’ve just done. By contrast ascii text, csv files and plain old sql dumps (at least if done with some respect for the ascii standard) don’t date — they remain forever in style.
- They reflect an interface centric, rather than data centric, point of view. This is wrong. Many interfaces can be written to that data (and not just a web one) and it is likely (if not certain) that a better interface will be written by someone else (albeit perhaps with some delay). Furthermore the data can be used for many other purposes than read-only access. To summarize: The data is primary, the interface secondary.
- Taking this issue further, for many projects, because the interface is taken as primary, the data does not get released until the interface has been developed. This can cause significant delay in getting access to that data.
When such points are made people often reply: “But you don’t want the data raw, in all its complexity. We need to clean it up and present it for you.” To which we should reply:
“No, we want the data raw, and we want the data now”
Open Learn 2007
November 5th, 2007
Last week I went to the OpenLearn 2007 conference hosted at the Open University. A lot was packed into the couple of days, and there was representation from different OER (Open Educational Resources) groups from around the world. There were an abundance of new projects, papers, groups and initiatives mentioned, and a recurring sentiment was that it is difficult to keep track of all the things that are happening!
In terms of coverage: on-the-fly notes from conference bloggers are available from OCHRE and other blog posts should appear at the OpenLearn blog aggregator. I think the OU also intend to release video/audio footage of the conference.
Below are some musings from the event…
Towards an ‘open participatory learning ecosystem’
John Seeley Brown’s talk started the conference with the idea that ‘we participate therefore we are’, with respect to learning. His He emphasised the advantages of a collaborative, participatory approach to education. The architecture studio - where all of the models are on view and everyone is able to listen to appraisals of everyone else’s work - was used to convey the paradigm of collaborative, ‘open’ development, and, indirectly, the value of ‘releasing early and releasing often’.
He said that ‘tinkering’ is an important form of learning - and suggested we are experiencing a new wave of tinkering as a result of open software and content. He also described a vision of a world where learners are also educators in an ‘open participatory learning ecosystem’. Central to this vision is the notion of a culture of sharing, remixing, blending, and modifying which is enabled by open licensing practices. In his view, the combination of eScience, eHumanities, OERs and web 2.0 is creating a ‘perfect storm of opportunity’ for such an ecosystem to flourish.
Two examples he gave of were the Faulkes Telescope Project, which gives students remote access to astronomical apparatus to perform experiments and pool/analyze their data, and Decameron web, a user generated portal for resources dedicated exploring Boccacio’s work. I was reminded of what the OKF set out to do with Open Economics and Open Shakespeare - i.e. to create open knowledge ‘exemplar’ projects with open material and open ‘tools’ to allow users to explore and analyse the material. Also I’m sure open datasets such as those listed on CKAN could be the basis for interesting ’social learning/research’ projects, by being integrated with visualisation tools (we’ve blogged about this before).
There was also discussion of new user-focused and user-led ways of collecting data for education and research. Patrick McAndrew told me about the Biodiversity Observatory, a joint project of the OU, Imperial, the Natural History Museum and 12 other projects to allow the public to contribute data about British wildlife. I wonder what kind of license they plan to make user-contributed data available under! Vijay Kumar spoke about iLabs - an architecture developed by MIT to allow students to gain remote access to laboratories.
Conceptions of ‘Openness’ and licensing practices
It was clear listening to the different talks that there were various different conceptions about what the ‘open’ in OER meant. There was certainly a strong sense that it is fundamentally related to liberal/open licensing practices (as opposed to just cost-free access) but it often seemed to have wider connotations than this. Erik Duval said that to him openness meant ‘removing barriers’ - including legal barriers, poor findability, and inconvenience to the user. Removing socio-economic obstacles to access, allowing access to source files, and creating a culture of inclusion and participation were recurring themes. I would be interested to hear more about how more people involved in OER felt about the Open Knowledge Definition!
Regarding licensing practices, speakers rarely made distinctions between different types of Creative Commons licenses. The term ‘open content’ was often taken to include material available under a license with noncommercial restrictions. In conversations I had about licenses with noncommercial restrictions (notably with people from MIT and the OU) - I was given the impression that many organisations were not opposed to the commercial usage of educational resources in principle. Commonly cited reasons for adopting one included wanting to incorporate other material available under noncommercial sharealike licenses (especially that which had been donated by other commercial organisations), the reluctance of content contributors (publishers, authors, educators, researchers…) and other parties, and wanting to prevent people mirroring with ads.
It would be great if more OER projects started using licenses requiring only attribution, or attribution sharealike so as to impose minimal restrictions on re-use! The absence of noncommercial restrictions could allow people to experiment with new models for sustaining the development of educational materials.
Repositories, registries and metadata
Chris Pegler gave an interesting talk about the wide range of repositories that now exist - from informal personal repositiories to national, international and discipline-specific repositories. She also discussed the continuum of ‘user concerns’ and the different kinds of technologies available to aid different kinds of repository usage - from rights management and metadata standards to search facilities and RSS feeds. She used Jan Hylén’s taxonomy from his 2006 paper on OER for the OECD to analyse a range of repositories and uses.
Erik Duval gave a talk about ‘open metadata for open educational resources’ - alluding to his experiences with:
- ARIADNE - “A European Association open to the World, for Knowledge Sharing and Reuse”
- GLOBE - a global alliance aiming to make educational material accessible worldwide
- MELT - which “has been designed to provide users of learning content in schools with access to more useful types of metadata that will allow them to find resources that fit their needs, language, cultures and preferred ways of teaching and learning”
- MACE - an EU project “aimed at improving architectural education, by integrating and connecting vast amounts of content from diverse repositories, including past European projects existing architectural design communities.”
He stressed the importance of open metadata and spoke of ARIADNE’s work on ‘attention metadata’ - or metadata generated automatically from users’ clickstreams, and Kuleuven’s work on automatic metadata generation.
Finally Giovanni Fulantelli spoke about ‘OpenLOs’ (open learning objects), and the EU SLOOP (’Sharing Learning Objects in an Open Perspective’) project. He described the importance of treating metadata as dynamic and changing information that is essential in supporting the evolution of learning resources.
Its good to see the work being done on metadata for OER (though it looks like some of the data that’s being made available has NC restrictions - and is hence not ‘open’ as in the OKD). It’d be fantastic to have more discussions with members of the OER community about how CKAN should be able to handle metadata!
Update, 2007-11-14: As ibbo commented below, there were many interesting discussions of Learning Object Metadata (LOM) and of LOM standards, such as the 2002 standard, IEEE 1484.12.1. We’re certainly keen to keep track of developments in this area!
British History Online: Why the Restrictions?
October 31st, 2007
British History Online is a site created and run by Institute for Historical Research (part of the University of London I believe) and the History of Parliament Trust and located at: http://www.british-history.ac.uk/ (note the ‘ac.uk’ domain name signifying the official academic status though rather unusually they do run ads). Their purpose is clearly stated on the front page:
“British History Online is the digital library containing some of the core printed primary and secondary sources for the medieval and modern history of the British Isles. Created by the Institute of Historical Research and the History of Parliament Trust, we aim to support academic and personal users around the world in their learning, teaching and research.”
Great stuff. And it looks like they are doing a fine job. For example, a quick browse indicates a recent addition was “A Catalogue of Ancient Deeds”, a digitization of a work originally produced by a Mr H. C. Maxwell Lyte in 1890. Obscure material perhaps but undoubtedly of worth and precisely the kind whose value would be maximized by being made open free for anyone to use, reuse and redistribute. In particular it’s good to remember the The Many Minds Principle (the coolest thing to do with your material will be thought of by someone else) and what it means in this context:
- Openness would permit re-presentation of the material in different formats, different layouts and even different media (you want it in plain text: no problem, you want to mark it up in fancy xml — or even RDF: no problem …).
- Openness would permit recombination of this material with other sources. After all much of this kind of material, while interesting, on its own has limited value. By interlinking, annotating and combining it with other data and content we can multiply its utility massively.
- Openness would permit redistribution, easier archiving and distributed hosting (the site’s down: no problem here’s a mirror).
But surprise, surprise what do we find at the bottom of very page:
“Copyright © 2007 University of London & History of Parliament Trust - All rights reserved”
Taking a look at their terms and conditions we find:
…
1. Licence
Unless otherwise stated, the copyright and other intellectual property rights in all material on this site, including photographs and graphical images, and the organisation and layout of the site are owned or controlled by the University of London or the History of Parliament Trust.
You are permitted to access, print and download extracts from this site on the following conditions:
- use of all material on this site is for information and for non-commercial or your own personal use only; any copies of these pages saved to disk or to any other storage medium may only be used for subsequent viewing purposes or to print extracts for non-commercial or your own personal use,
ed: do they have some kind of business model here or is this just “let’s restrict just in case”. Do they also realize that:
- (for example) hosting this material on a site which ran ads (just like they do) likely counts as commercial (and just the uncertainty as to whether or it is or not is a deal-killer)
- they’ve just excluded a large number of those who might be interested in archiving (e.g. Google) or promoting this material (producers of open educational materials for schools).
- material on this site must not be modified in any way,
ed: ok, so there goes reuse
- graphics on this site must not be used separately from accompanying text, and
ed: why?
- any use of the material for a permitted purpose must be accompanied by (i) a full source citation; (ii) the University of London and History of Parliament Trust copyright notice; and (iii) this permission notice.
No part of this site may be reproduced or stored in any other web site or included in any public or private electronic retrieval system or service without the University of London and History of Parliament Trust’s prior written permission.
ed: ok so even if I felt public-spirited and wanted to archive this — and I probably couldn’t even redistribute it — I’d need to ask permission. Really makes it an attractive proposition.
The University of London and History of Parliament Trust reserve all rights not expressly granted in these terms. [ed: one final irony is that much of the stuff on there appears to be (at least in its original unprocessed form) public domain!]
Fantastic, we’ve now pretty much disallowed all uses except plain access and printing (for non-commercial and personal purposes). Given their concern about commercial usage, one really has to wonder what revenue streams they have (or expect to develop) from the likes of “A Descriptive Catalogue of Ancient Deeds” or “Feet of fine for Sussex for 1190-1509″.
Moreover, like so many others, they just don’t get that the main benefits of making material digital is the potential for reuse, representation and distributed archiving and distribution. Let’s repeat one more time:
The Best Thing To Do With Your Material Will Be Thought of By Someone Else
WorldMapper: Is Its Data Open?
October 10th, 2007
WorldMapper produces a whole variety of illuminating cartograms to show the distribution of various statistics across the world from royalties to the level of military spending. While looking at the site I immediately started to wonder about the openness both of the maps themselves and the underlying data (to my mind while the maps are lovely the datasets are in many ways much more valuable because of their greater scope for reuse).
They certainly give out all the data (and the maps as well) for easy download but as anyone familiar with previous posts such ‘implicit’ openness does not imply ‘explicit’ openness. Looking at the copyright page one finds the following:
This website and its contents are copyright SASI Group (University of Sheffield) and Mark Newman (University of Michigan). Permission will usually be granted to non-profit organisations, and we welcome enquiries from other interested parties.
Please contact info@worldmapper.org
So far so bad. Even for non-profit organizations ‘permission’ is granted only ‘usually’ (so one will have to write to them check in each case and no transaction costs are saved). Furthermore it is unclear whether this is supposed to apply to just the maps (subject to copyright) or the underlying data which may be subject to both copyright (in the arrangement) and, here in the EU, database rights (note that one of the producers is the University of Sheffield). One might guess that they are only concerned with the maps, an impression reinforced by faq item 8, but unfortunately since the default is all rights reserved a presumption is not good enough.
Certainly the signs are that they are happy for people to use the data. After all the main data files have been brought together for easy download on the data page but no explicit statement regarding permissions is given. The data sources page further lists where all the data came from and most of those sources seem open (UN, World Bank etc) — though of course the fact that sources were open need not prevent WorldMapper from getting rights in the ‘database’ they have compiled. (One big exception to note regarding sources are Angus Maddison’s world historical statistics. which used to be freely available on Maddison’s site but access to which is now restricted to purchasers of the relevant book — see details on the world economy website. Though it should be noted Maddison’s data was likely never open, for example Maddison’s World Population, GDP and Per Capita GDP, 1-2003 AD dataset (link to ckan package page) have an explicit copyright notice attached. This makes it particularly interesting that WorldMapper makes Maddison’s dataset available for download from their data page).
However as already stated presumption ain’t enough. WorldMapper are certainly doing wonderful work and it would be even more wonderful if their data was open, free for others to use and reuse (for example we’d love to package it up and make it available via http://openeconomics.net). At present it seems like that’s the case but without an explicit statement in the form of a license one can’t be sure. One thing that is for sure is that we’ll be writing to info@worldmapper.org in the near future to find out.
What Do We Mean by Componentization (for Knowledge)?
April 30th, 2007
Background
Nearly a year ago I wrote a short essay entitled The Four Principles of (Open) Knowledge Development in which I proposed that the four key features features of a successful (open) knowledge development process were that it was:
- Incremental
- Decentralized
- Collaborative
- Componentized
As I emphasized at the time the most important feature — and currently least advanced — was the last: Componentization. Since then I’ve had the chance to discuss issue further, most recently and extensively at Open Knowledge 1.0 and this has prompted me to re-evaluate and extend the ideas I put forward in the original essay.
What Do We Mean By Componentization?
Componentization is the process of atomizing (breaking down) resources into separate reusable packages that can be easily recombined.
Componentization is the most important feature of (open) knowledge development as well as the one which is, at present, least advanced. If you look at the way software has evolved it now highly componentized into packages/libraries. Doing this allows one to ‘divide and conquer’ the organizational and conceptual problems of highly complex systems. Even more importantly it allows for greatly increased levels of reuse.
The power and significance of componentization really comes home to one when using a package manager (e.g. apt-get for debian) on a modern operating system. A request to install a single given package can result in the automatic discovery and installation of all packages on which that one depends. The result may be a list of tens — or even hundreds — of packages in a graphic demonstration of the way in which computer programs have been broken down into interdependent components.
Atomization
Atomization denotes the breaking down of a resource such as a piece of software or collection of data into smaller parts (though the word atomic connotes irreducibility it is never clear what the exact irreducible, or optimal, size for a given part is). For example a given software application may be divided up into several components or libraries. Atomization can happen on many levels.
At a very low level when writing software we break thinks down into functions and classes, into different files (modules) and even group together different files. Similarly when creating a dataset in a database we divide things into columns, tables, and groups of inter-related tables.
But such divisions are only visible to the members of that specific project. Anyone else has to get the entire application or entire database to use one particular part of it. Furthermore anyone working on any given part of one of the application or database needs to be aware of, and interact with, anyone else working on it — decentralization is impossible or extremely limited.
Thus, atomization at such a low level is not what we are really concerned with, instead it is with atomization into Packages:
Packaging
By packaging we mean the process by which a resource is made reusable by the addition of an external interface. The package is therefore the logical unit of distribution and reuse and it is only with packaging that the full power of atomization’s “divide and conquer” comes into play — without it there is still tight coupling between different parts of a given set of resources.
Developing packages is a non-trivial exercise precisely because developing good stable interfaces (usually in the form of a code or knowledge API) is hard. One way to manage this need to provide stability but still remain flexible in terms of future development is to employ versioning. By versioning the package and providing ‘releases’ those who reuse the packaged resource can use a specific (and stable) release while development and changes are made in the ‘trunk’ and become available in later releases. This practice of versioning and releasing is already ubiquitous in software development — so ubiquitous it is practically taken for granted — but is almost unknown in the area of knowledge.
A Basic Example: A Photo Collection
Imagine we had a large store of photos, say more than 100k of individual pictures (~50GB of data at 500k per picture). Suppose that initially this data is just sitting as a large set of files on disk somewhere. Consider several possibilities for how we could make them available:
Bundle all the photos together (zip/tgz) and post them for download. Comment: this is a very crude approach to componentization. There is little atomization and the ‘knowledge-API’ is practically non-existent (it consists solely of the filenames and directory structure).
In addition tag or categorize the photos and make this database available as part of the download. Comment: By adding some structured metadata we have started to develop an ‘knowledge-API’ for the underlying resource that makes it more useful. One could now write a screensaver program which showed photos from a particular category or auto-import photos by their area.
In addition suppose the photos fall into several well-defined and distinct classes (e.g. photos of animals, of buildings and of works of art). Divide the photo collection into these three categories and make each of them as a separate download. Comment: A initial step on atomizing the resource to make it more useful, after all 5GB is rather a lot to download for one photo.
In addition to dividing them up allow different people to maintain the tags for different categories (one might imagine those knowledgeable about animals are different from those knowledgeable about art). Comment: Atomization assists the development of good knowledge-APIs (the human mind is limited and divide and conquer helps us deal with the complexity).
Standardize the ids for each photo (if this hasn’t been done already) and separate the tags/categories data from the underlying photo data. This way multiple (independent) groups can provide tags/categorization data for the photos. Comment: Repackaging — along with the development of a better knowledge-API for the basic resource — allows a dramatic decrease in the level of coupling and increase the scope for independent development of complementary libraries (the tags). This in turn will increase the utility to end users.
Conclusion
In the early days of software there was also little arms-length reuse because there was little packaging. Hardware was so expensive, and so limited, that it made sense for all software to be bespoke and little effort to be put into building libraries or packages. Only gradually did the modern complex, though still crude, system develop.
The same evolution can be expected for knowledge. At present knowledge development displays very little componentization but as the underlying pool of raw, ‘unpackaged’, information continues to increase there will be increasing emphasis on componentization and reuse it supports. (One can conceptualize this as a question of interface vs. the content. Currently 90% of effort goes into the content and 10% goes into the interface. With components this will change to 90% on the interface 10% on the content).
The change to a componentized architecture will be complex but, once achieved, will revolutionize the production and development of open knowledge.
