Notes and reflections from #ScotGovCamp
August 1st, 2010
Yesterday I went to ScotGovCamp in Edinburgh and had a lovely time. Spent more of it chatting in the hallway than participating in the sessions; but have detailed notes from the Open Data session led by Chris Taggart of Openly Local, and scatterings from elsewhere.
Open Data
Chris cites his membership of OKF’s Open Government Data Working Group, the London Datastore advisory body, and the Westminster Local Public Data Panel. Good, we now know we are dealing with a pretty serious guy.
His focus has been on the “English Experience” and he’s come to make contacts in Scotland. Citing as recent developments with impact yet to be fully felt, the Ordnance Survey Open Data release and the disclosure of Westminster MPs’ expenses. Looking for “drivers and levers” that will surface as yet unseen issues in local government.
It’s much less clear (at least here in the UK) how local, as opposed to central, communications and decision-making networks actually work. Local authorities are in an unclear legal situation - European PSI law should oblige local government to publish more data, but the knowledge of the law is often just not there (people are too busy).
OpenlyLocal has been going for a mere 15 months. It was inspired by a Manchester version of They Work For You and by the ScraperWiki project. OpenlyLocal collects information about local government data sources and critically the people involved, the social networks involved in decision making at council level. The site now has some amount of data (scraped from websites and republished as Linked Data) for 158 councils in England and Wales - but for only 4 in Scotland. One ultimate aim is to encourage local authorities to re-adopt the data, and the practises, being created by Chris and the contributors to OpenlyLocal. Other motivating things for publishing local administration info, as pure data:
- Accessibility concerns. Publication of data, as opposed to pictures of data (like PDFs) avoids accessibility concerns. Creation of interfaces to data is expensive and incurs a maintenance burden…
- Possible to tie in to other hyperlocal resources - a good example in Edinburgh is Greener Leith
- Creation of an index, or directory, to existing council resources, that is easier to explore than a conventional website
Chris outlined 4 key reasons why open local data is important (though the reasons seem to alter with every re-telling).
- Transparency - we can see for ourselves, and draw our own conclusions.
- Engagement - citing Planning Alerts - casual engagement is possible, you don’t need to be obsessive
- Equality - “open data is about equality of access, because all this data is currently available for a price, and that’s not right”
- Relevance - to local temporal reality of affairs - less decoupled synthesis of prepared or reported data - just data.
“Quality of data is important and opening that helps (and is used as a blocker) but not as important as other points”
Can we make interfaces that work for our grandparents?
“There’s a much bigger step between creating nothing, and creating something, than between creating something stupid, and creating something great… just make a start, somewhere, anywhere.”
To local administrations - “it should cost nothing to release open data. If it doesn’t cost nothing, you’ve got a really bad outsourcing deal”.
To everyone else - “Fundamentally, it’s our data.”
Questions about quality
Recently, I’ve been thinking a lot about data quality within the geo ghetto, so it surprised me to hear several audience questions from local administrative workers, directly asking about data quality. How imperfect/unreliable/uncertain is the data? Given inevitable uncertainty, how is this doubt stopping us (or the decision makers for whom we are responsible) from opening data?
Data quality problems can have severe cost and social effects - one case cited was a database recording details of children, in which 5% of dates of birth were wrong, so 5% of people are being treated administratively as children when they are not, or treated as adults when they are not (at least according to the administrative definitions, processes etc).
It’s quite possible to measure quality, to test and to describe it. Data package tests, like software package tests, extracting what’s useful from the formal standards thinking on quality. But this is too much of a digression, some of which is here, some of which is on the way.
Law and Computers
An interesting session which i only caught the end of and is more fully described on the ScotGovCamp blog by my EDINA colleague, Nicola Osborne. My notes say this:
German reform in the early 19C. | Biblical census. Legislation | Standards | Influence e-Care records, ATOS Origin distributed versioning in citizen data - propagation, provenance, merging. Robot Queen? Automaton? Target specification - e.g. music education, department of education directive. specifications, models, records management overspecification in law, cost, fear. Westminster Information Act (ontology-like)
Cuts
Dropped into the session on cuts, which wasn’t all gloom and doom, but more vendor optimism about shared services. Asked vendors about whether they made free software, or could find a place where business benefit to themselves and organisational benefit to their (public administration) clients could be created by freeing their software (in parallel to building shared hosted services). Not sure there was an answer.
Wondering about open demographic data, social credit data, and what’s the non-proprietary answer to Experian.
Good comments from Chris Taggart in this session too - “specialising in one thing, as a service provider. Low barriers to entry - low barriers to exit equally important”. Wondering about a JISC-like body for stewardship of shared services for local authorities. Would probably become a beast.
Fragments of insight
The big consultancies that form consortia to do government work, work by mimicry - by mirroring the hierarchical administrative structures that they are serving. But then internally, they actually do iterative micro-procurements - as in EU consortia the bulk of the actual work is done by very small providers. Many large and small companies work across local authorities, and it would be fascinating to see the map of who and where they are, which Chris is beginning to derive from spending data.
Shadow networks, shadow systems form, inevitably, in organisations at scale. But a paradox - the more superficial openness there is (coming from cultural change, or coming from legal or quasi-legal mandates, or meeting in the middle) the less is actually recorded. Data implies audit, audit invokes fear of loss. So organisation becomes about emotional concerns - perhaps it would be helpful to recognise this more?
Note, i corrected a bit of this, Equality rather than Quality, with which i must be temporarily obsessed. Thanks Chris for notes. Thanks Tim Howgego for insights.
We Need Distributed Revision/Version Control for Data
July 12th, 2010
In the open data community, we need tools for doing distributed revision/version control for data like the one’s that already exist for code.
(Don’t know what I mean by revision control or distributed revision control? Read this)
Distributed revision control systems for code, like mercurial and git, have had a massive impact on software development, and especially so in the F/OSS community — the distributed methodology works particularly well with open material.
The same would be true for data. Revision control, and specifically distributed revision control, would support (cf this and this earlier post):
- Incremental development: “patches”, changelogs etc
- Provenance tracking: showing who did what, when is built in to a revisioning system
- Broader participation: you don’t have to worry (as much) about who you let in because changes can be reverted. It’s also easier to get involved because you can have your own independent copy to play around with (Distributed).
- Easier collaboration: updates don’t mean making a full copy (and applying updates is automatic), you can see who is making changes, when etc etc
- Peer-2-peer model: different contributors can work simultaneously and independently (Distributed). Extra “features” can added independently of mainline development with re-integration later (Distributed).
Because this is all a bit abstract it is worth giving a concrete example of why “distributed” revision control could be so useful.
Example
Imagine wikis on two related topics, say water sanitation technology and building construction technology for the developing world (alternatively just think of the first wiki and wikipedia). It is likely there are some significant overlaps in the wiki pages but also many pages that don’t overlap. At the moment, for these projects to reuse information their only option is:
- Copy the article from one wiki into the other
- (OR) Standardize on one wiki as the authoratative wiki for common content
These both have serious problems. For (1), the page goes out of date rapidly and you’ve forked the resource reducing the value of effort on each. For (2), for the wiki that does not have the content, people have to go off from one wiki to edit in another (disruptive experience), the material is not embedded within its relevant context and it is harder to adapt the material for each specific site. Furtheremore, and in part because of these issues, (2) is socially hard as it likely involves one wiki/community coming to dominate the other (whoever owns the “common” content).
However, in a world where things are distributed there a completely different option: each wiki could have its own copy but be able to push and pull changes from the other wiki with changes being merged. This allows for collaborative activity to continue but in a relatively independent way and solves the big social issue of who’s in charge (no-one is!).
The key take away from this is that a piece of technology (distributed version control) alters the social processes of collaboration thereby radically reducing the barriers to effective collaboration. And remember, social stuff is both a) hard and b) important.
Implementation (or why this is not trivial)
Two key features are involved, neither of which are much in evidence in the (open) data at present:
- Data versioning/revisioning — the creation of “changesets”
- Transmission and management of associated changesets between multiple peer nodes.
It is the P2P nature of this model (as opposed to classic server-client approach) that leads to it being termed: “Distributed Revision Control”. Given the existence of distributed revision control for code one might hope that we could just reuse those technologies for data. Unfortunately it is not that simple:
- The key aspect to developing a revision control (distributed or not) is to work out the diff and changeset format. This has not been done for data.
- Diffing and revision control for code works because code can be considered as (structured) text where a line-based-approach (or, occasionally character-based-approach) to code makes sense. For data it usually doesn’t make sense:
- Consider a hacky way to version a relational database using traditional text revisioning tools:
- dump the database to sql 2 . revision the dump that using standard code tools. Tthe impact of renaming a column or table in this scenario is that hundreds or maybe thousands of line in the dump would change (depending on how inserts were set up). Furthermore the diff format for the sql dump provides no easy way to apply changes to the live database — in essence, the diff has given you nothing over just taking snapshots. What is required here is some way to describe changes to a relational database in its terms (there are plenty btw this is just illustrating that simple text diffs don’t work well …)
- Unlike for code we probably have to talk about “what kind of data”. This is because the diff format we use to build “changesets” will depend on the structure of the data.
However, once you have diff (and merge) figured out for a given type of data we can directly reuse most of the ideas (and maybe even code) from frameworks used for software code. To put it briefly: it’s the diffing and merging that’s (relatively) hard — the rest we can copy!
Colophon
We have already made an attempt to implement distributed revision control ourselves for the specific case of the data stored in CKAN instances like http://ckan.net/.
Our approach was based heavily on the mercurial/git conceptual model and used as data structure the natural one implied by the domain model (~ database rows but not quite) — in essence we dump to json for each field and then do diffs on the json.
If you’re interested in finding out more here’s the code. Big kudos here to CKAN developer John Bywater who actually did almost all the work of getting this from concept to running code.
Dig the new breed, Part III - wrapping it all up
June 11th, 2010
This is the third in the amazing series of guest blogs from Ant Beck on the impact of linked open data for archaeology.
Part 1: New approaches to archaeological data analysis, as seen in the DART and STAR projects Part 2: Considering the ethics of sharing archaeological knowledge
OK, to recap we have:
- A scientific movement that advocates open approaches to data, theory and practice
- Emerging foundational interoperability using semantic web technology
- The potential to remove a barrier and facilitate the submission of primary data
These three powerful factors could prove to be highly disruptive. In combination they have the potential to turn archaeological data and data repositories from static siloed islands (containing data that is increasingly stale) into an interlinked network of data nodes that reflect changes dynamically.
The linch-pin is the use of triplestores (RDF databases) that provide persistent identifiers. Persistent identifiers allow us to refer to a digital object (a statement, a file or set of files) in perpetuity, even if the underlying storage location moves. This means links between objects are persistent: therefore, when an observation or interpretation changes its effects are propagated through to all the data/events that link to it. I see organisations such as the ADS, Talis (an innovating semantic web technology provider which provide the Talis Platform which includes a free RDF hosting service for open data) and national heritage bodies providing such services.
Some open science projects are likely to adopt RDF as their de-facto data sharing format. RDF triples (subject, predicate, object) provide a schema transparent mechanism for data storage. They are not ideal for all data types (raster data structures for example) but when used with Ontology and SKOS, as demonstrated by STAR, they are powerful analytical, search and inference tools.
So, what is the importance of storing heritage data in RDF? Well, it depends which point of view you take. From a data management perspective there is no longer any need to migrate data formats. However, to facilitate re-use, different “views” of the RDF model can be generated and incorporated into traditional analytical software, such as GIS. Importantly, analysis stops being a “knowledge backwater”: new knowledge can be appended back into the triplestore.

Linked Data concepts in archaeology
From a data curation, re-use and analysis perspective the quality of the data has the potential to be dramatically improved. Deposition is no longer the final act of the excavation process: rather it is where the dataset can be integrated with other digital resources and analysed as part of the complex tapestry of heritage data. The data does not have to go stale: as the source data is re-interpreted and interpretation frameworks change these are dynamically linked through to the archives, hence, the data sets retain their integrity in light of changes in the surrounding and supporting knowledge system.
An example is probably useful at this juncture: In addition to many other things pottery provides essential dating evidence for archaeological contexts. However, pottery sequences are developed on a local basis by individuals with imperfect knowledge of the global situation. This means there is overlap, duplication and conflict between different pottery sequences which are periodically reconciled (your Type IIb sherd is the same as my Type IVd sherd and we can refine the dating range…… Hurrah… now let’s have another beer). This is the perennial process of lumping and splitting inherent in any classification system. Updated classifications and probable dates allow us to re-examine our existing classifications. One can reason over the data to find out which contexts, relationships and groups are impacted by a change in the dating sequences either by proxy or by logical inference (a change in the date of a context produces a logical inconsistency with a stratigraphically related group) While we’re on the topic of stratigraphy, an area of notorious tedium and poor quality data (often with conflicting relationships), RDF allows rapid logical consistency checking as stratigraphic relationships are basically a graph and RDF triples are a graph database. Publically deposited RDF data should be linked data: this means that all the primary data archives are linked to their supporting knowledge frameworks (such as a pottery sequence). When a knowledge framework changes the implications are propagated through to the related data dynamically. This means that policy, development control and research decisions are based upon data that reflects the most-up-to date information and knowledge….. cool huh.
Incorporating excavation data into RDF means that ontology and SKOS can be used to dynamically repurpose the data for policy formulation, planning impact, regional heritage control and mitigation purposes in conjunction with the data in the Sites and Monuments Record (SMR). Raw data can be integrated from multiple different sources with different degrees of spatial and attribute granularity and, where appropriate, generalised so that the data is fit for the end users’ purpose. From a policy perspective curatorial officers no longer have to battle to stop datasets becoming stale and add new datasets to the local SMR. The SMR will remain an essential dataset: even though it is a generalised resource it is the only location of a digital record for resources that are unlikely to be digitised in the future (unless there is a very unlikely reverse in funding patterns). Thus the curatorial officer can develop more effective regional research agendas based upon up-to-date and accurate data.
This has the potential to change the way Historic Environment Information Resources (HEIRs) are managed by curatorial officers and transform how developers (property and software), policy makers and the general public engage with and consume any data. They will be able to support innovative access to primary linked data resources by researchers, planners and most importantly the public. This is a significant and important change in role. In addition the heritage data can be mashed up with other data resources to produce tailor made resources for different end-user communities – following the model successfully employed by data.gov.uk.
Data re-use and mashups are also important for those undertaking research and analysis. The big difference will be for those who undertake research or collect data that transcends different traditional analytical scales. For example, the National Mapping Programme which aims to “enhance the understanding of past human settlement, by providing primary information and synthesis for all archaeological sites and landscapes visible on aerial photographs or other airborne remote sensed data” will provider deeper insights when it is integrated with other data. However, this integration can occur in real time and add tangible interpretative depth. If an interpreter is digitising data from an aerial photograph and they see two ditches cutting one another they are unlikely to be able to tell the relative stratigraphic sequence of the two features. Direct access to excavation or other data will allow the full relationships and their interpretative relevance to be deduced during data collection.
In the longer term consumers of archaeological data will be more used to dealing with primary data, will become more aware of its potential and demand more of the resource. This should produce a ground up re-appraisal of recording systems and a better understanding of archaeological hermeneutics. The interpretative interplay between theory, practice and data as part of a dynamic knowledge system is essential. Although this has been recognised, in reality theory, practice and data have never really been joined up. We don’t have to use a one size fits all approach to conducting excavations, but we can tailor bespoke systems that address local, regional and national research challenges. We can generate interesting and provocative data that can be used to test theory and inform practice and move away from recording systems mired in the theoretical and intellectual paradigms of the mid 70’s.
The virtuous circle is re-established; theory will influence practice, which will change the nature of the data, which will impact on interpretative frameworks, which will provide a body of knowledge against which theory can be tested.
Final comments
There is a new breed: there are people and organisations who don’t want to do what’s always been done. People who are empowered and don’t believe that established institutions and hierarchies are the gatekeepers of progress: organisations that can, and want to, change the way we ‘play the game’, people who want to collaborate. Organisations that want to share. Open approaches can help to make all this happen. This is all facilitated by disruptive technology which is increasingly mature, broadly available for free (or at a low cost) and with low barriers of use and re-use. In the nearly twenty years of studying and working in the heritage sector I’ve seen it change dramatically. I feel we are on the cusp of changing the way we engage with our data which could profoundly alter the way we understand the past, how we can communicate this in the present and how we can sustainably manage a complex resource for the future.
Dig the new breed, Part II - open archaeology and ethics
June 11th, 2010
The second in this great series of three guest blogs by Ant Beck. See Part 1 for applications of linked data and remote sensing in archaeology. Part 3 will wrap things up and talk about the disruptive implications of linked open data for impact of archaeology.
Open Science provides the framework for producing transparent and reproducible science by providing open access to raw data, algorithms and interpretations. Efforts such as STAR and STELLAR provide the foundation from which fine granularity excavation data can be made available as part of the semantic web and feed into Open Science analysis. This provides answers to the questions of how and why we should have open access to archaeological data. However, it does not provide answers to what data should be opened or if archaeological data should be opened at all. We move into the sphere of ethics and open archaeology.
Recently I have chatted to a number of people and organisations who want to open up heritage data. The conversations tend to have an ethical component. Like other disciplines, such as ecology, there are potential ethical issues in making heritage data open. The oft touted reason, in the UK at least, is that if access is given to this information then it will be exploited by “night hawkers” (irresponsible metal-detectorists) and other “treasure hunters” and sites (a term I don’t really like) will be destroyed.
This argument is polarised and plays to the lowest common denominator: it is based on the premise that “accessible knowledge will inevitably be abused” and eschews any of the benefits that data sharing can provide. Nor does it consider the nuanced ethical arguments concerning re-appropriation of artifacts collected under imperialist regimes or the ethical conundrum surrounding research into aboriginal or other indigenous communities (which, now that I’ve raised them I wont comment on them further). The Portable Antiquities Scheme has done much to improve this argument.
The elephant in the room in this debate concerns those archaeologists who have sat on their archive for decades. We know of its significance but it is not available for academic and research analysis and does not inform the planning process. This has enormous impact on local planning policy, public and academic understanding, theory, practice etc. Since, the 1990 introduction of Planning Policy Guidance 16 (PPG16: essentially commercial archaeology) in the UK, and the later Planning Policy Statement 5 has improved the situation a bit.
But I find the situation somewhat paradoxical. The UK curatorial systems expect that a generalised summary, or synthesis, of any investigation is deposited with the regional curatorial officers. This data is entered into the Sites and Monuments Record (SMR) and is publically accessible. Therefore, the public has access to a generalised dataset. The expectations for primary, or raw, data are different: it’s considered ethically appropriate to deposit fine granularity data (i.e. non-generalised, primary, data, such as those from excavation) with the Archaeology Data Service (ADS), however, there are issues raised if an individual wants to do this outside such formal structures (however, the Perry Oaks Project have released redacted versions of their site data).
Is this an issue of ethics, or where formal and informal work practices collide; or is this simply an issue of cost, where individuals and organisations have the will but not the finances? Alternatively, and possibly most likely, do archaeologists just feel uncomfortable making their fine grained data available to a mass audience without going through a representative authority such as the ADS? My feeling is that within the archaeology domain there is an informal belief that if data is deposited with a repository then the repository also takes the ethical responsibility if the data is released. Deposition so that data is available in perpetuity is part of business and academic best practice, however, deposition does not necessarily mean release and subsequent consumption by other parties (public or otherwise).
Whatever the answer the point remains: archaeologists, for right or wrong, consider the implications of placing fine grained data in the public domain and “Ethical considerations” have been identified as a “barrier” to deposition. However, there appears to be limited guidance as to how to resolve these issues. This means that many archaeologists are re-inventing the wheel. The challenge is to provide some supporting “thing” that makes it easy for individuals and organisations to get to a clear, and hopefully unambiguous, ethical position. Such a “thing” will reduce uncertainty thereby removing one of the barriers to data sharing. The current default position is the equivalent of doing nothing: surely this must change.
Supporting “stuff” which is recognised and approved by national heritage organisations and standards bodies will act as important lubricant to help individuals and groups to release data through informal channels. It should be recognised that the relationship between the “citizen”, the archaeologists and heritage data will change: citizen science and citizen data, will play more of a role in heritage than ever before. Hence, a focus on the informal is important: we don’t want more grey data so we? The Portable Antiquities Scheme is the “poster boy” for archaeological approaches to citizen science - although they do have a range of different user access levels.
I raised this as a topic for the Archaeology working group at the Open Knowledge Foundation. Response so far has been positive and has spilled over to colleagues in the curatorial sector and beyond (the discussion thread can be found here). We’ll be setting up a meeting to discuss these issues later in 2010. Both the Archaeology Data Service and the University of Leeds have kindly offered a venue.
There’s also a start at creating an ethics statement on open access to raw archaeological data – a statement that should be supportable by institutions and individual researchers alike. If you’d like to get involved, please join the Open Archaeology working group and mailing list – involvement could be helping to craft the ethics statement, asking your institution to contribute its own statement, helping to plan and document the workshop.
Book Search, Museum View, and Exploitation
February 6th, 2010
Read today a Google Books PR piece on the Guardian website. Of out-of-print or hard-to-get books, it says, “Although copies may be available in libraries, they are effectively dead to the wider world.” Also heard today that Google Street View is proposing inside views, museum interiors.
Last week, I and some OKF people heard a Google Books lawyer, Antoine Aubert, speak at the 7th COMMUNIA workshop on the public domain.
Google digitise the holdings of libraries free of cost, returning the library a copy, retaining some exclusivity over further re-use for Google. For example, a library is asked not to allow other search engines to index the digitised full text of the works.
Rufus commented on the Public Domain Calculator cross-European project that “A library who will remain nameless would not provide us with their catalogue metadata because of an exclusive arrangement with Google in rights to re-use the catalogue. Were they mistaken?” Antoine was not able to give a definite answer, to this and other questions.
A library’s raison d’etre is to provide physical access to books. With high-quality digitisations online for free, physical traffic will definitely fall. Space used for storage in prime central locations is inefficient; why not just scan the books and keep them in an air-conditioned warehouse in Swindon?
Meanwhile a library’s purchasing power is partly determined by the number of people borrowing books. New books will be indexed and stored by Google directly from publishers. There won’t be much reason to visit a library.
The library will become a museum of books. The museum will become a mausoleum of things.
To survive as institutions, museums, libraries and archives need a sustainability model, one which cannot depend on state funding alone.
One path to explore is commercial services for special purposes - re-use of very large high-resolution scans, printing of images and facsimiles, new or custom images, new interfaces and search functions.
If Google now has the right to restrict the use of the works online, those libraries accepting the “free” digitisation offer are not free to build and maintain the services that, as memory institutions in a digital age, they really should be providing.
Well, there’s always Wikipedia, and particularly the Britain Loves Wikipedia events going on through February 2010, focused on photographing heritage objects.
Matthias Schindler spoke at the same COMMUNIA meeting about a German Wikipedia effort to fix and link metadata from authority files by the German National Library - some background slides. His message went, “Give us your metadata. Really, just give us your metadata right now.”
INSPIRE Directive heading towards UK law
May 24th, 2009
INSPIRE, the directive establishing a spatial data infrastructure for environmental information in Europe, is heading into UK law at last. DEFRA is doing a consultation on the transposition of the law and OKFN will hopefully co-submit a response by 26th May with the Open Rights Group, a summary of the responses is on the okfn-discuss mailing list.
In short it is fairly good news for those of you who are tiring of having requests for information about data holdings from the likes of Ordnance Survey, Transport for London, refused under FOI on the grounds of commercial confidence. Public authorities affected by the Freedom of Information Act 2000 Schedule 1 will be obliged to make the metadata for their geodata holdings available to the public free of cost, from 24th December 2010 (okay, so it’s still a bit of a wait). Additionally, “view services” complying with the Web Map Service spec will have to be available in just over 2 years time, for which there will be a “presumption of public access”.
So we will see Ordnance Survey’s MasterMap available in full via WMS (if still restricted for commercial use) or there will be a very good reason why it is not. What will happen to searches for data sets contained in MasterMap - will they come back as “here is the metadata for MasterMap as a whole, here is where to license it?”
I wrote about the metadata issue for Terradue a lot, the model contains 30-plus fields that must be completed, many of which don’t have a direct bearing on data search. But it will be mandatory, and it will be free of cost to all, and that will be a great improvement on where we are now. There will also have to be changes in how data licensing is managed online, as any data that is “restricted” for download must be made available through an e-commerce service - in the case of bodies like TfL this will mean a lot of data that has never surfaced publicly before. I am looking forward to it!
There is some uncertainty over what will happen to the information held by Trading Funds if they are fully privatised and thus no longer meet the definition of “public authority”, and I am trying not to worry about that.
Open organisations, need for two more definitions!
October 5th, 2008
If starting a new, public interest, organisation, there are three obvious principles you might like to have.
- Finance - have all bank transactions automatically public in real time. Plus accounts.
- Software - all software made by the organisation to be open source.
- Information - voluntarily subscribe to some sort of FOI law.
The software one is reasonably well covered.
There are problems with the finance one. For example, you probably need to anonymise individual donations, or at least those that are ’small’. It would be lovely if somebody could think through all this, and come up with an “Open Finance Definition”, for describing when an organisations finances are truely open.
There are also problems with the Freedom of Information one. In the UK at least, subscribing to public sector FOI law voluntarily would be dangerous. You wouldn’t get the protection from defamation that a public sector body gets, and you may have trouble applying the public interest test clearly. So again, would be lovely if somebody could come up with an “Open Information Organisation Definition” which encoded a good principle to have for this.
Amazingly really, the more you think about this openness, the more things you find that could be open, and the more definitions you need. There’s work for you forever, Rufus
Some Agricultural History via Open Economics
September 15th, 2008
One of the active Open Knowledge Foundation projects is Open Economics. A substantial part of that effort ends up being data acquisition and ‘cleaning’: getting hold of economic data, parsing it into (computer) usable form and adding it to the Store. (Wouldn’t it be nice if that data was already nicely packaged up or at least in a decent raw form …).
Once this job is done, the data is there in a nice clean state for others to use — plus we can draw some nice graphs (as we will see below). As an illustration of this process, we’ll look at one particular dataset acquired earlier this year when, motivated by the large increases in commodity prices and the concerns expressed regarding their impact, I decided to see what data I could dig up on food prices (starting with Wheat).
As usual, it was US government material that was most easily available (in a decent format) and I decided to start off with historical information on wheat to be found in the Wheat Yearbook, in particular the contents of:
http://www.ers.usda.gov/data/wheat/yearbook/WheatYearbookTables-Recent.xls
While the data was available (and open — since US Govt provided) it was in a format that was not immediately computer usable (lots of blank lines etc). Thus, the first step was to parse this into standard csv file format (see script here) and then upload this to Open Economics. The result:
http://www.openeconomics.net/store/517d7c4e-3cb7-4e8f-aaa1-745dd665ad1f
Not only do we now have nice clean data but, thanks to plotkit, Open Economics has javascript graphing so without any more effort we can automatically have graphs of the resulting material. Not only does this allow us to answer our original question (see Fig 4) but these graphs also tell a fascinating historical story:
US Wheat: 1866 - 2007
NB: if the figures are too small click through for the full-size versions on Open Economics (the dates at the bottom run from 1866 to 2007)
Figure 1: Output (Millions of Bushels)
First up is output. As can be seen here output rose steadily (approximately linearly) up until the First World War. It then stayed flat or even fell during the inter-war period — the Great Depression and the Dust Bowl can be seen in the sharp dip in the early 1930s. Following the Second World War output rose, accelerating (exponentially?) up until the early 1980s when it has flattened out, even declining (with sharp variations) to the present.
Looking at these raw output figures the immediate question one asks (at least as an economic historian) is: what underlying causes drove these changes in output. In particular, output is the product of two factors: total acreage in use and yield (average output per acre) so it would be interesting to see time-series for them as well. Fortunately this data is also available:
Figure 2: Acreage (Millions of Acres)
The first thing to note is that these series start in 1866, the year after the American Civil War ended. This was a period of great westward expansion in cultivation in the United States — the “Opening of the Prairies”. The graph bears graphic witness to these changes: we can see that harvested acreage tripled between 1866 and the outbreak of WWI in 1914.
This massive expansion was to have a profound effect far outside of the US: food prices dropped around the world due to the increase in supply. In Western Europe this lead to a ‘Great Depression’ in agriculture right up until the First World War (which in turn had a significant effect on European politics creating protectionist alliances between peasants and landowners in many European countries). It also assisted industrialization by keeping the price of bread low for the fast growing industrial proletariat.
However, by the end of WWI most of the acreage that could be cultivated was already in use. After that point, while there has been variation in planted acreage (perhaps driven by substitution between wheat and other crops) there has been no long term trend (whether increasing or decreasing). Thus, while the increase in output up to WWI can be largely explained by increases in acreage under cultivation 1 the large increases in output in the post-WWII period can’t be. This brings us then to the second major factor in explaining changes in output: yields.
Figure 3: Yield (Bushels / Acre)
One could not ask for a sharper confirmation of our previous hypothesis than Figure 3. As it shows average yields were almost perfectly flat from 1866 up until the end of the Second World War. From that point yields took off growing sharply, but at an almost constant rate, up until the mid 70s, following which the growth rate slowed substantially (though yields still continued to grow albeit with increased variability). In concrete terms this corresponded to a rise in yield from around 12 bushels per acre at the end of WWII to somewhere around 35 bushels per acre in the 70s — and around 40 today.
To put this most starkly: there was a roughly 3-fold increase in yields in this 30 year period. Again this is a particularly ‘graphic’ testament to the ‘green revolution’ of the post-war period which was driven largely by the development and adoption of new corn varieties (hybrid corn), fertilizers etc.
Figure 4: Price ($ per Bushel)
Lastly we come to price. Here, despite substantial fluctuations the basic trends fit with our historical intuition. There is little change between 1866 and WWI, a sharp rise during the war, a substantial decline in the inter-war period, then another sharp-rise during WWII (wars are good for farmers!) followed by stabilization (or even slight decline) until the mid 1970s when there is another sharp rise. Following that there is substantial variation but no great changes until the present when the line shoots up again (doubling from around $3 per bushel to somewhere near $6 in a year).
As basic economics tell us, price should reflect the interaction of supply and demand. The marked stability of price over long periods (particularly those where supply has increased) suggests then that demand has matched supply (or vice-versa) fairly well over this period (one might also need to take account of the fact that there may also have been substantial government intervention to stabilize prices).
Given that supply has risen substantially through the whole period, and especially since WWII (see Fig 1) this means that demand has also been climbing sharply. This is true: world population has increased at least 5x since 1850 and roughly tripled since WWII (in addition many people, especially in developed countries have increased their per-capita consumption, by eating more and better — as well as wasting more).
It would be interesting to imagine what would have happened if this kind of population increase, particularly that since WWII, had occurred without the massive increase in yields shown in Figure 3 (part of the answer may be that population would not have increased so much …). Certainly the price increases seen recently may reflect the kind of growing surplus of demand over supply that we would have seen without the ‘green revolution’. As such, they may be signals of the significant readjustments that will be needed in the near future, whether that be increases in supply, reductions in demand or more efficient use of existing supplies.
-
a crude eyeballing suggests that output increased somewhere between 3-4 times between 1866 and WWI. This is in line with the increase in acreage. That said, diminishing returns arguments (best land is cultivated first) would suggest that to maintain yield per acre on a vastly increased acreage would have necessitated some increase in yields. ↩
Open Data Going Mainstream?
April 10th, 2008
Bret Taylor’s recent post entitled “We Need a Wikipedia for Data” has been garnering a lot of attention around the blogosphere. While his suggestions are not particularly novel, the post and the attention it has garnered, is, I think, indicative of the growing interests in the issues of (open) data and its importance for the development of related services and products.
While generally in agreement with Bret’s arguments, there are a few differences that are worth raising. First Bret appears to favour some kind of centralized repository that everyone can read from and write to:
To this end, I think we should create a Wikipedia for data: a global database for all of these important data sources to which we all contribute and that anyone can use.
As readers of this blog will know, we’re sceptical of this ‘one ring to rule them all’ approach. In this regard, it is also important to distinguish finding material, parsing it, and plugging it together, issues that got rather run together in the surrounding discussion. As I wrote in a comment to Bret’s post:
There seem to be several distinct issues you (and your commenters) are concerned with:
1. Discoverability of datasets. For this you want a registry of some kind and this is exactly what the Comprehensive Knowledge Archive Network (CKAN) is designed to do. …
2. ‘Developing’ data particularly using many contributors and a versioning (wiki-like) model. This seems a general problem and one which I wrote about in this post on the collaborative development of data back in February last year. Since then various projects have launched or developed which attempt to address this issue, even if only partially (e.g. Freebase, Swivel, Numbrary, http://www.openeconomics.net …). This then leads into:
3. Componentizing data so that one can easily plug different datasets together rather than having to aggregate data together in one big place (crudely: ‘One Ring to Rule them All’ vs. ‘Small Pieces, Loosely Joined’). After all it seems unlikely that any one organization, however large, can hold ‘all the data’, and in ay case doing so would negate the benefits of having ‘many minds’ working on a problem. It is our hope that CKAN would start to facilitate the kind of packaging that one frequently observes in software but is, as yet, fairly rare for knowledge (data/content/…). More on this can be found in this blog post on componentization plus the slides from our presentation at XTech.
To conclude, I definitely agree about the importance of having more open data and making it easier to find and use though I’m hoping that it will take a more decentralized and componentized form than simply a ‘wikipedia’ for data. More important though than any details is the fact that this kind of interest from a wider audience indicates that issues of data openness and production are going mainstream — something we as a community should strongly welcome.
On data transport through payment networks
December 13th, 2007
I recently ran across the Cruickshank Report, a review written in 2000 of the state of payment information systems in the UK, and enjoyed what it had to say about “money transmission” (Think ATM networks, point-of-sale networks in shops, credit card networks, as well as intra-bank schemes for larger sums.)
A lot of value is apparently created by the transport mechanism itself, in the form of per-use access fees: “around three quarters of a billion pounds per year are paid in this way to UK debit and credit card issuers. The interests of bank run schemes do not coincide with the public interest.” These are interchange fees, paid between individual members to cover the cost of services supplied from one member to another.
Inflated interchange fees create a number of detriments. First, they raise the cost to retailers of card payments. … Second, allowing issuers to recover costs through interchange payments weakens the incentive to cut costs through greater efficiency … Third, competition between payment mechanisms is distorted in favour of products with artificially high interchange fees.
Money transmission infrastructures such as ATM networks conduct “network effects”. Initially there is a high entry cost to the builder of a network. People are more likely to join it, the more people they can reach over it, the more value there is to each participating node - in how widely a credit card is accepted, or how widely a videophone is used. Each new user can be given the same level of service for less new cost than the previous one. Once maintenance cost is covered, up until capacity is full, that extra cost is effectively zero.
“Network effects also have profound implications for competition, efficiency and innovation in markets where they arise… Once a network is well established, it can be extremely difficult to create a new network in direct competition.” A very high cost of initial capital investment, at a great deal of redundancy in services “raises entry barriers” to value transmission markets which “in turn leads to higher customer charges and lower levels of service in these markets. It also effects the geographic distribution…” where densely populated areas may become over-served, sparse ones neglected.
Except at times of high congestion, there isn’t a per-access cost impact over and above the maintenance costs of the underlying network. Cost to install, fix and improve services may be significant; but the benefit of being able to join such a network, collectively amongst participants, far outweighs this cost. A per-access fee levied against the ultimate end user may hold back the generation of network effects and act against the economic interests of the whole network.
The Cruickshank report’s analysis suggests a license for entry into “money transmission” networks which reflects the risk involved in trusting other participants to behave consistently, and the high levels of value being committed. The current schemes run by the payments industry have a “mutual governance” model, where the underlying network is operated by a not-for-profit company co-owned by the participating “competing” companies.
Yet the industry associations have formed non-for-profit mutuals on the grounds that some aspects of their business are better run collaboratively. There are many situations beyond payments networks that look like this and which tend to involve the underlying transmission medium for moving things from one place to another. A network - a road network carrying a bus network, or a communications network with public terminals - becomes so widespread and the necessity of interchange with it so complete that the cost of replicating it - where that is physically possible - must tend to be prohibitively more than the cost of joining it.
In a few very congested areas, private toll roads may be viable, but even then following the topology of a main network. Planning and licensing restrictions in the dependencies, additionally limit who can participate in building infrastructure.
What does this look like? Well, it looks a lot like another non-Internet network which the Internet increasing depends on to be of commercial interest, the cellular network. To become a full member of the GSM alliance and therefore entitled to read, and use, the specifications for phone call data exchange, one needs to have a “license” for a slice of spectrum and at least a minimal physical infrastructure of cell towers. A moratorium on new phone mast installation means that new market entrants must sublicense from competitors; even a really significant capital investment cannot do enough. This starts to look like what economists have called a “two-sided” market, where services depend on platforms, and an effective monopoly on the latter allows an entity
I want to claim a strong argument that there is a whole class of enterprises in which competition at the infrastructure layer cannot produce a better result than cooperation, and is likely to produce a worse one. If there really is a class of works which are “natural cooperatives”, I want to find out more about how they are constituted and how their runnings are best expressed in rules. For want of a better word, all these enterprises are some kind of “transport” and that’s what I’m trying to get at in proposing an “Open Transport” session for next year’s Open Knowledge Foundation conference.
An Economist Writes: What’s the best structure in welfare terms for society: standardization (cooperatively via an open or semi-open standard), standardization via monopoly, or multi-platform/network competition? Each of these structures will have different static (how good is it right now) and dynamic (investment in quality and innovation for the future) effects. For example open standards might be great statically but take ages to hammer out (while everyone negotiates) while a proprietary standard might be bad statically but fast to do (and be of good quality — since there are fewer compromises).
Mutual governance may come in for criticism, but perhaps it is rather the governance of mutual governance that is the problem. A revision of the rules describing these sorts of systems ought to be workable; the creation of a status to which these networks can apply in order to get tax breaks etc, and could achieve the same as any regulatory regime which aimed to increase transparency, lower costs and encourage innovation. Surely all participants in the building of transmission networks which move value around - in the form of water and waste, data and energy - must share these aims?

