You are browsing the archive for 2010 June.

Can You Close the Deficit Gap?

June 21, 2010 in News, Visualization, Where Does My Money Go

Where Does Your Money Go? challenges you to beat the Chancellor to it before tomorrow’s budget and close the UK’s financial deficit. Will you increase taxes, make cuts or a mix of both? No decision is going to be popular but are some more palatable than others, you decide.

Application Image

More information:

The application was created by the Where Does My Money Go? team. Researched by Lisa Evans and Tim Hubbard using many figures from the Institute for Fiscal Studies. Visualized by Rufus Pollock and Tim Hubbard using the thejit and jquery.

Open Geoprocessing Standards and Open Geospatial Data

June 21, 2010 in External, Open Data, Open Geodata, Open Standards, WG Open Geospatial Data

The following guest post is from Lance McKee, who is Senior Staff Writer at the Open Geospatial Consortium (OGC) and a member of the Open Knowledge Foundation‘s Working Group on Open Geospatial Data.

OGC meeting

As the founding outreach director for the Open Geospatial Consortium (OGC) and now as senior staff writer for the OGC, I have been promoting the OGC consensus process and consensus-derived geoprocessing interoperability standards for sixteen years.

From the time I first learned about geographic information systems in the mid-1980’s, I have been fascinated by the vision of an ever-deepening accumulation of onion-like spatial data layers covering the Earth.

For those unfamiliar with geographic information systems (GIS): a “spatial data layer” is a digital map that can be processed with other maps of the same geographic area. With an elevation map and a road map, for example, you can derive a road slope map. Today, geospatial information has escaped the confines of the GIS to become a ubiquitous element of the world’s information infrastructure. This is largely a result of standards: Communication means transmitting or exchanging through a common system of symbols, signs, or behavior. Standardization means agreeing on a common system. OGC runs an open standardization process, and OGC standards enable communication between GISs, Earth imaging systems, navigation systems, map browsers, geolocated sensors, databases with address fields etc.

I was disappointed when I discovered that, in practice, despite extraordinary advances in technical capabilities for data sharing, much of the geospatial data created by scientists, perhaps most of it (other than data from civil agencies’ satellite-borne imaging systems), never becomes available to their colleagues. This lack of open access to geospatial data seems to me to be more tragic than the lack of open access to other kinds of scientific data, not only because humanity faces critical environmental challenges, but also because all geospatial data refer to the same Earth, and thus every new data layer is rich with possibilities for exploration of relationships to other data layers. I am, therefore, very glad that the Panton Principles have been published and a geospatial open access working group has been established.

In preparation for eventually writing an article on the subject of open access to geospatial data, working with a few OGC member representatives (special thanks to Simon Cox of CSIRO) and OGC staff, I collected a list of 17 reasons why scientists’ geospatial data ought to be published online, with metadata registered in a catalog, using OGC interoperability standards. (The 17 reasons are appended to this blog entry.)

In January I put these reasons into slides that I used in a talk at the Marsh Institute at Clark University in Worcester, Massachusetts. After briefly stating each reason, I explained how OGC standards and the progress of information technology make open access feasible. I provided evidence that the geosciences are rapidly moving in the direction of open access, and I offered ideas on how academics might contribute to and benefit from this progress.

I’m quite sure the Panton Principles are consistent with the goals of the geoscientists in the OGC. But I hasten to add that I am not speaking for them, and most of the 390+ OGC members are not geoscience organizations; most are technology providers, data providers and technology users with other roles in the geospatial technology ecosystem. But this diversity makes the OGC, I think, a particularly valuable “idea space” for academics who have an interest in open access to geospatial data and services. (Services are the future. A land use change model, for example, is a service when it is made available online “in the cloud” for others to use without downloading.)

One domain in the OGC that has value for open science is the work of the OGC Geo Rights Management Working Group (GeoRM WG). The Panton Principles discourage the use of licenses that limit commercial re-use or limit the production of derivative works, because the authors recognize the value of integrating and re-purposing datasets and enabling commercial activities that could be used to support data preservation. That’s important with respect to geospatial data, both because they are so often integrated and repurposed and because geospatial data sets are often complex and voluminous and thus potentially more expensive to curate than other kinds of data. The GeoRM WG has written a remarkable document, the GeoDRM Reference Model for use in developing standards for management of digital rights in the complex area of geospatial data and services. I think this will be a key resource as open access to geospatial data unfolds. The GeoDRM Reference Model provides a technical foundation necessary for implementing the Panton Principles.

Another valuable domain within the larger OGC idea space is the OGC Sensor Web Enablement (SWE) activity. Most geospatial data are collected by means of sensors, and thus it is important in the geosciences to have rigorous standard ways to describe sensors and sensor data in human-readable and machine-readable form. It is also important to have standard ways to schedule sensor tasks and aggregate sensor readings into data layers. Use of SWE standards is becoming important in some scientific areas such as ocean observation, hydrology and meteorology.

Both Web-resident sensors and data collections can be published and discovered by means of catalogs that implement the OGC Catalog Services – Web Interface Standard. This standard will likely become an integral infrastructure element for open access to geospatial data. It is designed to work with the ISO geospatial metadata standards, but those who begin implementing in this area discover that some work remains to make those standards more generally useful.

There are, in fact, many technical and institutional obstacles to overcome before science becomes as empowered by information technology as other estates such as business and entertainment. Technical interoperability obstacles are being overcome in the OGC by groups working in technology domains such as geosemantics, workflow, grid computing, data quality and oblique imagery; and in application domains such as hydrology, meteorology and Earth system science. Overcoming technical obstacles often precedes the obsolescence of institutional policies that stand as obstacles to progress.

I recently read Richard Ogle’s “Smart World,” a book about the new science of networks. In network terms, the OGC is a “hub” in an “open dynamic network”. What were once weak links between the OGC and other hubs such as the World Meteorological Organization and the International Environmental Modeling & Software Society (iEMSs) have been strengthened, and these stronger links make both the OGC and its partner hubs more likely to form new connections with other hubs. Hubs that directly contribute to digital connectivity, as the OGC does, have a special “pizzazz,” I would say. (I haven’t yet mastered the network science vocabulary). It seems to me the Open Knowledge Foundation and the Science Commons are hubs or idea spaces with a bright future of rich connections, and I look forward to seeing what connections they form with the OGC.

17 Reasons why scientific geospatial data should be published online using OGC standard interfaces and ISO standard metadata

Reason 1: Data transparency

Science demands transparency regarding data collection methods, data semantics, and processing methods. Rigor, documented!

Reason 2: Verifiability

Science demands verifiability. Any competent person should be able to examine a researcher’s data to see if those data support the researcher’s conclusions.

Reason 3: Useful unification of observations

Being able to characterize, in a standardized human-readable and machine-readable way, the parameters of sensors, sensor systems and sensor-integrated processing chains (including human interventions) enables useful unification of many kinds of observations, including those that yield a term rather than a number.

(From Simon Cox, JRC Europe and CSIRO Australia, editor of ISO 19156 (Observations and Measurements), coordinator of One-Geology geoinformatics, a designer of GeoSciML, and chair of the OGC Naming Authority.)

Reason 4: Data Sharing & Cross-Disciplinary Studies

Diverse data sets with well documented data models can be shared among diverse information communities*. Cross-disciplinary data sharing provides improved opportunities for cross-disciplinary studies.

OGC defines an information community as a group of people (such as a discipline or profession) who share a common geospatial feature data dictionary, including definitions of feature relationships, and a common metadata schema.

Reason 5: Longitudinal studies

Archiving, publishing and preserving well-documented data yields improved opportunities for longitudinal studies. As data formats, data structures, and data models evolve, scientists will need to access historical data and understand the assumptions so that meaningful scientific comparisons can be conducted. Community standards will help ensure long-term consistency of data representation.

Reason 6: Re-use

Open data enables scientists to re-use or repurpose data for new investigations, reducing redundant data collection and enabling more science to be done.

Reason 7: Planning

Open data policies enable collaborative planning of data collection and publishing efforts to serve multiple defined and yet-to-be-defined uses.

Reason 8: Return on investment

With open data policies, institutions and society overall will see greater return on their investment in research.

Reason 9: Due diligence

Open data policies will help research funding institutions perform due diligence and policy development.

Reason 10: Maximizing value

The value of data increases with the number of potential users*. This benefits science in a general way. It also creates opportunities for businesses that will collect, curate (document, archive, host, catalog, publish), and add value to data.

Similar to Metcalf’s law: “The value of a telecommunications network is proportional to the square of the number of connected users of the system.”

Reason 11: Data Discoverability

Open data is discoverable data. Data are not efficiently discovered through literature searches. Searches of data registered using ISO-standard XML-encoded metadata can be efficient and fine-grained.

Reason 12: Data Exploration

Robust data descriptions and quick access to data will enable more frequent and rapid exploration of data – [“natural experiments”]((http://en.wikipedia.org/wiki/Natural_experiment) – to explore hypothetical spatial relationships and to discover unexpected spatial relationships.

Reason 13: Data Fusion

Open data improves the ability to “fuse” in-situ measurements with data from scanning sensors. This bridges the divide between communities using unmediated raw spatial-temporal data and communities using spatial-temporal data that is the result of a complex processing chain.

(From Simon Cox)

Reason 14: Service chaining

Open data (and open online processing services) will improve scientists’ ability to “chain” Web services for data reduction, analysis and modeling.

Reason 15: Pace of science

Open data enables an accelerated pace of scientific discovery, as automation and improved institutional arrangements give researchers more time for field work, study and communication.

“Changes to the Earth that used to take 10,000 years now take three, one reason we need real-time science. … Governances must be able to see and act upon key intervention points.” Brian Walker, Program Director Resilience Alliance and a scientist with the CSIRO, Australia

Reason 16: Citizen science & PR

Open science will help Science win the hearts and minds of the non-scientific public, because it will make science more believable and it will help engage amateur scientists – citizen scientists – who contribute to science and help promote science. It will also increase the quality and quantity of amateur scientists’ contributions.

Reason 17: Forward compatibility

Open Science improves the ability to adopt and utilize new/better data storage, format, discovery, and transmission technologies as they become available.

(Offered to OGC’s David Arctur for this list on 6 January 2010 by Sharon LeDuc, Chief of Staff, NOAA’s National Climatic Data Center, Asheville, North Carolina, USA.)

(Another reason – cross-checking for sensor accuracy — occurred to me while writing this post.)

Consuming the Transport for London Data

June 17, 2010 in External, Open Government Data, WG EU Open Data, WG Open Government Data, Working Groups

The following guest post is from Julian Todd, who works on projects such as Public Whip, UNdemocracy, and ScraperWiki. He is also a member of the Open Knowledge Foundation’s Working Group on Open Government Data.. The post was originally published on Julian’s blog, Freesteel.

Yesterday Transport for London made a data dump of various locations and links to their traffic cameras, station locations, and so on.

A quick and effective use of some of the data is CheckMyRoute by Stefan Wehrmeyer that shows you all the CCTV traffic-cams on the route between two points in London.

This makes use of the googlemap’s route-finding function to thin out the awesome overload of camera locations you would otherwise see if you plotted them all at once.

It’s an attractive application because it’s an end product, rather than a stepping stone to the big solution of getting all data structured all ways so it can be used everywhere all the time for everything.

Over on ScraperWiki I’m workling towards this big solution by parsing the self-service cycle hire locations. The data allowed me to plot the following attractive map — as a byproduct.


View Larger Map

This map is not an end in itself. It’s just to prove I have the data. The pins are coloured according to whether the hire locations have under 20, between 20 and 30, or more than 30 listed under “Capacity”.

I believe these are the docking stations of London’s new exciting self-service public cycle hire scheme.

Let’s discuss the data and the code for parsing it.

What we have are about 400 comma separated value (CSV) lines with the following fields:

‘Name’, ‘Postcode_District’, ‘TfL_Ref’, ‘Capacity’, ‘Lat’, ‘Long’, ‘Easting’, ‘Northing’

Here are a couple of rows:

Embankment (Horse Guards),SW1,01/610104,32,51.50494561,-0.123247648,530350.1,180121.23
Vauxhall Bridge?,SW1,01/610109,35,51.48836528,-0.129361842,529972.93,178266.55

As you can see, there is redundancy. We can assume that ‘Lat’ and ‘Long’ are in WGS84 coordinates, because that’s what googlemaps takes and most GPS devices deliver, even though coordinate schemes and datum shifts are an extremely complicated issue.

Because we are in Britain, the ‘Easting’ and ‘Northing’ must be in the British national grid reference system, which is the grid we do our maps in.

This is a a useful grid, because it’s flat and written in metres. You can tell that the Horse Guards is about 2km north of Vauxhall Bridge — which is very useful if you’re making maps with a ruler on paper, as most of the time they were.

The ‘Lat’ and ‘Long’ values, on the other hand, are magic numbers that require a computer that understands ellipsoidal geometry and transverse mercators to use. There is no perfect conversion from one pair of numbers to the other unless you also know the altitudes, because each system has its own idea of the down vector.

You don’t need to be interested in this stuff (I’m not particularly), but it is a very good idea to develop an appreciation for where the hard problems are, so you can avoid them rather than walk straight into them with your eyes closed.

This redundancy in the dataset shows that the person who created it has a similar appreciation for the difficulties.

The Postcode_district field is obviously redundant.

The Name field occasionally contains an inexplicable “xa0″ character that broke our string handling routines, so it had to be substituted out, sometimes for a space, and sometimes for nothing. I have no idea what it’s doing there.

The TfL_Ref is the unique identifier. Unique identifiers are so handy that most datasets have one (eg invoice numbers, document codes). Unfortunately this one has a ‘/’ in it which means you’ll have difficulty if you try to use it as part of a URL. In other datasets (eg my undemocracy.com) I tried substituting every ‘/’ for a ‘-’, and then found that ‘-’ characters were pretty common too, so I couldn’t escape back.

As I said, I have coloured the map symbols accordingly to capacity. It would be better to make the pins larger or smaller according to the capacity, but I wanted to use that little bike symbol, and the google chart API does not allow me to to vary the size.

What’s next?

Well, obviously if there are public CCTV cameras looking at the cycle racks, I ought to be able to merge the datasets so I could check whether there were any bikes at a location before I walked there.

Maybe a universal travel planner that had all the bus underground timetables routes could offer a cycle journey across town to the my friend’s office if there was an alternative option. Perhaps the computer could plot the hypothetical route for me, and compare it to the Annual Average Daily Traffic Flows at various junctions and decide, “No, maybe it’s not a good idea as you actually won’t enjoy this particular journey and all the trucks at this time of day.”

And don’t forget the datasets of accident and crime statistics that must be kicking around somewhere. We know that cycle fatalities are reported in far greater detail than the average run-of-the-mill car crashes or bus muggings in the kinds of papers your mother reads, so it’s important to obtain the actual live numbers to argue the case of what is safe.

Integrate that with having a lesser mortality due to actually getting some exercise for a change (the facts are around somewhere) matched to your actual age and demographic (if you’re 98 like my grandfather, I will concede that you will live longer if you take the bus), and we’ll never need to think for ourselves again.

Aside from having to answer the question at the top: “Where do you want to go?”

ScraperWiki is ready for business if you know any other datasets you would like to draw in to the common pool and are willing to code. Soon it will have PHP support (not just for Python).

Oh, also, you mustn’t forget to declare every dataset onto CKAN as I have done with this cycle information so that more people will be able to find it again.

Once a critical mass of connected and consistent information develops, the bigger projects become possible.

Avatar of lisa

by lisa

Understanding COINS

June 17, 2010 in OKF Projects, Open Data, Open Government Data, Uncategorized, Visualization, Where Does My Money Go

Something amazing has happened since the government spending recorded in the COINS database was made openly available to everyone. I’m talking about the impressive range of free, and in many cases open source, products to display the COINS data.

So far there are COINS search engines from The Guardian and The Open Knowledge Foundation, graphs from Rapid Gate Way and Alpine Interactive and bloggers like Martin Budden have been powering away on their own projects to describe the COINS data. What a triumph for publishing government data. It beats the alternative of using public funds to pay for these tools when the skills and enthusiasm are clearly out there in the community.

coins1

That’s not to say that the products to display the data are complete right now, or that we have understood the COINS data completely. We had a few clues about the structure of the data from previous research, but there is no substitute for having the data itself, and we are still building up our knowledge. But given it’s been just over a week since we first laid eyes on the data, I think it’s fair to say that we are making good progress by most IT project standards.

In this post I want to address two questions that drive our thinking at the Open Knowledge Foundation, since the COINS publication. They are: ‘what’s important in COINS?’ and ‘how do we get meaningful results out of it?’

It has taken some discussion with the exceptionally helpful staff at HM Treasury and reading the COINS Guidance(PDF) and other related materials that make more sense now we can see the data — but finally I feel we have more accurate answers to both of these questions.

What’s important in COINS?

The COINS Guidance lists every field in the version of COINS that was released. One of the big challenges with a big complicated data set, like COINS, is knowing which of these fields are important.

To determine this I’ve spoken with the Treasury team about the fields they consider most useful, and the combination of fields they use most frequently.

The answers I got focused mainly on the central government spending and income data.

The spending and income is described for each central government department which you can see in the ‘Department description‘ field. Each department has a number of programmes that will either require or generate money. The department’s programmes are in the ‘programmes object group description‘ part of COINS, and more detail still is in the ‘programme objects description‘, and yet more detail still is in the ‘account codes‘ which are all listed in Annex B.

The ‘Value‘ field tells the actual spending or income in thousands of pounds. If the number is positive it refers to the departments spending, if negative it refers to the department’s income. It should also be able to check if the amount is spending or income from the ‘account code’.

In addition to the spending programme and ‘account code’ information, there are two further categories in COINS that describe the data very usefully, those are:

  • budget boundary‘. There are three choices for ‘budget boundary’: 1) DEL which stands for Departmental Expenditure Limits. These are items that have been budgeted for 3 years, it is estimated that DEL makes up about 80% of the items in COINS. 2) AME which stands for Annually Managed Expenditure. These are the budget items that are difficult to predict accurately and the risk for these is taken by the Exchequer as a whole. We are ignoring everything in AME where the ‘Programme /admin’ is not set to ‘Other’. 3) ‘not DEL/AME’ is budgeting for arm.s length bodies — we are not too concerned about these budget items.
  • the ‘resource capital‘. There are two options that are both useful for .resource capital. which are 1) ‘capital’ which is investment and capital assets. 2) ‘resource’ which includes all wages, salaries and operating costs.

There are some parts of COINS that we are less concerned with at the moment.

Other than the expenditure and income data, there are plans and estimates in COINS. You can see plans and estimates that should roughly correspond to the supplementary budget information and the supply estimates, respectively. We have been less concerned with plans and estimates as, by their nature, they will be less detailed than the outturn.

There is a CPID code in COINS which is there for a special project within the Treasury called the Whole of Government Accounts (WGA). This project will ensure that there is no double counting of the money when a transaction occurs between government departments. As I understand it, if body A gives money to body B then WGA would be responsible for subtracting the amount body B received from body A’s total. There are scripts in COINS to ‘best guess’ these subtractions using the CPID code, along with the WGA staff performing lots of checks too, but once this matching has been successful the CPID code is largely redundant.

The Whole of Government Accounts also collects information about spending by local authorities and records this spending in COINS, but this is not in a publishable state. However it is possible to view central government grants for local authorities with the field called ‘Local Government Use only‘.

How do I get meaningful results out of COINS?

On the advice of the Treasury guidance we are focusing on the Fact Table more than the Adjustment Table in COINS. In the fact table the field that defines actual spending and income is the ‘Data_type‘ being set to ‘Outturn’ and ‘Data_subtype‘ being set to ‘approved’ or = submitted_outturn (both of these conditions required).

In addition we can set Budget_Boundary to either DEL or if we require the shorter term budget spending then we set AME and then set programme/admin to ‘Other’.

For the 2009-2010 COINS data we can also set the Resource_capital2: set to Resource (on 2010-11 budgeting basis).

With the COINS data defined this way it is then possible look at the spending programmes and associated account codes certain that the results are actual spending and actual income for the time frame, rather than estimated or planned spending or income.

It is wonderful that the publication of COINS has brought so much innovation in the open software community. It will be even more wonderful if we can continue to develop to make public spending data easier to understand, particularly when so many important decisions are being made that will affect our lives.

Open Correspondence

June 16, 2010 in External, Free Culture, OKF Projects, Public Domain, Texts, WG Humanities

The following guest post is from Iain Emsley, who is a member of the Open Knowledge Foundation Working Group on Open Resources in the Humanities, and a contributor to the Open Shakespeare and Open Milton projects.

Using the social graph, one can find the connections between seemingly disparate groups of people on different services. Most of the projects in the area are focussed on social media, such as Facebook, Twitter and so on. There is, however, a layer of social information that was created before this. Letters were, and still are, used as a method of communication. To some extent it is the Internet before the technology became available. There is a host of data that is shared in each missive. For example, the author and their correspondent. That is only the tip of the metadata though:

  • What are they writing about?
  • Whom are they writing about?
  • When was the letter written?
  • Where was it written?

The Open Letters project, grew out of some musings when working on the timeline for the Open Milton website. I could see the links between the texts and some of the events but I was curious about how things linked together. Neither texts nor authors exist in a vacuum. Authors write to other people – agents, authors, casual acquaintances, friends and family – and they write about books. Sometimes they write about books that they have read, sometimes about what they are writing.

From these we can infer what books, authors, or authors who influenced the author or were being influenced at the time. From this, we can see the growth of the social graph into the cultural graph. Essentially it is the same notion as the social graph but the cultural graph links items like books, poems and events together. In itself it means nothing but linked to the social graph, it allows the user to discover who is being written to whilst a book was being written. Is the author talking to other authors or only to his agent about it?

Charles Dickens was a prolific letter writer which is why he was chosen as the first author for the project. From his own letters, we can see him writing to authors, such as George Eliot or Wilkie Collins, and scientists like Charles Babbage, inventor of the Difference Engine or his agent about his works in progress. His letters shed some light into the nineteenth century literary world but also contextualises it within the wider world. His wide range of writing gave me a chance to cast widest net possible and set up as many nodes on the graph.

A brief peak at the correspondents to whom Dickens was writing about the Pickwick Papers, Dickens’s first novel, suggests that it more than just a book but an item of conversation which is revealed through his letters about the book. He managed to offend Mr David Dickson, a reader, with a passage in the novel, though invited W C Macready to a dinner to celebrate its publication. Later in his life, he wrote to Wilkie Collins, the author, complaining that “I have never seen anything about myself in print which has much correctness in it–any biographical account of myself I mean”. The set of letters sheds a little light into the public and private worlds of Dickens, from his mortification at offending a reader to complaining about his own portrayal. He comes alive as a person rather than just an author as does his social graph and the relationships with his correspondents is illuminated by the way that he addresses them with varying degrees of formality.

Now that the site is set up, the next step is to complete the set of Dickens letters which his daughters edited and published from the Project Gutenberg texts. The next major step is to try and collect the letters of his correspondents and from them the new correspondent nodes. As well as HTML representations of the letters, the project uses RDF, reusing Dublin Core and Friend of a Friend (FOAF) with its own extensions for the collection of letters called letter. Rufus Pollock has already created a graph that visualises the relationships between authors, time of begin written to and the number of times to which they were written and timelines for the letters are being developed.

There are, of course, more things that I would like to do but the major one task is building the collections of letters under open licenses. The project can be contacted through the open-literature mailing list if you would like to find out more or to contribute.

Open Correspondence

Learning from Libraries: The Literacy Challenge of Open Data

June 15, 2010 in CKAN, External, Open Data, Open Government Data, WG Open Government Data, Working Groups

The following guest post is from David Eaves who is the founder of datadotgc.ca, an open data portal powered by our CKAN software that crowdsources the location of open data sets in Canada (Canada has no equivalent of data.gov or data.gov.uk). David is also a member of the OKF’s Working Group on Open Government Data. The post originally appeared on eaves.ca.

We didn’t build libraries for a literate citizenry. We built libraries to help citizens become literate. Today we build open data portals not because we have public policy literate citizens, we build them so that citizens may become literate in public policy.

In a brilliant article on The Guardian website, Charles Arthur argued that a global flood of government data is being opened up to the public (sadly, not in Canada) and that we are going to need an army of people to make it understandable.

I agree. We need a data-literate citizenry, not just a small elite of hackers and policy wonks. And the best way to cultivate that broad-based literacy is not to release in small or measured quantities, but to flood us with data. To provide thousands of niches that will interest people in learning, playing and working with open data. But more than this we also need to think about cultivating communities where citizens can exchange ideas as well as involve educators to help provide support and increase people’s ability to move up the learning curve.

Interestingly, this is not new territory. We have a model for how to make this happen – one from which we can draw lessons or foresee problems. What model? Consider a process similar in scale and scope that happened just over a century ago: the library revolution.

In the late 19th and early 20th century, governments and philanthropists across the western world suddenly became obsessed with building libraries – lots of them. Everything from large ones like the New York Main Library to small ones like the thousands of tiny, one-room county libraries that dot the countryside. Big or small, these institutions quickly became treasured and important parts of any city or town. At the core of this project was that literate citizens would be both more productive and more effective citizens.

But like open data, this project was not without controversy. It is worth noting that at the time some people argued libraries were dangerous. Libraries could spread subversive ideas – especially about sexuality and politics – and that giving citizens access to knowledge out of context would render them dangerous to themselves and society at large. Remember, ideas are a dangerous thing. And libraries are full of them.

Cora McAndrews Moellendick, a Masters of Library Studies student who draws on the work of Geller sums up the challenge beautifully:

…for a period of time, censorship was a key responsibility of the librarian, along with trying to persuade the public that reading was not frivolous or harmful… many were concerned that this money could have been used elsewhere to better serve people. Lord Rodenberry claimed that “reading would destroy independent thinking.” Librarians were also coming under attack because they could not prove that libraries were having any impact on reducing crime, improving happiness, or assisting economic growth, areas of keen importance during this period… (Geller, 1984)

Today when I talk to public servants, think tank leaders and others, most grasp the benefit of “open data” – of having the government sharing the data it collects. A few however, talk about the problem of just handing data over to the public. Some questions whether the activity is “frivolous or harmful.” They ask “what will people do with the data?” “They might misunderstand it” or “They might misuse it.” Ultimately they argue we can only release this data “in context”. Data after all, is a dangerous thing. And governments produce a lot of it.

As in the 19th century, these arguments must not prevail. Indeed, we must do the exact opposite. Charges of “frivolousness” or a desire to ensure data is only released “in context” are code to obstruct or shape data portals to ensure that they only support what public institutions or politicians deem “acceptable”. Again, we need a flood of data, not only because it is good for democracy and government, but because it increases the likelihood of more people taking interest and becoming literate.

It is worth remembering: We didn’t build libraries for an already literate citizenry. We built libraries to help citizens become literate. Today we build open data portals not because we have a data or public policy literate citizenry, we build them so that citizens may become literate in data, visualization, coding and public policy.

This is why coders in cities like Vancouver and Ottawa come together for open data hackathons, to share ideas and skills on how to use and engage with open data.

But smart governments should not only rely on small groups of developers to make use of open data. Forward-looking governments – those that want an engaged citizenry, a 21st-century workforce and a creative, knowledge-based economy in their jurisdiction – will reach out to universities, colleges and schools and encourage them to get their students using, visualizing, writing about and generally engaging with open data. Not only to help others understand its significance, but to foster a sense of empowerment and sense of opportunity among a generation that could create the public policy hacks that will save lives, make public resources more efficient and effective and make communities more livable and fun. The recent paper published by the University of British Columbia students who used open data to analyze graffiti trends in Vancouver is a perfect early example of this phenomenon.

When we think of libraries, we often just think of a building with books. But 19th century mattered not only because they had books, but because they offered literacy programs, books clubs, and other resources to help citizens become literate and thus, more engaged and productive. Open data catalogs need to learn the same lesson. While they won’t require the same centralized and costly approach as the 19th century, governments that help foster communities around open data, that encourage their school system to use it as a basis for teaching, and then support their citizens’ efforts to write and suggest their own public policy ideas will, I suspect, benefit from happier and more engaged citizens, along with better services and stronger economies.

So what is your government/university/community doing to create its citizen army of open data analysts?

Other posts by David that you might find of interest include:

Launch of it.ckan.net for open data in Italy!

June 14, 2010 in CKAN, External, OKF, OKF Projects, Open Data, Open Government Data, Releases, WG EU Open Data, WG Open Government Data, Working Groups

The following guest post is by Stefano Costa and Federico Morando. Stefano Costa is a researcher at the University of Siena and Coordinator of the OKF’s Working Group on Open Data in Archaeology. Federico Morando is Managing Director & Research Fellow at the NEXA Center for Internet & Society and a member of the Working Group on EU Open Data.

We are delighted to announce that an Italian instance of CKAN is now live! You can see this at:

There are currently 67 packages available — thanks to the Extracting Value from Public Sector Information (EVPSI) project. In particular, the NEXA Center contributed material generated as part of the EVPSI project, which is funded by the Piedmont Region and coordinated by the University of Turin.

The site was launched on Sunday by OKF Director Rufus Pollock and NEXA Center co-director Juan Carlos De Martin at the 2010 Festival of Economics in Trento and is a collaboration between the Open Knowledge Foundation, the EVPSI project and the NEXA Center for Internet & Society.

The datasets that are currently available on the Italian instance of CKAN come from a first mapping of some of the main silos of public sector information (PSI) in Italy. Many more packages will be provided soon by EVPSI and the NEXA Center, as a product of a much more detailed mapping of PSI holding entities in the Italian Region of Piedmont.

Open data in Italy

Is Italy behind other countries with respect to open data? Judging from the data of the EVPSI project (and from the infringement procedure the the EU started against Italy), the answer to this question is ‘yes’, but things are changing. The Italian CKAN will hopefully help accelerate this change – providing a way for open data users and distributors to find datasets and see whether or not they can reuse them!

The new datasets on it.ckan.net include many which aren’t open, to help people get a ‘big picture’ about what datasets are out there, who holds them, how to download them and how open they are.

There are several bodies that produce data for their own institutional purposes, but most of the databases with clear commercial interest are only available by paying. And even when data are made available on the web they are distributed under restrictive terms of use or under unclear or no terms of use at all. That, considering the default status of potentially copyright and/or database right protected material (i.e. “All rights reserved”) implicitly means that no re-use is possible. This attitude is caused by a combination of factors, including:

  • lack of knowledge about the open data initiative and the benefits of open data for citizens and society at large
  • complex sub-licensing of datasets among many different public and private bodies, so that nobody can be considered the actual owner of data
  • a general fear of situations implying a loss of control over the re-use of data (coupled with a lack of internal guidelines about the access and re-use of data)
  • a difficult financial situation of PSI holders, pushing them to maximize their short run monetary income, without appropriately taking into account positive spillovers for the rest of society and in the medium/long run

For example ISTAT, the national institute of statistics, put their data online for free use, but unfortunately commercial reuse is not allowed – which may inhibit the development of innovative applications and services. See an overview of ISTAT datasets at CKAN.

A notable exception to this mindset is Regione Piemonte, that has recently launched a portal for open data at:

  • That result has been facilitated by the existence of common regional guidelines about the re-use of public data. What is more, all their currently available data are released under the CC0 license, enabling unrestricted re-use and dissemination by anyone, even for commercial purposes.

There are other regional governments offering some of their data (for example geospatial data) for free, but Piemonte is the only one explicitly adopting an open license. In all other cases, one has to ask for each case, and usually the answer is “free for non-commercial use” only.

The key point is that national and regional governments own large datasets that would be quite easily made available to the public. This process would however require 3 distinct actors, as outlined in the Open Data study by Becky Hogge:

  • government heads
  • civil servants (acting as the “middle layer”)
  • a small but determined group of citizens (or “civic hackers”)

Minister Brunetta promised “data.gov.it” in 6 months, but in the meantime we would like to get a more detailed picture of how open Italian public information is. In particular it will be interesting to see if any local authorities besides Regione Piemonte will consider following in the footsteps of many other local and national bodies around the world – and open up their data!

Interested in starting a new CKAN instance in your country?

If you’re interested in starting a new instance of CKAN for open data in your country, the Open Knowledge Foundation would be delighted to help! If you are able to help coordinate the translation and liaise with other local folks interested in open data — we can set up, host, and maintain the instance on our servers. Just pop us a line on the ckan-discuss list:

    *

Avatar of jwalsh

by jwalsh

Dig the new breed, Part III – wrapping it all up

June 11, 2010 in External, Ideas and musings, Uncategorized, WG Archaeology

This is the third in the amazing series of guest blogs from Ant Beck on the impact of linked open data for archaeology.

Part 1: New approaches to archaeological data analysis, as seen in the DART and STAR projects Part 2: Considering the ethics of sharing archaeological knowledge

OK, to recap we have:

  • A scientific movement that advocates open approaches to data, theory and practice
  • Emerging foundational interoperability using semantic web technology
  • The potential to remove a barrier and facilitate the submission of primary data

These three powerful factors could prove to be highly disruptive. In combination they have the potential to turn archaeological data and data repositories from static siloed islands (containing data that is increasingly stale) into an interlinked network of data nodes that reflect changes dynamically.

The linch-pin is the use of triplestores (RDF databases) that provide persistent identifiers. Persistent identifiers allow us to refer to a digital object (a statement, a file or set of files) in perpetuity, even if the underlying storage location moves. This means links between objects are persistent: therefore, when an observation or interpretation changes its effects are propagated through to all the data/events that link to it. I see organisations such as the ADS, Talis (an innovating semantic web technology provider which provide the Talis Platform which includes a free RDF hosting service for open data) and national heritage bodies providing such services.

Some open science projects are likely to adopt RDF as their de-facto data sharing format. RDF triples (subject, predicate, object) provide a schema transparent mechanism for data storage. They are not ideal for all data types (raster data structures for example) but when used with Ontology and SKOS, as demonstrated by STAR, they are powerful analytical, search and inference tools.

So, what is the importance of storing heritage data in RDF? Well, it depends which point of view you take. From a data management perspective there is no longer any need to migrate data formats. However, to facilitate re-use, different “views” of the RDF model can be generated and incorporated into traditional analytical software, such as GIS. Importantly, analysis stops being a “knowledge backwater”: new knowledge can be appended back into the triplestore.

Linked Data concepts in archaeology

Linked Data concepts in archaeology

From a data curation, re-use and analysis perspective the quality of the data has the potential to be dramatically improved. Deposition is no longer the final act of the excavation process: rather it is where the dataset can be integrated with other digital resources and analysed as part of the complex tapestry of heritage data. The data does not have to go stale: as the source data is re-interpreted and interpretation frameworks change these are dynamically linked through to the archives, hence, the data sets retain their integrity in light of changes in the surrounding and supporting knowledge system.

An example is probably useful at this juncture: In addition to many other things pottery provides essential dating evidence for archaeological contexts. However, pottery sequences are developed on a local basis by individuals with imperfect knowledge of the global situation. This means there is overlap, duplication and conflict between different pottery sequences which are periodically reconciled (your Type IIb sherd is the same as my Type IVd sherd and we can refine the dating range…… Hurrah… now let’s have another beer). This is the perennial process of lumping and splitting inherent in any classification system. Updated classifications and probable dates allow us to re-examine our existing classifications. One can reason over the data to find out which contexts, relationships and groups are impacted by a change in the dating sequences either by proxy or by logical inference (a change in the date of a context produces a logical inconsistency with a stratigraphically related group) While we’re on the topic of stratigraphy, an area of notorious tedium and poor quality data (often with conflicting relationships), RDF allows rapid logical consistency checking as stratigraphic relationships are basically a graph and RDF triples are a graph database. Publically deposited RDF data should be linked data: this means that all the primary data archives are linked to their supporting knowledge frameworks (such as a pottery sequence). When a knowledge framework changes the implications are propagated through to the related data dynamically. This means that policy, development control and research decisions are based upon data that reflects the most-up-to date information and knowledge….. cool huh.

Incorporating excavation data into RDF means that ontology and SKOS can be used to dynamically repurpose the data for policy formulation, planning impact, regional heritage control and mitigation purposes in conjunction with the data in the Sites and Monuments Record (SMR). Raw data can be integrated from multiple different sources with different degrees of spatial and attribute granularity and, where appropriate, generalised so that the data is fit for the end users’ purpose. From a policy perspective curatorial officers no longer have to battle to stop datasets becoming stale and add new datasets to the local SMR. The SMR will remain an essential dataset: even though it is a generalised resource it is the only location of a digital record for resources that are unlikely to be digitised in the future (unless there is a very unlikely reverse in funding patterns). Thus the curatorial officer can develop more effective regional research agendas based upon up-to-date and accurate data.

This has the potential to change the way Historic Environment Information Resources (HEIRs) are managed by curatorial officers and transform how developers (property and software), policy makers and the general public engage with and consume any data. They will be able to support innovative access to primary linked data resources by researchers, planners and most importantly the public. This is a significant and important change in role. In addition the heritage data can be mashed up with other data resources to produce tailor made resources for different end-user communities – following the model successfully employed by data.gov.uk.

Data re-use and mashups are also important for those undertaking research and analysis. The big difference will be for those who undertake research or collect data that transcends different traditional analytical scales. For example, the National Mapping Programme which aims to “enhance the understanding of past human settlement, by providing primary information and synthesis for all archaeological sites and landscapes visible on aerial photographs or other airborne remote sensed data” will provider deeper insights when it is integrated with other data. However, this integration can occur in real time and add tangible interpretative depth. If an interpreter is digitising data from an aerial photograph and they see two ditches cutting one another they are unlikely to be able to tell the relative stratigraphic sequence of the two features. Direct access to excavation or other data will allow the full relationships and their interpretative relevance to be deduced during data collection.

In the longer term consumers of archaeological data will be more used to dealing with primary data, will become more aware of its potential and demand more of the resource. This should produce a ground up re-appraisal of recording systems and a better understanding of archaeological hermeneutics. The interpretative interplay between theory, practice and data as part of a dynamic knowledge system is essential. Although this has been recognised, in reality theory, practice and data have never really been joined up. We don’t have to use a one size fits all approach to conducting excavations, but we can tailor bespoke systems that address local, regional and national research challenges. We can generate interesting and provocative data that can be used to test theory and inform practice and move away from recording systems mired in the theoretical and intellectual paradigms of the mid 70’s.

The virtuous circle is re-established; theory will influence practice, which will change the nature of the data, which will impact on interpretative frameworks, which will provide a body of knowledge against which theory can be tested.

Final comments

There is a new breed: there are people and organisations who don’t want to do what’s always been done. People who are empowered and don’t believe that established institutions and hierarchies are the gatekeepers of progress: organisations that can, and want to, change the way we ‘play the game’, people who want to collaborate. Organisations that want to share. Open approaches can help to make all this happen. This is all facilitated by disruptive technology which is increasingly mature, broadly available for free (or at a low cost) and with low barriers of use and re-use. In the nearly twenty years of studying and working in the heritage sector I’ve seen it change dramatically. I feel we are on the cusp of changing the way we engage with our data which could profoundly alter the way we understand the past, how we can communicate this in the present and how we can sustainably manage a complex resource for the future.

Avatar of jwalsh

by jwalsh

Dig the new breed, Part II – open archaeology and ethics

June 11, 2010 in External, Ideas and musings, WG Archaeology

The second in this great series of three guest blogs by Ant Beck. See Part 1 for applications of linked data and remote sensing in archaeology. Part 3 will wrap things up and talk about the disruptive implications of linked open data for impact of archaeology.

Open Science provides the framework for producing transparent and reproducible science by providing open access to raw data, algorithms and interpretations. Efforts such as STAR and STELLAR provide the foundation from which fine granularity excavation data can be made available as part of the semantic web and feed into Open Science analysis. This provides answers to the questions of how and why we should have open access to archaeological data. However, it does not provide answers to what data should be opened or if archaeological data should be opened at all. We move into the sphere of ethics and open archaeology.

Treasure seeking - CC-BY-SA-NC

Recently I have chatted to a number of people and organisations who want to open up heritage data. The conversations tend to have an ethical component. Like other disciplines, such as ecology, there are potential ethical issues in making heritage data open. The oft touted reason, in the UK at least, is that if access is given to this information then it will be exploited by “night hawkers” (irresponsible metal-detectorists) and other “treasure hunters” and sites (a term I don’t really like) will be destroyed.

This argument is polarised and plays to the lowest common denominator: it is based on the premise that “accessible knowledge will inevitably be abused” and eschews any of the benefits that data sharing can provide. Nor does it consider the nuanced ethical arguments concerning re-appropriation of artifacts collected under imperialist regimes or the ethical conundrum surrounding research into aboriginal or other indigenous communities (which, now that I’ve raised them I wont comment on them further). The Portable Antiquities Scheme has done much to improve this argument.

The elephant in the room in this debate concerns those archaeologists who have sat on their archive for decades. We know of its significance but it is not available for academic and research analysis and does not inform the planning process. This has enormous impact on local planning policy, public and academic understanding, theory, practice etc. Since, the 1990 introduction of Planning Policy Guidance 16 (PPG16: essentially commercial archaeology) in the UK, and the later Planning Policy Statement 5 has improved the situation a bit.

But I find the situation somewhat paradoxical. The UK curatorial systems expect that a generalised summary, or synthesis, of any investigation is deposited with the regional curatorial officers. This data is entered into the Sites and Monuments Record (SMR) and is publically accessible. Therefore, the public has access to a generalised dataset. The expectations for primary, or raw, data are different: it’s considered ethically appropriate to deposit fine granularity data (i.e. non-generalised, primary, data, such as those from excavation) with the Archaeology Data Service (ADS), however, there are issues raised if an individual wants to do this outside such formal structures (however, the Perry Oaks Project have released redacted versions of their site data).

Is this an issue of ethics, or where formal and informal work practices collide; or is this simply an issue of cost, where individuals and organisations have the will but not the finances? Alternatively, and possibly most likely, do archaeologists just feel uncomfortable making their fine grained data available to a mass audience without going through a representative authority such as the ADS? My feeling is that within the archaeology domain there is an informal belief that if data is deposited with a repository then the repository also takes the ethical responsibility if the data is released. Deposition so that data is available in perpetuity is part of business and academic best practice, however, deposition does not necessarily mean release and subsequent consumption by other parties (public or otherwise).

Whatever the answer the point remains: archaeologists, for right or wrong, consider the implications of placing fine grained data in the public domain and “Ethical considerations” have been identified as a “barrier” to deposition. However, there appears to be limited guidance as to how to resolve these issues. This means that many archaeologists are re-inventing the wheel. The challenge is to provide some supporting “thing” that makes it easy for individuals and organisations to get to a clear, and hopefully unambiguous, ethical position. Such a “thing” will reduce uncertainty thereby removing one of the barriers to data sharing. The current default position is the equivalent of doing nothing: surely this must change.

Supporting “stuff” which is recognised and approved by national heritage organisations and standards bodies will act as important lubricant to help individuals and groups to release data through informal channels. It should be recognised that the relationship between the “citizen”, the archaeologists and heritage data will change: citizen science and citizen data, will play more of a role in heritage than ever before. Hence, a focus on the informal is important: we don’t want more grey data so we? The Portable Antiquities Scheme is the “poster boy” for archaeological approaches to citizen science – although they do have a range of different user access levels.

I raised this as a topic for the Archaeology working group at the Open Knowledge Foundation. Response so far has been positive and has spilled over to colleagues in the curatorial sector and beyond (the discussion thread can be found here). We’ll be setting up a meeting to discuss these issues later in 2010. Both the Archaeology Data Service and the University of Leeds have kindly offered a venue.

There’s also a start at creating an ethics statement on open access to raw archaeological data – a statement that should be supportable by institutions and individual researchers alike. If you’d like to get involved, please join the Open Archaeology working group and mailing list – involvement could be helping to craft the ethics statement, asking your institution to contribute its own statement, helping to plan and document the workshop.

Dig the New Breed: How open approaches can empower archaeologists- Part I

June 10, 2010 in External, WG Archaeology

Very happy to post the first in an amazing series of OKFN guest blogs by Ant Beck, a member of the Open Archaeology working group. Ant discusses the DART project and the STAR project, both employed Linked Data in a heritage context. Later we’ll get into the ethics of open heritage, and a vision for the future of archaeological data.

The title “Dig the New Breed” is taken from the presentation I gave at the Open Knowledge Conference 2010. I did this for two reasons: It’s a terrible play on words (dig is employed as a synonym for “excavation” and “To like”) and I like name-checking “The Jam”. As this series of posts has taken form, it’s changed from being a piece about Open Science and Ethics into something about how disruptive technologies can be implemented to transform how the heritage sector operates.

STAR & STELLAR – Anyone for linked heritage data?

DART

DARTProject Flickr page

I recently attended a STAR project workshop and saw a glimpse of the future. The Semantic Technologies for Archaeological Resources (STAR) project investigated “the potential of semantic terminology tools for widening and improving access to heritage resources, exploring the possibilities of combining a high level, core ontology with domain thesauri and natural language processing techniques”. The project has looked at extracting structured knowledge from “grey literature” using Natural Language Processing (NLP) tools – all very worthy and interesting but not something I’m directly excited by as “grey literature” is essentially tertiary data (an extraction of synthetic data derived from the primary record). In addition they have developed an RDF based approach to query data stored in heterogeneous excavation databases. WOW!

And in case you missed that…. querying data stored in heterogeneous excavation databases. Essentially they have resolved syntactic (platform/format), schematic (structural) and semantic (language) heterogeneities by generating mappings of key fields (i.e. a sub-set of the source data) to the English Heritage extension of the CIDOC Conceptual Reference Model (CRM), extracting the data as RDF and providing semantic interoperability through Knowledge Organization Systems (KOS) represented in SKOS format from standard heritage thesauri. In essence, they extract RDF from relational databases using hand crafted mappings to both SKOS and ontology articulating semantics and canonical concepts respectively.

The combination of RDF, ontology and SKOS have allowed the team to produce a demonstrator capable of cross searching different excavation databases with “difficult queries”. The team demonstrated that they could address questions such as, show me contexts that satisfy the following criteria:

  • Roman corn drying ovens with palaeobotanical analysis
  • Charred plant remains and charcoal from 4 post structures
  • Post holes that contain ritual deposits

Granted there are limitations: it currently supports a sub-set of the data collected during excavation and the RDF model is viewed as an interim tool with users going back to the source databases to conduct further analysis. However, the concept has been definitively demonstrated. Great stuff! The impact of this work is profound: the SKOS and ontology will allow inferencing/reasoning over the data which will transform the way the data can be re-used, analysed and generalised (more on this in a bit).

The Glamorgan team have a follow on project called Semantic Technologies Enhancing Links and Linked data for Archaeological Resources (STELLAR) funded by the AHRC. One of the aims of STELLAR is to develop “best practice guidelines and tools … both for mapping/extracting archaeological data as RDF and for generating archaeological Linked Data”. This will take the research developed in STAR and provide tools so that it can be deployed to mainstream archaeological data. I’m really looking forward to seeing the roll-out of this technology.

DART and Open Science

HeritageDetectionProblem

DART is an acronym for Detection of Archaeological Residues using remote sensing Techniques. DART is a three year Science and Heritage initiative funded by AHRC and EPSRC, led by the School of Computing at the University of Leeds. The project aims to improve the understanding of the physical, chemical, biological and environmental factors that determine whether an archaeological feature (pit, ditch, posthole etc.) can be detected by a sensor (camera, Ground Penetrating Radar, etc.). DART brings together consultants and researchers from the areas of computer vision, geophysics, remote sensing, knowledge engineering and soil science.

Archaeological sites and features are created by localized processes of formation and deformation. There are a range of imaging instruments that can be used to detect these archaeological residues, although, the knowledge required to determine what, when, how and why to use each different type of sensor is patchy. Seasonal, environmental and vegetation dynamics play a part, although the complexities of interaction and how they modify “contrast signatures” derived from the existing formation and deformation processes is uncertain.

This is important so I will provide an example: as a mud-brick built farmstead erodes, the silt, sand, clay, large clasts and organics in the mud-brick along with other anthropogenic debris are incorporated into the soil. This produces a localised variation in soil size and structure. This in turn impacts on drainage and localised crop stress and vigour. These localised variations can all provide measurable differences, or contrasts, that indicate the presence of archaeology.

For example, archaeological residues can affect drainage of the soil, which then affects the appearance of crops. Different drainage characteristics result in different soil moisture retention properties, and local variations in crop stress/vigour can be observed as differences in crop height or crop colour (essentially crop marks). Archaeological contrasts can be expressed through, for example, variations in chemistry, magnetic field, resistance, topography, temperature and spectral reflectance.

The DART project is trying to identify physical, chemical and biological contrast factors that may allow us to detect archaeological residues (both directly and by proxy) under different land-use and environmental conditions. We address the following research issues:

  • What are the factors that produce archaeological contrasts?
  • How do these contrast processes vary over space and time?
  • What processes cause these variations?
  • How can we best detect these contrasts (sensors and conditions)?

DART is committed to open science principles and aims to act as an exemplar for how data, tools, and analysis can be made available to the wider academic, heritage and general community. Data, software, algorithms and services developed throughout the project will be made available for re-use with appropriate open licences.

Licensing is an issue as license incompatibility can severely restrict re-use. Science Commons is establishing protocols in this area. Publically accessible dissemination is preferred, however, where necessary domain specific or institutional repositories will be utilised for long-term preservation. Cameron Neylon is part of the project consortium and provides steer on these issues.

The whole point of taking an open science position on this project is so that we can maximise the benefit and impact. The research problem is large and complex: one project will not solve it. Inevitably the science will need refining; adequate articulation will require long term data collection under different conditions, followed by iterative hypothesis testing and modelling. The challenge is to get this information in the quickest, cheapest and easiest ways. An Open Science approach means that DART is openly collaborating with researchers and individuals throughout the world. The body of work developed within DART can be easily re-used by others: our results can be tested as the data and algorithms will be in the public domain, which means that they can be rapidly evaluated and easily re-used. Unlocking the “body of knowledge” and “know how” surrounding a programme of research should significantly reduce the barriers to re-use. This may generate a critical mass of surrounding research, which can only improve the underlying models and science. Providing scientists with the methodology of how to make the wheel will not only stop us reinventing it, but will also improve the manufacturing process.