You are browsing the archive for Francis Irving.

From CMS to DMS: C is for Content, D is for Data

March 9, 2012 in Featured, Ideas and musings, Open Standards

This is a joint blog post by Francis Irving, CEO of ScraperWiki, and Rufus Pollock, Founder of the Open Knowledge Foundation. It’s being cross-posted to both blogs.

Content Management Systems, remember those?

Tim Berners-Lee in thought

It’s 1994. You haven’t heard of the World Wide Web yet.

Your brother goes to a top university. He once overheard some geeks in the computer room making a ‘web site’ consisting of a photo tour of their shared house. He thought it was stupid, Usenet is so much better.

The question – in 1994 did you understand what a Content Management System (CMS) was?

In the intervening years, CMS’s have gone through ups and downs.

Building massive businesses, crashing in the .com collapse. Then a glut, web design agencies all building their own CMS in the early noughties. Ending up with the situation now.

A mature market, commoditised by open source WordPress. Anyone can get a page on the web using Facebook. There’s still room for expensive, proprietary players, newspapers custom make their own, and businesses have fancy intranets.

Data Management Systems, time to meet them!

DMSs are also called "data hubs". Hopefully less patented than this wheel!

It’s 2012. You’ve just about heard of Open Data.

Your nephew researches the Internet at a top university. He says there’s no future in Open Data, no communities have formed round it. Companies aren’t publishing much data yet, and Governments the wrong data reluctantly.

The question – what is a Data Management System (DMS)?

There isn’t a very good one yet. We’re at round about where CMS’s were in the mid 1990s. Most people get by fine without them.

Just as then we wrote HTML in text files by hand and uploaded it by FTP, now we analyse data on our laptops using Excel, and share it with friends by emailing CSV files.

But it reaches the point where using the filesystem and Outlook as your DMS stretches to breaking point. You’ll need a proper one.

Nobody really knows what a proper one will look like yet. We’re all working on it. But we do know what it will enable.

What must a DMS do?

All the things people expect a DMS to do!

A mature DMS will let people do all the following things. Whether as a proprietary monolith, or by slick integration across the web:

  • Load and update data from any source (ETL)
  • Store datasets and index them for querying
  • View, analyse and update data in a tabular interface (spreadsheet)
  • Visualise data, for example with charts or maps
  • Analyse data, for example with statistics and machine learning
  • Organise many people to enter or correct data (crowd-sourcing)
  • Measure and ensure the quality of data, and its provenance
  • Permissions; data can be open, private or shared
  • Find datasets, and organise them to help others find them
  • Sell data, sharing processing costs between users

If it sounds like a fat list for a product, that’s because it is. But sometimes the need, the market, pulls you – something simple just won’t do. It has to do or enable, best it can, everything above. (Compare it to the same list for CMSs)

In short, it’s what the elite data wrangling teams inside places like Wolfram Alpha and Google’s Metaweb teams do. But made easier and more visible using standardised tools and protocols.

Who’s making a DMS?

More people than I realise. From the largest IT company to the tiniest startup. Here are some I know about, mention more in the comments:

  • Windows / OSX (+ Excel / LibreOffice / …) – the desktop serves as a (good enough so far) DMS
  • CKAN software – started as a data catalog, but has grown into more and powers the DataHub, a community data hub and market. Created by the Open Knowledge Foundation
  • ScraperWiki- coming from the viewpoint of a programmer, good at ETL
  • Infochimps/DataMarket – approaching it as a data marketplace
  • BuzzData – specialising in the social aspects
  • Tableau Public – specialising in visualisation
  • Google Spreadsheets – coming from the web spreadsheet direction
  • Microsoft Data Hub – corporate information management
  • PANDA – making a DMS for newsrooms

They’re all DMS’s because they all naturally grow bad versions of each other’s features. Two examples.

ScraperWiki is particularly good at complex ETL (loading data into a system), yet every DMS has to have a data ingestion interface of at least choosing CSV columns.

CKAN has particularly good metadata, usage and provenance, yet every DMS has to have a way for people to find the data stored in it.

So will they be giant monolithic bits of software?

We standardised the shipping container, can we standardise data interoperation?

We hope not! That didn’t turn out great for CMSs, although there are some businesses providing that.

CMS’s only really came of age when in the mid-noughties everyone realised that WordPress (open source blogging software!) was a better CMS than most CMS’s.

It’s in everyone’s interest that users aren’t locked into one DMS. One of them might have a whizzy content analysis tool that somebody who has data in another DMS wants to use. They should be able to, and easily.

OKFN is about to launch a standards initiative to bring together such things. It’s called Data Protocols.

So far the clearest needs are twofold and mirror each other – pulling and pushing data:

a) a data query protocol/format to allow realtime querying, for example for exploring data. Imagine a Google Refine instance live querying a large dataset on OKFN’s the Data Hub.

b) a data sync protocol/format that is liken to CouchDB’s protocol. It would let datasets get updated in real time across the web. Imagine a set of scrapers on ScraperWiki automatically updating a visualisation on Many Eyes as the data changed.

Later even more imaginative things… I reckon Google’s Web Intents can be used to make the whole experience of the user slick when using multiple DMS’s at once. And hopefully somebody, somewhere is making a simplified version of SPARQL/RDF just as XML simplified SGML and then really took off.

Enough of me! What do you think?

Join in. Make standards. Write code.

Leave a comment below, and join the data protocols list.

And so corporations begin to open data…

July 27, 2011 in Business, Open Data

The following post is by Francis Irving, CEO of ScraperWiki.

Now it seems almost normal that red in tooth and claw competitors, like Microsoft and Google, are both major contributors to the latest version of a popular open source operating system kernel.

Businesses are gradually realising they can share the costs of anything based on intellectual property which isn’t key to their business advantage. For example, this year Facebook opened up the hardware designs of their data centres.

Governments are struggling forwards and backwards to open up data about their work, ultimately to gain similar efficiencies and strategic advantages.

What about companies opening data?

On the right is Hannah Jones. She heads up Nike’s Sustainable Business and Innovation group. Yes, the trainer company.

She’s backed in what she does by Nike’s seemingly very forward thinking CEO, Mark Parker. He previously set up GreenXchange, for better sharing patents related to the environment (detailed writeup, scraped list of all the patents).

Nike have a surprisingly long history of releasing data. Back in 2000, they started publishing a list of all their contracted factories (scraped list by Selena Deckelmann) and related audit information. The aim? To improve their factory working conditions, both by improved scrutiny of Nike’s own measurement systems, and by enabling direct on the ground inspection and campaigning by activists.

Recently Nike have shifted it up a gear. They’re lucky to be based in Portland in the US, where there is a rampant community of open data activists. Following a strange story of an advertising agency, a sort of startup incubator and a hack day, back in April Nike started advertising to recruit a “Code for a Better World Fellow”.

Anyone reading this blog will find the description in the job advert strangely alluring.

The ideal candidate will be part developer/programmer, part researcher, part designer, part business analyst. He or she will have demonstrated expert experience with databases, programming in multiple languages, in visual design and with statistics. He or she will have an understanding of the existing open data communities and networks of visual designers and researchers who love data.

 

Why do they want this person? You can piece the broad picture, but not the details, together from the job advert and an article in Forbes.

They’re terrified.

Not terrified of bad PR due to human rights violations like in the 1990s (“Nike suffered from these blows, losing contracts and its good rapport with many consumers“).

Terrified that there won’t be enough water to grow cheap cotton. Terrified that oil prices will continue to shoot up, and they won’t be able to afford international shipping. Terrified that the delicate, beautiful-if-flawed civilisation that we’ve built won’t last in a form where enough people can afford to take part in organised, well equipped sport.

Some of those problems can only be fixed by changing entire supply chains. For example, someone told me that cotton recycling will only work if all their competitors, with whom they share upstream factories, also change to it. Doing that well requires sharing data, and helping others know your data about the scale of the difficulties.

Some of those problems can only be fixed by the invention of new products and services. Startups radically disrupting things – startups that will be more likely the more people understand Nike’s problems.

They want to release data to get:

Disruptive, radical, jaw-dropping innovation. Innovation we cannot imagine. That kind of innovation is not going to come only from within.

 

That quote is from the job advert again.

Unfortunately the bad news is that Nike have made their hire, so you can’t apply for it. The good people at Code for America helped them do their recruitment.

Who have they hired?

It’s barely leaked out onto the Internet (you can spot it in here, an announcement of an event on the 6th August that you must go to if you are in San Francisco). So, dear reader of this blog post, you’re the first to know.

Ward Cunningham.

You might recognise the name, he invented the wiki.

“Should Britain flog off the family silver to cut our national debt?”

March 14, 2011 in External, Open Data, Uncategorized

The following post is from Francis Irving, CEO of ScraperWiki.

Should Britain flog off the family silver to cut our national debt?‘ — that’s the question the UK current affairs documentary Dispatches tackled last Monday.

ScraperWiki worked with Channel 4 News and Dispatches to make two supporting data visualisations, to help viewers understand what assets the UK Government owns. This blog post tells you a bit about the background to them – where the data came from, what it was like, and how and why we made the visualisations.

1. Asset bubbles

Inspired by Where Does My Money Go’s bubble chart of public spending, the first is a bubble chart of what central Government owns.

We couldn’t find any detailed national asset registry more recent than 2005 (assembled in the National Asset Registry 2007). With a good accounting system, and properly published data all the way through Government, such a thing would constantly update.

In some ways there is less of a problem than with Government spending at needing drill down. There isn’t the equivalent problem of wanting to know who the contractor for some spending is, or to see the contract. Instead, you want to know assessments of value, and what investments could do to that value, as well as strategic consequences of losing control of the asset – detailed information that perhaps the authorities themselves often don’t have.

The PDFs were mined by hand (by Nicola) to make the visualisation, and if you drill down you will see an image of the PDF with the source of the data highlighted. That’s quite an innovation – one of the goals of the new data industry is transparency of source. Without knowing the source of data, you can’t fully understand the implications of making a decision based on it.

Julian used RaphaelJS to code the bubbles (source code here). You can think of it as “JQuery for in browser SVG”. Amazingly, it even works in (most) versions of Internet Explorer (using a compatibility layer via VRML).

This has some advantages over Flash – you at least get iPad compatibility. It’s also easier for people with other web skills to maintain than Flex, plus people can “view source” and learn from each other just like in the good old days of the web.

That said, on the down side, CSS compatibility with the stylesheets of the site it is embedded in were a pain. We had to override a few higher level styles (e.g. background transparency) to get it to work. Perhaps next time we should use an iframe :)

2. Brownfield sites

The second is a map of brownfield landed owned by local councils in England.

Or at least, that they owned in 2008. There isn’t a more recent version, yet, of the National Land Use Database. One of the main pieces of feedback we got was people frustrated that we didn’t have up to date, or always complete data. There is definitely an expectation in the public that something as basic as what the Government owns should be available in an online, up to date fashion.

The dataset is compiled by the Homes and Communities Agency, who have a goal of improving use of brownfield land to help reduce the housing shortage. This makes it reasonably complete, and cover the whole of England. That’s important, as it gives everyone a good chance that they will find something near them.

The data is prepared by local authorities, sent using an Excel or a GIS file (see the guidance notes linked near the bottom of this page) to the agency. Depending where you live, the detail and thoroughness will vary.

The same dataset contains lots of information about privately owned land, but we deliberately only show the local authority owned land, as the Dispatches show was about what the state could sell off. It’s quite interesting that a dataset gathered for purposes of developing housing is also useful, as an aside, for measuring what the state owns. It’s that kind of twist of use of data that really requires understanding of the source of the data.

The actual application is fairly straightforward Google Maps API and JQuery, although as with the asset bubbles, Zarino made it look and behave fantastic. The main innovative thing is that it tells a story about each site which is constructed from the dataset.

For example, what was originally quite a hard to read line in an Excel file comes out as:

JUNCTION OF PARK ROAD, NORTHUMBERLAND STREET

Liverpool City Council own this brownfield land. This site was dwellings and is now derelict. It is proposed that it is used for housing. Planning permission is detailed. A developer could build an estimated 14 homes here, selling for £1,820,000 (if they were at £130,000 per home, the median North West price).

Nicola did a lot of testing to make the wording as natural as possible, although we could have done even more. You can see the source code here. We think of these paragraphs as mini constructed stories, local to the viewer, a kind of visualisation as text.

Conclusion

This kind of visualisation, to help a viewer dig into the details they are most interested in of an overall story or theme, is just the start of how use of (open!) data can help media organisations.

I’d like to see more work to integrate the data early on in the development of stories – so it acts as another source, finding leads in an investigation. And I think there are lots of opportunities for news organisations to build ongoing applications, which build audience, revenue and personal stories even when the story isn’t in the 24 hour news cycle.

See also Nicola’s post 600 Lines of Code, 748 Revisions = A Load of Bubbles on the ScraperWiki blog.

Election data!

May 5, 2010 in Campaigning, Open Data, Open Government Data

If you’d asked me back in 2005, I’d have told you that the 2010 election would be the first online election. It turned out not to be.

For example, the YouTube and Facebook leaders debate was much less important than the Television debates.

However, there are a few places relating to data where the Internet did something genuinely new this time.

MP candidate data

The most basic data about an election is the names who you can vote for. Shockingly, there is no central, official source. Never mind one with an open data license. The data from newspaper websites is usually incomplete, particularly for independent candidates.

In the UK, we have Parliaments of an irregular length. This means that officially, you only know the list of candidates 2 weeks before the election. Before that you have to make do with screen scraping party website lists of Prospective Parliamentary Candidates (the obtuse term for someone who is going to be a candidate, before they can officially be one. e.g. the Conservatives) and hoping to get data from Wikipedia.

This year Edmund von der Burg bravely overcame all these problems with his YourNextMP, which he has run entirely as a volunteer. Not only does it have the names and parties of candidates, but it has extra information like email addresses, photos, schools attended and web site links. The data is available under a CC BY-SA license.

After nominations close, just 2 weeks before the election, each local authority publishes notices of poll (e.g. Liverpool, Riverside). They are the official list. Amazingly, the volunteers building up the YourNextMP data set cross checked their data against 650 notices of poll, in just a few days after nominations closed.

Really, all the basic work of YourNextMP should be done by the state. We could have fixed term Parliaments, with nominations that close two months before the election. The local authorities could upload basic candidate data, including electronic and paper contact addresses, to a central website, perhaps run by the Electoral Commission.

Then YourNextMP could concentrate on the added value. What we ought to be doing – researching the candidates to find out more about them. What companies have they been directors of? What charities do they support? Will they voluntarily declare their interests in advance of the election (as recommended by the Ministry of Justice)?

Maybe candidates should have to declare a bunch of “same as” RDF identifiers – such as their unique codes in the companies house database, the land registry and Wikipedia.

Julian Todd from the Straight Choice thinks every candidate should be obliged to publish a full CV, perhaps as structured data (see last paragraph of this article). And why not? Currently, we ask for far less of our new employee in Parliament than we would of somebody we employ in our business.

Election leaflets

As I said, this election was another offline election. Part of that is the mass media, big leaders. But the other key part is getting out the vote. It is the door to door canvassing, the hard labours of local party workers up and down the country. Vital are election leaflets, a data set hitherto hidden from us.

The Straight Choice has crowd sourced 5173 election leaflets, from all parties and most constituencies (disclosure, I do some systems administration for them). You can see a zoomable map of them, and a mosaic of the party leaders made of their leaflets, in this blog post where they report back on what they’ve found.

Have a read through the presentation at the end of that blog post. The Straight Choice have a series of campaigning demands. They’re all data related.

As I said above, they’d like CVs of candidates. If we continue to have a non-proportional electoral system, they’d like local voting intention polling – essential data to properly tactically vote. And finally, they want every electoral leaflet to be sent to the Electoral Commission and published. Like a copyright library, so electoral law can be properly enforced.

Just like YourNextMP, The Straight Choice is run entirely by volunteers. Julian Todd and Richard Pope did the central work.

Please please please, upload any leaflets you have – it’s vital to catch lots in the “end game”, as they can be particularly dirty.

Candidate opinions

Wouldn’t it be nice to have structured data on what the candidates think on a series of local and national issues? Luckily some volunteers, along with a small charity, found out using an incredibly complicated crowd sourcing operation.

The hinge of this was Democracy Club, a network of over 6000 transparency activists in nearly every constituency in the country. It’s amazing what you need to build, when you don’t have handy JSON files.

Once again, Democracy Club was started and is run by volunteers – Seb Bacon and Tim Green. For the last few weeks, mySociety got an emergency grant from the Joseph Rowntree Reform Trust to pay Seb, so Democracy Club could do even more in the run up to the election. Thank you to them!

The Democracy Club members built up the YourNextMP database of candidates, and uploaded lots of the leaflets to the Straight Choice. They also made a database of local issues in each constituency. These were munged together by mySociety (who I work for), into a survey to all candidates.

You can view the results for your constituency by entering your postcode on TheyWorkForYou’s election site. Please pass it around.

(By the way, the data for even that postcode lookup caused complications because the election is fought on new constituency boundaries. Matthew Somerville from mySociety worked them out and offers an API, although some political parties have had trouble.)

You can download the candidate survey response data from the TheyWorkForYou Election API.

Election results

Finally, by the time the counts are finished on Friday morning, you’ll want to find out who won. There are two elections happening tomorrow. One for Westminster, but also one for local councillors in your area.

Chris Taggart, another volunteer, has come to the rescue with his Open Election Project. He’s been promoting an electoral results RDFa – a neat, lightweight form of the semantic web that embeds in extra HTML tags, so it is easy for councils to add them in their existing content management system.

He’s persuaded quite a few councils to start publishing their data in this format, and invented a new technique of asking ‘are you an enabler or a blocker?‘. Hopefully in a few years time, he’ll have got every council to publish data in this format.

The learning process will have taught everyone how to make progress on the difficult question of going from the theory of national, open sets of local data, to the practice.

Finally

Have a great General Election!

Sources of data on data.gov.uk

January 26, 2010 in External, Open Data, Open Government Data

When data.gov.uk was launched, I had a quick browse around the data, to get a feel for what was in it. Most data sets that I randomly looked at were from statistics.gov.uk (from the Office for National Statistics).

Today, I decided to investigate, and work out some basic statistics about the source of the data. Hopefully this will help find what the interesting new data sets are.

I secretly hoped that I’d have to screenscrape data.gov.uk to work this out. Irony. Luckily, a comment on this blog revealed that there is a handy data dump of all the CKAN data behind data.gov.uk in CSV and JSON formats.

I downloaded the JSON file (21st January 2010 dump) and used basic Unix text processing commands such as grep, sort and uniq to do some calculations.

How many data sets are there, and what protocol are their downloads?

First I did some basic counts, to check how many data sets had a download link, and what protocol the link was in.

Normal HTTP (http://) – 2623 data sets
Secure HTTP (https://) – 178 data sets
No download URL (download_url in the .json dump) – 78 data sets
Total – 2879 data sets

What are the top level domain names of the data sets?

Of the data sets which have a download URL, they are distributed about the following top level domains.

.gov.uk – 2009 data sets
.nhs.uk – 412 data sets
.co.uk – 114 data sets
.org.uk – 79 data sets
.org – 78 data sets
.mod.uk – 34 data sets
.net – 25 data sets
.ac.uk – 14 data sets
.com – 9 data sets
.police.uk – 5 data sets
other (IP, not full qualified domain) – 21 data sets
Total – 2801 domains

Top ten sites the data sets are from

Here are the top domains that download links on data.gov.uk go to. I removed any www from them before analysis, to make sure URLs with and without www were counted together.

257 statistics.gov.uk
245 neighbourhood.statistics.gov.uk
231 hesonline.nhs.uk
176 fti.communities.gov.uk
173 communities.gov.uk
150 wales.gov.uk
125 dcsf.gov.uk
110 scotland.gov.uk
106 nomisweb.co.uk
95 hmrc.gov.uk

First thing to notice is that even including its neighbourhood section, statistics.gov.uk still only counts for about 18% of the total number of data sets. So there is lots else to find in there!

The full table is available here as a file: domain-counts.txt. There are 114 different domains.

What license do the data sets have?

Update:in fact data.gov.uk has its own set of terms and conditions which cover all the datasets on the site. These terms are OKD-compliant as they allow anyone to freely use, reuse and redistribute the data. It would be nice for the license field to reflect this though.

Most are marked as being in a straightforward “crown copyright” section. I’d like to see some work on the licensing, to use more standard licenses, or new OKD compliant license, where possible.

Non-OKD Compliant::Crown Copyright – 2871 data sets
OKD Compliant::UK Click Use PSI – 8 data sets

And a question for you

What interesting data sets have you spotted while browsing about data.gov.uk? Has anything sparked an idea for an application? Have you used any of the new data sets?

Please post in the comments!

Open organisations, need for two more definitions!

October 5, 2008 in Ideas and musings, Open/Closed

If starting a new, public interest, organisation, there are three obvious principles you might like to have.

  • Finance – have all bank transactions automatically public in real time. Plus accounts.
  • Software – all software made by the organisation to be open source.
  • Information – voluntarily subscribe to some sort of FOI law.

The software one is reasonably well covered.

There are problems with the finance one. For example, you probably need to anonymise individual donations, or at least those that are ‘small’. It would be lovely if somebody could think through all this, and come up with an “Open Finance Definition”, for describing when an organisations finances are truely open.

There are also problems with the Freedom of Information one. In the UK at least, subscribing to public sector FOI law voluntarily would be dangerous. You wouldn’t get the protection from defamation that a public sector body gets, and you may have trouble applying the public interest test clearly. So again, would be lovely if somebody could come up with an “Open Information Organisation Definition” which encoded a good principle to have for this.

Amazingly really, the more you think about this openness, the more things you find that could be open, and the more definitions you need. There’s work for you forever, Rufus :)

Clearer Climate Code

September 17, 2008 in Exemplars, External, Open Data, Open Science

GISTEMP is a crucial open data set, because it contains the historical global temperature record. Not very important right now, but in the medium term absolutely vital for the continuing functioning of our society given the likelihood of adverse climate change.

Stations that measure temperature naturally do so at specific points in space, and the historical record is additionally contaminated by changes in hardware, urbanisation and other issues. Because of this GISTEMP is made using software that estimates a single global temperature from the measurements using a basic scheme invented by James Hansen in the 1970s.

What is interesting from an open knowledge point of view, is that without this software the GISTEMP data itself is fairly meaningless. It defines clearly what the data is. There have been arguments about the derivation, and to address these the original Fortran software was released into the public domain by NASA in September 2007.

Of course, the software is no use if people can’t read and understand it. Because of this, Nick Barnes (from a company called Ravenbrook) has started a project to rewrite the GISTEMP software in Python, ensuring it produces the same output as the original Fortran.

This is called the Clear Climate Code project. They intend eventually to make clear climate modelling code, they are just starting with the global temperature record.

This open approach to the scientific code and data has already found some rewards. The August 11, 2008 GISTEMP update describes a bug in the original Fortran code which the Python rewrite unearthed:

Nick Barnes and staff at Ravenbrook Limited have generously offered to reprogram the GISTEMP analysis using python only, to make it clearer to a general audience. In the process, they have discovered in the routine that converts USHCN data from hundredths of °F to tenths of °C an unintended dropping of the hundredths of °F before the conversion and rounding to the nearest tenth of °C. This did not significantly change any results since the final rounding dominated the unintended truncation. The corrected code has been used for the current update and is now part of the publicly available source.

So two lessons – 1) Free that scientific code and data. The proper peer review might save more than you think, one day. 2) Good software engineering is worth it in the case of critically important, academic software.

(Some details and references in this Wikipedia article)

A Wikipedia of English law

August 20, 2008 in External, Ideas and musings

Writing in Times Online in April 2006 the eminent Professor Richard Susskind, legal tech guru and adviser to the great and good, spelt out his vision for a “Wikipedia of English law”:

This online resource could be established and maintained collectively by the legal profession; by practitioners, judges, academics and voluntary workers. If leaders in the English legal world are serious about promoting the jurisdiction aEven with that done, we’re still in a bit of trouble, as an important part of English law is entirely owned by a private charity, the Incorporated Council of Law Reporting. They publish the law reports, which contain summaries of important cases, and strongly s world class, here is a genuine opportunity to pioneer, to excel, to provide a wonderful social service, and to leave a substantial legacy. The initiative would evolve a corpus of English law like no other: a resource readily available to lawyers and lay people; a free web of inter-linked materials; packed with scholarly analysis and commentary, supplemented by useful guidance and procedure; rendered intensely practical by the addition of action points and standard documents; and underpinned by direct access to legislation and case law, made available by the Government, perhaps through BAILII. A Wikipedia of English law could be an evolving, interactive, multimedia legal resource of unprecedented scale and utility. (Quick, get into wikis – before everyone else)

Now a group, led by Nick Holmes of infolaw, are implementing that vision, calling it the Free Legal Web. Read their manifesto.

Of course, this project will need a vast amount of basic legal information to be open first. Luckily, lots of people have been working on this already for the last few years. We’ve got the Statute Law database and OPSI publish Acts and Statutory Instruments. The licenses you get for these are good, quite free ones (anybody know if they meet Open Knowledge Definition?)

All those sources have holes, in terms of missing historical data, timeliness of new data, and most importantly lack of structure. I hope the timeliness and historical completeness will be dealt with by the Ministry of Justice and OPSI as the Statute Law project matures. mySociety are digging away at the structure one from the direction of new laws, with their Free Our Bills campaign, and anyway the Free Legal Web project can add the structure.

But bigger than those holes is the complete lack of case law. Even the initial court decisions are not published by the courts. They send them in secret to a separate charity called BAILII, who understand the web so badly they don’t let search engines send their website traffic. I’ve submitted a request to OPSI’s excellent new public sector data unlocking service asking for the court decisions to be freed.

Even with that done, we’re still in a bit of trouble, as an important part of English law is entirely owned by a private charity, the Incorporated Council of Law Reporting. They publish the law reports, which contain summaries of important cases, and strongly define which case law a lawyer can refer to. They are in sore need of a new business model which lets them publish their information as open knowledge.

Finally the most important bit is the commentary – the equivalent of all the expensive legal textbooks. I’m confident a good community can build round a Wikipedia like resource to replace that.

So work to do, but all now so doable.

Sign up to the Free Legal Web planning barcamp on 18th October. We’re going to free the law. How empowering is that for everyone in society. Be part of it.

Free our Bills!

March 27, 2008 in Uncategorized

Free Our Bills! is a campaign led by a cheeky platypus, just escaping from the portcullis of Parliament. Sign up now, or read on…

Sometimes data being free isn’t good enough – it needs to be released in a properly structured format. If you want to reproduce the text of Bills (proposed new laws in the UK), you can get a reasonably good click-use license and go for it.

However, the PDF or HTML you get is not very intelligible to machines. For example, consider the current version of the controversial Human Fertilisation and Embryology Bill. It contains lots of amendments of an earlier 1990 act, and it is very hard to follow without being able to see how those amendments alter the earlier act. As the data isn’t structured, nobody can easily make a user interface to do this. If the Bill was published in a 21 century way, then lots of people could and would do so. This is just one example – there are lots of other ways the data for bills and amendments could be better structured, and more timely.

It’s an esoteric campaign, but a very important one. Having good quality law is vital to all of us. So please do sign up now, and help get Parliament to publish Bills better!

Review of economics of trading funds published by UK Government

March 12, 2008 in Uncategorized

Hot off the press – the UK’s Department for Business, Enterprise and Regulatory Reform has published a review of the economics of trading funds. The review follows (I think) recommendation 9 of the Power of Information review:

Recommendation 9. By Budget 2008, government should commission and publish an independent review of the costs and benefits of the current trading fund charging model for the re-use of public sector information, including the role of the five largest trading funds, the balance of direct versus downstream economic revenue, and the impact on the quality of public sector information.

Today that is published on the BERR website as Models of Public Sector Information Provision via Trading Funds by Professor David Newbery, Professor Lionel Bently and Rufus Pollock, all of Cambridge University.

If you read any of it, please post tidbits and thoughts in the comments below!