When data.gov.uk was launched, I had a quick browse around the data, to get a feel for what was in it. Most data sets that I randomly looked at were from statistics.gov.uk (from the Office for National Statistics).
Today, I decided to investigate, and work out some basic statistics about the source of the data. Hopefully this will help find what the interesting new data sets are.
I secretly hoped that I’d have to screenscrape data.gov.uk to work this out. Irony. Luckily, a comment on this blog revealed that there is a handy data dump of all the CKAN data behind data.gov.uk in CSV and JSON formats.
I downloaded the JSON file (21st January 2010 dump) and used basic Unix text processing commands such as grep, sort and uniq to do some calculations.
How many data sets are there, and what protocol are their downloads?
First I did some basic counts, to check how many data sets had a download link, and what protocol the link was in.
Normal HTTP (http://) – 2623 data sets
Secure HTTP (https://) – 178 data sets
No download URL (download_url in the .json dump) – 78 data sets
Total – 2879 data sets
What are the top level domain names of the data sets?
Of the data sets which have a download URL, they are distributed about the following top level domains.
.gov.uk – 2009 data sets
.nhs.uk – 412 data sets
.co.uk – 114 data sets
.org.uk – 79 data sets
.org – 78 data sets
.mod.uk – 34 data sets
.net – 25 data sets
.ac.uk – 14 data sets
.com – 9 data sets
.police.uk – 5 data sets
other (IP, not full qualified domain) – 21 data sets
Total – 2801 domains
Top ten sites the data sets are from
Here are the top domains that download links on data.gov.uk go to. I removed any www from them before analysis, to make sure URLs with and without www were counted together.
257 statistics.gov.uk
245 neighbourhood.statistics.gov.uk
231 hesonline.nhs.uk
176 fti.communities.gov.uk
173 communities.gov.uk
150 wales.gov.uk
125 dcsf.gov.uk
110 scotland.gov.uk
106 nomisweb.co.uk
95 hmrc.gov.uk
First thing to notice is that even including its neighbourhood section, statistics.gov.uk still only counts for about 18% of the total number of data sets. So there is lots else to find in there!
The full table is available here as a file: domain-counts.txt. There are 114 different domains.
What license do the data sets have?
Update:in fact data.gov.uk has its own set of terms and conditions which cover all the datasets on the site. These terms are OKD-compliant as they allow anyone to freely use, reuse and redistribute the data. It would be nice for the license field to reflect this though.
Most are marked as being in a straightforward “crown copyright” section. I’d like to see some work on the licensing, to use more standard licenses, or new OKD compliant license, where possible.
Non-OKD Compliant::Crown Copyright – 2871 data sets
OKD Compliant::UK Click Use PSI – 8 data sets
And a question for you
What interesting data sets have you spotted while browsing about data.gov.uk? Has anything sparked an idea for an application? Have you used any of the new data sets?
Please post in the comments!
CEO of ScraperWiki. Made several of the world's first civic websites, such as TheyWorkForYou and WhatDoTheyKnow.
Francis, really nice summary and glad you found the data dumps useful :)
One important “correction”: all the datasets/packages on data.gov.uk are licensed in an OKD-compliant manner thanks to the data.gov.uk terms and conditions which apply to all the datasets on data.gov.uk.
You are quite right though that this should be reflected in the license field of each dataset. The current value of the license field is a legacy of the pre-public launch phase I believe and is one of those minor bugs that slips through the release process!
Thanks for the analysis of the released datasets. Notice though that the data.gov.uk terms and conditions is a Crown Copyright waiver (what’s that? Ed.) and not a proper licence with definitions of terms, and clarity on liability, derivation, applicable law and termination. At the moment these T&C’s are widely interpretable and also revocable, and so it is urgent that we get a CC-compliant licence out ASAP.
Never look a gift horse in the mouth: it is good that this data is out there. But some one should have seen the licence question coming, especially given the Open StreetMap debates on the OpenDatabase licence. For now I wouldn’t want to commercialise on the basis of these T&C’s as it is just not secure enough legally.
I’m curious about what metadata do the datasets have? Is it some sort of standard?