When data.gov.uk was launched, I had a quick browse around the data, to get a feel for what was in it. Most data sets that I randomly looked at were from statistics.gov.uk (from the Office for National Statistics).
Today, I decided to investigate, and work out some basic statistics about the source of the data. Hopefully this will help find what the interesting new data sets are.
I secretly hoped that I’d have to screenscrape data.gov.uk to work this out. Irony. Luckily, a comment on this blog revealed that there is a handy data dump of all the CKAN data behind data.gov.uk in CSV and JSON formats.
I downloaded the JSON file (21st January 2010 dump) and used basic Unix text processing commands such as grep, sort and uniq to do some calculations.
How many data sets are there, and what protocol are their downloads?
First I did some basic counts, to check how many data sets had a download link, and what protocol the link was in.
Normal HTTP (http://) – 2623 data sets
Secure HTTP (https://) – 178 data sets
No download URL (download_url in the .json dump) – 78 data sets
Total – 2879 data sets
What are the top level domain names of the data sets?
Of the data sets which have a download URL, they are distributed about the following top level domains.
.gov.uk – 2009 data sets
.nhs.uk – 412 data sets
.co.uk – 114 data sets
.org.uk – 79 data sets
.org – 78 data sets
.mod.uk – 34 data sets
.net – 25 data sets
.ac.uk – 14 data sets
.com – 9 data sets
.police.uk – 5 data sets
other (IP, not full qualified domain) – 21 data sets
Total – 2801 domains
Top ten sites the data sets are from
Here are the top domains that download links on data.gov.uk go to. I removed any www from them before analysis, to make sure URLs with and without www were counted together.
First thing to notice is that even including its neighbourhood section, statistics.gov.uk still only counts for about 18% of the total number of data sets. So there is lots else to find in there!
The full table is available here as a file: domain-counts.txt. There are 114 different domains.
What license do the data sets have?
Update:in fact data.gov.uk has its own set of terms and conditions which cover all the datasets on the site. These terms are OKD-compliant as they allow anyone to freely use, reuse and redistribute the data. It would be nice for the license field to reflect this though.
Most are marked as being in a straightforward “crown copyright” section. I’d like to see some work on the licensing, to use more standard licenses, or new OKD compliant license, where possible.
Non-OKD Compliant::Crown Copyright – 2871 data sets
OKD Compliant::UK Click Use PSI – 8 data sets
And a question for you
What interesting data sets have you spotted while browsing about data.gov.uk? Has anything sparked an idea for an application? Have you used any of the new data sets?
Please post in the comments!