There have recently been several posts about what features are desirable in government data catalogues.
The Sunlight Foundation recently announced they are planning to build on data.gov to allow “community participation so that people can submit their own data sources” (including support for adding data that is not open such as data with noncommercial restrictions).
The City of San Francisco’s Open SF project are working on CivicDB, which is an open-source platform for helping people to access government data.
They’ve also been working on a list of Data Consumer Requirements – which includes things like:
- Downloadable data sets should be available for regular time periods (i.e., by month, year).
- Proprietary data formats, and non-malleable formats should be avoided wherever possible (i.e., Excel, PDF, etc.).
In addition to data.gov (which was launched back in May), the last few months have seen the launch of several other prominent catalogues for government data, including:
- New Zealand’s Opengovt.org.nz
- .. an attempt to collate the many different datasets available through the New Zealand Government Departments and Local Bodies
- The USA’s IT Dashboard
- The IT Dashboard provides the public with an online window into the details of Federal information technology investments and provides users with the ability to track the progress of investments over time.
Many of the issues being discussed are things we’ve thought about in relation to CKAN – our registry of (collections of) open data and open content.
Here are a few suggestions for those building catalogues for (open) government data based on our experience developing CKAN:
- Make the catalogue itself open!
- By using a legal tool such as CC0, the PDDL or the ODbL to make your data catalogue’s metadata open (even if some of the data it describes isn’t), you ensure that the fruits of your hard work can be integrated with that of others! Also, by making the code open source you allow others to re-use and build on it.
- All of CKAN’s code and data is available under an open license – which lets other projects like Infochimps use it.
- Let others download the catalogue data in bulk (not just via an API)
- Create a regular dump of the metadata in your catalogue describing the data – so that your work can be built upon.
- CKAN’s data dump is updated daily.
- Include information on how to get the data, and how it can be used
- In addition to basic details such as title and description, it should be made clear how to get the data, and how it can be used. If it is in the public domain make this explicit (or use a legal tool, such as CC0 or the PDDL). If it is available under the terms of a license – make this explicit and include the text or a link.
- Each entry on CKAN includes a license field, which includes a drop down menu for common open content/data licenses and tools, as well as licenses for Free/Open Source Software. There is also a free text field for any further details.
- Make it versioned!
- If you are going to allow people to add items to or edit the catalogue you might consider making it versioned like a wiki. This allows others to see changes that have been made to each item – which can be useful for reversing and otherwise keeping track of user contributions.
- You can see the history of changes for each item on CKAN. Furthermore the CKAN’s code (and its domain model) are versioned.
What features do you think are important in catalogues for open government data? We’d love to hear what you think!
Dr. Jonathan Gray is Lecturer in Critical Infrastructure Studies at the Department of Digital Humanities, King’s College London, where he is currently writing a book on data worlds. He is also Cofounder of the Public Data Lab; and Research Associate at the Digital Methods Initiative (University of Amsterdam) and the médialab (Sciences Po, Paris). More about his work can be found at jonathangray.org and he tweets at @jwyg.
Great post. Agree on all points. One additional suggestion is to allow users to browse the catalog of data sets via multiple facets of metadata. For example, users should be able to sort and filter lists of available data by country, topic, subtopic, region, source, etc, etc. And should be able to filter by more than one facet. A couple of unrelated examples of this include:
http://resource.smartdesktop.org/rescon/
and
http://tinyurl.com/3y9zd8
As the registry grows, this will allow users to find stuff more easily.
Great list. Very helpful.
A couple of other guidelines I’m trying to follow in my catalog…
Make sure segmented file sets are actually grouped together. I should be able to see that I can put Northwest financials 2008 and Northwest financials 2009 together. Stuff like this is all over data.gov
Provide direct links to supporting datasets and shared segments. If a supporting dataset is used by more than one set like say… a list of postal codes make it a separate entity and link each dataset to it so users can find compatible datasets.
Provide direct download links to all files. Avoid links to pages that lead you to files or force you to use a query tool to find the data.
If you’re actually hosting data:
Use platform independent archive types. For instance no self extracting exes.
Provide at least small sets like csv uncompressed or just gzipped by the web server so they can be piped directly to other web services like google docs without having to download, extract, push.