Notes from the Big Clean in Prague
The following is a guest post from Jindrich Mynarz at the National Technical Library in Prague, Czech Republic, member of the OpenData.cz initiative, and one of the organizers of the Big Clean in Prague.
On the Saturday, March 19th, the Big Clean workshop took place as a twin event in two cities, Prague (Czech Republic) and Jyväskylä (Finland). This was a preliminary event, preceding the global Big Clean that is planned for this year’s summer, which was dedicated to the practices of improving the quality of the data available from websites of public sector institutions.
The Open Knowledge Foundation provided the workshop’s website and the #okfn IRC channel that served to participants in both cities to stay in touch throughout the event. There was also direct support from ScraperWiki provided on this channel. The second communication channel was established on the Twitter account @BigCleanCZ and using the #bigcleancz hash-tag.
The number of the workshop’s attendants took us by surprise. We had 80 participants, of which roughly 15 took part in the afternoon hacking session. The workshop attracted a varied audience and it was an uncommon chance to meet for quite different groups of people, such as web-developers and journalists.
(by Pavel Farkaš)
During the whole day the workshop offered two parallel tracks. One track was aimed at the active involvement of the workshop participants whereas the other provided talks and practical demonstrations. While the one part of the programme was dedicated to data acquisition and cleansing and as such required tech-savvy audience, the other introduced easy-to-use tools, that can be applied on existing structured data.
The main goal of the workshop was to teach the participants by example how to take unstructured data available at the websites of public sector institutions, turn them into clean and structured data, and publish the results for others to re-use. In this way the barrier to use of such data can be lowered. If there already are data in a structured format, there is no need for their users to convert the data first in order to make them suitable for processing with various software tools. This simplifies the access to raw, structured public sector data.
The programme was designed to mimic a cycle starting with data acquisition and cleaning, continuing with light-weight data analysis, and ending with the use of data as a source for data-driven journalism. With these steps we wanted to demonstrate how to distill data from websites in the public sector and turn the data into stories or a source for a journalistic article. During the day, we went through this cycle with the example data on air pollution published by the Czech Hydrometeorological Institute. We demonstrated how can you extract this data from the website where it is exposed, how can you measure its quality or how to project the data on a map.
As a part of the Big Clean there was a discussion about the disclosure of open government data in Czech Republic. Various national-level organizations were introduced, including OpenData.cz, the initiative for a transparent data infrastructure of the public sector in Czech Republic. The discussion created an opportunity to meet like-minded people, to discover existing efforts in the Czech open data space, and to find available tools and projects built with public sector data.
(by Pavel Farkaš)
The hands-on track was started by an introduction to ScraperWiki showing how to acquire structured data by writing screen-scrapers. Next was a demonstration of Google Refine, a powerful, yet approachable tool for cleansing tabular data. The session continued with a talk on data quality, which discussed the factors that influence it and ways how the it can be measured.
(by Pavel Farkaš)
The workshop continued with talks on how the structured data can be used and built upon. The intent was to show how to pre-process, filter or combine, and present the data in a story or in an information visualization. Tools such as Yahoo! Pipes, Yahoo! Query Language, or Google Fusion Tables were introduced. A brief showcase of social network analysis was provided on an example showing the graph of voting of Prague’s local councilors.
The Big Clean’s programme included talks by journalists about the emerging practice of data-driven journalism. There was a discussion about the new role journalists may adopt, the role of a provider of interpretation and presentation of data. This novel approach was contrasted with the classical journalism, where data are lost during the course of writing an article.
Information visualization is a domain where the need for raw data can be seen with great clarity. Journalists are starting to grasp that they need to be able to present data in an accessible visual format, for example by plotting data on to a map. To illustrate the point an example was brought out, a visualization showing gambling clubs in the city of Brno, that violate the directive prohibiting them in close proximity to schools.
We started the afternoon hacking session with a video introduction to ScraperWiki recorded by Francis Irving, one of the ScraperWiki developers. Having seen the video the participants split into small groups to write screen-scrapers in their programming language of choice. Each group took care of one of the pre-selected datasets. For instance, we scraped the register of public collections or the list of insolvency administrators. As a result a couple of screen-scrapers were written, all of which are available at the workshop’s website and at ScraperWiki under the bigclean tag.
One of the outputs of the Big Clean is an information visualization of public collections registered in Czech Republic. Based on the data provided by one of the screen-scrapers written during the workshop, it displays a map showing where the public collections took place. In one of the early versions of this visualization screened at the end of the workshop, there were well-recognizable clusters in three areas. At the first sight the reason of such clustering was unclear but when the data was laid out on a time line visualization we discovered the reason. It was floods. This example shows that having structured data enables us to discover hidden relationships, such as tracing back to the causes of why the public collections were announced.
It might not be obvious why we need public sector data in an open, structured format. We might not even know what such data can be good for. And this is the reason we wanted to use the Big Clean to show on examples, why the public sector data should be publicly available as raw data, and what types of use such data make feasible.