The following guest post is by Boyan Yurukov, blogger and open government data activist.
In the beginning of 2011 some open data was released by the Bulgarian government on www.parliament.bg. Visitors could export information of bills and members of parliament as XML or CSV. They could also download the votes of individual MPs or parliamentary groups as Excel files. While what data was useful and an important step forward, I found problems in the format and the exported files. Also, one could find a lot more information on the website, that could not be exported as open structured data.
So I started a project to scrape the website, fix the available data, refine, enrich and link it. After several versions of the schema, the final dataset was released in the beginning of December. It contains over 11,000 data points and over 1.12 Gb of data. The items are as follows:
- Profiles for each MP – general biography, previous parliamentary terms, participation in parliamentary groups, committees, “friendship groups” and delegation, supported bills, absences, external consultants, questions during plenary meetings.
- Data on bills – laws, legislative proposals, decisions and official declarations.
- Parliamentary groups and committees – current members and member history, proposed bills, external consultants, meeting schedule, agenda, transcripts and reports.
- Parliamentary delegations and “friendship groups” – current members and member history.
- Parliamentary sittings – program for the sitting with questions and legislative proposals; transcripts; voting history for each MP on each discussion point.
- Parliamentary procurements – description, topic, procurement registry code.
The dataset can be downloaded as two ZIP files together with the XSD schema. The scraping scripts are also open sourced in GitHub. You can find all this open Bulgarian Parliament data on the DataHub.
Although refined, this data is not without its flaws. Some historical data on MPs’ biography and questions is missing. Also, transcripts are not structured, but in free text, making it almost impossible to parse. There is some hope that the parliamentary administration will release the transcripts in XML, but I’m not holding my breath. Currently the transcripts go back 20 years, and those back to the ’70s are being parsed and will be released soon. All other data is since 2001, except individual votes, which are since 2009.
This data can be quite useful for parliamentary journalism, but in itself consists only of raw XML files. This is why another project is being set up that aims at building a platform for analyzing and visualizing the refined dataset. It will be targeted at data journalists and visualization experts. It is sponsored by the Institute for Public Environment Development and all results will be released as open data. I hope that in the first quarter of 2012 the first beta will come out.