You are browsing the archive for Rufus Pollock.

New Brazilian Data Portal dados.gov.br – powered by CKAN

May 10, 2012 in Featured, Open Government Data

Last Friday (May 4), the Ministry of Planning in Brazil launched the final version of the Brazilian Open Data Portal. In line with the federal government policy to promote the use of free software in public administration, the portal was made using only free and open source tools. Among them is the Open Knowledge Foundation’s open-source data portal software CKAN. Moreover, the whole process of development of the portal was conducted with the participation of concerned citizens in an open way to promote open data.

dados.gov.br

Opening Data Openly

The development project of the Brazilian Open Data Portal takes the concept of social participation to the extreme. From the beginning, planning meetings and development forums were open to any interested citizen, announced in advance on open discussion lists and where possible relayed via live streaming video to the Internet (webcast) .

In each planning meeting, the development tasks were selected by a flexible development process, in which people present ideas of what they think is needed in small ticket records. At the end of the round, the tickets are grouped, categorized and prioritized. At the end of the meeting, the events were recorded in a publicly accessible wiki (INDA wiki), and a publicly visible task manager (Trac).

We engaged the participation of people from civil society and of civil servants, who collaborated in various ways. Some people were involved right through the process, while others made contributions along the way. We had contributions in the form of software development, design, and information architecture, among others. The latter began with an experimental “card sorting” conducted with the participants of the event Campus Party 2012 in Sao Paulo. This synergy between government and citizens working together for the common good is what we mean by open government.

The Portal has gone through several versions, but the most important are the first (a simple HTML page with a tagcloud of catalogue data), followed by its beta, a little more prepared and documented, and then the current version with a new set of features and extensive reference material and learning.

The dados.gov.br now has 78 data sets with 849 resources. These have mostly been catalogued based on a survey of data that public bodies already publish on the Internet, but that until then were scattered and lacked a central access point where the public could find them. They are, however, the tip of the iceberg compared to what there is to be opened around public data in Brazil.

Recognizing this and the urgency in meeting the new law on access to information, the Secretariat for Logistics and Information Technology is preparing a workshop to guide public bodies, on how to include their data in the catalogue. This will take place in early June.

The portal is part of a larger project called the National Infrastructure Open Data – INDA. The general idea of ​​INDA is to establish technical standards for open data, promote training and support public bodies in the task of publishing open data. This entire process is done through intra-government cooperation and cooperation between government and citizens, always aiming to achieve a real platform for open government.

Talk at LIFT 2012: Open Data – How We Got Here, and Where We’re Going

April 2, 2012 in Featured, Ideas and musings, Open Data, Our Work, Talks

I’m pleased to announce that the video of my talk, Open Data: How We Got Here, and Where We’re Going, that I gave a few weeks ago at the LIFT 2012 conference has now been published:

Over the past few years, there has an explosive growth in open data with significant uptake in government, research and elsewhere. Open data has the potential to transform society, government and the economy, from how we travel to work to how we decide to vote. But we have only just begun down this road, and the going, even so far, has not always been easy.

My talk introduced the idea of open data, explaining how, and why, we are where we are today, and, finally, took a look to the future of the rapidly evolving open data ecoystem.

Slides from the talk – Link to full version

Introducing the DataStore

March 27, 2012 in CKAN, Our Work, Technical

A major new feature in the DataHub is good news for data wranglers. The DataStore allows users to store and load structured data into a database, where it can be queried, filtered, or accessed from other programs via a rich data API.

The API is also used by CKAN’s inbuilt Recline Data Explorer, giving in-page previews of the data with full text search, filtering, sorting and graphing, as in the screenshots below:

[IMG: Sorting] [IMG: Graph]

These new DataHub capabilities are powered by the recently enhanced DataStore and Data API functionality of our open-source CKAN data management system, which as well as powering the DataHub runs many other data portals including data.gov.uk.

An introduction to the DataStore and Data API



Announcing the Open Definition Licenses Service

February 16, 2012 in Open Content, Open Data, Open Definition, Open Knowledge Definition, Open Standards, Our Work, WG Open Licensing

We’re pleased to announce a simple new service from the Open Knowledge Foundation as part of the Open Definition Project: the (Open) Licenses Service.

open licensing

The service is ultra simple in purpose and function. It provides:

  • Information on licenses for open data, open content, and open-source software in machine readable form (JSON)
  • A simple web API that allows you retrieve this information over the web — including using javascript in a browser via JSONP

In addition to the service there’s also:

What’s Included

There’s data on more than 100 open (and a few closed) licenses including all OSI-approved open source licenses and all Open Definition conformant open data and content licenses. Also included are a few closed licenses as well as ‘generics’ — licensed representing a category (useful where a user does not know the exact license but knows, for example, that the material only requires attribution).

View all the licenses available »

In addition various generic groups are provided that are useful when constructing license choice lists, including non-commercial options, generic Public Domain and more. Pre-packaged groups include:

The source for all this material is a git licenses repo on github. Not only does it provide another way to get the data, but also means that if you spot an error, or have a suggestion for an improvement, you can file an issue on the Github repo or fork, patch and submit a pull request.

Why this Service?

The first reason is the most obvious: having a place to record license data in a machine readable way, especially for open licenses (i.e. for content and data those conforming to the Open Defnition and for Software the Open Source Definition).

The second reason is to make it easier for other people to include license info into their own apps and services. Literally daily, new sites and services are being created that allow users to share or create content and data. But when they do that, if there’s any intention for that data to get used and reused by others it’s essential that the material get licensed — and preferably, openly licensed.

By providing license data in a simple machine-usable, web friendly format we hope to make it easier for people to integrate license choosers — and good license defaults — into their sites. This will provide not only greater clarify, but also, more open content and data — remember, no license usually means defaulting to the most restrictive, all rights reserved, condition.

Dreams of a Unified Text

January 24, 2012 in Ideas and musings, Public Domain

The following is a blog post by Rufus Pollock co-Founder of the Open Knowledge Foundation.

I have a dream, one which I’ve had for a while.

In this dream I’m able to explore, seamlessly, online, every text ever written. With the click of a button I can go from Pynchon to Proust, from Musil to Machiavelli, from Homer to Hugo.

And in this dream not only can I read, but I myself am able to contribute, to write upon these texts — to annotate, to anthologize, to interlink, to translate, to borrow — and to share what I do with others.

I can see what others have shared, what notes they have added, what selections they have made. I can see the interweaving of these texts created by borrowing, by inspiration, by reference, all made concrete by the insight and efforts of myself and others and their ability to layer their insights freely upon those original texts — just as those writers built upon the works that had gone before them.

And while each text still can stand still stand alone — in all its greatness or mediocrity — we have something new, a single unified corpus woven together out of this multitude of separate text — e pluribus unum.

A whole that is a concrete instantiation in an immaterial realm of the cultural achievement of mankind as expressed in the written word.

Dream Meets Reality

Why is this dream not yet a reality? After all don’t we have the tools and technology.

One answer is legal, one answer is technological, and one answer is social. The legal issue is copyright, at least in its current exclusive rights form 1. Copyright means this vision is only really possible for works in the public domain, works therefore that are, in most countries, a hundred years or more old. This isn’t necessarily that big a problem, at least for texts: the public domain though old is already incredibly rich and so we therefore already have more than enough material to be getting on with.

On the technology front we have the cost of digitization, processing and storage. Digitization costs are significant. This has meant either that digitization activities have either been limited or the material created has not been released openly (for example, the material produced by Google’s efforts with its Books project, which is probably largest effort to date, is not open). That said, efforts like Project Gutenberg and the Internet Archive have already made available tens of thousands of texts, and there are now several digitization projects underway that will result in even larger amounts of material freely and openly available.

Then third we have the social issue, or rather it a question of how technology can support the social activities required for this dream of a unified text to become real. Specifically, to realize our dream we need to bring material — texts and the writing upon them — together in a single coherent experience. Yet the centralization (and ownership) that implies may be a significant obstacle to mass participation.2 Similarly, we need it to be possible for anyone with ‘net access to be able to contribute to the weaving of the unified inter-text but, at the same time, to be able to select which contributions we want to see (if we are not to be overwhelmed by an avalanche of material, much of it possibly of dubious quality).

Conclusion

We have then within our grasp, the realization of the dream of a unified text. Combining of text of technology we can create something truly extraordinary.

Interested in making this happen, come join us at the Textus Project.


  1. Let me be clear, I’m not saying that copyright is per se is bad or that everything should be ‘free’. Time, energy and capital are required to create books, music and films and that expenditure often needs to be recompensed. However, the current system of copyright is by no means the best way to achieve this. This is not something I wish to explore in detail here. More can be found on my personal website and in papers such as Forever Minus a Day: Theory and Empirics of Optimal Copyright 

  2. This tension between distributed collaboration and centralizing tendencies of coordination and scale is a common theme in many ‘net projects. 

Two Open Knowledge Events in Cape Town: Africa@Home and Open Knowledge Meetup

November 18, 2011 in Events, Meetups, News, Open Data, Open Government Data, Open Science, Sprint / Hackday

The following post is by Francois Grey and Rufus Pollock. Francois is a recent Shuttleworth Fellow, visiting professor at Tsinghua University working and coordinator of the Citizen Cyberscience Centre. Rufus is a co-Founder of the Open Knowledge Foundation.

There are two exciting open data and open knowledge events in Cape Town South Africa taking place in the next week (in which we’ll both be participating).

First up, this Saturday and Sunday, 19-20 November 2011, we’ll be holding an Open Data and Science hackfest at the African Institute of Mathematical Sciences.

Then next Tuesday, from 6:30pm-8:30pm an Open Knowledge Meetup is being organized for those interested in Open Data, Open Content and Open Source. More details on both below.

Open Knowledge Meetup – Open Data, Open Content, Open Source

  • Event page (signup):
  • When: Tuesday, November 22, 2011, 6:30 PM
  • Where: Open Innovation Studio, 27 Buitenkant St, Cape Town, ZA 7925
  • Hashtag: #OpenMeetupCT

This meetup is for those in Cape Town interested in open data and content. This is the first in what we hope will be a regular event. Come find out about other projects and activities and share your own.

Africa@Home

  • Event page:
  • When: 19-20 November, 9am-5pm both days
  • Where: African Institute of Mathematical Sciences

What’s it all about?

Volunteers on the Web can now help researchers with a host of scientific and social challenges.

From collecting data about government spending to folding proteins to simulating the future of our planet’s climate.

The scope for citizens and schools to benefit from all this online science is enormous. But there’s a catch. This is a grassroots movement, so it needs YOUR help!

If you are a scientist, if you have programming and web-design skills you’d like to contribute to science, or if you are just passionate about the idea of volunteer science on the Web, then you should come!

Who’s organizing this?

Kindly hosted by AIMS, the African Institute of Mathematical Sciences, in Muizenberg, Cape Town, South Africa. With participation of the Department of Computer Science at UCT, the Citizen Cyberscience Centre, Connexions, the Open Knowledge Foundation, P2PU, Siyavula and SACEMA, through the support of the Shuttleworth Foundation.

What will I get out of it?

This is a two-day event, and the goal is to learn about some cool science, play with some neat software, and above all meet people with a passion for public participation in cutting-edge research. We will form teams and work together to design and produce some really nifty demos and prototypes.

The sort of projects you can work on will be based on the real needs of scientists, many of whom will be actively participating. Concretely, you might get involved in …

  • Working on a mobile-phone-based scheme for monitoring the spread of AIDS in Southern Africa.
  • Building an interface for a project that will allow anyone on the Web to help digitize historical documents.
  • Designing a course to help others create their own citizen science project on the Web.
  • Turning an online project for simulating the spread of malaria in Africa into an educational tool that teachers could use in a high-school math class.

… by all means bring your own ideas for projects to the event, as well!

Scaling the Open Data Ecosystem

October 31, 2011 in Ideas and musings, News, Open Data

This is a post by Rufus Pollock, co-Founder of the Open Knowledge Foundation. As reported elsewhere I’ve been fortunate enough to have my Shuttleworth Fellowship renewed for the coming year so that I can continue and extend my work at the Open Knowledge Foundation on developing the open data ecosystem. The following text and video formed the main part of my renewal application.

Scaling the Open Data Ecosystem

Describe the world as it is.

The last several decades the world has seen an explosion of digital technologies which have the potential to transform the way knowledge is disseminated. This world is rapidly evolving and one of its more striking possibilities is the creation of an open data ecosystem in which information is freely used, extended and built on. The resulting open data ‘commons’ is valuable in and of itself, but also, and perhaps even more importantly, because the social and commercial benefits it generates — whether in helping us to understand climate change; speeding the development of life-saving drugs; or improving govenance and public services.

In developing this open data ecosystem there are three key things are needed: material, tools and people. This is a key point: open information without tools and communities to utilise it is not enough, after all, openness isn’t an end itself – open material has no value if it isn’t used.We need therefore to have widely available the capabilities for utilising open material, for processing, analysing and sharing it, especially on a large scale. Relevant tools need to be freely and openly available and the related infrastructure — after all tools need somewhere to run, and data needs somewhere to be stored — should be capable of effective deployment by distributed communities.

Over the last few years we’ve started to see increasing amounts of open material made available, with release of open data really starting to take off in the last couple of years. But the (open) tools and the communities to use them are still very limited — we’re just starting to see the first self-identified “data wranglers / data hackers / data scientists” (note how the terms have not settled yet!). Key architectural elements of the ecosystem, such as how we create and share data in an open componentized way, are only just beginning to be worked through. We are therefore at a key moment where we transition from just ‘getting the data’ (and building the app) to a real data ecosystem in which data is transformed, shared and reintegrated and we replace a ‘data pipeline’ with ‘data cycles’.

What change do you want to make?

I want to see a world in which open data – data that can be freely shared and used without restriction – is ubiquitous and in which that data is used to improve the world around us, whether by finding you a better route to work, helping us to prevent climate change, or improving reportage. I want open data to allow us to build the tools and systems to help us navigate and managing the increasingly complex information-based world in which we now live.

Specifically, I want to help grow the emerging open data ecosystem. While part of this involves supporting and expanding the ongoing release of material — building on the major progress of the last few years — the biggest change I want to make is develop the tools and communities so that we can make effective use of the increasing amounts of open data is now becoming available.

Particular changes I want to make are:

  • Development of real ‘data cycles’ (especially for government data). By data cycles I mean a process whereby material is released, it’s used and improved by the community and then that work finds its way back to the data source.
  • Greater connection of open data to journalists and other types of reporters/analysts who can use this data and bring it to a wider audience.
  • Development of an active and globally-connected community of open data wranglers.
  • Development of better open tools and infrastructure for working with data, especially in a distributed community using a componentization approach that allow us to scale rapidly and efficiently.

What do you want to explore?

I’m interested in learning more about the actual and potential user communities for open data. I want to explore what they want — in relation to both tools and data — and, also their awareness of what is already out there. I’m especially interested in areas like journalism, government, and the general civic hacker community.

I want to explore the processes around ‘data refining’ — obtaining, cleaning and transforming source data into something more useful and data ‘analysis’ (usually closely related tasks). I’m especially interested in existing business activity in this area — often labelled with headings like business intelligence and data warehousing. I want to see what we could learn from business regarding tools and process that could be used in the wider open data community as well as how the business community can take advantage of open data.

I want to explore how we can connect together the distributed community of data wranglers and hacktivists, focusing on a specific area like civic information or finances. How do we allow for loose networks across different location and different organisations while sharing information and collaborating on the development of tools.

Lastly, I want to explore the tools and processes needed to support decentralised, collaborative, and componentised development of data. How can we build robust and scalable infrastructures? How can we build the technology to allow people to combine multiple sources of official data in a wiki-like manner – so that changes can be tracked, and provenance can be traced? How can we break down data into smaller manageable components, and then successfully recombine them again? How can we ‘package’ data and create knowledge APIs to enable automated distribution and reuse of datasets? How can we achieve real read/write status for official information – not just access alone?

What are you going to do to get there?

I want to focus my efforts in this next year on 3 key areas, breaking new ground but also building on existing work I’ve been doing with the Open Knowledge Foundation.

First, I want to build out CKAN software and community from a registry to a data hub – a platform for working with data not just listing it. The last year has seen very significant uptake of the CKAN with dozens of CKAN instances around the world including several official government and institutional deployments. Improving and expanding CKAN we will allow us to capitalize on this success to make CKAN into an essential tool and platform for open data “development”.

The most important aspect of the software side of this will be the development of a datastore component supporting the processing and visualization of data within CKAN. With features like these CKAN can become a valuable tool not just for tech-savvy data ‘geeks’ but for the more general users of data such as journalists and civil servants. Engaging this wider, “non-techy” audience is a key part of scaling up the ecosystem. It is important to emphasize that this won’t just be about developing software but is about understanding and engaging with the a variety of data-user communities, exploring how they work, what they want and how they can be helped.

Second I want to build out the OpenSpending platform and community. OpenSpending is Where Does My Money Go Goes globalized — a worldwide project to ‘map the money’. Following the successful launch of Where Does My Money Go last autumn in the UK, in the last 6 months we have dramatically expanded of coverage with data now from more than 15 countries (in May our work on Italy received coverage in La Stampa, the Guardian and other major newspapers).

Working with OpenSpending complements work on CKAN because it is a chance to act as a data user and refiner — we already have some basic integration with CKAN but it’s still very basic. Furthermore, OpenSpending presents the chance to develop a specific data wrangler / data user community and one which can and should have close links with users and analysts of data including journalist and civic ‘hacker’ groups. In this way OpenSpending can act as a microcosm and prototype for developments in the wider open data community.

Third, I want to develop the OKF Open Data Labs. Much like the “Google Labs” for Google’s web services, Mozilla Labs for the Web, and the “Sunlight Labs” for US transparency websites, I would like the “Open Data Labs” to be a place for coders and data wranglers to collaborate, experiment, share ideas and prototypes, and ultimately build a new generation of open source tools and services for working with open data. The labs would form a natural complement to the my other activities with CKAN and OpenSpending – the Labs could build on material and tools from those projects while simultaneously acting as an incubator for new extensions and ideas useful both there and elsewhere.

Open Data: Wishlist for the Next Year

October 23, 2011 in OGDCamp

In our closing session at Open Government Data Camp, we asked keynoters to reflect on what developments they would most like to see in the next year in relation to open government data and open data more generally. Here’s the resulting list:

  • Open Government Data as a Right
  • More Schemas (Knowledge APIs) – keep it focused, let’s not try to boil the ocean
  • Open Data as a Platform, Not a Commodity
  • Massive Interconnection Between Open Data Sites
  • Open Corporate Data (for and by Corporates)
  • Standards (e.g. for catalog metadata) for Data Portals and Data Hubs
  • Open Data for Growth – making clear the the connection
  • Strong international norms for data inventories
  • Organizational identifiers – Dunn & Bradstreet should be replaced with open data
  • MiData – getting personal data out of corporates and government back into the hands of the people whose data it is

Open Data: a means to an end, not an end in itself

September 15, 2011 in Ideas and musings, Open Data

The following is a post by Rufus Pollock, co-Founder of the Open Knowledge Foundation.

In almost all the talks I give about open data or content, I aim, at least once, to make the statement along the lines:

“Openness for data and content is not an end in itself, it’s a means to an end”

This, of course, begs the question: if open data is a means and not an end in itself, what are the real ends that we are seeking?

The real ends are the improved creation, processing and use information for the purpose of bettering our lives and the world around us — finding a better way to travel to work, understanding and addressing climate change, finding better ways to cure and prevent disease, deciding who to vote for, the list goes on and on because it includes almost anything where information, and more specifically digital information is or could be important.

Now, there are many things that contribute to us improving the “creation, processing and use of information” but the following are especially important (and interlink):

  1. Scalability — i.e. dealing with larger and larger amounts of information
  2. Improved tools, techniques and process for handling that information
  3. Wide access to the raw data and content

(I’d also add a fourth item: to create, process and use information in a collaborative, distributed and decentralized manner that puts ‘information power’ — the power to access, understand and utilize information — in the hands of the many rather than concentrating it in the hands of the few. However, I have left this out as it could be argued that this is not a requirement for improvement but an additional, and separate, desiderata.)

It is at this point that openness enters: openness — both of data and of tools — is central to making rapid progress in each of these areas:

  1. Scalability: successful ‘data scaling’ requires componentization — the breaking up material into maintainable chunks (components) that can be recombined. However, without openness componentization cannot function because the recombination of components will rapidly become impossible due to the need to check and clear rights with so many different sources of data (and incompatibilities between the conditions imposed by different sources).

  2. Tools, technique and process. Open data makes it much easier to develop and share tools, techniques and processes for working with data. Moreover, without open data the application of those tools can be severely limited.

  3. Wider access to the material: given the vast amount of material becoming available we’re going to want as many people as possible (and not just ‘professionals’) to be able to access, experiment with and redistribute that data as easily as possible. Remember the many minds principle: the best thing to do with your data will be though of by someone else.

Summing Up

Open data, then, is a means to an end not an end in itself. Openness is important to the extent it helps us do something “useful” — not because it is valuable in and of itself.

I think it’s important to emphasize this point because as the open data movement grows, we need to be clear that open data is not some magic potion that, on its own, will automatically solve problems. Fundamentally, to be useful data (open or otherwise) needs to be used: it needs individuals and institutions to analyze it and to act on that analysis, it needs companies and communities to build apps and services with it, and it needs tools and processes developed to facilitate doing those activities.

This is not to underestimate the value of openness: as argued above, it is central to making significant progress in “doing useful stuff”, but we must also avoid the trap of confusing means with ends, and thereby neglecting the many other changes that are needed if open data is to deliver full value.

DataPatterns.org: let’s collect some tricks for data wrangling!

August 4, 2011 in Ideas and musings, OKF Projects, Open Data Handbook

[Friedrich Lindenberg], data wrangler and member of OKF Germany, advocates for the creation of Data Patterns book to complement the existing Open Data Manual.

How do you scrape a massive online archive? How do you fix a broken CSV file? How do you normalize entity names in a large collection of records?

There is a lot of practical skill in handling newly opened data, and the implicit promise of the open data movement is that we will help more people to access and re-use data. And while it would be desirable to be able to offer simple web-based tools for data wrangling, the truth is that what’s required is often a wild mix of web tools, desktop and command-line tools and programming skills.

So what we need is the other half of the Open Data Manual.

datapatterns.org will be a collaborative attempt to collect specific tips on how to code, wrangle and hack your way through messy data. The site will not be end-all of data literacy, but rather adopt a focussed point of view:

  • We try to provide methods that are immediately useful for coders, data journalists, researchers etc. If it doesn’t solve a data acquisition, cleanup or use problem, it can probably wait a bit.
  • Assume basic knowledge of python programming and web technologies. There are many ways to learn this, and we’d probably have a hard time trumping Zed Shaw.
  • Provide opinionated advice: it’s impossible to give a comprehensive overview of all tools, concerns or strategies relating to data and knowledge management. While its certainly interesting to discuss pros and cons of various technologies, its not always useful in practice. datapatterns.org will pick sides, and follow them through.
  • Link out. There’s no reason not to provide contextualized links instead of explaining things ourselves whereever possible.

So how will we create this? Luckily, we have at least two sources of information about data wrangling: the excellent questions on getthedata.org and our own attempts at making sense of data, e.g. in the OpenSpending project. Using these two sources of both questions and answers will probably mean we’ll start off with a slightly odd set of issues, but as with all OKF projects the answer is: bring your own! Either post questions to getthedata.org or write a chapter and commit it to the datapatterns repository on github.