DataPatterns.org: let’s collect some tricks for data wrangling!
How do you scrape a massive online archive? How do you fix a broken
CSV file? How do you normalize entity names in a large collection of
There is a lot of practical skill in handling newly opened
data, and the implicit promise of the open data movement is that we
will help more people to access and re-use data. And while it would
be desirable to be able to offer simple web-based tools for data
wrangling, the truth is that what’s required is often a wild mix of
web tools, desktop and command-line tools and programming skills.
So what we need is the other half of the Open Data Manual.
datapatterns.org will be a collaborative attempt to collect specific tips on how to code, wrangle and hack your way through messy data. The site will not be end-all of data literacy, but rather adopt a focussed point of view:
- We try to provide methods that are immediately useful for coders,
data journalists, researchers etc. If it doesn’t solve a data acquisition, cleanup or use problem, it can probably wait a bit.
- Assume basic knowledge of python programming and web technologies.
There are many ways to learn this, and we’d probably have a hard
time trumping Zed Shaw.
- Provide opinionated advice: it’s impossible to give a
comprehensive overview of all tools, concerns or strategies relating
to data and knowledge management. While its certainly interesting
to discuss pros and cons of various technologies, its not always
useful in practice. datapatterns.org will pick sides, and follow
- Link out. There’s no reason not to provide contextualized links
instead of explaining things ourselves whereever possible.
So how will we create this? Luckily, we have at least two sources of
information about data wrangling: the excellent questions on
getthedata.org and our own attempts at making sense of data, e.g. in the OpenSpending project. Using these two sources of both questions and answers will probably mean we’ll start off with a
slightly odd set of issues, but as with all OKF projects the answer is: bring your own! Either post questions to getthedata.org or write a chapter and commit it to the datapatterns repository on github.