DataPatterns.org: let’s collect some tricks for data wrangling!
How do you scrape a massive online archive? How do you fix a broken CSV file? How do you normalize entity names in a large collection of records?
There is a lot of practical skill in handling newly opened data, and the implicit promise of the open data movement is that we will help more people to access and re-use data. And while it would be desirable to be able to offer simple web-based tools for data wrangling, the truth is that what’s required is often a wild mix of web tools, desktop and command-line tools and programming skills.
So what we need is the other half of the Open Data Manual.
datapatterns.org will be a collaborative attempt to collect specific tips on how to code, wrangle and hack your way through messy data. The site will not be end-all of data literacy, but rather adopt a focussed point of view:
- We try to provide methods that are immediately useful for coders, data journalists, researchers etc. If it doesn’t solve a data acquisition, cleanup or use problem, it can probably wait a bit.
- Assume basic knowledge of python programming and web technologies. There are many ways to learn this, and we’d probably have a hard time trumping Zed Shaw.
- Provide opinionated advice: it’s impossible to give a comprehensive overview of all tools, concerns or strategies relating to data and knowledge management. While its certainly interesting to discuss pros and cons of various technologies, its not always useful in practice. datapatterns.org will pick sides, and follow them through.
- Link out. There’s no reason not to provide contextualized links instead of explaining things ourselves whereever possible.
So how will we create this? Luckily, we have at least two sources of information about data wrangling: the excellent questions on getthedata.org and our own attempts at making sense of data, e.g. in the OpenSpending project. Using these two sources of both questions and answers will probably mean we’ll start off with a slightly odd set of issues, but as with all OKF projects the answer is: bring your own! Either post questions to getthedata.org or write a chapter and commit it to the datapatterns repository on github.