What Do We Mean By Small Data
Earlier this week we published the first in a series of posts on small data: “Forget Big Data, Small Data is the Real Revolution”. In this second in the series, we discuss small data in more detail providing a rough definition and drawing parallels with the history of computers and software.
What do we mean by “small data”? Let’s define it crudely as:
“Small data is the amount of data you can conveniently store and process on a single machine, and in particular, a high-end laptop or server”
Why a laptop? What’s interesting (and new) right now is the democratisation of data and the associated possibility of large-scale distributed community of data wranglers working collaboratively. What matters here then is, crudely, the amount of data that an average data geek can handle on their own machine, their own laptop.
A key point is that the dramatic advances in computing, storage and bandwidth have far bigger implications for “small data” than for “big data”. The recent advances have increased the realm of small data, the kind of data that an individual can handle on their own hardware, far more relatively than they have increased the realm of “big data”. Suddenly working with significant datasets – datasets containing tens of thousands, hundreds of thousands or millions of rows can be a mass-participation activity.
(As should be clear from the above definition – and any recent history of computing – small (and big) are relative terms that change as technology advances – for example, in 1994 a terabyte of storage cost several hundred thousand dollars, today its under a hundred. This also means today’s big is tomorrow’s small).
Our situation today is similar to microcomputers in the late 70s and early 80s or the Internet in the 90s. When microcomputers first arrived, they seemed puny in comparison to the “big” computing and “big” software then around and there was nothing strictly they could do that existing computing could not. However, they were revolutionary in one fundamental way: they made computing a mass-participation activity. Similarly, the Internet was not new in the 1990s – it had been around in various forms for several decades – but it was at that point it became available at a mass-scale to the average developer (and ultimately citizen). In both cases “big” kept on advancing too – be it supercomputers or the high-end connectivity – but the revolution came from “small”.
This (small) data revolution is just beginning. The tools and infrastructure to enable effective collaboration and rapid scaling for small data are in their infancy, and the communities with the capacities and skills to use small data are in their early stages. Want to get involved in the small data forward revolution — sign up now