What Do We Mean By Small Data

Earlier this week we published the first in a series of posts on small data: “Forget Big Data, Small Data is the Real Revolution”. In this second in the series, we discuss small data in more detail providing a rough definition and drawing parallels with the history of computers and software.

What do we mean by “small data”? Let’s define it crudely as:

“Small data is the amount of data you can conveniently store and process on a single machine, and in particular, a high-end laptop or server”

Why a laptop? What’s interesting (and new) right now is the democratisation of data and the associated possibility of large-scale distributed community of data wranglers working collaboratively. What matters here then is, crudely, the amount of data that an average data geek can handle on their own machine, their own laptop.

A key point is that the dramatic advances in computing, storage and bandwidth have far bigger implications for “small data” than for “big data”. The recent advances have increased the realm of small data, the kind of data that an individual can handle on their own hardware, far more relatively than they have increased the realm of “big data”. Suddenly working with significant datasets – datasets containing tens of thousands, hundreds of thousands or millions of rows can be a mass-participation activity.

(As should be clear from the above definition – and any recent history of computing – small (and big) are relative terms that change as technology advances – for example, in 1994 a terabyte of storage cost several hundred thousand dollars, today its under a hundred. This also means today’s big is tomorrow’s small).

Our situation today is similar to microcomputers in the late 70s and early 80s or the Internet in the 90s. When microcomputers first arrived, they seemed puny in comparison to the “big” computing and “big” software then around and there was nothing strictly they could do that existing computing could not. However, they were revolutionary in one fundamental way: they made computing a mass-participation activity. Similarly, the Internet was not new in the 1990s – it had been around in various forms for several decades – but it was at that point it became available at a mass-scale to the average developer (and ultimately citizen). In both cases “big” kept on advancing too – be it supercomputers or the high-end connectivity – but the revolution came from “small”.

This (small) data revolution is just beginning. The tools and infrastructure to enable effective collaboration and rapid scaling for small data are in their infancy, and the communities with the capacities and skills to use small data are in their early stages. Want to get involved in the small data forward revolution — sign up now

This is the second in a series of posts about the power of Small Data – follow the Open Knowledge Foundation blog, Twitter or Facebook to learn more and join the debate at #SmallData on Twitter.

Rufus Pollock

Website | + posts

Rufus Pollock is Founder and President of Open Knowledge.

9 thoughts on “What Do We Mean By Small Data”

Great article, there’s a lot to be said about pushing back against the tidal wave of big data talk, and getting people to see just how much can be accomplished with “small data”. We’re working on a data platform that’s a part of the empowerment of small data through the democratization of it, and by providing easy access to datasets while providing it in a form that people can use with almost any software they’re comfortable with. http://www.quandl.com/

I think as time goes on we’ll really start to see some incredible uses of data that really make a difference in people’s lives, and IMO parallel the 3D printing industry in terms of how much it can change the world.

Small data is still Big Data when it requires mass participation of thinkers independently working with smaller data sets. One major difference is in the freedom to independently abstract data without as much permission or access to large data sets. I think linked-open data is a nice way to hybridize big data (which is publicly available) with more independent data abstraction processes.

It should be My Data, not Big Data or small Data. You want massively scalable data architectures as the default on end user machines, with facilities to add nodes/hardware to your endpoint. We are all Google.

Pingback: Most Data Isn’t ‘Big,’ And Businesses Are Wasting Money Pretending It IsDon't Call Me Tony | Don't Call Me Tony

Pingback: Most Data Isn’t ‘Big,’ And Businesses Are Wasting Money Pretending It Is | Digital Wealth

Pingback: Most Data Isn’t ‘Big,’ And Businesses Are Wasting Money Pretending It Is | Home Job Reviews

Pingback: Most Data Isn’t ‘Big,’ And Businesses Are Wasting Money Pretending It Is

I agree with other commenters, the data are only “small” when you look at them narrowly; they still emerge from an infrastructure of “big” data. “Localized” data is perhaps my favorite term. You need perspective on the larger whole, but you also need the immediate context.

Dislike the proposed definition of small data. For most people, the term “Small Data” conjures images of the polar opposite of “Big Data.” Rather than large quantities of high variance information, small data might refer to small quantities of low variance data. This category includes the outputs of big data analysis, which speaks to the proposed “final mile” view of the term small data.

Comments are closed.