This is the first in a series of posts. The next posts in the series is What Do We Mean by Small Data

There is a lot of talk about “big data” at the moment. For example, this is Big Data Week, which will see events about big data in dozens of cities around the world. But the discussions around big data miss a much bigger and more important picture: the real opportunity is not big data, but small data. Not centralized “big iron”, but decentralized data wrangling. Not “one ring to rule them all” but “small pieces loosely joined”.

Big data smacks of the centralization fads we’ve seen in each computing era. The thought that ‘hey there’s more data than we can process!’ (something which is no doubt always true year-on-year since computing began) is dressed up as the latest trend with associated technology must-haves.

Meanwhile we risk overlooking the much more important story here, the real revolution, which is the mass democratisation of the means of access, storage and processing of data. This story isn’t about large organisations running parallel software on tens of thousand of servers, but about more people than ever being able to collaborate effectively around a distributed ecosystem of information, an ecosystem of small data.

Just as we now find it ludicrous to talk of “big software” – as if size in itself were a measure of value – we should, and will one day, find it equally odd to talk of “big data”. Size in itself doesn’t matter – what matters is having the data, of whatever size, that helps us solve a problem or address the question we have.

For many problems and questions, small data in itself is enough. The data on my household energy use, the times of local buses, government spending – these are all small data. Everything processed in Excel is small data. When Hans Rosling shows us how to understand our world through population change or literacy he’s doing it with small data.

And when we want to scale up the way to do that is through componentized small data: by creating and integrating small data “packages” not building big data monoliths, by partitioning problems in a way that works across people and organizations, not through creating massive centralized silos.

This next decade belongs to distributed models not centralized ones, to collaboration not control, and to small data not big data.

Want to create the real data revolution? Come join our community creating the tools and materials to make it happen — sign up here:


This is the first in a series of posts about the power of Small Data – follow the Open Knowledge Foundation blog, Twitter or Facebook to learn more and join the debate at #SmallData on Twitter.

Further Reading

  • Nobody ever got fired for buying a cluster
    • Even at enterprises like Microsoft and Yahoo most jobs could run on a single machine. E.g. median job size is 14GB at Microsoft and 80% of jobs are less than 1TB. At Yahoo estimate median job size is 12GB.
    • Ananthanarayanan et al. show that Facebook jobs follow a power-law distribution with small jobs dominating; from their graphs it appears that at least 90% of the jobs have input sizes under 100 GB,” the paper states. “Chen et al. present a detailed study of Hadoop workloads for Facebook as well as 5 Cloudera customers. Their graphs also show that a very small minority of jobs achieves terabyte scale or larger and the paper claims explicitly that ‘most jobs have input, shuffle, and output sizes in the MB to GB range.'”
  • PACMan: Coordinated Memory Caching for Parallel Jobs – Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, Ion Stoica
Website | + posts

Rufus Pollock is Founder and President of Open Knowledge.

23 thoughts on “Forget Big Data, Small Data is the Real Revolution”

  1. I agree with you Rufus. Great blog post! It is time to make platforms services where you not have silos and you can integrate small datasets to analyze your problem. That will be the real success!

  2. Great stuff. I agree, decentralisation is the way forward. The current obsession with big data is frustrating: it’s about getting the right data for your purpose.

  3. I would agree that the whole “Big Data” meme is yet another bubble. Not too dissimilar to those of social media, cloud etc. Not to say that they are not relevant, but they are just an evolution of the digital space.

    People have always wanted to connect and share, so in that sense social media is simply a digital fulfillment of that human need. The cloud is also an evolution to hosted solutions. Again, not much new there. It’s been a long time since most of us have hosted our own websites, or email servers etc. Those trends will continue, but they will not revolutionize how our world works.

    You’re right, Small Data is the future. I think “Small, Open and Linked Data” (SOLD) that is the future! But, where as the others are incremental evolution of the digital space, SOLD is what has the potential to change our society. Distribute creation, access, updating, linking, and most importantly control.

    Let’s learn to value small, not big. Let’s learn to value open, not proprietary. Let’s learn to value connected, not silos. I’m SOLD ;-)

  4. Rufus, great post!

    however I’m pretty sure that the big data platforms and technologies while are crossing the chasm are also becoming more and more “democratic” of what we can imagine. Isn’t true that big data technologies, until few time ago was only available to very large corporations? And isn’t true that today Hadoop platforms can be installed, scaled and used in matter of few clicks by almost anyone?
    It seems to me that all the platform vendors, made few exceptions, are committed on pursue the open source approach and focus on elevating the technology to the enterprise grade and ready for mass adoption.There are vendors who already predict for 2013 that the Big prefix soon will disappears and we will just talk about Data.
    Therefore as you say the point is not about Big Data or Small Data but more about open and not open data and when it’s open is all about its quality and affordability.

  5. terminology aside, while there is plenty of innovation in big data visualisation and new tools – most ‘small data’ is created in excel and communicated in PowerPoint – tools that are pretty old and don’t really fit with the workflows people are using today. I just wrote a blog on this subject (check out my posterous blog), but anyone interested should have a look at SharpCloud!

  6. Dear Rufus

    Thanks for this blog post. I agree with you that big data can reveal much more information than we may be willing to admit or be aware of.

    For instance, even a blog like this can, if we analyze reactions, social sharing and comments, reveal plenty of information about the organization, its clients (here supporters :-) ) and so forth.

    I have tried to demonstrate this by just benchmarking this blog and some small data.

    @rufuspollock:twitter thanks for sharing.

    Urs (see small data here)
    http://blogrank.cytrap.eu/rank/blog.okfn.org

    PS. I also like small data because it is far less of a nightmare in cases of data breach. Hence, we are more able to protect user rights and their privacy.

    Something we should maybe also keep in mind?

  7. Good article, but your excessive use of bold text is irritating to say the least. Your readers are supposed to be smart, they deserve better than this.

  8. Thanks for the post. Good points.

    The latest edition of the Datastories podcast, run by some eminent information design people, partly touches on this issue. http://datastori.es/data-stories-21-visualization-save-the-world/

    While the episode is about information design work for NGOs, they mention that working with big data assumes that one can find an answer to something in the data, but not really beginning with a question.

    When starting out with a question, one needs to look at which “small data” ( and maybe the big data, depending on what one wants to find) could be relevant.

    Hence, bragging about how “big” one’s data is, is a bit like bragging about how many features one’s phone has, rather than how useful it is.

    Of course, all data has its place – and the democratic access to this is important, as Rufus mentioned – and it’s a good question of whether to start with opening data, or thinking of good questions to ask, and then seeing which data one should open.

    Perhaps one answer is to get as many domain-experts – and people in general – into the discussion, so they can come up with good questions leading to relevant data being opened.

  9. Love the post/topic! The way I’ve been thinking about Small Data is as the ‘last mile’ of Big Data – we (often) need Big Data behind the scenes, but the trick is to provide simpler, more consumer-style apps and tools at the front end that work on any device, foster social sharing and help non-technical users turn insights into actions…that are actually helpful in the moment. Readers can see my latest thoughts here: http://www.digitalclaritygroup.com/blog/small-data-goes-big-time/

    cheers,
    Allen

  10. Agreed. “small data” may be a harder problem to tackle than Big Data though, similar to how organisations learned to deal with writing and maintaining big software systems in-house, but when it came to distributing and reusing functionality, the problem becomes quite thorny (think Python’s setuptools, pip; win32’s DLL hell; Ruby’s problems with multiple gem versions on 1 system).

    Another problem which may arise with Big Data is the ability to co-operate – in the software industry, before the API age there wasn’t really a way to maximise interoperability “out-of-the-box”. And it’s still difficult to get an API quite right.

    It is, however, a worthwhile problem, certainly.

    (Another related topic is that software and the software industry have really mostly been about data…)

  11. Hi,

    I read your article and a few things stuck out as odd. The revolution talk is all nice, but you didn’t explain why big data is unprofitable, nor how you would reach a consensus about data formats in a decentralized environment.

    Small data have a tendency to fragment and loosing meaning as data formats permute. Why mention Excel as an example when it’s produced by a company with a history of proprietary format lock-in? From your examples it seems like you encourage a movement of “keep track of your household expenses” and “know what you’re doing”, but those are not exactly revolutionary concepts.

    You mention “componentization”, but how do you make sure that components have any meaningful connection? What is a “data package”? Is it like an XML-file or a zip-file? How do you define the data exchange protocols if all data permutes faster than your consensus grows?

    If all data is decentralized and sporadic, wouldn’t there be big business opportunities in organizing the data and making it searchable? You know, like Google and The Web. What about selling redundancy for data that could go missing? Or selling a package of related data in a chuck, so you’d know that you have all the data you need for a specific purpose.

    What you do you with the email addresses from the people that sign up at the bottom of your article? Do you make any kind of money from those?

    1. i forgot. One of the “poster childs” of big data would be Google Translate. Which tries to use so much data that grammar becomes a “soft problem”.

  12. Everybody seems to have a different definition of “Big Data”. Some see it as a single huge repository of information that can tell you the meaning of Life, the Universe, and Everything. That model dates back to the mainframe era.

    Just as that system gave way to a distributed system of computing, data will become more local. It’s still “Big Data”, you’ll just get data that’s more relevant to the end user. But “Local Data” or “Your Data” doesn’t quite have the same awe-inspiring, jaw-dropping, come-to-Jesus effect, does it?

  13. I believe that our ultimate goal is reuse already generated knowledge rather than reinvention of wheels.
    Experience of work within P2P environments shows the vital importance of indexing of decentralised knowledge and data. “Loosely joined” packages would be difficult to find unless they are organised within the Unified Conceptual Space:
    http://confocal-manawatu.pbworks.com/w/page/62073491/Why%20Unified%20Access%20to%20Information%20is%20Required
    and linked in the way similar to this:
    http://confocal-manawatu.pbworks.com/w/page/67722926/Integrated%20Virtual%20Associative%20Network
    I would appreciate joining a group interested in development of “small data” initiative under the principles similar to IVAN and UCS.
    Please visit the prototype of IVAN as the example how a set of about 15,000 packages of “small data” are arranged with each of the packages (“sense domains”) to be accessed in seconds:
    http://confocal-manawatu.pbworks.com/w/page/68435296/What%20is%20noaSphere
    I would be happy to help you started.
    Dimitri

Comments are closed.