This is a joint blog post by Francis Irving, CEO of ScraperWiki, and Rufus Pollock, Founder of the Open Knowledge Foundation. It’s being cross-posted to both blogs.
Content Management Systems, remember those?
It’s 1994. You haven’t heard of the World Wide Web yet.
Your brother goes to a top university. He once overheard some geeks in the computer room making a ‘web site’ consisting of a photo tour of their shared house. He thought it was stupid, Usenet is so much better.
The question – in 1994 did you understand what a Content Management System (CMS) was?
In the intervening years, CMS’s have gone through ups and downs.
Building massive businesses, crashing in the .com collapse. Then a glut, web design agencies all building their own CMS in the early noughties. Ending up with the situation now.
A mature market, commoditised by open source WordPress. Anyone can get a page on the web using Facebook. There’s still room for expensive, proprietary players, newspapers custom make their own, and businesses have fancy intranets.
Data Management Systems, time to meet them!
It’s 2012. You’ve just about heard of Open Data.
Your nephew researches the Internet at a top university. He says there’s no future in Open Data, no communities have formed round it. Companies aren’t publishing much data yet, and Governments the wrong data reluctantly.
The question – what is a Data Management System (DMS)?
There isn’t a very good one yet. We’re at round about where CMS’s were in the mid 1990s. Most people get by fine without them.
Just as then we wrote HTML in text files by hand and uploaded it by FTP, now we analyse data on our laptops using Excel, and share it with friends by emailing CSV files.
But it reaches the point where using the filesystem and Outlook as your DMS stretches to breaking point. You’ll need a proper one.
Nobody really knows what a proper one will look like yet. We’re all working on it. But we do know what it will enable.
What must a DMS do?
A mature DMS will let people do all the following things. Whether as a proprietary monolith, or by slick integration across the web:
- Load and update data from any source (ETL)
- Store datasets and index them for querying
- View, analyse and update data in a tabular interface (spreadsheet)
- Visualise data, for example with charts or maps
- Analyse data, for example with statistics and machine learning
- Organise many people to enter or correct data (crowd-sourcing)
- Measure and ensure the quality of data, and its provenance
- Permissions; data can be open, private or shared
- Find datasets, and organise them to help others find them
- Sell data, sharing processing costs between users
If it sounds like a fat list for a product, that’s because it is. But sometimes the need, the market, pulls you – something simple just won’t do. It has to do or enable, best it can, everything above. (Compare it to the same list for CMSs)
In short, it’s what the elite data wrangling teams inside places like Wolfram Alpha and Google’s Metaweb teams do. But made easier and more visible using standardised tools and protocols.
Who’s making a DMS?
More people than I realise. From the largest IT company to the tiniest startup. Here are some I know about, mention more in the comments:
- Windows / OSX (+ Excel / LibreOffice / …) – the desktop serves as a (good enough so far) DMS
- CKAN software – started as a data catalog, but has grown into more and powers the DataHub, a community data hub and market. Created by the Open Knowledge Foundation
- ScraperWiki– coming from the viewpoint of a programmer, good at ETL
- Infochimps/DataMarket – approaching it as a data marketplace
- BuzzData – specialising in the social aspects
- Tableau Public – specialising in visualisation
- Google Spreadsheets – coming from the web spreadsheet direction
- Microsoft Data Hub – corporate information management
- PANDA – making a DMS for newsrooms
They’re all DMS’s because they all naturally grow bad versions of each other’s features. Two examples.
ScraperWiki is particularly good at complex ETL (loading data into a system), yet every DMS has to have a data ingestion interface of at least choosing CSV columns.
CKAN has particularly good metadata, usage and provenance, yet every DMS has to have a way for people to find the data stored in it.
So will they be giant monolithic bits of software?
We hope not! That didn’t turn out great for CMSs, although there are some businesses providing that.
CMS’s only really came of age when in the mid-noughties everyone realised that WordPress (open source blogging software!) was a better CMS than most CMS’s.
It’s in everyone’s interest that users aren’t locked into one DMS. One of them might have a whizzy content analysis tool that somebody who has data in another DMS wants to use. They should be able to, and easily.
OKFN is about to launch a standards initiative to bring together such things. It’s called Data Protocols.
So far the clearest needs are twofold and mirror each other – pulling and pushing data:
a) a data query protocol/format to allow realtime querying, for example for exploring data. Imagine a Google Refine instance live querying a large dataset on OKFN’s the Data Hub.
b) a data sync protocol/format that is liken to CouchDB’s protocol. It would let datasets get updated in real time across the web. Imagine a set of scrapers on ScraperWiki automatically updating a visualisation on Many Eyes as the data changed.
Later even more imaginative things… I reckon Google’s Web Intents can be used to make the whole experience of the user slick when using multiple DMS’s at once. And hopefully somebody, somewhere is making a simplified version of SPARQL/RDF just as XML simplified SGML and then really took off.
Enough of me! What do you think?
Join in. Make standards. Write code.
Leave a comment below, and join the data protocols list.
CEO of ScraperWiki. Made several of the world's first civic websites, such as TheyWorkForYou and WhatDoTheyKnow.
Interesting article; I assume you’re not aware of Semantic MediaWiki.
DataCouch has great potential for data sharing/crowd-sourced cleanup. Since it uses CouchDB, which automtically versions every change, it provides a method for data set forking & merging. It also lets users create visualization/etc. apps on top of datasets.
Yaron – interesting, yes it does look like Semantic MediaWiki is a data hub!
Data loading is very important but must be so intuitive as to be almost transparent. For example, http://www.dynamicalsoftware.com/convocontent/ccm.html is an Alfresco and Hippo CMS integration where online team discussion gets summarized into actionable documents and published automatically.
I code for a living!
When do we start???
You must really hate The DataTank because this is exactly what we started to do from the start ;).
http://data.appsforghent.be
I still don’t quite understand. Why do Windows and OSX serve as a DMS? May I have a brief explanation?
Nice blog post!
An old auntie, like me, can tell you about some of the “DMS” we had long before 1994 — Systems to manage for example payroll data and accounting data with systems running on so called mainframes. And You may have an old onkel who can tell you about how they for example managed manufacturing data and spare part data with with systems on so called minicomputers. We didn’t call them”DMS” but “ADB system” (Automatisk DataBehandling in Swedish).
Many things were different from now: no global scale (often local homegrown systems), no crowd (often just a bunch of terminal- users) and no sql (but a lot of Get-Hold-Unique-within-Parent calls in hierarchical database management systems).
While other things have become even more important given global scale, crowdsourcing, and the recognision of that “anyone can say anything about any topic” on a web of data — That is things such as data integrity, data traceability, and the need for data context and also to cope with the variances in the reality represented in the data.
Fariz, it’s always easy when making an innovative product to claim that you have no competition. This isn’t true – people always do something at the moment, get by in some way. The world wouldn’t end without your new product. That current solution – getting by somehow else without is the competion.
In the case of data hubs, people’s operating systems are the incumbent competition. We use a combination of tools, such as their filesystem, email clients, and applications like Excel, SPSS, Matlab etc.
Finally, I have a better idea of what ckan or the datatank are trying to do. This article was an eye opener for me. That said, with my background, your argumentation is a bit weakened by the fact that you use wordpress as the example of what a dms should try to be, while it is a quite bad cms( but a decent blog platform). Just a minor nitpick.
Glenn, some data hubs have transparent loading, yes! But they don’t have to, and when they do it is of its nature limiting. I think actually if something totally automatically gets all its data (like say an RSS reader), then all its datasets are “of one kind”. And that means it isn’t really a data hub, just an app and a database. Hmmm, so maybe my definition of data hub is somewhere that has “a list of datasets, of an indefinite number of kinds (i.e. schemas)”.
Pieter, I’d never heard of DataTank before. Their website (thedatatank.com) that appsforghent.be links to seems to be down. Is it the same as this DataTank? http://datatank.co.uk/ Anyway, yes Apps for Ghent looks like a Data Hub, in the publishing Government information vertical! Like Socrata and CKAN.
Kerstin, that sounds to me like ERP running on databases… But now I think about it, perhaps ERP is a data hub vertical, in theory, if it was across the web and more transparent.
Hi Francis,
Doesn’t seem down to me now: http://thedatatank.com. Our hosting provider announced some maintenance down-time yesterday though. Sorry for that.
Kind regards,
Pieter
Pieter, it looks fantastic!
This article completely failed to mention Drupal – a DMS that has been around for a long time. In fact, Drupal was never much of a CMS, if you ask me.
I don’t personally like Drupal, for a number of reasons – I won’t get into those here. But I wonder if part of the reason it doesn’t meet my expectations for neither a CMS or a DMS, is the fact that they think they’re building a CMS – when clearly it’s much closer to a DMS in terms of features.
In fact, I think those who pick Drupal are those who actually need a DMS more than a CMS. I wonder how many people tried out Drupal as a CMS and were confused and disappointed? Perhaps it’s time they change their description from CMS to DMS.
Franics, Rufus,
Congrats, great post, thanks for establishing the necessary terms here. We really need more discussion and awareness in this area (data management systems). Minor nit: the abbreviation DMS, I think, is somehow attached to Document Management Systems, which, at least in my experience is a rather suboptimal thing to do or further encourage. Can we come up with something better?
FYI: I’ve linked your post from my recent presentation [1] at the Irish Local Government Management Agency (LGMA) Open Source Forum in Dublin, where I explain the transition from Doc MS over Content MS to Data MS.
Again, thanks and KUTGW!
Cheers,
Michael
[1] http://bit.ly/consumer-pull-through-open-data
Rasmus – intriguing! Have never thought about Drupal. Does it handle multiple kinds of data sets for its end users? I don’t think we should call everything a DMS just because it has data in it! Reckon dissecting Drupal as a DMS or not would take a proper conversation though not just blog comments…
Michael – thanks! Hmmm, that’s a shame regarding DMS. I find myself calling them “data hubs” more often, so perhaps that is what we should go for. What do you think?
Martin, yeah! WordPress is like the commodity, lowest-common-denominator CMS. Just as in the last couple of decades, Excel has been the go-to for data analysis. In both cases, there are better higher end products!
http://aliecett.wicp.net/