From CMS to DMS: C is for Content, D is for Data
This is a joint blog post by Francis Irving, CEO of ScraperWiki, and Rufus Pollock, Founder of the Open Knowledge Foundation. It’s being cross-posted to both blogs.
Content Management Systems, remember those?

It’s 1994. You haven’t heard of the World Wide Web yet.
Your brother goes to a top university. He once overheard some geeks in the computer room making a ‘web site’ consisting of a photo tour of their shared house. He thought it was stupid, Usenet is so much better.
The question – in 1994 did you understand what a Content Management System (CMS) was?
In the intervening years, CMS’s have gone through ups and downs.
Building massive businesses, crashing in the .com collapse. Then a glut, web design agencies all building their own CMS in the early noughties. Ending up with the situation now.
A mature market, commoditised by open source WordPress. Anyone can get a page on the web using Facebook. There’s still room for expensive, proprietary players, newspapers custom make their own, and businesses have fancy intranets.
Data Management Systems, time to meet them!

DMSs are also called "data hubs". Hopefully less patented than this wheel!
It’s 2012. You’ve just about heard of Open Data.
Your nephew researches the Internet at a top university. He says there’s no future in Open Data, no communities have formed round it. Companies aren’t publishing much data yet, and Governments the wrong data reluctantly.
The question – what is a Data Management System (DMS)?
There isn’t a very good one yet. We’re at round about where CMS’s were in the mid 1990s. Most people get by fine without them.
Just as then we wrote HTML in text files by hand and uploaded it by FTP, now we analyse data on our laptops using Excel, and share it with friends by emailing CSV files.
But it reaches the point where using the filesystem and Outlook as your DMS stretches to breaking point. You’ll need a proper one.
Nobody really knows what a proper one will look like yet. We’re all working on it. But we do know what it will enable.
What must a DMS do?
A mature DMS will let people do all the following things. Whether as a proprietary monolith, or by slick integration across the web:
- Load and update data from any source (ETL)
- Store datasets and index them for querying
- View, analyse and update data in a tabular interface (spreadsheet)
- Visualise data, for example with charts or maps
- Analyse data, for example with statistics and machine learning
- Organise many people to enter or correct data (crowd-sourcing)
- Measure and ensure the quality of data, and its provenance
- Permissions; data can be open, private or shared
- Find datasets, and organise them to help others find them
- Sell data, sharing processing costs between users
If it sounds like a fat list for a product, that’s because it is. But sometimes the need, the market, pulls you – something simple just won’t do. It has to do or enable, best it can, everything above. (Compare it to the same list for CMSs)
In short, it’s what the elite data wrangling teams inside places like Wolfram Alpha and Google’s Metaweb teams do. But made easier and more visible using standardised tools and protocols.
Who’s making a DMS?
More people than I realise. From the largest IT company to the tiniest startup. Here are some I know about, mention more in the comments:
- Windows / OSX (+ Excel / LibreOffice / …) – the desktop serves as a (good enough so far) DMS
- CKAN software – started as a data catalog, but has grown into more and powers the DataHub, a community data hub and market. Created by the Open Knowledge Foundation
- ScraperWiki- coming from the viewpoint of a programmer, good at ETL
- Infochimps/DataMarket – approaching it as a data marketplace
- BuzzData – specialising in the social aspects
- Tableau Public – specialising in visualisation
- Google Spreadsheets – coming from the web spreadsheet direction
- Microsoft Data Hub – corporate information management
- PANDA – making a DMS for newsrooms
They’re all DMS’s because they all naturally grow bad versions of each other’s features. Two examples.
ScraperWiki is particularly good at complex ETL (loading data into a system), yet every DMS has to have a data ingestion interface of at least choosing CSV columns.
CKAN has particularly good metadata, usage and provenance, yet every DMS has to have a way for people to find the data stored in it.
So will they be giant monolithic bits of software?

We standardised the shipping container, can we standardise data interoperation?
We hope not! That didn’t turn out great for CMSs, although there are some businesses providing that.
CMS’s only really came of age when in the mid-noughties everyone realised that WordPress (open source blogging software!) was a better CMS than most CMS’s.
It’s in everyone’s interest that users aren’t locked into one DMS. One of them might have a whizzy content analysis tool that somebody who has data in another DMS wants to use. They should be able to, and easily.
OKFN is about to launch a standards initiative to bring together such things. It’s called Data Protocols.
So far the clearest needs are twofold and mirror each other – pulling and pushing data:
a) a data query protocol/format to allow realtime querying, for example for exploring data. Imagine a Google Refine instance live querying a large dataset on OKFN’s the Data Hub.
b) a data sync protocol/format that is liken to CouchDB’s protocol. It would let datasets get updated in real time across the web. Imagine a set of scrapers on ScraperWiki automatically updating a visualisation on Many Eyes as the data changed.
Later even more imaginative things… I reckon Google’s Web Intents can be used to make the whole experience of the user slick when using multiple DMS’s at once. And hopefully somebody, somewhere is making a simplified version of SPARQL/RDF just as XML simplified SGML and then really took off.
Enough of me! What do you think?
Join in. Make standards. Write code.
Leave a comment below, and join the data protocols list.





Pingback: (09:01 13-03-2012) Noticias más populares de #opendata en las ultimas 24 horas | Tuits de Software Libre
Pingback: (14:42 13-03-2012) Noticias más populares de #opendata en las ultimas 24 horas | Tuits de Software Libre
Pingback: Stream of consciousness March 14th