Support Us

You are browsing the archive for Technical.

The DataTank 4.0

Guest - December 5, 2013 in OKF Belgium, Open Data, Technical

This post was written by Pieter Colpaert, a member of the Open Knowledge Foundation Belgium Chapter.

The DataTank is open source software, just like CKAN, Drupal or Elastic Search, which you can use to transform a dataset into an HTTP API. Today (the 5th of December 2014), we are proud to launch the 4.0 version on which professional support will be provided. The project was started in 2008 by one of the founders of Open Knowledge Foundation Belgium. Today it remains mainly developed by OKFN Belgium, but we are welcoming new contributors from all over the world.

To get an idea of what The DataTank can do, check http://thedatatank.com or our demo server: http://demo.thedatatank.com.

With this new version of The DataTank, we hope that hackathon developers will have a tool to set up an API for their developers in no time, that start-ups will be able to combine different Open Datasets from all over the world in one Web service without trouble, that Open Data Portal developers are going to integrate The DataTank with CKAN, and that data owners are going to see a faster return on investment from publishing their data.

The platform is written in PHP using the Laravel framework. If this is a language you speak, feel free to dig in and fork us on github: http://github.com/tdt/core

Oh, did we mention it also publishes RDF where suitable? http://demo.thedatatank.com/api/dcat

If you want to use The DataTank at your organisation but you’re not a technical person, we can help you! Contact our team at info@thedatatank.com

Open Data Training at the Open Knowledge Foundation

Laura James - September 26, 2013 in Business, CKAN, Featured, Open Data, Open Government Data, Open Knowledge Foundation, Our Work, School of Data, Technical, Training

We’re delighted to announce today the launch of a new portfolio of open data training programs.

For many years the Open Knowledge Foundation has been working — both formally and informally — with governments, civil society organisations and others to provide this kind of advice and training. Today marks the first time we’ve brought it all together in one place with a clear structure.

These training programs are designed for two main groups of people interested in open data:

  1. Those within government and other organisations seeking a short introduction to open data – what it is, why to “do” open data, what the challenges are, and how to get started with an open data project or policy.

  2. The growing group of those specialising in open data, perhaps as policy experts, open data program managers, technology specialists, and so on, generally within government or other organisations. Here we offer more in-depth training including detailed material on how to run an open data program or project, and also a technical course for those deploying or maintaining open data portals.

Our training programs are designed and delivered by our team of open data experts with many years of experience creating, maintaining and supporting open data projects around the world.

Please contact us for details on any of the these courses, or if you’d be interested in discussing a custom program tailored to your needs.

Our Open Data Training Programs

Open Data Introduction

Who is this for?

This course is a short introduction to open data for anyone and is perfectly suited to teams from diverse functions across organisations who are thinking about or adopting open data for the first time.

Topics covered

Everything you need to understand and start working in this exciting new area: what is open data, why should institutions open data, what are the benefits and opportunities to doing so, and of course how you can get started with an open data policy or project.

This is a one day course to help you and your team get started with open data.

Photo by Victor1558

Administrative Open Data Management

Who is this for?

Those specialising in open data, whether as policy experts, open data program managers and similar roles in government, civil service, and other organisations. This course is specifically for non-technical staff who are responsible for managing Open Data programs in their organisation. Such activities typically include implementing an Open Data strategy, designing/launching an Open Data portal, coordinating publication processes, preparing data for publication, and fostering data re-use.

Topics covered

Basics of Open Data (legal, managerial, technical); Success factors for the design and execution of an Open Data program; Overview of the technology landscape; Success factors for community re-use.

Open Data Portal Technology

Who is this for?

Those specializing in open data, whether as software or data experts, and open data delivery managers and similar roles in government, civil service, and other organisations. Technical staff who are responsible for maintaining or running an enterprise Open Data portal. Such activities typically include deployment, system administration and hosting, site theming, development of custom extensions and applications, ETL procedures, data conversions, data life-cycle management.

Topics covered

Basics of Open Data, publication process, and technology landscape; architecture and core functionality of a modern Open Data Management System (CKAN used as example). Deployment, administration and customisation; deploying extensions; integration; geospatial and other special capabilities; engaging with the CKAN community.

Photo by Victor1558

Custom training

We can offer training programs tailored to your specific needs, for your organisation, data domain, or locale. Get in touch today to discuss your requirements!

Working with data

We also run the School of Data, which helps civil society organisations, journalists and citizens learn the skills they need to use data effectively, through both online and in-person “learning through doing” workshops. The School of Data runs data-driven investigations and explorations, and data clinics and workshops from “What is Data” up to advanced visualisation and data handling. As well as general training and materials, we offer topic-specific and custom courses and workshops. Please contact schoolofdata@okfn.org to find out more.

As with all of our work, all relevant materials will be openly licensed, and we encourage others (in the global Open Knowledge Foundation network and beyond) to use and build on them.

Git (and Github) for Data

Rufus Pollock - July 2, 2013 in Featured, Ideas and musings, Open Data, Small Data, Technical

The ability to do “version control” for data is a big deal. There are various options but one of the most attractive is to reuse existing tools for doing this with code, like git and mercurial. This post describes a simple “data pattern” for storing and versioning data using those tools which we’ve been using for some time and found to be very effective.

Introduction

The ability to do revisioning and versioning data – store changes made and share them with others – especially in a distributed way would be a huge benefit to the (open) data community. I’ve discussed why at some length before (see also this earlier post) but to summarize:

  • It allows effective distributed collaboration – you can take my dataset, make changes, and share those back with me (and different people can do this at once!)
  • It allows one to track provenance better (i.e. what changes came from where)
  • It allows for sharing updates and synchronizing datasets in a simple, effective, way – e.g. an automated way to get the last months GDP or employment data without pulling the whole file again

There are several ways to address the “revision control for data” problem. The approach here is to get data in a form that means we can take existing powerful distributed version control systems designed for code like git and mercurial and apply them to the data. As such, the best github for data may, in fact, be github (of course, you may want to layer data-specific interfaces on on top of git(hub) – this is what we do with http://data.okfn.org/).

There are limitations to this approach and I discuss some of these and alternative models below. In particular, it’s best for “small (or even micro) data” – say, under 10Mb or 100k rows. (One alternative model can be found in the very interesting Dat project recently started by Max Ogden — with whom I’ve talked many times on this topic).

However, given the maturity and power of the tooling – and its likely evolution – and the fact that so much data is small we think this approach is very attractive.

The Pattern

The essence of the pattern is:

  1. Storing data as line-oriented text and specifically as CSV1 (comma-separated variable) files. “Line oriented text” just indicates that individual units of the data such as a row of a table (or an individual cell) corresponds to one line2.

  2. Use best of breed (code) versioning like git mercurial to store and manage the data.

Line-oriented text is important because it enables the powerful distributed version control tools like git and mercurial to work effectively (this, in turn, is because those tools are built for code which is (usually) line-oriented text). It’s not just version control though: there is a large and mature set of tools for managing and manipulating these types of files (from grep to Excel!).

In addition to the basic pattern, there are several a few optional extras you can add:

  • Store the data in GitHub (or Gitorious or Bitbucket or …) – all the examples below follow this approach
  • Turn the collection of data into a Simple Data Format data package by adding a datapackage.json file which provides a small set of essential information like the license, sources, and schema (this column is a number, this one is a string)
  • Add the scripts you used to process and manage data — that way everything is nicely together in one repository

What’s good about this approach?

The set of tools that exists for managing and manipulating line-oriented files is huge and mature. In particular, powerful distributed version control systems like git and mercurial are already extremely robust ways to do distributed, peer-to-peer collaboration around code, and this pattern takes that model and makes it applicable to data. Here are some concrete examples of why its good.

Provenance tracking

Git and mercurial provide a complete history of individual contributions with “simple” provenance via commit messages and diffs.

Example of commit messages

Peer-to-peer collaboration

Forking and pulling data allows independent contributors to work on it simultaneously.

Timeline of pull requests

Data review

By using git or mercurial, tools for code review can be repurposed for data review.

Pull screen

Simple packaging

The repo model provides a simple way to store data, code, and metadata in a single place.

A repo for data

Accessibility

This method of storing and versioning data is very low-tech. The format and tools are both very mature and are ubiquitous. For example, every spreadsheet and every relational database can handle CSV. Every unix platform has a suite of tools like grep, sed, cut that can be used on these kind of files.

Examples

We’ve been using with this approach for a long-time: in 2005 we first stored CSV’s in subversion, then in mercurial, and then when we switched to git (and github) 3 years ago we started storing them there. In 2011 we started the datasets organization on github which contains a whole list of of datasets managed according to the pattern above. Here are a couple of specific examples:

Note Most of these examples not only show CSVs being managed in github but are also simple data format data packages – see the datapackage.json they contain.


Appendix

Limitations and Alternatives

Line-oriented text and its tools are, of course, far from perfect solutions to data storage and versioning. They will not work for datasets of every shape and size, and in some respects they are awkward tools for tracking and merging changes to tabular data. For example:

  • Simple actions on data stored as line-oriented text can lead to a very large changeset. For example, swapping the order of two fields (= columns) leads to a change in every single line. Given that diffs, merges, etc. are line-oriented, this is unfortunate.3
  • It works best for smallish data (e.g. < 100k rows, < 50mb files, optimally < 5mb files). git and mercurial don’t handle big files that well, and features like diffs get more cumbersome with larger files.4
  • It works best for data made up of lots of similar records, ideally tabular data. In order for line-oriented storage and tools to be appropriate, you need the record structure of the data to fit with the CSV line-oriented structure. The pattern is less good if your CSV is not very line-oriented (e.g. you have a lot of fields with line breaks in them), causing problems for diff and merge.
  • CSV lacks a lot of information, e.g. information on the types of fields (everything is a string). There is no way to add metadata to a CSV without compromising its simplicity or making it no longer usable as pure data. You can, however, add this kind of information in a separate file, and this exactly what the Data Package standard provides with its datapackage.json file.

The most fundamental limitations above all arise from applying line-oriented diffs and merges to structured data whose atomic unit is not a line (its a cell, or a transform of some kind like swapping two columns)

The first issue discussed below, where a simple change to a table is treated as a change to every line of the file, is a clear example. In a perfect world, we’d have both a convenient structure and a whole set of robust tools to support it, e.g. tools that recognize swapping two columns of a CSV as a single, simple change or that work at the level of individual cells.

Fundamentally a revision system is built around a diff format and a merge protocol. Get these right and much of the rest follows. The basic 3 options you have are:

  • Serialize to line-oriented text and use the great tools like git (what’s we’ve described above)
  • Identify atomic structure (e.g. document) and apply diff at that level (think CouchDB or standard copy-on-write for RDBMS at row level)
  • Recording transforms (e.g. Refine)

At the Open Knowledge Foundation we built a system along the lines of (2) and been involved in exploring and researching both (2) and (3) – see changes and syncing for data on on dataprotocols.org. These options are definitely worth exploring — and, for example, Max Ogden, with whom I’ve had many great discussions on this topic, is currently working on an exciting project called Dat, a collaborative data tool which will use the “sleep” protocol.

However, our experience so far is that the line-oriented approach beats any currently available options along those other lines (at least for smaller sized files!).

data.okfn.org

Having already been storing data in github like this for several years, we recently launched http://data.okfn.org/ which is explicitly based on this approach:

  • Data is CSV stored in git repos on GitHub at https://github.com/datasets
  • All datasets are data packages with datapackage.json metadata
  • Frontend site is ultra-simple – it just provides catalog and API and pulls data directly from github

Why line-oriented

Line-oriented text is the natural form of code and so is supported by a huge number of excellent tools. But line-oriented text is also the simplest and most parsimonious form for storing general record-oriented data—and most data can be turned into records.

At its most basic, structured data requires a delimiter for fields and a delimiter for records. Comma- or tab-separated values (CSV, TSV) files are a very simple and natural implementation of this encoding. They delimit records with the most natural separation character besides the space, the line break. For a field delimiter, since spaces are too common in values to be appropriate, they naturally resort to commas or tabs.

Version control systems require an atomic unit to operate on. A versioning system for data can quite usefully treat records as the atomic units. Using line-oriented text as the encoding for record-oriented data automatically gives us a record-oriented versioning system in the form of existing tools built for versioning code.


  1. Note that, by CSV, we really mean “DSV”, as the delimiter in the file does not have to be a comma. However, the row terminator should be a line break (or a line break plus carriage return). 

  2. CSVs do not always have one row to one line (it is possible to have line-breaks in a field with quoting). However, most CSVs are one-row-to-one-line. CSVs are pretty much the simplest possible structured data format you can have. 

  3. As a concrete example, the merge function will probably work quite well in reconciling two sets of changes that affect different sets of records, hence lines. Two sets of changes which each move a column will not merge well, however. 

  4. For larger data, we suggest swapping out git (and e.g. GitHub) for simple file storage like s3. Note that s3 can support basic copy-on-write versioning. However, being copy-on-write, it is comparatively very inefficient. 

Announcing CKAN 2.0

Mark Wainwright - May 10, 2013 in CKAN, Featured, Featured Project, News, OKF Projects, Open Data, Open Government Data, Releases, Technical

CKAN is a powerful, open source, open data management platform, used by governments and organizations around the world to make large collections of data accessible, including the UK and US government open data portals.

Today we are very happy and excited to announce the final release of CKAN 2.0. This is the most significant piece of CKAN news since the project began, and represents months of hectic work by the team and other contributors since before the release of version 1.8 last October, and of the 2.0 beta in February. Thank you to the many CKAN users for your patience – we think you’ll agree it’s been worth the wait.

[Screenshot: Front page]

CKAN 2.0 is a significant improvement on 1.x versions for data users, programmers, and publishers. Enormous thanks are due to the many users, data publishers, and others in the data community, who have submitted comments, code contributions and bug reports, and helped to get CKAN to where it is. Thanks also to OKF clients who have supported bespoke work in various areas that has become part of the core code. These include data.gov, the US government open data portal, which will be re-launched using CKAN 2.0 in a few weeks. Let’s look at the main changes in version 2.0. If you are in a hurry to see it in action, head on over to demo.ckan.org, where you can try it out.

Summary

CKAN 2.0 introduces a new sleek default design, and easier theming to build custom sites. It has a completely redesigned authorisation system enabling different departments or bodies to control their own workflow. It has more built-in previews, and publishers can add custom previews for their favourite file types. News feeds and activity streams enable users to keep up with changes or new datasets in areas of interest. A new version of the API enables other applications to have full access to all the capabilities of CKAN. And there are many other smaller changes and bug fixes.

Design and theming

The first thing that previous CKAN users notice will be the greatly improved page design. For the first time, CKAN’s look and feel has been carefully designed from the ground up by experienced professionals in web and information design. This has affected not only the visual appearance but many aspects of the information architecture, from the ‘breadcrumb trail’ navigation on each page, to the appearance and position of buttons and links to make their function as transparent as possible.

[Screenshot: dataset page]

Under the surface, an even more radical change has affected how pages are themed in CKAN. Themes are implemented using templates, and the old templating system has been replaced with the newer and more flexible Jinja2. This makes it much easier for developers to theme their CKAN instance to fit in with the overall theme or branding of their web presence.

Authorisation and workflow: introducing CKAN ‘Organizations’

Another major change affects how users are authorised to create, publish and update datasets. In CKAN 1.x, authorisation was granted to individual users for each dataset. This could be augmented with a ‘publisher mode’ to provide group-level access to datasets. A greatly expanded version of this mode, called ‘Organizations’, is now the default system of authorisation in CKAN. This is much more in line with how most CKAN sites are actually used.

[Screenshot: Organizations page]

Organizations make it possible for individual departments, bodies, groups, etc, to publish their own data in CKAN, and to have control over their own publishing workflow. Different users can have different roles within an Organization, with different authorisations. Linked to this is the possibility for each dataset to have different statuses, reflecting their progress through the workflow, and to be public or private. In the default set-up, Organization user roles include Members (who can read the Organization’s private datsets), Editors (who can add, edit and publish datasets) and Admins (who can add and change roles for users).

More previews

In addition to the existing image previews and table, graph and map previews for spreadsheet data, CKAN 2.0 includes previews for PDF files (shown below), HTML (in an iframe), and JSON. Additionally there is a new plugin extension point that makes it possible to add custom previews for different data types, as described in this recent blog post.

[Screenshot: PDF preview]

News feeds and activity streams

CKAN 2.0 provides users with ways to see when new data or changes are made in areas that they are interested in. Users can ‘follow’ datasets, Organizations, or groups (curated collections of datasets). A user’s personalised dashboard includes a news feed showing activity from the followed items – new datasets, revised metadata and changes or additions to dataset resources. If there are entries in your news feed since you last read it, a small flag shows the number of new items, and you can opt to receive notifications of them via e-mail.

Each dataset, Organization etc also has an ‘activity stream’, enabling users to see a summary of its recent history.

[Screenshot: News feed]

Programming with CKAN: meet version 3 of the API

CKAN’s powerful application programming interface (API) makes it possible for other machines and programs to automatically read, search and update datasets. CKAN’s API was previously designed according to REST principles. RESTful APIs are deservedly popular as a way to expose a clean interface to certain views on a collection of data. However, for CKAN we felt it would be better to give applications full access to CKAN’s own internal machinery.

A new version of the API – version 3 – trialled in beta in CKAN 1.8, replaced the REST design with remote procedure calls, enabling applications or programmers to call the same procedures as CKAN’s own code uses to implement its user interface. Anything that is possible via the user interface, and a good deal more, is therefore possible through the API. This proved popular and stable, and so, with minor tweaks, it is now the recommended API. Old versions of the API will continue to be provided for backward compatibility.

Documentation, documentation, documentation

CKAN comes with installation and administration documentation which we try to keep complete and up-to-date. The major changes in the rest of CKAN have thus required a similarly concerted effort on the documentation. It’s great when we hear that others have implemented their own installation of CKAN, something that’s been increasing lately, and we hope to see even more of this. The docs have therefore been overhauled for 2.0. CKAN is a large and complex system to deploy and work on improving the docs continues: version 2.1 will be another step forward. Where people do run into problems, help remains available as usual on the community mailing lists.

… And more

There are many other minor changes and bug fixes in CKAN 2.0. For a full list, see the CKAN changelog.

Installing

To install your own CKAN, or to upgrade an existing installation, you can install it as a package on Ubuntu 12.04 or do a source installation. Full installation and configuration instructions are at docs.ckan.org.

Try it out

You can try out the main features at demo.ckan.org. Please let us know what you think!

Frictionless Data: making it radically easier to get stuff done with data

Rufus Pollock - April 24, 2013 in Featured, Ideas and musings, Labs, Open Data, Open Standards, Small Data, Technical

Frictionless Data is now in alpha at http://data.okfn.org/ – and we’d like you to get involved.

Our mission is to make it radically easier to make data used and useful – our immediate goal is make it as simple as possible to get the data you want into the tool of your choice.

This isn’t about building a big datastore or a data management system – it’s simply saving people from repeating all the same tasks of discovering a dataset, getting it into a format they can use, cleaning it up – all before they can do anything useful with it! If you’ve ever spent the first half of a hackday just tidying up tabular data and getting it ready to use, Frictionless Data is for you.

Our work is based on a few key principles:

  • Narrow focus — improve one small part of the data chain, standards and tools are limited in scope and size
  • Build for the web – use formats that are web “native” (JSON) and work naturally with HTTP (plain-text, CSV is streamable etc)
  • Distributed not centralised — designed for a distributed ecosystem (no centralized, single point of failure or dependence)
  • Work with existing tools — don’t expect people to come to you, make this work with their tools and their workflows (almost everyone in the world can open a CSV file, every language can handle CSV and JSON)
  • Simplicity (but sufficiency) — use the simplest formats possible and do the minimum in terms of metadata but be sufficient in terms of schemas and structure for tools to be effective

We believe that making it easy to get and use data and especially open data is central to creating a more connected digital data ecosystem and accelerating the creation of social and commercial value. This project is about reducing friction in getting, using and connecting data, making it radically easier to get data you need into the tool of your choice. Frictionless Data distills much of our learning over the last 7 years into some specific standards and infrastructure.

What’s the Problem?

Today, when you decide to cook, the ingredients are readily available at local supermarkets or even already in your kitchen. You don’t need to travel to a farm, collect eggs, mill the corn, cure the bacon etc – as you once would have done! Instead, thanks to standard systems of measurement, packaging, shipping (e.g. containerization) and payment, ingredients can get from the farm direct to my local shop or even my door.

But with data we’re still largely stuck at this early stage: every time you want to do an analysis or build an app you have to set off around the internet to dig up data, extract it, clean it and prepare it before you can even get it into your tool and begin your work proper.

What do we need to do for the working with data to be like cooking today – where you get to spend your time making the cake (creating insights) not preparing and collecting the ingredients (digging up and cleaning data)?

The answer: radical improvements in the “logistics” of data associated with specialisation and standardisation. In analogy with food we need standard systems of “measurement”, packaging, and transport so that its easy to get data from its original source into the application where you can start working with it.

Frictionless DAta idea

What’s Frictionless Data going to do?

We start with an advantage: unlike for physical goods transporting digital information from one computer to another is very cheap! This means the focus can be on standardizing and simplifying the process of getting data from one application to another (or one form to another). We propose work in 3 related areas:

  • Key simple standards. For example, a standardized “packaging” of data that makes it easy to transport and use (think of the “containerization” revolution in shipping)
  • Simple tooling and integration – you should be able to get data in these standard formats into or out of Excel, R, Hadoop or whatever tool you use
  • Bootstrapping the system with essential data – we need to get the ball rolling

frictionless data components diagram

What’s Frictionless Data today?

1. Data

We have some exemplar datasets which are useful for a lot of people – these are:

  • High Quality & Reliable

    • We have sourced, normalized and quality checked a set of key reference datasets such as country codes, currencies, GDP and population.
  • Standard Form & Bulk Access

    • All the datasets are provided in a standardized form and can be accessed in bulk as CSV together with a simple JSON schema.
  • Versioned & Packaged

    • All data is in data packages and is versioned using git so all changes are visible and data can becollaboratively maintained.

2. Standards

We have two simple data package formats, described as ultra-lightweight, RFC-style specifications. They build heavily on prior work. Simplicity and practicality were guiding design criteria.

Frictionless Data: package standard diagram

Data package: minimal wrapping, agnostic about the data its “packaging”, designed for extension. This flexibility is good as it can be used as a transport for pretty much any kind of data but it also limits integration and tooling. Read the full Data Package specification.

Simple data format (SDF): focuses on tabular data only and extends data package (data in simple data format is a data package) by requiring data to be “good” CSVs and the provision of a simple JSON-based schema to describe them (“JSON Table Schema”). Read the full Simple Data Format specification.

3. Tools

It’s early days for Frictionless Data, so we’re still working on this bit! But there’s a need for validators, schema generators, and all kinds of integration. You can help out – see below for details or check out the issues on github.

Doesn’t this already exist?

People have been working on data for a while – doesn’t something like this already exist? The crude answer is yes and no. People, including folks here at the Open Knowledge Foundation, have been working on this for quite some time, and there are already some parts of the solution out there. Furthermore, much of these ideas are directly borrowed from similar work in software. For example, the Data Packages spec (first version in 2007!) builds heavily on packaging projects and specifications like Debian and CommonJS.

Key distinguishing features of Frictionless Data:

  • Ultra-simplicity – we want to keep things as simple as they possibly can be. This includes formats (JSON and CSV) and a focus on end-user tool integration, so people can just get the data they want into the tool they want and move on to the real task
  • Web orientation – we want an approach that fits naturally with the web
  • Focus on integration with existing tools
  • Distributed and not tied to a given tool or project – this is not about creating a central data marketplace or similar setup. It’s about creating a basic framework that would enable anyone to publish and use datasets more easily and without going through a central broker.

Many of these are shared with (and derive from) other approaches but as a whole we believe this provides an especially powerful setup.

Get Involved

This is a community-run project coordinated by the Open Knowledge Foundation as part of Open Knowledge Foundation Labs. Please get involved:


  • Spread the word! Frictionless Data is a key part of the real data revolution – follow the debate on #SmallData and share our posts so more people can get involved

Announcing: Linked Open Vocabularies (LOV), enabling the vocabulary commons

Pierre-Yves Vandenbussche - July 10, 2012 in Featured Project, Linked Open Data, News, Open Data, Our Work, Technical

We are delighted to announce that Linked Open Vocabularies is now being hosted on Open Knowledge Foundation servers and is now officially an Open Knowledge Foundation project.

LOV Project in 5 points

  • LOV is about vocabularies (aka. metadata element sets or ontologies) in OWL / RDFS used to describe linked data.
  • LOV provides a single-stop access to the Vocabulary Commons ecosystem
  • LOV helps to improve vocabularies understanding, visibility, usability, synergy, sustainability and overall quality
  • LOV promotes a technically and socially sustainable management of the Vocabulary Commons ecosystem
  • LOV is a community and open project. You are welcome to join the team of gardeners of the Vocabulary Commons!

Project context

The LOV project is borned in the framework of the Datalift project which aims at providing a platform to lift data from semi-structured formats (csv, xls, etc.) to linked data. Part of this project under Mondeca‘s company responsibility was focused on vocabulary selection and re-use. The LOV project purpose goes now far beyond this original catalogue. The LOV dataset is maintained by Bernard Vatant and Pierre-Yves Vandenbussche.

Project purposes

  • To identify vocabularies used or usable to express linked data in RDF
  • To harvest or create metadata and links between vocabularies
  • To suggest to vocabulary curators some vocabulary description improvements
  • To foster sustainable and responsible behavior of vocabulary creators and publishers
  • To provide advanced search features among vocabulary ecosystem elements

Project features

Among the various features of the LOV project, you can explore the vocabularies dataset
using an intuitive UI. You can also access directly to an RDF dump via a file or an endpoint.

Lov 1

For every vocabulary, as much metadata as possible is harvested (gathered in the RDF file, in
the documentation or via interaction with authors). For example, the links between a particular
vocabulary and the ecosystem are shown as well as its different versions.

Lov 2

One may search for a particular vocabulary element using the LOV search feature, filtering results
by domain, type, or vocabulary. This feature is enabled thanks to the LOV-bot which monitor all the
vocabularies on a daily basis.

Lov 3

OKFN support and the future of Vocabulary Commons

Along with a sustainable and resilient future for vocabularies, we believe the LOV project should live far beyond the Datalift research project in which it is born. In that perspective, the Open Knowledge Foundation agreed to support our project for the future years. We are really delighted by this support, which strengthens our belief that heritage organizations will play a major role in vocabularies preservation.

LOV and Vocabulary Commons future belongs to its community. You are therefore, as an individual or organization, most welcome to participate in the future of LOV in many ways:

Introducing PyBossa – the open-source micro-tasking platform

Sam Leon - June 8, 2012 in Featured, Featured Project, OKF Projects, Our Work, PyBossa, Technical, WG Open Data in Science, Working Groups

PyBossa Logo

For a while now our network has been working on applications, tools and platforms for crowd-sourcing and micro-tasking. At the end of last year, we posted about a cute little app developed at a hackday called the Data Digitizer that was being used to transcribe Brazillian budgetary data.

In recent months we’ve been working closely with the Citizen Cyberscience Center on an exciting new platform called PyBossa. In a nut-shell, PyBossa is a free, open-source crowd-sourcing and micro-tasking platform. It enables people to create and run projects that utilise online assistance in performing tasks that require human cognition such as image classification, transcription, geocoding and more. PyBossa is there to help researchers, civic hackers and developers to create projects where anyone around the world with some time, interest and an internet connection can contribute.

There is already a wealth of such projects, including long-running ones such as FreeBMD – a huge effort to transcribe the Civil Registration of births, marriages and deaths in the UK – as well as more recent ones such as GalaxyZoo – a hugely successful project based on volunteer efforts to classify photographs of galaxies taken by the Hubble telescope.

With PyBossa we want to make the creation of such potentially transformative projects as easy as possible and so PyBossa is different to existing efforts:

  • It’s a 100% open-source
  • Unlike, say, “mechanical turk” style projects, PyBossa is not designed to handle payment or money — it is designed to support volunteer-driven projects.
  • It’s designed as a platform and framework for developing deploying crowd-sourcing and microtasking apps rather than being a crowd-sourcing application itself. Individual crowd-sourcing apps are written as simple snippets of Javascript and HTML which are then deployed on a PyBossa instance (such as PyBossa.com). This way one can easily develop custom apps while using the PyBossa platform to store your data, manage users, and handle workflow.

You can read more about the architecture in the PyBossa Documentation and follow the step-by-step tutorial to create your own apps.

Demos

PyBossa currently comes with several demo applications that showcase two types of projects:

Flickr Person shows how easily you can create a project where you have a set of photos or figures that need a classification or a description of the photo. In this demo application, the latest 20 published public photos from Flickr are used as input for the volunteers where they will have to answer a simple question: Do you see a human in this photo?

PyBossa Baby

The demo project Melanoma comes from an idea conceived by a team at Sage Bionetworks. Melanoma is one of the most life-threatening forms of cancer and its incidence is on the rise. It is often difficult for medical professionals to determine if a skin lesion is cancerous or not, but if diagnosed early patients have a 95% chance of survival. Advances in computer-aided image manipulation have improved the diagnostic process, but the hope is that the combination of these techniques and crowd-sourcing will improve these techniques further making early diagnosis more common.

In the demo you are asked to say if a skin lesion shows signs of being cancerous, and are taken through the various key questions: is it asymmetrical?, are its borders blurred?, is its colour uneven? and is it bigger than 6mm in diameter?. The plan is to extend this demo into a project that will help citizens recognise the early signs of skin cancer and also enable scientists to evaluate the role of crowd-sourcing in medical diagnosis.

PyBossa Melanoma

Urban Parks is a rather different kind of project. It shows a web mapping tool where volunteers are asked to locate an urban park for a given city. The goal is to show how web mapping tools can be used to address tasks like geo-locating items in a map.

PyBossa Urban Parks

If you want to try the demos and PyBossa, go to PyBossa.com and get clicking. If you are interested in the framework you can download the source code from the Github repo and access the documentation here.

The Future

The focus on PyBossa has initially been on online citizen science projects, but it could have important applications in a host of other domains. For one, PyBossa could be used to help transcribe handwritten manuscripts of historical significance and contribute to existing efforts to make more of our shared cultural heritage available for free online and in a structured form.

We have no doubt that there are hundreds of other use-cases for PyBossa which we haven’t conceived of yet, and we’re looking forward to seeing the unexpected projects that emerge from it.

Call to action

Does PyBossa sound like something you’d like to get involved in? If so…

For any questions that you would like to address directly to the development team please use info [at] pybossa.com

Introducing the DataStore

Rufus Pollock - March 27, 2012 in CKAN, Our Work, Technical

A major new feature in the DataHub is good news for data wranglers. The DataStore allows users to store and load structured data into a database, where it can be queried, filtered, or accessed from other programs via a rich data API.

The API is also used by CKAN’s inbuilt Recline Data Explorer, giving in-page previews of the data with full text search, filtering, sorting and graphing, as in the screenshots below:

[IMG: Sorting]
[IMG: Graph]

These new DataHub capabilities are powered by the recently enhanced DataStore and Data API functionality of our open-source CKAN data management system, which as well as powering the DataHub runs many other data portals including data.gov.uk.

An introduction to the DataStore and Data API

Major new CKAN release: v1.5!

Theodora Middleton - November 14, 2011 in CKAN, News, OKF Projects, Open Data, Technical

The following post is by David Read, on behalf of the CKAN team.


We’re proud to announce a major new release of CKAN!

Version 1.5 brings major improvements including:

  • Major user experience upgrades around dataset publication and access plus a new theme
  • Integrated structured and blob data storage, with associated with data previewing and visualization
  • Extended catalog API providing ability to access every piece of the CKAN system
  • Documentation overhaul and extension including a new administrator and development manual at http://docs.ckan.org/
  • Easier installation and deployment, specifically via new debian / ubuntu packages of CKAN — CKAN installation and deployment can now be less than 5 minutes

CKAN, the Open Knowledge Foundation’s data hub and catalogue software, has now been deployed in over 20
countries around the world, providing Open Data hubs for governments
and communities. Started five years ago, CKAN has gradually increased
in momentum – the development team is now at 6 full-time developers.
Originally developed to power the community data site http://thedatahub.org/ (previously http://ckan.net) which can be freely used by anyone in the open data communities, the Foundation has also now been involved in assisting governments and other public organisations make use of CKAN through the development of customisations and new features as well as the provision of hosted solutions.

For more about this release, see: http://ckan.org/2011/11/09/ckan-1-5-release/

We’d also like to take this opportunity to thank the amazing on-line community around CKAN for their continued ideas,
suggestions and support in the development of this open source data hub software.

Get Updates