Help Turn Voices from BBC Radio into Open Data for Wikipedia

This is a cross-posting from the OpenGLAM blog written by Michael Smethurst, development producer at the BBC – see the original post.

An invite

On Saturday, 18th January 2014 between 10am and 5pm the BBC is teaming up with the Open Knowledge Foundation’s OpenGLAM initiative, Creative Commons UK and the Wikimedia community to host an open event in the Media Cafe, New Broadcasting House, Portland Place, London (map here). Attendees will be given access to the Radio 4 permanent audio archive, the tools to take samples of voice recordings and the opportunity to upload them to Wikimedia Commons for inclusion into Wikipedia.

If you’d like to come along you can sign up for the day on EventBrite. The first 25 non-BBC sign ups will be given a free tour of Broadcasting House.

Some background

Back in 2012 Andy Mabbett, an active Wikipedian and open cultural data advocate, published a blog post requesting open-licensed, open-format recordings of the voices of Wikipedia subjects for Wikimedia Commons. The request went on to become the Voice Intro Project and some examples can be found here. In September Andy talked about the project to the Wikimedia UK blog:

The idea is to let Wikipedia readers find out what the people we write about sound like [..] It’s great that we can hear the voices of people like Gandhi and Alexander Graham Bell, but what about all the other historic figures, whose voices are lost forever? We shouldn’t let that happen when we have the technology and resources so easily available. Sure, some of our subjects are known for media appearances, but those aren’t necessarily available globally nor under an open licence.

Andy’s original post was spotted by Tristan and passed around R&D. Our first thought was, “we’ve got lots of voices”. Our second thought was, with some adjustment this could be a useful hook for institutions like the BBC and beyond with large, digitised audio archives but sparse metadata and no way to know who’s speaking in them.

Generating Linked Open Data from Open Content

As part of the ABC-IP project Yves built a speaker recognition algorithm that scales to large number of speakers, based on the LIUM speaker diarization toolkit. The software is able to recognise voice patterns and identify where the same voice box speaks across a large audio (or video) archive. Unfortunately, it doesn’t identify an actual person and doesn’t give us a name / identity for the person speaking.

The results can be seen on this episode of From Our Own Correspondent and this aggregation of episodes featuring the voice of Orla Guerin from the World Service archive (you’ll need to signed in to see). The names have been provided by users of the archive and are just strings and not identifiers for “things”.

The BBC makes extensive use of identifiers from the Wikimedia family (Wikipedia and Wikidata) and related projects (e.g. DBpedia) so it would be better if we could associate voice boxes with Wikipedia, DBpedia or Wikidata concept identifiers. This would allow us to surface programmes about and featuring person X.

As a piece of research we’re looking to investigate whether voice samples on Wikimedia/pedia could be used to generate a voice box “fingerprint” which could then be used to identify speakers across a large archive. Which would close the circle of archive audio to speaker recognition to Wikimedia voice fingerprint to Wikipedia, DBpedia or Wikidata identifier to Linked Open Data for speakers in an archive.

To do that we’d need longer (duration) and higher quality samples than suggested by the Voice Intro Project. So we’re looking to upload 30-40 second voice samples losslessly encoded as FLAC. We’ve created a few examples:

Mark Carny

Mary Robinson

Justin Welby

Any software we create to do this will be open sourced and (obviously) the voice samples will be openly licenced so other researchers and cultural institutions will be able to use the same methods to annotate audio / video with identified speakers. And hopefully contribute to the project by uploading voice samples from their own archives. By releasing small nuggets of their archives they’d be both improving Wikipedia and putting just enough in place to make the further contextualisation of their (and other) archives possible.

Details of the day

On the day we’ll be giving out access to Snippets, an R&D tool built on top of Redux. Snippets gives access to everything broadcast by the BBC since ~2007. For rights reasons we’ll only be uploading voice samples from the selection of Radio 4 news and factual programmes with permanent availability. A list of permanently available Radio 4 programmes can be found below the original post on the BBC’s website.

Before we meet up it would be good if you could have a listen to some of these programmes and identify interesting people and suitable 30-40 second samples. If you’d like to come along you can sign up to attend here. Please bring along a laptop and some headphones. Food and drink will be provided.