Roundtable Recap: How to digitise a physical archive to work strategically with AI?

In an age of AI, what should someone digitising a physical archive consider before they begin? This question was at the heart of an online roundtable Open Knowledge hosted last week as part of our AI Learning Labs partnership with AVANCSO – the Association for the Advancement of Social Sciences in Guatemala – whose documentation centre is at risk.

The full recording is available with automatically generated subtitles above.

We heard from initiatives around the world that digitised archives of documents, photos, and audio, following different paths to preservation. Each told a story of breathing life into dusty archives by changing the way people are able to interact with the content.

AVANCSO’s documentation centre holds historical records of social and indigenous movements for human rights through decades of war in Central America. With Open Knowledge’s AI Learning Labs, anthropologist Alejandro Flores Aguilar is developing a plan for how to preserve its knowledge for future generations, including by using AI. So how should an organisation in the Global South consider digitisation, open source platforms, sustainability, governance, security risks, data and technological sovereignty?

Languages, oral history, and multimedia

Subhashish Panigrahi has helped document indigenous and under-resourced languages in India and South Asia for over a decade. He founded OpenSpeaks in 2017 to build technical infrastructure to support more communities in documenting their own languages in order to fight epistemic inequality. Recently, they launched new open captioning and subtitling tools that work offline or with slow internet. In addition to software, OpenSpeaks develops educational resources and helps train people. Regarding safely digitising human rights testimonies, Subhashish said they try to avoid subjecting people to trauma from retelling their stories, unless they can offer mental health support. OpenSpeaks also developed an Oral Knowledge Framework to help guide ethical recording and publishing.

Subhashish also shared stories of digitising materials from the private archives of writers and researchers that may otherwise never have seen the light of day. For instance, from research for a book about traditional healers and medicinal plants that was never published. Once materials are shared with Wikimedia and Wiktionary, their utility can quickly expand with volunteer community support. “When a physical archive’s knowledge is digitised, sometimes they get a life of their own, and you can’t always predict what it is,” he said.

Open archives for historic photography

For the past decade, Felipe Bengoa Trucco has led the project Enterreno, an archive of more than 100,000 historic photos from Chile submitted by more than 7,000 people. He says they now use AI to support the digitisation, coding and collaboration process— including for the development of tools and games on the platform. Providing access to archives is not enough, he said, you also need good design and participation tools for the discovery of photos. “So that it’s not just a repository, but a living space,” he said.

Some years ago, he was approached by a group of people forming a community of “bottom-up” (meaning crowd-sourced) archives. It’s called the Open Portal Archive Network (OPAN), and today Felipe is its president. “We neglect the family archives, the day to day memories and emotions that people have and want to share,” he said. They list projects from more than a dozen countries. Anyone can join. OPAN develops guides on best practices. “Usually when we talk with little archives they don’t have a lot of resources, so we give them guidelines on which equipment and day to day technology to use,” he said.

Recently, Felipe has noticed tens of thousands of bots visiting Enterreno to scrape images for AI. “We are open access, but we have to begin to give friction to the bots,” he said. “We have Creative Commons licences, but they don’t care about this.” He said it would be going backwards to put images behind a paywall or block them from indexing by search engines, but he’s still trying to figure out what to do. Felipe said OPAN’s ambition is to federate the power of the archives, so they can negotiate in the future. “These companies are so big, and we don’t have any say. The economic part of this is that the bots are using our resources. We don’t want to have an archive for bots, but for living people.”

Paper journals and journalism

Gabriela Manuli is the director of special projects at the International Press Institute (IPI), the oldest global network supporting press freedom and independent journalism.

The day after our roundtable IPI announced the launch of their new digital archives: 18,000 pages (and 75 years) of press freedom history, which is now available and searchable by journalists, researchers, and the general public (for a small fee).

Gabriela told the story of how they explored many paths for the project – including how to work towards an AI commons with Open Knowledge. Eventually, they found that the most practical and affordable option in the near term was to strike up a partnership with a private company in Budapest called Arcanum, which already holds a vast repository of digitised news and scholarly archives that more than 200 educational institutions have access to. From one day to another, digitisation and distribution were made possible at zero cost.

For now, Gabriela said, they have only digitised copies of the IPI Report, the organisation’s flagship monthly publication between 1952 and 2005. There are still many photos, documents, and other archive materials that they have yet to decide how to handle.

Regarding sensitive materials, Gabriela advised AVANCSO to develop an advisory committee to develop criteria and governance rules for what is shared publicly (or not).

Another project mentioned in the roundtable was Open Knowledge Greece’s ARXIVE-Echoes (European Collaborative Cloud for Cultural Heritage). They are working on interoperable methods for sharing archive data from cultural heritage institutions in Europe, like libraries and museums, so they can be more openly available – including to AI models. The project is at the early stage of building a network of people in different countries and gathering requirements for digital tools.

We would like to extend our heartfelt thanks to our guest speakers and to everyone who contributed to the discussion in the chat.

If you are interested in our continually evolving learning exchanges on this topic, drop into the Open Knowledge Forum, where you can share links or reflections.

About

Open Knowledge’s AI Learning Labs is an initiative that aims to experiment with AI, translate knowledge from social sector organisations around the world, and produce public, multilingual AI-literacy resources tailored for organisations addressing similar issues elsewhere.

Together, we will catalyse learning and develop replicable methods to help organisations build AI skills, use AI responsibly, and develop their own AI projects. All resources will be openly available at School of Data.

Join the conversation:

This project has been made possible thanks to the generous support of the Patrick J. McGovern Foundation (PJMF). We are grateful for our ongoing partnership in promoting digital literacy and investing in AI for the public good. Learn more about its funding programmes here.