Open Data Editor in Action: Building Culturally-Accurate, and Global South Sensitive AI Datasets

This text shows a real case of how the Open Data Editor (ODE) impacted the workflow of an organisation working to serve the public good.

Community engagement, and data collection on Smot and Indigenous Knowledge Systems on indigenous plants in action in Cambodia (top left and bottom), and South Africa (top right).

Organisation: An AI of Our Own (AAOO)
Location: Cambodia 🇰🇭 / South Africa 🇿🇦
Knowledge Area: Cultural Heritage
Type of Data: Indigenous & Traditional Knowledge Systems (Medicinal Plants, and Cultural Arts Forms)

How do you build an artificial intelligence model that accurately represents the cultural heritage of communities often excluded from the digital record? For An AI of Our Own (AAOO), the challenge was not just collecting data from rural communities in South Africa and Cambodia, but transforming a mix of structured and unstructured information into a clean, standardised, and well-annotated dataset. Their work, aimed at creating Afrocentric and Global-South-centric AI models, required a tool that could bring clarity and consistency to complex cultural data. The Open Data Editor became the crucial way to go for this meticulous process.

The Challenge

AAOO’s practice is to collect unbiased cultural data directly from source communities. In South Africa, this involved documenting indigenous knowledge of medicinal and ritual plants through interviews with community members and traditional healers. Data was collected via digital forms (using tools like KoboToolbox, and Epicollect5) and through unstructured methods like audio, pictures, and video recordings.

This multi-method approach created a significant data harmonisation problem. The exported data was messy and difficult to interpret: viewing the raw data in a standard spreadsheet resulted in a “misrepresentation of the data,” where information was jumbled and unclear. The team faced two core issues:

Lack of Standardisation: Data from digital forms and transcribed interviews had different formats (e.g., dates were inconsistent), making it impossible to combine them for analysis.
Poor Data Presentation: The raw tables were not “presentable,” hindering the team’s ability to interpret the information and prepare it for the critical steps of annotation and labeling required for AI model training.

Structured field data exported from a data collection tool (Epicollect5) as a csv file presented on Excel sheet.

The Solution

AAOO imported their collected data – from both structured forms and transcribed interviews – into the Open Data Editor. ODE provided an immediate visual advantage, presenting the same data that was messy in a spreadsheet in a clean, organised, and easily interpretable table.

The key features that empowered their work were:

Data Standardisation: ODE allowed the team to identify and rectify formatting inconsistencies. They could standardise date formats and other variables, ensuring that data from different collection methods could be seamlessly unified. This was essential for creating a single, reliable master dataset.
Descriptive Annotation: A critical feature for AAOO was the ability to add detailed descriptions to each column. This process of adding context and meaning to each data point is fundamental to building a high-quality, culturally nuanced AI dataset.
Error Identification: When gathering online source data for traditional Cambodian art form of the Smot, ODE helped flag inconsistencies, such as mismatched data sources (e.g., a blog entry in a list of YouTube channels), which were easy to miss in a manual review.

Structured field data exported from a data collection tool (Epicollect5) as a csv file presented on ODE tool

Errors flagged showing data inconsistency from the converted unstructured data into structured format without considering the standard format and time stamps (Data on Indigenous Knowledge Systems on Plant use)

Flagged inconsistencies, such as mismatched data sources (e.g., a blog entry in a list of YouTube channels on Smot data)

ODE’s capability to allow the full description of data in each column, for easy tracking and understanding

The Results

By using ODE as their central data cleaning and standardisation platform, AAOO achieved foundational results for their ethical AI project.

Created a Master Index: ODE enabled the team to build a clean, well-documented “master index” for their cultural heritage datasets. This index serves as the single source of truth for all subsequent data curation and model training activities.
Enabled Accurate Interpretation: The clear presentation of data within ODE allowed the team to interpret their findings accurately, moving from a “misrepresentation” to a true reflection of the knowledge shared by communities.
Laid the Groundwork for Language Models: The process of standardising and annotating data within ODE is a direct contribution to AAOO’s goal of creating AI models that are built on respectful, accurately represented, and ethically sourced data from the Global South.

Data collection in progress (Ikat weaving technique in Cambodia)

Quote

Tatenda Tavingeyi, Program Coordinator

“With ODE, we can actually interpret our data so easily and it’s well presented. It has allowed us to standardise our data and create a master index for the dataset we are trying to build.”

About the Open Data Editor

The Open Data Editor (ODE) is Open Knowledge’s open source desktop application for nonprofits, data journalists, activists, and public servants, aiming at helping them detect errors in their datasets. It’s a free, open-source tool designed for people working with tabular data (Excel, Google Sheets, CSV) who don’t know how to code or don’t have the programming skills to automatise the data exploration process.

Simple, lightweight, privacy-friendly, and built for real-world challenges like offline work and low-resource settings, ODE is part of Open Knowledge’s initiative The Tech We Want — our ambitious effort to reimagine how technology is built and used. In October 2025, ODE was recognised as a digital public good by the Digital Public Goods Alliance.

And there’s more! ODE comes with a free online course that can help you improve the quality of your datasets, therefore making your life/work easier.

Download Open Data Editor

↪ Take the course: Learn how to use ODE

All of Open Knowledge’s work with the Open Data Editor is made possible thanks to a charitable grant from the Patrick J. McGovern Foundation. Learn more about its funding programmes here.