In collaboration with Romina Colman
How can AI help non-technical users validate and improve the quality of their data in the Open Data Editor, taking into account transparency, privacy, and functionality?
In this blog post, we reflect on the collaboration, process and outcomes of integrating an AI feature into the Open Data Editor (ODE) to help its users better understand their tables of data. We describe the challenges for which AI could provide a solution, our exploration of potential AI features, and the first implemented AI feature to help users better understand their data. We reflect on this integration, and finally outline the roadmap for further AI features for ODE to further improve its functionality and user experience.
Objectives and challenges
The current functionality of the Open Data Editor is aimed at providing “data validation and basic cleaning” capabilities to improve the quality of data in tables. In plain language, ODE checks for errors in tables according to specific rules. In ODE, these rules are defined by Frictionless Data, an OKFN initiative that provides standards and software implementations to improve data quality and interoperability.
ODE is a unique tool in that it offers these capabilities to those non-technical data practitioners who typically analyse individual data files or data from public sources in a more ad hoc manner. Existing ‘data observability’ tools, such as Metaplane and Monte Carlo Data, are typically aimed at a technical audience to facilitate robust integration into large-scale data pipelines. Building a data preparation tool for non-technical users remains a major challenge and requires special attention to the interface, level of abstraction, and interactions. Writing the code is therefore only one side of the coin. A combination of soft and technical skills is needed to ensure that complex technical terms and feature implementations are understandable and transparent to those who are not necessarily exploring how something that looks simple, such as an AI button in the app, works.
The collaboration between the ODE team and myself, Madelon Hulsebos, as an AI Consultant, was prompted by a desire to explore how Artificial Intelligence (AI) features could be used to enhance the core functionality of ODE and help users.
First pre-work meeting
As a first step, the team’s Product Owner, Romina Colman, and I met to discuss the status of the Open Data Editor and the shortcomings of related tools that the team had identified through a survey. We found that it can be difficult for users to understand how to use the tool and how to interpret its interactions (e.g. error messages). We also concluded that for non-technical users of ODE, it is important to provide transparency about what is happening, why and how, and to ensure the privacy of the user’s data.
The key question, therefore, is how AI can enhance the ODE’s ability to validate and improve data quality in a way that is transparent, privacy-preserving and trustworthy to a non-technical audience.
Exploring AI-driven features
The Open Data Editor team had initially identified 3 ideas:
- Improving key metadata elements
- Suggesting analysis questions
- Reporting table statistics
Based on this, it seemed that the first idea, semantic metadata refinement, such as “column descriptions” and “descriptive column names”, was at the core of ODE’s capabilities and that this feature would significantly help users to better understand their data.
I reviewed the ideas generated by the ODE team and enriched them with further suggestions that would improve the core functionality as well as some ideas for extending the functionality of the application. The ideas were described along the following list of dimensions:
- Feature type (core/extension)
- Motivation
- Implementation complexity
- User data need
- Priority
I proposed three additional AI features that would assist users in using the ODE by 1) interpreting error messages, 2) answering questions about the ODE documentation, 3) summarising and contextualising the table metadata generated by ODE.
In addition to these features, I identified three other features that would extend ODE capabilities:
- Suggesting relevant data quality checks for the dataset at hand
- Suggesting ‘data repair’ based on error messages after running quality checks
- Suggesting a complete ‘data processing plan’ based on the dataset and the intended analysis
These features will proactively guide a non-technical user in validating and improving the quality of their data with ODE, extending its current capabilities.
Refinement and prioritisation
The team met to reflect on the ideas from different perspectives: backend, frontend, and product. The aim of this meeting was to come up with a list of priorities and an action plan. The ODE team prioritised four ideas based on functionality, in order:
- User-friendly error interpretation
- Answering questions about documentation
- Generating table statistics
- Improving table metadata (column names)
The team crystallised these feature ideas and I provided additional input based on open questions.
Based on insights from a few testing sessions with community members, the implementation effort versus the release timeline, and coordination with the Frictionless community, the team decided to start improving table metadata by suggesting improved column names and column content descriptions.
Development and experimentation
After identifying the AI feature that the team wanted to focus on first, they developed an initial implementation of the feature, taking into account the key values:
- Privacy of user data
- Functionality
- Transparency of the use of AI and references to OpenAI’s terms and conditions
After providing instructions for the AI implementation, Romina tested it from the product side. Later, she had a meeting with me to ensure that the ODE team had not overlooked any relevant elements in the implementation. In fact, during this call, I noticed that the AI box, which asks the user to insert an OpenAI key, did not contain any references to explain how to obtain it. We added a link to the OpenAI documentation, and as OKFN has just launched a general course to help people work with open data, we asked the instructors to explain what a key is. We also added text and a link to allow users to check the terms and conditions of OpenAI. Finally, the link to this blog post will be added to the ODE’s user guide, so that people can also read more about the implementation process and decisions there.
The current pipeline for the AI feature is as follows:
- Given a particular table uploaded to ODE, the user clicks on the “AI” button.
- A dialogue informs the user that only the table header is sent to OpenAI.
[Note to readers: On 18 December, our team held the first group user test for the Open Data Editor stable release. One of the participants suggested changes to this message, such as including the name of the third party that ODE uses for AI integration (OpenAI), and some additional clarification regarding the steps that follow when the user clicks ‘Confirm’. We will be releasing a new version with these changes soon.]
- On proceeding, the user is asked for their OpenAI key.
- On proceeding, the user is shown the editable prompt that will be sent to OpenAI.
- On proceeding, ODE makes an LLM call with the key, the table header, and the current prompt asking to provide per column 1) improved column name, 2) description.
- The user is shown the LLM generated table description.
Upon further review, we identified several revisions that were important for the first release of ODE with the integrated AI feature:
- The user experience for activating the feature, i.e. generating column-level metadata, should be more descriptive than “AI”, e.g. “describe this table”.
- The user should be told where and how to find their OpenAI key, the terms and conditions of OpenAI, and that the data will not be stored in ODE or shared externally.
- The prompt to the LLM should not be displayed and editable in advance to avoid confusion. Instead, it can be shown/edited after the initial output is generated for advanced use.
- The prompt should be and the LLM should be forced to adhere to the desired “structured output” of the metadata (e.g. provide a schema and output a json). Requirements that cannot be enforced in this way can be built into the natural language prompt, e.g. that the output should be short, and that a particular language is required.
- Persist the generated output as metadata for future use or publication, and make the output useful. For example, give the user the option (via a button) to use the generated column names to replace the current ones.
AI roadmap for the Open Data Editor
Integrating AI into the Open Data Editor can have significant value in providing a data quality validation and improvement tool that is accessible to non-technical users.
- Reuse the built-in AI feature to extend capabilities that fit into the same pipeline as described above, so taking as input the table or just its column names, making a single LLM call to generate as output, for example, data validation rules or data analysis questions.
- Link the error message from an executed data validation rule with the context of the ODE features and how to use them (e.g. from the user manual, code or documentation) to generate suggestions on how to “repair” the data.
- Question-answering through the documentation, so that users can ask any question and be directed to the right information in the documentation, for example using a retrieval-assisted generation approach such as that developed by the Scikit-learn team (see blog post). Given the effort this would require, it may be efficient to develop this pipeline together with other product teams in OKFN.
Conclusion
The key question was how AI can strengthen the Open Data Editor’s ability to validate and improve data quality in a transparent, privacy-preserving and trustworthy way.
In this blog post, we reflected on the process and outcome of the AI feature, and outlined a roadmap for future integrations of AI functionality in ODE. The team successfully integrated its first AI feature: using an LLM to generate enhanced column names along with column descriptions, which helps users understand their data and improve metadata. The implementation of the feature minimises the amount of data actually passed to the LLM: only the table column names are provided, ensuring privacy. The user is actively informed of what is being shared with the LLM, ensuring transparency. When sending the table metadata to the LLM, the prompt is preset in ODE, while the LLM call restricts the generated metadata to be formatted in a structured way, ensuring trustworthy output.
Overall, the final AI feature strengthens the core of ODE by helping users better understand their data before anything is done with it, taking into account the key values of transparency, privacy and trustworthiness.
Read more
- [Announcement] Open Data Editor 1.2.0 stable version release
- Open Data Editor: learnings from the user testing sessions
- Open Data Editor: The tormented journey of an app
- Open Data Editor: 5 tips for building data products that work for people
- Open Data Editor: Meet the team behind the app
- Open Data Editor: What we learned from user research