A few weeks ago we announced our collaboration with Brazil’s Office of the Comptroller General and Uruguay’s Agency for Electronic Government and the Information and Knowledge Society to prototype a Model Context Protocol (MCP) bridge between LLMs and open data portals. The goal is simple: let citizens ask natural‑language questions and get answers that are traceable back to official datasets.
But once we started building, we discovered that connecting an AI assistant to an open data portal was relatively straightforward. The bigger challenge was helping the system understand what the data actually means. A column name is not a definition, a dataset title is not a methodology, and a time series does not explain its own gaps. When an AI system lacks that context, it will often try to fill the gap itself.
That is where a technical integration becomes a trust problem.
Two countries, two datasets, one architecture
The pilots in Brazil and Uruguay shared the same architecture but focused on very different datasets.
In Brazil, the pilot focuses on a dataset about parliamentary amendments, one of the most requested datasets on the country’s open data portal. In Uruguay, the pilot works with the National Energy Balance, a set of datasets covering areas such as energy imports, generation, and installed capacity.
The setup was simple: an MCP server connecting the AI to datasets, a chat gateway to have a simple user interface, and tools that explain what questions can be asked and how to properly answer them.

Both countries selected datasets based on real user demand, and partners provided frequently asked questions that became the basis for testing. These were not demonstration datasets–they were datasets that people already use and ask questions about.
Understanding the data took more time than coding
One of the reassuring findings from the pilots was that the MCP layer did what it was supposed to do. It connected to the data, exposed tools, and gave the language model a structured way to get fresh data from the portal instead of relying only on its general training or on copied-and-pasted context.
But once the connection worked, the question changed. It was no longer simply: can the model access the data? The question became: does the model understand what the data represents well enough to answer responsibly?
The quality of the answers depended heavily on the quality of the descriptions around the data.
For every dataset and tool, someone had to explain what the data meant, what the fields represented, what units were being used, what time periods were covered, and what assumptions should not be made.
This sounds simple until you try to write those descriptions well.
What does an “amendment” include in this dataset? Are different types comparable? For an energy dataset, what units are being used? Does “import” include transit and re-exports? Is the data provisional or final?
These are not details that developers nor AI systems can safely infer from a column name. They require domain knowledge, and they determine whether an AI-generated answer is useful, misleading, or simply wrong.
When the AI invented a plausible explanation
One example from the Uruguay pilot made this especially clear.
The system was answering a question about changes in the energy dataset when it introduced “climate factors” as part of its explanation.
The problem was that there was no climate data in the dataset being used.
The explanation sounded plausible, but the system had no evidence for that claim in the data available to it.
Traceability helps, but it does not solve the whole problem. A cited source can show where a number came from, but it does not automatically prevent the model from adding an explanation that the source does not support.
That is why human review during the development was essential. Someone familiar with the data had to check whether an answer was actually supported by the dataset. The goal is not to stop the AI from explaining the data, but to ensure it can distinguish between what the data supports, and what it cannot verify.
Tool description improved through real questions
That review did more than catch individual mistakes. It also showed us where the system needed better guidance.
In an MCP-based architecture, tool descriptions help the model understand what a tool does, when to use it, and how to interpret the results. But the best descriptions did not emerge from writing documentation in isolation. They emerged through testing real questions, reviewing the answers, and refining the guidance with the pilot partners.
Their frequently asked questions became test cases. Their feedback revealed where the model misunderstood the data, where descriptions were too vague, and where the system needed clearer limits.
Instead of asking whether the AI could answer questions about a dataset in general, we focused on whether it could answer the questions people actually ask, show where the answers came from, and avoid unsupported interpretations.
That proved far more useful than a generic benchmark.
What this means for open data teams
A major motivation for using MCP in this context is traceability. If an AI assistant answers a question about public data, users should be able to see where the answer came from.
But the pilots reminded us that traceability is only one part of trust. A system can cite the right dataset and still make an unsupported inference, misunderstand a unit, or overstate what the data shows.
For open data publishers, AI readiness is not only about making data machine-readable. It is also about providing enough context–clear metadata, definitions of key concepts, limitations, and examples of valid and invalid questions–for an AI system to interpret the data correctly.
The early lesson from these pilots is simple: the technical architecture can make data accessible to a model, but only the people who understand the data can make the answers trustworthy.
What comes next
So far, the work has focused on building the bridge: connecting open data portals to AI assistants, defining datasets as tools, and testing answers with pilot teams.
The next step is user testing. That will raise a different set of questions. What happens when real users ask questions in unexpected ways? How much ‘hallucination’ remains after better tool descriptions and source traceability? When does traceability build trust, and when does it add friction? These are the questions we need to answer next.
You are welcome to join this discussion at the Open Knowledge Forum and share your thoughts, too.
Acknowledgement
We are grateful for the Patrick J. McGovern Foundation’s (PJMF) generous support and our continued partnership in enhancing digital literacy and investing in AI for the public good. Learn more about its charitable programmes here.
Photo: Spencer Davis/Unsplash






