An Honest Reflection on the Integration of LLMs into Open Data Portals

We have been seeing projects and lots of conversations around how to integrate the latest AI technologies to open data portals. However, we see that all of them fail to answer or tackle one key challenge: LLM’s generated content is not trustworthy when what we need is factual, data backed-up answers.

The evidence on hallucinations is overwhelming and the most wide-spread use cases of AI depend on human supervision that knows the topic so well that he or she can detect the errors, the bias and lead the AI to better results. As every major AI provider tells us in their interfaces: “AI-generated content is for reference only”. Given this structural limitation, is there any path for trustworthy integrations of these technologies into open data portals?

At the Open Knowledge Foundation (OKFN), we are focusing our 2026 work to answer this question.

Starting Point: An open and honest technical discussion on the challenges we face

We know the limits of these emerging technologies, so to build trustworthy and lasting solutions we need to create solid foundations and robust architectures. Here are five problems that we need to address for any kind of integration.

Problem #1: Lack of Trustworthiness

We cannot trust what comes out of an LLM, as they are not designed for fact-telling.

Bias, lack of training data, architectural limitations, we can put a lot of names to this problem but the problem is clear: AI content is for reference only.

Problem #2: Lack of Transparency

How do we know how LLMs are answering us? We have spent decades on reproducible research just to replace it with a machine that, each time it answers, follows a different “reasoning”. We have no clarity on what data is being used to arrive at the conclusion and no way of reproducing the results. AI quoting where the content comes from is a subproduct of specific implementations, not an embedded feature of the technology.

Problem #3: We cannot trust LLM-generated SQL queries

This is more technical. Instead of asking the AI, some solutions are using AI to generate code that can retrieve the answer from a database. Again, the evidence on the power of LLMs to query data points to untrustworthy systems. Commercial LLMs have an “exactness” of ~50%, meaning that almost one in two queries will return wrong data. Optimised engines could get a performance boost of up to 80%, although all the well-known benchmarks are under scrutiny since solutions could be overfitting for it.

Problem #4: LLMs are not intelligent

LLMs are probabilistic systems that predict the next word based on previously trained data. They do not know; they just guess the next thing based on what they know. Moreover, LLMs are not good at processing data. Given a list of numbers, the LLM will fail with basic arithmetics (the total number in the screenshot is 41, not 42)

Problem #5: Eager strategies for problem solving

LLMs rarely answer “I don’t know,” and often they will tend to overuse whatever tool they have at their disposal to try to get an answer. This “behaviour” makes LLMs really good at discovery and generative tasks, but poses a threat when we want to use them to tell facts based on data. We have analysed that even when the LLMs can properly access the data they need, they will sometimes use them incorrectly, falling into problem #4.

Our Proposal: Pilots to test MCP tools for trustworthy data retrieval from open data portals

With the previous context, we are working this year on a project to explore how MCPs can be designed and implemented to interact with open data portals in a trustworthy manner. Is there a path to implement this technology while maintaining the accuracy, reliability and legal obligations of communicating factual information?

We don’t trust AI, but we think that, with the help of an MCP and a well-designed system, we can implement a set of tools to retrieve accurate information from open data portals and control what information the LLM uses for its answer. We use data engineering for information retrieval and LLMs for the presentation layer.

Why is our approach different?

We don’t trust AI-generated content. Our starting point is on the opposite sidewalk.
We know the technical limitations, and we are not ignoring them, but rather designing systems to overcome them.
We work collaboratively. We are co-designing this solution with the help of governments and with two pilot programs.
It is not our goal to build the next shiny tool, but rather to have an honest and open technical conversation on possible solutions to these problems.
We understand the importance of quality data, access to information and fact-checking, and we will not jeopardize our 20-year history with false claims.