Building a Governed Conversational BI Agent

Written by | Jun 23, 2026 8:00:00 AM

Blog

Technical Implementation · 2026

Building a governed conversational BI agent

Lessons from a retail POC. How we translated the conceptual architecture of Conversational BI into a working proof of concept — and what building it revealed about governance, grounding and analytical truth.

Authors

Dr. Elías Zamora Sillero¹ · Joe Olson² · Juan A. Cabeza Souza¹ · Dr. Agata Ferretti² · Milan Jobard-Hoffmann³ · Juan A. Montalbán Vidal¹ · Andrea Greggo²

Affiliations

¹ Technology Department, Sevilla FC · AI Alliance Member
² IBM Research · AI Alliance Member
³ ETH Zurich · AI Alliance Member

Stack

FastMCP · DuckDB · Python · Claude Desktop

Format

Technical post · companion to essay

Repository

→ AI Alliance GitHub

Part 2 of 2 Published June 2026 18 min read

Contents

01From concept to working POC
02The core architectural decision
03The governed analytical workflow
04The epistemic architecture
05Confidence regimes
06End-to-end trace
07What the POC proves

01 · Introduction

From concept to working POC

An earlier essay published in this blog made a conceptual claim: that the next frontier of enterprise AI is not conversing with documents, but conversing with structured data in a governed, reliable and decision-ready way. It argued that the bottleneck blocking most organizations from being truly data-driven is not the absence of data — it is the cognitive rupture between the moment a question arises and the moment reliable evidence reaches the person deciding.

That essay defined the problem and the architecture. It did not build anything.

This post is the proof.

We built a minimal proof of concept in a real enterprise retail domain. The system receives questions in natural language, routes them through a governed analytical pipeline, and returns qualified responses backed by institutional knowledge — with explicit indication of the authority level behind each answer.

The goal was not to build a general-purpose Conversational BI system. It was narrower and more honest than that:

To test whether the core architectural thesis of the earlier essay is technically actionable — whether you can, in a real enterprise domain, separate the conversational interface from the analytical authority, enforce a governed workflow, and qualify responses according to their epistemic status.

The post that follows is not a repeat of the earlier essay's argument, nor documentation of the codebase. It is an account of the architectural decisions made, the problems encountered, and the degree to which the implementation validates the claims of the essay that preceded it.

Note · Dataset

The system was validated on production retail data. The examples, traces and figures in this post use a synthetic electronics dataset — smaller in scale, designed for reproducibility and open exploration. The architectural claims hold in both cases. Some operational complexity that motivates specific design decisions — in particular the multi-pass SKU filtering — is more visible at production scale.

02 · The core architectural decision

A system organized around a separation

The system rests on one deliberate structural decision: the conversational layer and the analytical authority layer are strictly separated.

The conversational shell manages the interaction with the user — it holds context, surfaces results, presents answers. In this POC, that role is implemented by Claude Desktop. But regardless of implementation, the shell does not own the analytical model. It does not contain the product ontology. It does not define the metrics. It does not decide what SQL should be trusted.

All of that authority lives in the backend.

The LLM in the frontend is the orchestrator. The governed backend is the analyst. That distinction is not cosmetic — it is the architectural expression of the essay's central claim that analytical truth must remain under institutional control.

The system has five structural layers. Each has a defined role and a concrete implementation in the POC:

Architectural layer

POC implementation

Role

Conversational shell

Claude Desktop

Manages dialogue and user interaction. Surfaces results. Has no access to analytical knowledge.

Orchestration protocol

Project Instructions

Defines the exact workflow the shell must follow — which tools to call, in what order, under what conditions.

Governed analytical backend

FastMCP Server

Executes the governed pipeline. Consults the semantic knowledge base at each analytical step.

Semantic knowledge base

Ontology · TLK · DDL · Metrics

Institutional knowledge that makes the pipeline governed. Explicit, inspectable assets that define what the system is allowed to know and reason about.

Execution layer

DuckDB + Parquet

Executes analytical queries locally over data mirroring the production schema.

↕ User Natural language in · qualified analytical response out

Conversational shell + visualization Claude Desktop Receives questions · presents results · renders charts and tables when needed

Orchestration protocol Project Instructions Enforced workflow · no improvisation allowed

Governed analytical backend FastMCP Server — retail-electronics Analytical pipeline · validation · SQL generation · confidence regimes

Semantic knowledge base Institutional analytical knowledge ontology_product.json · TLK_Library.json · retail_analytics.ddl · ontology_metrics.json · synonyms.json

Execution layer DuckDB + Parquet Local execution · mirrors Athena production schema

The nested structure makes the separation concrete: each layer contains the one below it and constrains what that layer can see and do. The semantic knowledge base is what the analytical pipeline consults — not what it executes. Claude Desktop is the outermost layer, the only one that touches the user in both directions.

This is what a governed architecture looks like in practice. Not a system where the model is instructed to behave correctly, but one where the structure itself enforces the boundaries.

Note · Shell

Claude Desktop is the conversational shell used in this POC, but the architecture does not depend on it. Any conversational client with MCP support and the ability to operate under a structured orchestration protocol could take its place. The governed backend and the orchestration protocol are the architecturally essential elements. The shell is replaceable.

03 · The governed analytical workflow

A pipeline where every step earns its place

A question enters the system as natural language and exits as a qualified analytical answer. Between those two points, the workflow is not a sequence of tasks — it is a system of control points. Every step can stop the pipeline, reject the question, or qualify the response. That capacity to halt, reject and qualify is what makes the workflow governed rather than merely structured.

The workflow has seven logical stages but exposes fewer tools to the conversational shell. That asymmetry is intentional — it is the Orchestration Protocol in action. Steps S3 through S4 are grouped inside a single backend tool precisely because they are the most sensitive: semantic mapping, live SKU resolution, SQL generation and response-authority logic all stay inside the governed backend, invisible to the shell.

Logical step · MCP tool

Function

Governance purpose

S1 · Classifystep1_classify_question

Gates entry to the pipeline. Only retail data questions proceed.

Prevents out-of-domain answers.

S2 · Validatestep2_validate_retail_domain

Decomposes and validates concept, metric, date range, location and size. Produces the structured intent all subsequent steps use.

Prevents invalid analytical requests entering the pipeline.

S3 · Ontology mappingstep3_to_4_pipeline

Maps the user concept to the product ontology — 7 families, 29 subfamilies, 95 product types.

Prevents semantic drift and invented product categories.

S3.25 · SKU prefilterstep3_to_4_pipeline

Eliminates impossible SKU (Stock Keeping Unit) candidates deterministically, before any model call.

Reduces noisy SKU pools and unnecessary model exposure.

S3.5 · SKU filterstep3_to_4_pipeline

Selects the exact live item strings that will anchor the SQL filter.

Prevents false positives, false negatives and stale product-name matching.

S4 · TLK / SQLstep3_to_4_pipeline

Matches the question against certified TLK (Trusted Knowledge Library) patterns and generates SQL according to explicit confidence regimes.

Prevents unconstrained SQL generation.

S5 · Executestep5_execute_query

Executes the query and returns a qualified answer with a visible confidence level.

Prevents unqualified answers being presented as equally reliable.

This design changes the role of the language model. The LLM is not asked to answer the business question — it is asked to participate in a controlled analytical process where each step has a defined jurisdiction. That is what distinguishes this system from a chatbot with SQL access.

The workflow asks the language model to participate in a controlled analytical process where each step has a defined jurisdiction — and where the system can stop, reject or qualify the answer at every stage.

Supp. A

For the full algorithmic detail of what happens inside each step — including decision logic, model calls, and failure modes — see Supplementary Material A: Pipeline Deep Dive →

04 · The epistemic architecture

Institutional knowledge made executable

A governed analytical system requires an explicit epistemic architecture — a body of knowledge organized into layers, each corresponding to a fundamental dimension of the analytical universe of the organization. This section shows how that architecture became concrete. Each conceptual layer has a direct counterpart in the repository: a file, a structure, a set of definitions that the system consults at runtime.

Without those assets, the system has no epistemic architecture. It has a language model with access to a SQL interface.

Asset

Role in the system

Epistemic layer

ontology_product.json7 families · 29 subfamilies · 95 types · 271 SKUs

Definition and classification of existing products
Node metadata for LLM navigation without hallucination
Active in S3 — ontology mapping

Ontology of the analytical world

ontology_metrics.json

Definition of computable metrics and their properties
Temporal nature (flow vs. stock) and multilingual aliases
Active in S2 — metric validation

Ontology of metrics and KPIs

TLK_Library.json68 verified question-SQL templates

68 verified question-SQL pairs
Institutional memory for certified query patterns
Active in S4 — template matching and SQL instantiation

Certified analytical patterns

retail_analytics.ddl

Map of available tables and columns
Explicit exclusion of non-existent tables
Active in S4 — SQL generation at LOW confidence

Structural schema of the analytical world

synonyms.json

Translation of terms to ontology nodes across Spanish, English and retail jargon
Deterministic fallback when LLM returns zero matches
Active in S3 — ontology mapping

Semantic resolution layer

The epistemic architecture is the mechanism that moves analytical knowledge out of people's heads and into the system, where it can be audited, versioned and evolved independently of the individuals who created it.

Maintaining that layer is a continuous analytical responsibility. The ontology grows as the catalogue grows. The TLK expands as new classes of questions are validated. The metric definitions evolve as the organization's measurement practices evolve. In that sense, the epistemic architecture redefines the role of the analyst — from operational intermediary to curator of institutional knowledge.

05 · Response authority

Confidence regimes: knowing what you know

The epistemic architecture defines what the system knows. The confidence regimes define how honestly the system communicates what it knows. A system that presents every response with the same degree of confidence is not being transparent — it is deferring a judgment to the user without giving them the information they need to make it.

The POC implements four response-authority states. Each is a direct consequence of how the SQL was produced:

Level

Condition

What it means

ACCURATE

Direct match in the TLK library. SQL instantiated deterministically from a verified template using live item_names.

Answer derived from a certified institutional pattern. No free generation involved.

MEDIUM

Partial TLK match. The LLM adapts the closest verified template. Template lineage is preserved and visible.

Answer follows an institutional pattern but has been adapted. Usable with awareness of the adaptation.

LOW

No TLK match, or date expression too complex for template instantiation. LLM generates SQL from DDL and ontology context.

Answer is generated, not institutionally verified. Warning prepended. User must validate before acting.

BLOCKED

No valid item_names after the SKU filter. Query rejected before SQL generation.

Not a failure — a governance decision. Prevents silently returning unfiltered grand totals.

Two distinctions matter here. LOW does not mean wrong. It means the answer has not been validated against a certified institutional pattern — which is a fundamentally different thing. A LOW response can be correct; it simply has not earned institutional authority. BLOCKED is not an error. It is the system asserting that executing the query without a governed product scope would be analytically unsafe.

06 · End-to-end trace

Two questions. Two different answers.

The two questions below were chosen deliberately. The first has a direct match in the TLK library — the system answers it with full institutional authority. The second requires a derived ratio that no certified template covers — the system answers it, but qualifies that answer honestly. Same architecture. Same pipeline. Different epistemic authority.

Trace 01 · ACCURATE "How many smartphones did we sell in 2025?"

2,205 units sold · no warnings

S1Classify

electronics_retail — retail data query · enters pipeline

S2Validate

Valid. concept: smartphones · metric: items_sold / total · date: 2025 · location: not specified

S3Ontology map

16 nodes matched. 5 smartphone types + 11 accessory types (cases, cables, chargers) Pool: 52 SKUs · Smartphone 6.1" Standard · 6.7" Pro · 5G Mid-range · Foldable · Compact 5.4"

S3.25SKU prefilter

52 → 52 SKUs · 0% reduction Date window: 2025-01-01 → 2025-12-31 · 0 items excluded · size filter: not applied

S3.5SKU filter

52 → 16 SKUs · −69% · Path B · 1.058s 36 accessories, cases, cables and chargers removed · 16 smartphone SKUs retained LLM correctly discriminates smartphones from co-matched accessories

S4TLK lookup

ACCURATE · TLK #12 · 100% similarity Template: "How many units of <ITEM_NAME> were sold in <YYYY>?" · SQL instantiated with 16 item_names · 1,811 chars Direct TLK match · no LLM generation · sql_id stored server-side

S5Execute

2,205 units · net_qty_sold = 2205.0 · 1 row · no warnings In 2025, a total of 2,205 smartphones were sold.

What this shows A well-scoped concept with a clear metric produces an exact TLK match. The answer is built deterministically from a certified institutional pattern — no SQL generation involved. The LLM filter correctly removes 36 co-matched accessories while retaining all 16 smartphone SKUs.

Trace 02 · LOW "What percentage of laptops were returned in 2025?"

1.54% return rate · warning active

S1Classify

electronics_retail — retail data query · enters pipeline

S2Validate

Valid. concept: laptops · metric: items_sold / derived_ratio · date: 2025 Note: "percentage returned" mapped as derived_ratio. The phrasing "return rate" was rejected at this same step as out_of_scope. Early signal: derived_ratio operation — no certified TLK template likely

S3Ontology map

8 nodes matched · 5 laptop types + 3 bag/sleeve types · pool: 23 SKUs Laptop 13" Ultrabook · 15" Notebook · 17" Gaming · Convertible 2-in-1 14" · Chromebook 11"

S3.25SKU prefilter

23 → 23 SKUs · 0% reduction Date window: full year 2025 · 0 items excluded

S3.5SKU filter

23 → 15 SKUs · −35% · Path B · 1.517s 8 bags, sleeves and briefcases removed · 15 laptop SKUs retained Smaller pool and slower than Trace 01 — broader concept, no attribute to narrow scope

S4TLK lookup

LOW · 0% similarity · no template match Derived ratio queries (credit note units ÷ invoice units × 100) not yet in TLK · SQL generated from scratch: 2,280 chars · 59 lines derived_ratio requires joining sales + returns in a division — no certified pattern covers this

S5Execute

1.54% return rate · 1 row

⚠ LOW CONFIDENCE: This SQL was generated from scratch without a verified template. Results may be inaccurate.

1.54% of laptops were returned in 2025. Verify before acting on this result.

What this shows A derived ratio metric has no certified TLK template. The system answers — but qualifies the answer honestly. The result may be correct; it has simply not earned institutional authority. This query is a candidate for promotion to a TLK template after manual validation.

The contrast between the two traces is not incidental. It is what the architecture is designed to produce: different answers carry different authority, and the system makes that difference visible. The ACCURATE trace demonstrates what the system can do at its best — deterministic, verified, institutional. The LOW trace demonstrates what the system does when it reaches the boundary of its certified knowledge — it continues, but it is honest about what lies beyond that boundary.

07 · Conclusion

What the POC actually proves

The earlier essay argued that the first wave of enterprise generative AI made unstructured knowledge conversational, and that the next challenge is making structured data conversational without sacrificing analytical truth. This proof of concept is the test of that claim in a real enterprise domain.

This POC shows that the problem cannot be solved by simply connecting a chatbot to a database. Natural language access to structured data becomes useful only when it is governed by explicit semantics, validated workflows, certified analytical patterns and visible response authority. The key to that governance is architectural: separating the conversational interface from the analytical authority, so that the language model orchestrates interaction while the institution retains ownership of truth.

The key architectural lesson is clear:

The LLM should orchestrate the interaction.
The institution should own the truth.
The backend should operationalize that truth.
The user should see the authority level of every answer.

This is the difference between conversational analytics and decision-grade Conversational BI.

The proof of concept described here is not the final form of the architecture. It is a controlled first materialization of its logic. It demonstrates that governed semantics, enforced workflows and natural-language interaction can coexist. More importantly, it shows that building governance into the system from the beginning is easier — and safer — than trying to retrofit it after the system has already learned to answer freely.

The frontier is not simply giving every manager the ability to "chat with data". The frontier is giving every manager access to governed evidence at the moment of decision.

That is the real promise of Conversational BI.

The proof of concept is not the final form of Conversational BI. It is a controlled first materialization of its architectural logic.

Decision-grade conversational analytics depends less on unrestricted generation than on governed semantics, enforced workflow and qualified response behaviour.

In this series

Essay From Document RAG to Conversational BI: Rewiring Enterprise Decision-Making →

Defines the conceptual thesis: why governed conversational access to structured data is the next frontier of enterprise AI.

Technical Post Building a Governed Conversational BI Agent: Lessons from a Retail POC current

Translates the thesis into a working proof of concept and documents the architectural decisions behind it.

Repository conversational_bi · AI Alliance GitHub →

Exposes the technical artifact: the governed pipeline, semantic assets and workflow logic, open for inspection and reuse.

AI Alliance · Newsletter

Open, governed, decision-grade AI — in your inbox.

Research, reference architectures and working code from across our member network. No noise, monthly cadence.

LinkedIn Luma YouTube X GitHub

Connect

Subscribe to AI Alliance newsletter Apply to join a working group Membership inquiry

Legal

Competition Law Guidelines (PDF) Code of Conduct (PDF)

View full post