Return to Articles

How Can We Test Enterprise AI Applications?

Dean Wampler

When the AI Alliance started, I became co-leader of the Trust and Safety Focus Area, because it was clear that without the ability to trust AI, it would not be widely adopted into enterprise and consumer applications. We have made a lot of progress since then, but another, related blocking issue became apparent to me about a year ago.

I realized that most of my fellow enterprise software developers and I don't know how to test these applications, because the probabilistic nature of generative AI is new to us. We are accustomed to more deterministic behavior when we design, implement, and test our "pre-AI" code.

Figure 1: The spectrum between deterministic and stochastic behavior, and the people accustomed to them!

So, I started a project we now call Achieving Confidence in Enterprise AI Applications to bridge this gap. It is a "living" guide designed to explore how we can adapt and adopt the evaluation technologies AI experts use for the fine-grained testing of use cases and requirements that we developers must do, in order to be confident in our AI-empowered applications. It also explores the implications for how we design these applications.

Today, I'm pleased to announce version V0.2.0 of our living guide. While there is much we still need to do, we have added content you can use today to start testing your AI-empowered applications.

We added a working example that demonstrates how to adapt benchmark techniques to create unit benchmarks, the analog of unit tests. Similarly, integration benchmarks and acceptance benchmarks are the analogs of integration tests, which explore interactions between system components, and acceptance tests, which provide the final confirmations that features are done.

We show how to use LLMs to synthesize focused data sets for these benchmarks, how to validate this data, and how to think about the results of tests that use this data.

Perhaps most important, we explore design concepts that allow us to reduce the randomness in parts of our applications, while still exploiting AI benefits. We also think about how to discover the AI equivalents of "features", units of functionality that we can implement incrementally and iteratively, as we prefer to do in Agile Development.

We show an example of a healthcare ChatBot designed to allow patients to ask questions of their healthcare provider. We show that frequently asked questions (FAQs), like requests for prescription refills, enable simplified, even deterministic handling. Finding the FAQs or equivalents for a domain enables more rapid progress, incrementally developing each one, without the impossible burden of trying to tackle all input-output combinations at once.

Even small large language models (LLMs) can easily interpret the many ways patients might ask for refills. When detected, we can direct the LLM to return a deterministic response. Effectively, it becomes a sophisticated classifier. When we know the query's class, we know exactly how to handle it, deterministically and confidently. So, we keep the benefits of an LLM's ability to interpret the many variations of human questions, while enjoying the benefits of deterministic outcomes, at least for use cases like this one. These insights reduce our overall design, implementation, and testing burden.

We still need to design and test for the many other possible user queries, and learn how to reason about and test the generative replies. We have began this journey, which we will continue in subsequent releases of this guide.

Please let us know what you think and consider joining us on this exploration!

Related Articles

View All

Building a Deep Research Agent Using MCP-Agent

This article by Sarmad Qadri documents the journey of building a Deep Research Agent with MCP-Agent, highlighting the evolution from an initial Orchestrator design, to an over-engineered Adaptive Workflow, and finally to the streamlined Deep Orchestrator. The author emphasizes that “MCP is all you need,” showing how connecting LLMs to MCP servers with simple design patterns enables agents to perform complex, multi-step research tasks. Key lessons include the importance of simplicity over complexity, leveraging deterministic code-based verification alongside LLM reasoning, external memory for efficiency, and structured prompting for clarity. The resulting Deep Orchestrator balances performance, scalability, and adaptability, proving effective across domains like finance research. Future directions include remote execution, intelligent tool and model selection, and treating memory/knowledge as MCP resources. The open-source project, available on GitHub, offers developers a powerful foundation for creating general-purpose AI research agents.

Openly shared AI tools are transforming medicine. chemistry and more

AI Alliance: Open Science Community Harnessing the Power of Open-Source AI

News

The open science community is increasingly using open source AI to accelerate discovery and innovation across disciplines, with collaboration at its core. At Meta’s first Open Source AI Summit for Advancing Scientific Discovery, scientists and researchers showcased breakthroughs made possible by open models like Llama and FAIR. Examples include UT Health’s use of MedSAM for cancer detection and AIChat for secure personalized research, Mayo Clinic’s RadOnc-GPT for radiation oncology, and national labs leveraging massive compute resources for molecular and materials discovery through the OMol25 project. These efforts highlight how openly shared AI tools are transforming medicine, chemistry, and beyond, reinforcing the importance of continued collaboration through initiatives like the AI Alliance.

docling open ai open data YouTube

From Layout to Logic: How Docling is Redefining Document AI  

Discover Docling, the powerful open-source AI document processing tool developed by IBM Research and supported by the AI Alliance, designed for fast, local, and privacy-first workflows. With no reliance on cloud APIs, Docling offers high-quality outputs and flexible licensing, making it ideal for enterprise and research use. Now enhanced by Hugging Face’s SmolVLM models, SmolDocling brings lightweight, multimodal AI to complex document layouts—handling code, charts, tables, and more with precision. Join the growing open-source community transforming document AI and contribute to the future of trusted, efficient, and collaborative AI innovation.