Achieving Confidence in Enterprise AI Applications

When the AI Alliance started, I became co-leader of the Trust and Safety Focus Area, because it was clear that without the ability to trust AI, it would not be widely adopted into enterprise and consumer applications. We have made a lot of progress since then, but another, related blocking issue became apparent to me about a year ago.

I realized that most of my fellow enterprise software developers and I don't know how to test these applications, because the probabilistic nature of generative AI is new to us. We are accustomed to more deterministic behavior when we design, implement, and test our "pre-AI" code.

Figure 1: The spectrum between deterministic and stochastic behavior, and the people accustomed to them!

So, I started a project we now call Achieving Confidence in Enterprise AI Applications to bridge this gap. It is a "living" guide designed to explore how we can adapt and adopt the evaluation technologies AI experts use for the fine-grained testing of use cases and requirements that we developers must do, in order to be confident in our AI-empowered applications. It also explores the implications for how we design these applications.

Today, I'm pleased to announce version V0.2.0 of our living guide. While there is much we still need to do, we have added content you can use today to start testing your AI-empowered applications.

We added a working example that demonstrates how to adapt benchmark techniques to create unit benchmarks, the analog of unit tests. Similarly, integration benchmarks and acceptance benchmarks are the analogs of integration tests, which explore interactions between system components, and acceptance tests, which provide the final confirmations that features are done.

We show how to use LLMs to synthesize focused data sets for these benchmarks, how to validate this data, and how to think about the results of tests that use this data.

Perhaps most important, we explore design concepts that allow us to reduce the randomness in parts of our applications, while still exploiting AI benefits. We also think about how to discover the AI equivalents of "features", units of functionality that we can implement incrementally and iteratively, as we prefer to do in Agile Development.

We show an example of a healthcare ChatBot designed to allow patients to ask questions of their healthcare provider. We show that frequently asked questions (FAQs), like requests for prescription refills, enable simplified, even deterministic handling. Finding the FAQs or equivalents for a domain enables more rapid progress, incrementally developing each one, without the impossible burden of trying to tackle all input-output combinations at once.

Even small large language models (LLMs) can easily interpret the many ways patients might ask for refills. When detected, we can direct the LLM to return a deterministic response. Effectively, it becomes a sophisticated classifier. When we know the query's class, we know exactly how to handle it, deterministically and confidently. So, we keep the benefits of an LLM's ability to interpret the many variations of human questions, while enjoying the benefits of deterministic outcomes, at least for use cases like this one. These insights reduce our overall design, implementation, and testing burden.

We still need to design and test for the many other possible user queries, and learn how to reason about and test the generative replies. We have began this journey, which we will continue in subsequent releases of this guide.

Please let us know what you think and consider joining us on this exploration!

Harnessing Open Source AI for Europe’s Digital Future

11th November 2025

Europe is at a defining crossroads in shaping the future of artificial intelligence. Harnessing open-source AI for Europe’s digital future means creating innovation that is not only powerful but also ethical, transparent, and inclusive. The recent Roundtable on Open Source AI, co-hosted by the AI Alliance and ETH Zurich’s AI Ethics and Policy Network, brought together experts from across governments, academia, and industry to explore how open, trusted, and socially beneficial AI can drive growth while upholding Europe’s democratic values. As the EU AI Act and related frameworks evolve, discussions centered on fostering responsible innovation, digital sovereignty, and interoperability across borders. The resulting recommendations emphasize the need for a shared understanding of open-source AI, stronger ecosystem infrastructure, agile governance, and leveraging Europe’s strategic strengths. Together, these efforts aim to position Europe as a global leader in responsible and open AI development, ensuring that technological progress serves people, communities, and society at large.

AI Alliance: Open Science Community Harnessing the Power of Open-Source AI

3rd September 2025News

The open science community is increasingly using open source AI to accelerate discovery and innovation across disciplines, with collaboration at its core. At Meta’s first Open Source AI Summit for Advancing Scientific Discovery, scientists and researchers showcased breakthroughs made possible by open models like Llama and FAIR. Examples include UT Health’s use of MedSAM for cancer detection and AIChat for secure personalized research, Mayo Clinic’s RadOnc-GPT for radiation oncology, and national labs leveraging massive compute resources for molecular and materials discovery through the OMol25 project. These efforts highlight how openly shared AI tools are transforming medicine, chemistry, and beyond, reinforcing the importance of continued collaboration through initiatives like the AI Alliance.

Building AI Agents to Real-World Use Cases

1st October 2025

The AI Alliance's open-source projects, AgentLabUI (a practitioner workbench for building AI agents) and Gofannon (a set of agent tools) work together with ATA Systems' front-end development to create production-ready AI applications in days rather than weeks. The approach is demonstrated through a collaborative Grant Matching Agent case study, where researchers can upload their CV and receive curated funding opportunities within minutes, showcasing a complete workflow from agent development to end-user delivery. AgentLabUI serves as a flexible IDE where practitioners can swap models, build modular tools, and integrate various frameworks, while the Agent UI provides a simple interface for non-technical users to interact with deployed agents without needing to understand the underlying complexity. This two-layer system bridges the gap between AI R&D and real-world adoption, making advanced AI capabilities accessible, secure, and practical across organizations.

How Can We Test Enterprise AI Applications?

Related Articles

Harnessing Open Source AI for Europe’s Digital Future

AI Alliance: Open Science Community Harnessing the Power of Open-Source AI

Building AI Agents to Real-World Use Cases