Return to Articles

DoomArena: A Security Testing Framework for AI Agents

Technical Report

As AI agents become increasingly autonomous and widely deployed, they introduce new attack surfaces and amplify existing security vulnerabilities. The DoomArena framework, developed by researchers at ServiceNow Research, addresses critical gaps in current AI agent security testing approaches.

The Problem: Current Security Testing Falls Short

AI agents are poised to transform how we interact with technology, but with great autonomy comes significant security challenges. Privacy and security concerns remain top blockers for agentic AI adoption, yet current testing methods have fundamental limitations:

Limitations of Existing Approaches

Ad Hoc Red Teaming

  • Works for well-known attacks but fails to achieve systematic, continuous risk assessment
  • Cannot capture the dynamic nature of real-world threats

Static Benchmarks (HarmBench, AgentHarmBench, ST WebAgent Bench)

  • Excellent for evaluating known attacks
  • Cannot capture dynamic and adaptive attacks relevant to agentic scenarios
  • Miss the interactive nature of agent deployments

Dedicated Dynamic Benchmarks (Agent Dojo)

  • Treat cybersecurity as a siloed process separate from agent evaluation
  • Don't provide integrated evaluation of security and task performance

The Guardrail Problem

Many organizations turn to AI-powered guardrails like LlamaGuard for protection. However, research shows these guardrail models are:

  • Porous and unreliable in agentic settings
  • Easy to evade with basic techniques
  • Often implemented with generic, case-agnostic definitions that miss context-specific threats

Testing revealed that LlamaGuard failed to identify any attacks in the study, even obvious ones visible to human reviewers.

Introducing DoomArena

DoomArena is a comprehensive security testing framework designed to address these weaknesses through four core principles:

1. Fine-Grained Evaluation

  • Modular, configurable, and extensible architecture
  • Detailed threat modeling with component-specific attack targeting
  • Granular analysis of both attacks and defenses

2. Realistic Deployment Testing

  • Tests agents in realistic environments with actual user-agent-environment loops
  • Supports web agents, tool-calling agents, and computer-use agents
  • Integrates security evaluation with task performance assessment

3. Attack Decoupling and Reusability

  • Complete separation of attacks from environments
  • Library of attacks usable across multiple benchmarks
  • Support for combining multiple attack types

4. Extensible Framework Design

  • Easy integration of new attack types and threat models
  • Simple wrapper-based approach for existing environments
  • Plug-and-play architecture for rapid iteration

Technical Architecture

DoomArena models deployed agents as a three-way interaction loop:

  • Agent: The AI system (OpenAI, Google's Project Mariner, ServiceNow agents, etc.)
  • Environment: The deployment context (Browser Gym, ToolBench, OS World, etc.)
  • User: The human making requests

Any of these three components can potentially be malicious, requiring comprehensive security evaluation at every node.

Key Components

Attack Gateway: A wrapper around existing environments that enables attack insertion without modifying the underlying benchmark.

Attack Configuration: Defines:

  • The specific attack type
  • Success criteria and filters
  • Attackable components
  • Timing and conditions for attack deployment

Success Filters: Granular evaluation of attack effectiveness separate from task completion.

Attack Types and Results

Web Agent Attacks (WebArena)

Evaluated on Reddit tasks with three attack types:

  • Banner Attacks: Malicious banners with hidden instructions in alt text
  • Pop-up Attacks: Benign-appearing pop-ups with hidden redirection commands
  • Combined Attacks: Multiple simultaneous attack vectors

Results: Agents showed high vulnerability with attack success rates reaching 97% for Claude 3.5 Sonnet in some scenarios.

Tool-Calling Agent Attacks (ToolBench)

Tested in retail and airline customer service scenarios:

  • Malicious User Attacks: Prompt injection to obtain unauthorized refunds
  • Malicious Catalog Attacks: Compromised tools providing false information
  • Combined Attacks: Multiple simultaneous threat vectors

Key Findings:

  • GPT-4 showed higher vulnerability to malicious user attacks than Claude 3.5
  • Combined attacks had constructive effects, increasing overall vulnerability
  • No single agent dominated across all threat models

Defense Evaluation

Basic defenses tested included:

  • LlamaGuard: Completely unsuccessful at detecting attacks
  • GPT-4 as Judge: Effective but generated many false positives
  • Simple Interruption: Stops tasks when attacks detected (heavy-handed approach)

Critical Insights

No Universal Security Champion

Research revealed no single agent consistently outperformed others across all threat models, highlighting the complexity of security-performance trade-offs.

Attack Combination Effects

Combining multiple attack types often increased vulnerability, demonstrating the need for comprehensive multi-vector testing.

Detection vs. Mitigation Gap

Current defenses focus on detection and task interruption rather than sophisticated mitigation strategies.

Future Directions

Advanced Attack Development

  • Stealth Attacks: More subtle techniques that avoid detection
  • Backdoor Attacks: Fine-tuning based vulnerabilities triggered at test time
  • Scale-Based Attacks: Low individual probability but effective at scale

Sophisticated Defenses

  • Filtering Systems: Remove malicious elements rather than stopping tasks
  • Multi-Step Processes: Intermediate filtering between agents and actions
  • Advanced Guardrails: Evaluation of newer systems like Llama Firewall

Expanded Coverage

  • Computer-Use Agents: OS-level agent evaluation (in development)
  • Domain-Specific Testing: Healthcare, finance, and other specialized applications
  • Large-Scale Analysis: Systematic evaluation across diverse attack types

Getting Involved

The DoomArena framework is open source and designed for community collaboration. The team particularly welcomes:

  • Benchmark Integration: Help adapting existing agent benchmarks for security testing
  • Attack Development: Contributing new attack types and threat models
  • Defense Research: Developing more sophisticated mitigation strategies

Conclusion

As AI agents become more capable and widely deployed, security testing must evolve beyond static benchmarks and ad hoc red teaming. DoomArena provides a principled, extensible framework for evaluating agent security in realistic deployment scenarios.

The framework's early results demonstrate significant vulnerabilities in current agent systems and highlight the inadequacy of existing defense mechanisms. By providing fine-grained, systematic security evaluation alongside task performance assessment, DoomArena enables developers to build more secure and reliable AI agents.

The path forward requires continued collaboration between the security and AI communities to develop both more sophisticated attacks and more effective defenses. Only through comprehensive, realistic security testing can we ensure AI agents are ready for safe deployment in critical applications.

For more information, visit the DoomArena GitHub repository or read the accompanying research paper on arXiv. The framework is actively maintained and welcomes community contributions.

Related Articles

View All

Mastering Data Cleaning for Fine-Tuning LLMs and RAG Architectures

News

In the rapidly advancing field of artificial intelligence, data cleaning has become a mission-critical step in ensuring the success of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) architectures. This blog emphasizes the importance of high-quality, structured data in preventing AI model hallucinations, reducing algorithmic bias, enhancing embedding quality, and improving information retrieval accuracy. It covers essential AI data preprocessing techniques like deduplication, PII redaction, noise filtering, and text normalization, while spotlighting top tools such as IBM Data Prep Kit, AI Fairness 360, and OpenRefine. With real-world applications ranging from LLM fine-tuning to graph-based knowledge systems, the post offers a practical guide for data scientists and AI engineers looking to optimize performance, ensure ethical compliance, and build scalable, trustworthy AI systems.

Architecture of Data Prep Kit Framework 

Technical Report

The Data Prep Kit (DPK) framework enables scalable data transformation using Python, Ray, and Spark, while supporting various data sources such as local disk, S3, and Hugging Face datasets. It defines abstract base classes for transformations, allowing developers to implement custom data and folder transforms that operate seamlessly across different runtimes. DPK also introduces a data abstraction layer to streamline data access and facilitate checkpointing. To support large-scale processing, it provides three runtimes: Python for small datasets, Ray for distributed execution across clusters, and Spark for highly scalable processing using Resilient Distributed Datasets (RDDs). Additionally, DPK integrates with Kubeflow Pipelines (KFP) for automating transformations within Kubernetes environments. The framework includes transform utilities, testing support, and simplified APIs for invoking transforms efficiently. By abstracting complexity, DPK simplifies development, deployment, and execution of data processing pipelines in both local and distributed environments.

Transform Pipelines in Data Prep Kit 

Technical Report

The blog post explores how Kubeflow Pipelines (KFP) automate Data Prep Kit (DPK) transforms on Kubernetes, simplifying execution, scaling, and scheduling. It details the required Kubernetes infrastructure, reusable KFP components, and a pipeline generator for automating workflows. By integrating KFP, DPK streamlines orchestrating and managing complex data transformations.