Interesting open source projects, models, and evals...
In addition to the survey, we also looked across the ecosystem to see what was trending. While open LLMs have continued to be released by industry and startups including Granite, Mistral, Llama, Gemma, Phi and many others, safety has been less of a focus. Here is a brief survey of what we’ve seen in the community but please reach out if you know of other great projects to include and we’d be happy to update!
Open source guardrails
NeMo-Guardrails: NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational applications. Guardrails (or "rails" for short) are specific ways of controlling the output of a large language model, such as not talking about politics, responding in a particular way to specific user requests, following a predefined dialog path, using a particular language style, extracting structured data, and more.
Llama Guard: Llama Guard models consist of a series of high-performance input and output moderation models designed to support developers to detect various common types of violating content. The latest versions filter both image input and text input/outputs in multiple languages.
Prompt Guard: Prompt Guard is a classifier model trained on a large corpus of attacks, capable of detecting both explicitly malicious prompts as well as data that contains injected inputs. The model is useful as a starting point for identifying and guardrailing against the most risky realistic inputs to LLM-powered applications; for optimal results we recommend developers fine-tune the model on their application-specific data and use cases.
Granite Guardian: A collection of models designed to detect risks in user prompts and LLM responses, along the risk dimensions catalogued in IBM’s AI Risk Atlas.
CodeShield: CodeShield is a robust inference time filtering tool engineered to prevent the introduction of insecure code generated by LLMs into production systems.
ShieldGemma: A series of safety classifiers, trained on top of Gemma 2, for developers to filter inputs and outputs of their applications.
Guardrails AI: Guardrails is a Python framework that helps build reliable AI applications by running Input/Output Guards in your application that detect, quantify and mitigate the presence of specific types of risks
Garak: checks if an LLM can be made to fail in a way we don't want. Garak probes for hallucination, data leakage, prompt injection, misinformation, toxicity generation, jailbreaks, and many other weaknesses.
Roblox’s Voice Safety Classifier: checks for voice toxicity and classification trained on a manually curated real-world dataset. The model weights either by downloading from from HuggingFace under roblox/voice-safety-classifier
PII Masker: a Python tool designed to identify and mask Personally Identifiable Information (PII) in text using a pre-trained NLP model based on the DeBERTa-v3 architecture.
Safety measurement & benchmarking
TrustyAI: TrustyAI is, at its core, a Java library and service for Explainable AI (XAI). TrustyAI offers fairness metrics, explainable AI algorithms, and various other XAI tools at a library-level as well as a containerized service and Kubernetes deployment.
Unitxt: Unitxt is a Python library for textual data preparation and evaluation of generative language models. It deconstructs the data preparation and evaluation flows into modular components, enabling easy customization and sharing between practitioners. Unitxt is an AI Alliance Affiliated Project.
Attaq, ProvoQ, and SocialStigmaQA: Red-teaming and social bias evaluation datasets.
BenchBench: a Python package designed to facilitate benchmark agreement testing for NLP models. It allows users to easily compare multiple models against various benchmarks and generate comprehensive reports on their agreement. A safety-oriented version of BenchBench is deployed on the AI Alliance space, as SafetyBAT.
Lm-evaluation-harness: This project provides a unified framework to test generative language models on a large number of different evaluation tasks and is the backend for 🤗 Hugging Face's popular Open LLM Leaderboard, has been used in hundreds of papers, and is used internally by dozens of organizations including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML.
Project Moonshot: Developed by the AI Verify Foundation, Moonshot is one of the first tools to bring Benchmarking and Red-Teaming together to help AI developers, compliance teams and AI system owners evaluate LLMs and LLM applications.
Giskard: Giskard is an open-source Python library that automatically detects performance, bias & security issues in AI applications. The library covers LLM-based applications such as RAG agents, all the way to traditional ML models for tabular data.
CyberSec Evaluation: CyberSecEval 3 is an extensive benchmark suite designed to assess the cybersecurity vulnerabilities of Large Language Models (LLMs). Building on its predecessor, CyberSecEval 2, this latest version introduces three new test suites: visual prompt injection tests, spear phishing capability tests, and autonomous offensive cyber operations tests. Created to meet the increasing demand for secure AI systems, CyberSecEval 3 offers a comprehensive set of tools to evaluate various security domains. It has been applied to well-known LLMs such as Llama2, Llama3, codeLlama, and OpenAI GPT models. The findings underscore substantial cybersecurity threats, underscoring the critical need for continued research and development in AI safety.
Detoxify: Trained models & code to predict toxic comments on 3 Jigsaw challenges: Toxic comment classification, Unintended Bias in Toxic comments, Multilingual toxic comment classification. Built by Laura Hanu at Unitary, where they are working to stop harmful content online by interpreting visual content in context.
MLCommons AILuminate: a community based effort focused on: 1) Curate a pool of safety tests from diverse sources; 2) Defining benchmarks for specific AI use-cases, each of which uses a subset of the tests and summarizes the results in a way that enables decision making by non-experts; and 3) Developing a community platform for safety testing of AI systems that supports registration of tests, definition of benchmarks, testing of AI systems, management of test results, and viewing of benchmark scores.
DecodingTrust: comprehensive and unified evaluation platform dedicated to assessing the trustworthiness of LLMs.
RedArena: Part of the LMSys effort, RedTeam Arena provides a time bound gamified platform for red teamers to attempt to jailbreak models.
HydroX AI safety community: periodically publishes the latest safety ranking of both open-source and closed-source models. Results are based on evaluation across 30+ safety categories (e.g. bias) and 20+ advanced jailbreaks (e.g. AutoDAN). Its affiliated community, Silicon Wall-E, provides a gamified LLM jailbreak experience for public awareness.