Do Not Answer: A Dataset for Evaluating Safeguards in LLMs

The Do Not Answer project provides an open-source English language dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.

View project →