By Joe Spisak (Meta), Dean Wampler (IBM), Jonathan Bnayahu (IBM)
The era of generative AI has ushered in new challenges and risks, which have completely changed the way we think about product development. With increasingly diverse ways to evaluate models, we learn both about emergent capabilities but also about the potential harms they bring. One of the major challenges as evaluations become more esoteric is that ways to collaborate are limited for experts in various fields that also intersect in generative AI.
For example, at Meta there is a team of cyber security experts who also build models for coding and productivity and, in parallel, are building safeguards for things like malicious code generation. For many of CBRNE (chemical, biological, radiological, nuclear, and explosives) risks, these experts don’t have a central place to aggregate their ideas nor are they working with generative AI experts. Furthermore, there isn’t a de facto place today for the open community to evaluate its models across a growing number of potential harms.
This is one of the key reasons we created the AI Alliance. To bring together global experts to collaborate and ultimately understand not only how to measure these emergent capabilities, what we mean by evaluations, but also to learn how best to mitigate the risks of using generative AI.
State of the AI model eval world
There are a number of ways to evaluate a model (coding, reasoning, etc..) but the line of sight from something like the Massive Multitask Language Understanding (MMLU) or HellaSwag datasets, to what the downstream consumer (i.e., the developer) wants in terms of application performance is unclear and certainly non-linear.
In many ways, we are really talking about the evaluation of a model or agent as the new PRD (product requirements document). This flips product development on its head given defining an eval up front and working backwards requires those developing foundation models to work backwards defining everything from safety mitigations to data mixtures for both pretraining and post training. We almost need a neural net as a co-pilot to help us design all of this!
While this isn’t a perfect comparison, as it’s hard to define emergent capabilities as a product manager, this is actually an appropriate analogy. Having evaluations that accurately represent success in the application environment gives model developers something to shoot for - and we all know, when we get a good benchmark to optimize toward, we as a community go nuts!!
Moreover, while it’s still very early, there are signs that we can more confidently start from desired capabilities and adjust the data mixture at various stages of the model development process to align our models more toward how we want them to behave in the eventual application. Whether we need to go all the way back to pre-training is still a good question but we certainly know we can do a lot at the continual pre-training (CPT) and supervised fine tuning (SFT) stages to get models to behave in our desired ways (e.g. speak languages other than english).
So what are the goals of evaluations and leaderboards?
The answer is clearly dependent on the target application. It can be a combination of wanting to find the best model across performance, type (i.e. base, chat, multimodal, different languages, etc..), latency and cost - or certainly a local optima across all of them depending on who the developer is and what his or her application goal is, level of sophistication, desired level of customization, etc.
For the average model developer today, a common pattern is to start with Helm, OpenLLM or LMSys leaderboards, sort on the provided metrics, see which models rise to the top and which can fit onto a Colab instance to quickly load something up and hack a prototype.
Let’s walk through some of the leaderboards and evaluations:
- Maintainer: Hugging Face & Stanford respectively
- Summary: A sortable set of models, open and closed, evaluated on mostly academic benchmarks. Covers core scenarios such as Q&A, MMLU (Massive Multitask Language Understanding), MATH, GSM8K (Grade School Math), LegalBench, MedQA, WMT 2014 along with other benchmarks. These are largely open source evaluations that have been developed over the years by various researchers.
- Developer goal: General AI developers looking for models to build on will come here first to understand, for a given size, which models perform the best and then leverage the result for their research to fine tune for a custom application. One thing that’s clear however is that there is a stage of disillusionment that the community is in when it comes to actual real-world performance of models vs. how each one ranks on academic evaluations. This has driven the development of other leaderboards such as Chatbot Arena (discussed later).
- Maintainer: Hugging Face
- Summary: It evaluates the propensity for hallucination in LLMs across a diverse array of tasks, including Closed-book Open-domain QA, Summarization, Reading Comprehension, Instruction Following, Fact-Checking, Hallucination Detection, and Self-Consistency. The evaluation encompasses a wide range of datasets such as NQ Open, TriviaQA, TruthfulQA, XSum, CNN/DM, RACE, SQuADv2, MemoTrap, IFEval, FEVER, FaithDial, True-False, HaluEval, and SelfCheckGPT, offering a comprehensive assessment of each model's performance in generating accurate and contextually relevant content.
- Developer goal: Have a starting point to understand which models hallucinate the least. There is some overlap with other available leaderboards here and, given the fact that without RAG or search integrated, these models will always hallucinate, this leaderboard is less interesting for developers.
- Maintainer: UC Berkeley
- Summary: A benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner. Some experts, like Andrej Karpathy, find it to be the best test of a model’s performance as the arena is much more dynamic and compares models head to head.
- Developer goal: Understand how models will perform in a real environment that challenges in ways that status datasets and prompts can’t. Like #1 above, this may influence the developer's starting point for further work.
- Maintainer: Hugging Face & Cohere
- Summary: MTEB or “Massive Text Embedding Benchmark” contains 8 embedding tasks covering a total of 58 datasets and 112 languages. Through the benchmarking of 33 models on MTEB, establish the most comprehensive benchmark of text embeddings to date.
- Developer goal: Understand which models could perform best in applications where retrieval augmented generation (RAG) is used.
- Maintainer: Independent / Startup
- Summary: Artificial Analysis provides benchmarks and information to support developers, customers, researchers, and other users of AI models to make informed decisions in choosing which AI model to use for a given task, and which hosting provider to use to access the model.
- Developer goal: Understand cost, quality and speed tradeoffs across models and model providers (e.g. OpenAI, Microsoft Azure, Together.ai, Mistral, Google, Anthropic, Amazon Bedrock, Perplexity, Fireworks, Lepton, and Deepinfra.)
- Maintainer: Hugging Face
- Summary: Enterprise Scenarios leaderboard evaluates the performance of language models on real-world enterprise use cases such as finance and legal.
- Developer goal: This leaderboard is still very much nascent as you can see from the lack of public models competing. Overall though this set of evaluations would provide developers with an idea of which base models to use as a starting point for a given task. Again, it is very, very early.
- Maintainer: SambaNova Systems
- Summary: A tool manipulation benchmark consisting of diverse software tools for real-world tasks.
- Developer goal: Illustrate which available models perform best for action generation as described in natural language. Note it seems like this project is also a great resource for broader developers for how to implement a particular tool usage (i.e. plug-in) for their model. We are absolutely in the early innings and there is no one stop shop. If you can dream of a use case or requirement, you could probably define an evaluation and leaderboard. As Alan Kay the PARC researcher said, “the best way to predict the future is to invent it.”
Meet the AI Alliance Trust and Safety Working Group
Model evaluation, or ‘evals’, in the generative AI era is simultaneously one of the most important areas of investment but also one of the highest areas of entropy, meaning both breadth of possible concerns and approaches possible. Our goals for the working group are to:
- Raise awareness for community efforts around trust and safety – including, but not limited to, work happening globally in various languages and in various domains (e.g. cyber security, CBRNE, etc.), as well as evals not related to safety, such as those for evaluating sustainability and general alignment of the system for its intended purposes. Foster and grow the academic and technical communities in Trustworthy AI. By extension, create a center of mass for domain experts that can help us push beyond where evals stand today and into areas where we don’t have good visibility; and
- Drive the development of comprehensive, reliable, and stable tools for model evaluation, where these tools provide repeatable, reproducible, and diverse results, and where these tools are regularly refreshed and ever evolving as we learn about new risks for generative AI and other evaluation concerns. This means, as a community, creating new benchmarks and metrics to address quality, safety, robustness and performance aspects of generative AI models. The approach is to be as broad and diverse as possible with the goal of uncovering new evals in as many domains as possible so we can learn as a community. The goal is NOT to create a standard for model evaluations, but to work closely with MLCommons to help shepherd a subset of these evals into their standardization effort.
To foster further innovation in this area, we are pleased to invite the community to participate in the AI Alliance’s Trusted Evals working group by submitting a response to the AI Alliance Trusted Evals RFP to be included in efforts to:
- Raise awareness - selected proposals will be showcased through AI Alliance communications, including our newsletter, blog, whitepapers, and website; and
- Drive the development of comprehensive, reliable, and stable tools - The AI Alliance intends to support select project proposals with resources to help teams accelerate progress in building the foundations of safe, trusted AI.
For this RFP, we are excited to work with those in academia, big industry, startups, and anyone else excited to collaborate in the open and build an ecosystem around their work.
Areas of interest
- Cybersecurity threats
- Sensitive data detection including areas such as toxic content (e.g. hate speech), personally identifiable information (PII), bias, etc.
- Model performance including helpfulness, quality, alignment, robustness, etc., (as opposed to operational concerns like throughput, latency, scalability, etc.)
- Knowledge and factuality
- Multilingual evaluation
- Mediation
- Balancing harms and helpfulness
- Personal data memorization / data governance
- Vertical domains such as legal, financial, medical
- Areas related to CBRNE - Chemical, Biological, Radiological, Nuclear, and high yield Explosives
- Weapons acquisition specifically
- Measuring dataset integrity when data is created by AI: label fairness, prompt generation for RLHF
- Effectiveness of tool use that exacerbates malicious intent
- Demographic representation across different countries
- Distribution bias
- Political bias
- Capability fairness
- Undesirable use cases
- Regional bias, discrimination
- Violence and hate
- Terrorism and sanctioned individuals
- Defamation
- Misinformation
- Guns, illegal weapons, controlled substances