Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative benchmarking framework designed to evaluate the ability of score predictors (referred to as assessors) to anticipate LLM errors on specific task instances (i.e., prompts) from existing datasets. PredictaBoard evaluates pairs of LLMs and assessors by considering the rejection rate at different tolerance errors. As such, PredictaBoard stimulates research into developing better assessors and making LLMs more predictable, not only with a higher average performance. We conduct illustrative experiments using baseline assessors and state-of-the-art LLMs. PredictaBoard highlights the critical need to evaluate predictability alongside performance, paving the way for safer AI systems where errors are not only minimised but also anticipated and effectively mitigated.
In the plot above, each system corresponds to a LLM-assessor pair. Systems A and B have overall accuracy a and b respectively, with b larger than A. However, for an error tolerance of 20%, System A has a larger non-rejection rate. We say therefore that System A's 0.8 PVR is larger than System B's (where 0.8 refers to the minimum required accuracy).
We perform experiments with a set of baseline assessors based on various text embeddings approaches and 41 LLMs. PredictaBoard includes the code to train those assessors and the instance-level LLM results on MMLU-Pro and Big-Bench Hard (BBH). This allows researchers to work on improving assessors or LLMs independently. A split of MMLU-Pro is used to train assessors and another one to test them; instead, BBH is kept as Out-Of-Distribution dataset.
The figure above shows the top LLM-assessor pairs for three minimum accuracy thresholds (0.8, 0.9, 0.95). At a threshold of 0.8, the top LLM-assessor pairs perform well, as assessors can reliably predict success. However, when the threshold increases to 0.9 or higher, the predictably valid region drops significantly, as assessors face greater difficulty in anticipating failures. The best-performing pairs consistently include LLMs with high average accuracy, which aligns with expectations. Interestingly, the ranking of LLM-assessor pairs is not preserved across different thresholds, highlighting variations in assessor effectiveness. These findings underscore the importance of developing AI systems capable of operating under strict error constraints, particularly in high-risk scenarios. Future research will explore even higher thresholds to further enhance AI safety.
The figure above shows the top LLM-assessor pairs for three minimum accuracy thresholds (0.8, 0.9, 0.95). The lower values with respect to the in-distribution case highlight the difficulty of predicting performance OOD. Additionally, the top pairs differ from the in-distribution ones, suggesting the latter may not have the highest generalisation power.
OpenAI/GPT-4o-2024-08-06
) on MMLU-Pro.
@misc{pacchiardi2025predictaboardbenchmarkingllmscore,
title={{PredictaBoard}: Benchmarking {LLM} Score Predictability},
author={Lorenzo Pacchiardi and Konstantinos Voudouris and Ben Slater and Fernando Martínez-Plumed and José Hernández-Orallo and Lexin Zhou and Wout Schellaert},
year={2025},
eprint={2502.14445},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.14445},
}