PredictaBoard: Benchmarking LLM Score Predictability

1Leverhulme Centre for the Future of Intelligence, University of Cambridge, United Kingdom
2Institute for Human-Centered AI, Helmholtz Zentrum Munich, Germany
3VRAIN, Universitat Politècnica de València, Spain
The PredictaBoard framework

We introduce PredictaBoard, a novel benchmarking framework that simultaneously evaluates an LLM’s score and its predictability on individual questions by an "assessor" module. This makes the size of the LLM’s predictably valid operating region the crucial consideration. Thus, PredictaBoard shifts the focus from average performance and promotes the development of AI systems that enable the anticipation of errors —an essential consideration for high-stakes scenarios.

Abstract

Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative benchmarking framework designed to evaluate the ability of score predictors (referred to as assessors) to anticipate LLM errors on specific task instances (i.e., prompts) from existing datasets. PredictaBoard evaluates pairs of LLMs and assessors by considering the rejection rate at different tolerance errors. As such, PredictaBoard stimulates research into developing better assessors and making LLMs more predictable, not only with a higher average performance. We conduct illustrative experiments using baseline assessors and state-of-the-art LLMs. PredictaBoard highlights the critical need to evaluate predictability alongside performance, paving the way for safer AI systems where errors are not only minimised but also anticipated and effectively mitigated.

Predicting validity

We use assessors to predict the score (or, more generally, validity) of a model on a question. Assessors can be trained to predict model score from features of the question using a dataset of (questions, scores). We require assessors to make predictions without relying on the model's outputs to prevent them from leveraging knowledge of the ground truth. However, assessors can potentially use model internals. Moreover, while we focus on assessors, PredictaBoard can be used to evaluate other approaches to validity prediction, such as self-confidence or guarantees.
How an assessor is trained

Metrics for predictability

To jointly measure LLM score and predictability, we consider the Accuracy-Rejection Curve (ARC, displayed below). This is obtained by considering varying rejection thresholds for the assessor and plotting the LLM accuracy on the non-rejected samples as a function of the rejection rate corresponding to each threshold. For a fixed error tolerance (e.g., 20%), we can then compute the highest non-rejection rate that leads to an error below the tolerance. We call this the Predictably Valid Region (PVR). Moreover, to consider multiple rejection thresholds, we can compute the Area Under the ARC (AUARC) curve.

PredictaBoard also includes other metrics that independently score the assessor or the LLM, but the ARC and its derived metrics allow understanding their combined behaviour.
ARC curves

In the plot above, each system corresponds to a LLM-assessor pair. Systems A and B have overall accuracy a and b respectively, with b larger than A. However, for an error tolerance of 20%, System A has a larger non-rejection rate. We say therefore that System A's 0.8 PVR is larger than System B's (where 0.8 refers to the minimum required accuracy).

Findings

We perform experiments with a set of baseline assessors based on various text embeddings approaches and 41 LLMs. PredictaBoard includes the code to train those assessors and the instance-level LLM results on MMLU-Pro and Big-Bench Hard (BBH). This allows researchers to work on improving assessors or LLMs independently. A split of MMLU-Pro is used to train assessors and another one to test them; instead, BBH is kept as Out-Of-Distribution dataset.

In-distribution results

In-distribution Predictably Valid Regions and Area Under the Accuracy Rejection Curve

The figure above shows the top LLM-assessor pairs for three minimum accuracy thresholds (0.8, 0.9, 0.95). At a threshold of 0.8, the top LLM-assessor pairs perform well, as assessors can reliably predict success. However, when the threshold increases to 0.9 or higher, the predictably valid region drops significantly, as assessors face greater difficulty in anticipating failures. The best-performing pairs consistently include LLMs with high average accuracy, which aligns with expectations. Interestingly, the ranking of LLM-assessor pairs is not preserved across different thresholds, highlighting variations in assessor effectiveness. These findings underscore the importance of developing AI systems capable of operating under strict error constraints, particularly in high-risk scenarios. Future research will explore even higher thresholds to further enhance AI safety.

Out-of-distribution results

Out-of-distribution Predictably Valid Regions and Area Under the Accuracy Rejection Curve

The figure above shows the top LLM-assessor pairs for three minimum accuracy thresholds (0.8, 0.9, 0.95). The lower values with respect to the in-distribution case highlight the difficulty of predicting performance OOD. Additionally, the top pairs differ from the in-distribution ones, suggesting the latter may not have the highest generalisation power.

Visualisation of some Accuracy-Rejection Curves

Visualisation of some Accuracy-Rejection Curves

Comparison of the ARC curves for the different in-distribution assessors of the highest accuracy LLM (OpenAI/GPT-4o-2024-08-06) on MMLU-Pro.

Overall, current LLMs are very poorly predictable, particularly at low error thresholds, which is critical for high-stakes scenarios. PredictaBoard makes predictability the key consideration, stimulating research into more predictable AI systems.

BibTeX

@misc{pacchiardi2025predictaboardbenchmarkingllmscore,
      title={{PredictaBoard}: Benchmarking {LLM} Score Predictability},
      author={Lorenzo Pacchiardi and Konstantinos Voudouris and Ben Slater and Fernando Martínez-Plumed and José Hernández-Orallo and Lexin Zhou and Wout Schellaert},
      year={2025},
      eprint={2502.14445},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.14445},
}

Acknowledgements

This research project has benefitted from the Microsoft Accelerate Foundation Models Research (AFMR) grant program.