PredictaBoard: Benchmarking LLM Score Predictability

1Leverhulme Centre for the Future of Intelligence, University of Cambridge, United Kingdom
2Institute for Human-Centered AI, Helmholtz Zentrum Munich, Germany
3VRAIN, Universitat Politècnica de València, Spain
The PredictaBoard framework

We introduce PredictaBoard, a novel benchmarking framework that simultaneously evaluates an LLM's score and its predictability on individual questions by an "assessor" module. This makes the size of the LLM's predictably valid operating region the crucial consideration. Thus, PredictaBoard shifts the focus from average performance and promotes the development of AI systems that enable the anticipation of errors —an essential consideration for high-stakes scenarios. To contribute to the PredictaBoard project, please see this repo.

Abstract

Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative benchmarking framework designed to evaluate the ability of score predictors (referred to as assessors) to anticipate LLM errors on specific task instances (i.e., prompts) from existing datasets. PredictaBoard evaluates pairs of LLMs and assessors by considering the rejection rate at different tolerance errors. As such, PredictaBoard stimulates research into developing better assessors and making LLMs more predictable, not only with a higher average performance. We conduct illustrative experiments using baseline assessors and state-of-the-art LLMs. PredictaBoard highlights the critical need to evaluate predictability alongside performance, paving the way for safer AI systems where errors are not only minimised but also anticipated and effectively mitigated.

Predicting validity

We use assessors to predict the score (or, more generally, validity) of a model on a question. Assessors can be trained to predict model score from features of the question using a dataset of (questions, scores). We require assessors to make predictions without relying on the model's outputs to prevent them from leveraging knowledge of the ground truth. However, assessors can potentially use model internals. Moreover, while we focus on assessors, PredictaBoard can be used to evaluate other approaches to validity prediction, such as self-confidence or guarantees.
How an assessor is trained

Metrics for predictability

To jointly measure LLM score and predictability, we consider the Accuracy-Rejection Curve (ARC, displayed below). This is obtained by considering varying rejection thresholds for the assessor and plotting the LLM accuracy on the non-rejected samples as a function of the rejection rate corresponding to each threshold. For a fixed error tolerance (e.g., 20%), we can then compute the highest non-rejection rate that leads to an error below the tolerance. We call this the Predictably Valid Region (PVR). Moreover, to consider multiple rejection thresholds, we can compute the Area Under the ARC (AUARC).

PredictaBoard also includes other metrics that independently score the assessor or the LLM, but the ARC and its derived metrics allow understanding their combined behaviour.
ARC curves

In the plot above, each system corresponds to a LLM-assessor pair. Systems A and B have overall accuracy a and b respectively, with b larger than a. However, for an error tolerance of 20%, System A has a larger non-rejection rate. We say therefore that System A's 0.8 PVR is larger than System B's (where 0.8 refers to the minimum required accuracy).

Leaderboards

Predictaboard uses a split of MMLU-Pro to train assessors and another one to test them; instead, BBH is kept as Out-Of-Distribution dataset. The in-distribution and out-of-distribution leaderboards below show the Predictably Valid Region (PVR) of the considered LLM-assessor pairs for three minimum accuracy thresholds (0.8, 0.9, 0.95) and the Area Under the ARC (AUARC).

Our initial results use a set of baseline assessors based on various text embeddings approaches and 41 LLMs. Our project repo makes it easy to contribute new LLMs or assessors, test them on this fixed setup, and add them to the leaderboards. Please refer to instructions in the repo.

In-distribution results

The table below shows the results of the considered LLM-assessor pairs for three minimum accuracy thresholds (0.8, 0.9, 0.95). At a threshold of 0.8, the top LLM-assessor pairs perform well, as assessors can reliably predict success. However, when the threshold increases to 0.9 or higher, the predictably valid region drops significantly, as assessors face greater difficulty in anticipating failures. The best-performing pairs consistently include LLMs with high average accuracy, which aligns with expectations.

LLM Predictive Method Features 0.8 PVR 0.9 PVR 0.95 PVR Area under ARC
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 Logistic Regression (l1) OpenAI 0 0 0 0.39
Orenguteng/Llama-3.1-8B-Lexi-Uncensored Logistic Regression (l1) OpenAI 0 0 0 0.377
MaziyarPanahi/calme-2.1-qwen2.5-72b Logistic Regression (l1) OpenAI 0.172 0.081 0 0.726
awnr/Mistral-7B-v0.1-signtensors-1-over-4 Logistic Regression (l1) OpenAI 0 0 0 0.248
wave-on-discord/qwent-7b Logistic Regression (l1) OpenAI 0 0 0 0.185
Aryanne/SuperHeart Logistic Regression (l1) OpenAI 0 0 0 0.431
Intel/neural-chat-7b-v3-2 Logistic Regression (l1) OpenAI 0 0 0 0.277
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco Logistic Regression (l1) OpenAI 0 0 0 0.336
Intel/neural-chat-7b-v3-1 Logistic Regression (l1) OpenAI 0 0 0 0.298
ssmits/Qwen2.5-95B-Instruct Logistic Regression (l1) OpenAI 0.162 0.061 0 0.679
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 Logistic Regression (l1) OpenAI 0 0 0 0.21
migtissera/Tess-v2.5.2-Qwen2-72B Logistic Regression (l1) OpenAI 0.212 0.04 0.01 0.715
migtissera/Llama-3-70B-Synthia-v3.5 Logistic Regression (l1) OpenAI 0.131 0.03 0.02 0.654
migtissera/Trinity-2-Codestral-22B-v0.2 Logistic Regression (l1) OpenAI 0 0 0 0.34
Stark2008/VisFlamCat Logistic Regression (l1) OpenAI 0 0 0 0.337
migtissera/Trinity-2-Codestral-22B Logistic Regression (l1) OpenAI 0 0 0 0.337
awnr/Mistral-7B-v0.1-signtensors-7-over-16 Logistic Regression (l1) OpenAI 0 0 0 0.32
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge Logistic Regression (l1) OpenAI 0 0 0 0.25
apple/DCLM-7B Logistic Regression (l1) OpenAI 0 0 0 0.307
Intel/neural-chat-7b-v3-3 Logistic Regression (l1) OpenAI 0 0 0 0.284
migtissera/Tess-3-Mistral-Nemo-12B Logistic Regression (l1) OpenAI 0 0 0 0.306
migtissera/Llama-3-8B-Synthia-v3.5 Logistic Regression (l1) OpenAI 0 0 0 0.308
migtissera/Tess-v2.5-Phi-3-medium-128k-14B Logistic Regression (l1) OpenAI 0 0 0 0.417
awnr/Mistral-7B-v0.1-signtensors-5-over-16 Logistic Regression (l1) OpenAI 0 0 0 0.303
NousResearch/Hermes-3-Llama-3.1-70B Logistic Regression (l1) OpenAI 0 0 0 0.507
OpenAI/GPT-4o-2024-08-06 Logistic Regression (l1) OpenAI 0.788 0.162 0.051 0.847
BlackBeenie/llama-3-luminous-merged Logistic Regression (l1) OpenAI 0 0 0 0.398
LenguajeNaturalAI/leniachat-gemma-2b-v0 Logistic Regression (l1) OpenAI 0 0 0 0.119
migtissera/Tess-3-7B-SFT Logistic Regression (l1) OpenAI 0 0 0 0.307
awnr/Mistral-7B-v0.1-signtensors-3-over-8 Logistic Regression (l1) OpenAI 0 0 0 0.318
meta-llama/Meta-Llama-3.1-70B-Instruct Logistic Regression (l1) OpenAI 0.232 0.051 0.02 0.72
OpenAI/GPT-4o-mini Logistic Regression (l1) OpenAI 0.354 0 0 0.772
upstage/solar-pro-preview-instruct Logistic Regression (l1) OpenAI 0 0 0 0.568
Stark2008/LayleleFlamPi Logistic Regression (l1) OpenAI 0 0 0 0.336
NousResearch/Yarn-Llama-2-13b-128k Logistic Regression (l1) OpenAI 0 0 0 0.257
OpenAI/GPT-4o-2024-05-13 Logistic Regression (l1) OpenAI 0.808 0.162 0 0.844
AbacusResearch/Jallabi-34B Logistic Regression (l1) OpenAI 0 0 0 0.505
Stark2008/GutenLaserPi Logistic Regression (l1) OpenAI 0 0 0 0.341
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra Logistic Regression (l1) OpenAI 0 0 0 0.323
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model Logistic Regression (l1) OpenAI 0 0 0 0.119
xxx777xxxASD/L3.1-ClaudeMaid-4x8B Logistic Regression (l1) OpenAI 0 0 0 0.366
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 Logistic Regression (l1) Word2Vec 0 0 0 0.401
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 Logistic Regression (l1) FastText 0 0 0 0.39
Orenguteng/Llama-3.1-8B-Lexi-Uncensored Logistic Regression (l1) Word2Vec 0 0 0 0.384
Orenguteng/Llama-3.1-8B-Lexi-Uncensored Logistic Regression (l1) FastText 0 0 0 0.389
MaziyarPanahi/calme-2.1-qwen2.5-72b Logistic Regression (l1) Word2Vec 0.121 0 0 0.707
MaziyarPanahi/calme-2.1-qwen2.5-72b Logistic Regression (l1) FastText 0.02 0 0 0.692
awnr/Mistral-7B-v0.1-signtensors-1-over-4 Logistic Regression (l1) Word2Vec 0 0 0 0.244
awnr/Mistral-7B-v0.1-signtensors-1-over-4 Logistic Regression (l1) FastText 0 0 0 0.239
wave-on-discord/qwent-7b Logistic Regression (l1) Word2Vec 0 0 0 0.163
wave-on-discord/qwent-7b Logistic Regression (l1) FastText 0 0 0 0.167
Aryanne/SuperHeart Logistic Regression (l1) Word2Vec 0 0 0 0.406
Aryanne/SuperHeart Logistic Regression (l1) FastText 0 0 0 0.406
Intel/neural-chat-7b-v3-2 Logistic Regression (l1) Word2Vec 0 0 0 0.283
Intel/neural-chat-7b-v3-2 Logistic Regression (l1) FastText 0 0 0 0.285
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco Logistic Regression (l1) Word2Vec 0 0 0 0.347
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco Logistic Regression (l1) FastText 0 0 0 0.341
Intel/neural-chat-7b-v3-1 Logistic Regression (l1) Word2Vec 0 0 0 0.276
Intel/neural-chat-7b-v3-1 Logistic Regression (l1) FastText 0 0 0 0.269
ssmits/Qwen2.5-95B-Instruct Logistic Regression (l1) Word2Vec 0.03 0 0 0.652
ssmits/Qwen2.5-95B-Instruct Logistic Regression (l1) FastText 0 0 0 0.638
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 Logistic Regression (l1) Word2Vec 0 0 0 0.201
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 Logistic Regression (l1) FastText 0 0 0 0.206
migtissera/Tess-v2.5.2-Qwen2-72B Logistic Regression (l1) Word2Vec 0.091 0 0 0.689
migtissera/Tess-v2.5.2-Qwen2-72B Logistic Regression (l1) FastText 0 0 0 0.675
migtissera/Llama-3-70B-Synthia-v3.5 Logistic Regression (l1) Word2Vec 0.03 0 0 0.626
migtissera/Llama-3-70B-Synthia-v3.5 Logistic Regression (l1) FastText 0.01 0 0 0.618
migtissera/Trinity-2-Codestral-22B-v0.2 Logistic Regression (l1) Word2Vec 0 0 0 0.337
migtissera/Trinity-2-Codestral-22B-v0.2 Logistic Regression (l1) FastText 0 0 0 0.346
Stark2008/VisFlamCat Logistic Regression (l1) Word2Vec 0 0 0 0.345
Stark2008/VisFlamCat Logistic Regression (l1) FastText 0 0 0 0.335
migtissera/Trinity-2-Codestral-22B Logistic Regression (l1) Word2Vec 0 0 0 0.335
migtissera/Trinity-2-Codestral-22B Logistic Regression (l1) FastText 0 0 0 0.336
awnr/Mistral-7B-v0.1-signtensors-7-over-16 Logistic Regression (l1) Word2Vec 0 0 0 0.321
awnr/Mistral-7B-v0.1-signtensors-7-over-16 Logistic Regression (l1) FastText 0 0 0 0.326
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge Logistic Regression (l1) Word2Vec 0 0 0 0.248
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge Logistic Regression (l1) FastText 0 0 0 0.263
apple/DCLM-7B Logistic Regression (l1) Word2Vec 0 0 0 0.318
apple/DCLM-7B Logistic Regression (l1) FastText 0 0 0 0.317
Intel/neural-chat-7b-v3-3 Logistic Regression (l1) Word2Vec 0 0 0 0.277
Intel/neural-chat-7b-v3-3 Logistic Regression (l1) FastText 0 0 0 0.278
migtissera/Tess-3-Mistral-Nemo-12B Logistic Regression (l1) Word2Vec 0 0 0 0.279
migtissera/Tess-3-Mistral-Nemo-12B Logistic Regression (l1) FastText 0 0 0 0.291
migtissera/Llama-3-8B-Synthia-v3.5 Logistic Regression (l1) Word2Vec 0 0 0 0.316
migtissera/Llama-3-8B-Synthia-v3.5 Logistic Regression (l1) FastText 0 0 0 0.295
migtissera/Tess-v2.5-Phi-3-medium-128k-14B Logistic Regression (l1) Word2Vec 0 0 0 0.412
migtissera/Tess-v2.5-Phi-3-medium-128k-14B Logistic Regression (l1) FastText 0 0 0 0.395
awnr/Mistral-7B-v0.1-signtensors-5-over-16 Logistic Regression (l1) Word2Vec 0 0 0 0.308
awnr/Mistral-7B-v0.1-signtensors-5-over-16 Logistic Regression (l1) FastText 0 0 0 0.316
NousResearch/Hermes-3-Llama-3.1-70B Logistic Regression (l1) Word2Vec 0 0 0 0.5
NousResearch/Hermes-3-Llama-3.1-70B Logistic Regression (l1) FastText 0 0 0 0.498
OpenAI/GPT-4o-2024-08-06 Logistic Regression (l1) Word2Vec 0.657 0 0 0.805
OpenAI/GPT-4o-2024-08-06 Logistic Regression (l1) FastText 0.515 0 0 0.803
BlackBeenie/llama-3-luminous-merged Logistic Regression (l1) Word2Vec 0 0 0 0.387
BlackBeenie/llama-3-luminous-merged Logistic Regression (l1) FastText 0 0 0 0.391
LenguajeNaturalAI/leniachat-gemma-2b-v0 Logistic Regression (l1) Word2Vec 0 0 0 0.121
LenguajeNaturalAI/leniachat-gemma-2b-v0 Logistic Regression (l1) FastText 0 0 0 0.12
migtissera/Tess-3-7B-SFT Logistic Regression (l1) Word2Vec 0 0 0 0.295
migtissera/Tess-3-7B-SFT Logistic Regression (l1) FastText 0 0 0 0.308
awnr/Mistral-7B-v0.1-signtensors-3-over-8 Logistic Regression (l1) Word2Vec 0 0 0 0.311
awnr/Mistral-7B-v0.1-signtensors-3-over-8 Logistic Regression (l1) FastText 0 0 0 0.327
meta-llama/Meta-Llama-3.1-70B-Instruct Logistic Regression (l1) Word2Vec 0.131 0.03 0 0.703
meta-llama/Meta-Llama-3.1-70B-Instruct Logistic Regression (l1) FastText 0.061 0 0 0.695
OpenAI/GPT-4o-mini Logistic Regression (l1) Word2Vec 0.111 0.01 0.01 0.737
OpenAI/GPT-4o-mini Logistic Regression (l1) FastText 0.051 0 0 0.721
upstage/solar-pro-preview-instruct Logistic Regression (l1) Word2Vec 0 0 0 0.55
upstage/solar-pro-preview-instruct Logistic Regression (l1) FastText 0 0 0 0.524
Stark2008/LayleleFlamPi Logistic Regression (l1) Word2Vec 0 0 0 0.321
Stark2008/LayleleFlamPi Logistic Regression (l1) FastText 0 0 0 0.334
NousResearch/Yarn-Llama-2-13b-128k Logistic Regression (l1) Word2Vec 0 0 0 0.271
NousResearch/Yarn-Llama-2-13b-128k Logistic Regression (l1) FastText 0 0 0 0.26
OpenAI/GPT-4o-2024-05-13 Logistic Regression (l1) Word2Vec 0.717 0.03 0 0.811
OpenAI/GPT-4o-2024-05-13 Logistic Regression (l1) FastText 0.586 0.01 0 0.805
AbacusResearch/Jallabi-34B Logistic Regression (l1) Word2Vec 0 0 0 0.52
AbacusResearch/Jallabi-34B Logistic Regression (l1) FastText 0 0 0 0.52
Stark2008/GutenLaserPi Logistic Regression (l1) Word2Vec 0 0 0 0.34
Stark2008/GutenLaserPi Logistic Regression (l1) FastText 0 0 0 0.342
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra Logistic Regression (l1) Word2Vec 0 0 0 0.33
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra Logistic Regression (l1) FastText 0 0 0 0.326
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model Logistic Regression (l1) Word2Vec 0 0 0 0.116
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model Logistic Regression (l1) FastText 0 0 0 0.112
xxx777xxxASD/L3.1-ClaudeMaid-4x8B Logistic Regression (l1) Word2Vec 0 0 0 0.364
xxx777xxxASD/L3.1-ClaudeMaid-4x8B Logistic Regression (l1) FastText 0 0 0 0.361
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 Logistic Regression (l1) Ngrams-1 0 0 0 0.394
Orenguteng/Llama-3.1-8B-Lexi-Uncensored Logistic Regression (l1) Ngrams-1 0 0 0 0.397
MaziyarPanahi/calme-2.1-qwen2.5-72b Logistic Regression (l1) Ngrams-1 0.071 0.01 0 0.694
awnr/Mistral-7B-v0.1-signtensors-1-over-4 Logistic Regression (l1) Ngrams-1 0 0 0 0.257
wave-on-discord/qwent-7b Logistic Regression (l1) Ngrams-1 0 0 0 0.161
Aryanne/SuperHeart Logistic Regression (l1) Ngrams-1 0 0 0 0.4
Intel/neural-chat-7b-v3-2 Logistic Regression (l1) Ngrams-1 0 0 0 0.276
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco Logistic Regression (l1) Ngrams-1 0 0 0 0.338
Intel/neural-chat-7b-v3-1 Logistic Regression (l1) Ngrams-1 0 0 0 0.285
ssmits/Qwen2.5-95B-Instruct Logistic Regression (l1) Ngrams-1 0.051 0 0 0.644
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 Logistic Regression (l1) Ngrams-1 0 0 0 0.199
migtissera/Tess-v2.5.2-Qwen2-72B Logistic Regression (l1) Ngrams-1 0 0 0 0.686
migtissera/Llama-3-70B-Synthia-v3.5 Logistic Regression (l1) Ngrams-1 0.03 0.01 0 0.627
migtissera/Trinity-2-Codestral-22B-v0.2 Logistic Regression (l1) Ngrams-1 0 0 0 0.341
Stark2008/VisFlamCat Logistic Regression (l1) Ngrams-1 0 0 0 0.337
migtissera/Trinity-2-Codestral-22B Logistic Regression (l1) Ngrams-1 0 0 0 0.331
awnr/Mistral-7B-v0.1-signtensors-7-over-16 Logistic Regression (l1) Ngrams-1 0 0 0 0.32
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge Logistic Regression (l1) Ngrams-1 0 0 0 0.241
apple/DCLM-7B Logistic Regression (l1) Ngrams-1 0 0 0 0.306
Intel/neural-chat-7b-v3-3 Logistic Regression (l1) Ngrams-1 0 0 0 0.271
migtissera/Tess-3-Mistral-Nemo-12B Logistic Regression (l1) Ngrams-1 0 0 0 0.281
migtissera/Llama-3-8B-Synthia-v3.5 Logistic Regression (l1) Ngrams-1 0 0 0 0.31
migtissera/Tess-v2.5-Phi-3-medium-128k-14B Logistic Regression (l1) Ngrams-1 0 0 0 0.4
awnr/Mistral-7B-v0.1-signtensors-5-over-16 Logistic Regression (l1) Ngrams-1 0 0 0 0.319
NousResearch/Hermes-3-Llama-3.1-70B Logistic Regression (l1) Ngrams-1 0 0 0 0.5
OpenAI/GPT-4o-2024-08-06 Logistic Regression (l1) Ngrams-1 0.677 0.071 0 0.829
BlackBeenie/llama-3-luminous-merged Logistic Regression (l1) Ngrams-1 0 0 0 0.391
LenguajeNaturalAI/leniachat-gemma-2b-v0 Logistic Regression (l1) Ngrams-1 0 0 0 0.119
migtissera/Tess-3-7B-SFT Logistic Regression (l1) Ngrams-1 0 0 0 0.298
awnr/Mistral-7B-v0.1-signtensors-3-over-8 Logistic Regression (l1) Ngrams-1 0 0 0 0.311
meta-llama/Meta-Llama-3.1-70B-Instruct Logistic Regression (l1) Ngrams-1 0.02 0 0 0.689
OpenAI/GPT-4o-mini Logistic Regression (l1) Ngrams-1 0.162 0 0 0.74
upstage/solar-pro-preview-instruct Logistic Regression (l1) Ngrams-1 0 0 0 0.558
Stark2008/LayleleFlamPi Logistic Regression (l1) Ngrams-1 0 0 0 0.321
NousResearch/Yarn-Llama-2-13b-128k Logistic Regression (l1) Ngrams-1 0 0 0 0.249
OpenAI/GPT-4o-2024-05-13 Logistic Regression (l1) Ngrams-1 0.717 0.02 0 0.817
AbacusResearch/Jallabi-34B Logistic Regression (l1) Ngrams-1 0 0 0 0.504
Stark2008/GutenLaserPi Logistic Regression (l1) Ngrams-1 0 0 0 0.335
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra Logistic Regression (l1) Ngrams-1 0 0 0 0.319
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model Logistic Regression (l1) Ngrams-1 0 0 0 0.117
xxx777xxxASD/L3.1-ClaudeMaid-4x8B Logistic Regression (l1) Ngrams-1 0 0 0 0.371
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 Logistic Regression (l2) OpenAI 0 0 0 0.395
Orenguteng/Llama-3.1-8B-Lexi-Uncensored Logistic Regression (l2) OpenAI 0 0 0 0.39
MaziyarPanahi/calme-2.1-qwen2.5-72b Logistic Regression (l2) OpenAI 0.192 0.071 0 0.731
awnr/Mistral-7B-v0.1-signtensors-1-over-4 Logistic Regression (l2) OpenAI 0 0 0 0.245
wave-on-discord/qwent-7b Logistic Regression (l2) OpenAI 0 0 0 0.169
Aryanne/SuperHeart Logistic Regression (l2) OpenAI 0 0 0 0.427
Intel/neural-chat-7b-v3-2 Logistic Regression (l2) OpenAI 0 0 0 0.285
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco Logistic Regression (l2) OpenAI 0 0 0 0.332
Intel/neural-chat-7b-v3-1 Logistic Regression (l2) OpenAI 0 0 0 0.294
ssmits/Qwen2.5-95B-Instruct Logistic Regression (l2) OpenAI 0.172 0.02 0 0.683
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 Logistic Regression (l2) OpenAI 0 0 0 0.206
migtissera/Tess-v2.5.2-Qwen2-72B Logistic Regression (l2) OpenAI 0.242 0.061 0.03 0.719
migtissera/Llama-3-70B-Synthia-v3.5 Logistic Regression (l2) OpenAI 0.152 0.051 0 0.66
migtissera/Trinity-2-Codestral-22B-v0.2 Logistic Regression (l2) OpenAI 0 0 0 0.34
Stark2008/VisFlamCat Logistic Regression (l2) OpenAI 0 0 0 0.339
migtissera/Trinity-2-Codestral-22B Logistic Regression (l2) OpenAI 0 0 0 0.34
awnr/Mistral-7B-v0.1-signtensors-7-over-16 Logistic Regression (l2) OpenAI 0 0 0 0.328
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge Logistic Regression (l2) OpenAI 0 0 0 0.242
apple/DCLM-7B Logistic Regression (l2) OpenAI 0 0 0 0.311
Intel/neural-chat-7b-v3-3 Logistic Regression (l2) OpenAI 0 0 0 0.291
migtissera/Tess-3-Mistral-Nemo-12B Logistic Regression (l2) OpenAI 0 0 0 0.301
migtissera/Llama-3-8B-Synthia-v3.5 Logistic Regression (l2) OpenAI 0 0 0 0.302
migtissera/Tess-v2.5-Phi-3-medium-128k-14B Logistic Regression (l2) OpenAI 0 0 0 0.395
awnr/Mistral-7B-v0.1-signtensors-5-over-16 Logistic Regression (l2) OpenAI 0 0 0 0.319
NousResearch/Hermes-3-Llama-3.1-70B Logistic Regression (l2) OpenAI 0 0 0 0.495
OpenAI/GPT-4o-2024-08-06 Logistic Regression (l2) OpenAI 0.808 0.222 0 0.85
BlackBeenie/llama-3-luminous-merged Logistic Regression (l2) OpenAI 0 0 0 0.391
LenguajeNaturalAI/leniachat-gemma-2b-v0 Logistic Regression (l2) OpenAI 0 0 0 0.118
migtissera/Tess-3-7B-SFT Logistic Regression (l2) OpenAI 0 0 0 0.318
awnr/Mistral-7B-v0.1-signtensors-3-over-8 Logistic Regression (l2) OpenAI 0 0 0 0.311
meta-llama/Meta-Llama-3.1-70B-Instruct Logistic Regression (l2) OpenAI 0.242 0.071 0.01 0.723
OpenAI/GPT-4o-mini Logistic Regression (l2) OpenAI 0.444 0 0 0.782
upstage/solar-pro-preview-instruct Logistic Regression (l2) OpenAI 0 0 0 0.565
Stark2008/LayleleFlamPi Logistic Regression (l2) OpenAI 0 0 0 0.332
NousResearch/Yarn-Llama-2-13b-128k Logistic Regression (l2) OpenAI 0 0 0 0.254
OpenAI/GPT-4o-2024-05-13 Logistic Regression (l2) OpenAI 0.848 0.192 0 0.85
AbacusResearch/Jallabi-34B Logistic Regression (l2) OpenAI 0 0 0 0.506
Stark2008/GutenLaserPi Logistic Regression (l2) OpenAI 0 0 0 0.338
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra Logistic Regression (l2) OpenAI 0 0 0 0.322
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model Logistic Regression (l2) OpenAI 0 0 0 0.118
xxx777xxxASD/L3.1-ClaudeMaid-4x8B Logistic Regression (l2) OpenAI 0 0 0 0.373
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 Logistic Regression (l2) Word2Vec 0 0 0 0.395
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 Logistic Regression (l2) FastText 0 0 0 0.398
Orenguteng/Llama-3.1-8B-Lexi-Uncensored Logistic Regression (l2) Word2Vec 0 0 0 0.378
Orenguteng/Llama-3.1-8B-Lexi-Uncensored Logistic Regression (l2) FastText 0 0 0 0.395
MaziyarPanahi/calme-2.1-qwen2.5-72b Logistic Regression (l2) Word2Vec 0.101 0 0 0.706
MaziyarPanahi/calme-2.1-qwen2.5-72b Logistic Regression (l2) FastText 0.02 0.01 0.01 0.695
awnr/Mistral-7B-v0.1-signtensors-1-over-4 Logistic Regression (l2) Word2Vec 0 0 0 0.241
awnr/Mistral-7B-v0.1-signtensors-1-over-4 Logistic Regression (l2) FastText 0 0 0 0.251
wave-on-discord/qwent-7b Logistic Regression (l2) Word2Vec 0 0 0 0.163
wave-on-discord/qwent-7b Logistic Regression (l2) FastText 0 0 0 0.175
Aryanne/SuperHeart Logistic Regression (l2) Word2Vec 0 0 0 0.405
Aryanne/SuperHeart Logistic Regression (l2) FastText 0 0 0 0.4
Intel/neural-chat-7b-v3-2 Logistic Regression (l2) Word2Vec 0 0 0 0.274
Intel/neural-chat-7b-v3-2 Logistic Regression (l2) FastText 0 0 0 0.287
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco Logistic Regression (l2) Word2Vec 0 0 0 0.344
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco Logistic Regression (l2) FastText 0 0 0 0.338
Intel/neural-chat-7b-v3-1 Logistic Regression (l2) Word2Vec 0 0 0 0.276
Intel/neural-chat-7b-v3-1 Logistic Regression (l2) FastText 0 0 0 0.276
ssmits/Qwen2.5-95B-Instruct Logistic Regression (l2) Word2Vec 0.03 0 0 0.654
ssmits/Qwen2.5-95B-Instruct Logistic Regression (l2) FastText 0.03 0 0 0.644
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 Logistic Regression (l2) Word2Vec 0 0 0 0.202
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 Logistic Regression (l2) FastText 0 0 0 0.208
migtissera/Tess-v2.5.2-Qwen2-72B Logistic Regression (l2) Word2Vec 0.101 0.01 0.01 0.691
migtissera/Tess-v2.5.2-Qwen2-72B Logistic Regression (l2) FastText 0 0 0 0.679
migtissera/Llama-3-70B-Synthia-v3.5 Logistic Regression (l2) Word2Vec 0.04 0 0 0.627
migtissera/Llama-3-70B-Synthia-v3.5 Logistic Regression (l2) FastText 0.02 0 0 0.621
migtissera/Trinity-2-Codestral-22B-v0.2 Logistic Regression (l2) Word2Vec 0 0 0 0.34
migtissera/Trinity-2-Codestral-22B-v0.2 Logistic Regression (l2) FastText 0 0 0 0.333
Stark2008/VisFlamCat Logistic Regression (l2) Word2Vec 0 0 0 0.338
Stark2008/VisFlamCat Logistic Regression (l2) FastText 0 0 0 0.339
migtissera/Trinity-2-Codestral-22B Logistic Regression (l2) Word2Vec 0 0 0 0.339
migtissera/Trinity-2-Codestral-22B Logistic Regression (l2) FastText 0 0 0 0.331
awnr/Mistral-7B-v0.1-signtensors-7-over-16 Logistic Regression (l2) Word2Vec 0 0 0 0.321
awnr/Mistral-7B-v0.1-signtensors-7-over-16 Logistic Regression (l2) FastText 0 0 0 0.319
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge Logistic Regression (l2) Word2Vec 0 0 0 0.25
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge Logistic Regression (l2) FastText 0 0 0 0.256
apple/DCLM-7B Logistic Regression (l2) Word2Vec 0 0 0 0.316
apple/DCLM-7B Logistic Regression (l2) FastText 0 0 0 0.316
Intel/neural-chat-7b-v3-3 Logistic Regression (l2) Word2Vec 0 0 0 0.265
Intel/neural-chat-7b-v3-3 Logistic Regression (l2) FastText 0 0 0 0.275
migtissera/Tess-3-Mistral-Nemo-12B Logistic Regression (l2) Word2Vec 0 0 0 0.284
migtissera/Tess-3-Mistral-Nemo-12B Logistic Regression (l2) FastText 0 0 0 0.297
migtissera/Llama-3-8B-Synthia-v3.5 Logistic Regression (l2) Word2Vec 0 0 0 0.31
migtissera/Llama-3-8B-Synthia-v3.5 Logistic Regression (l2) FastText 0 0 0 0.292
migtissera/Tess-v2.5-Phi-3-medium-128k-14B Logistic Regression (l2) Word2Vec 0 0 0 0.42
migtissera/Tess-v2.5-Phi-3-medium-128k-14B Logistic Regression (l2) FastText 0 0 0 0.409
awnr/Mistral-7B-v0.1-signtensors-5-over-16 Logistic Regression (l2) Word2Vec 0 0 0 0.31
awnr/Mistral-7B-v0.1-signtensors-5-over-16 Logistic Regression (l2) FastText 0 0 0 0.31
NousResearch/Hermes-3-Llama-3.1-70B Logistic Regression (l2) Word2Vec 0 0 0 0.497
NousResearch/Hermes-3-Llama-3.1-70B Logistic Regression (l2) FastText 0 0 0 0.499
OpenAI/GPT-4o-2024-08-06 Logistic Regression (l2) Word2Vec 0.667 0 0 0.804
OpenAI/GPT-4o-2024-08-06 Logistic Regression (l2) FastText 0.636 0 0 0.806
BlackBeenie/llama-3-luminous-merged Logistic Regression (l2) Word2Vec 0 0 0 0.385
BlackBeenie/llama-3-luminous-merged Logistic Regression (l2) FastText 0 0 0 0.388
LenguajeNaturalAI/leniachat-gemma-2b-v0 Logistic Regression (l2) Word2Vec 0 0 0 0.124
LenguajeNaturalAI/leniachat-gemma-2b-v0 Logistic Regression (l2) FastText 0 0 0 0.129
migtissera/Tess-3-7B-SFT Logistic Regression (l2) Word2Vec 0 0 0 0.299
migtissera/Tess-3-7B-SFT Logistic Regression (l2) FastText 0 0 0 0.297
awnr/Mistral-7B-v0.1-signtensors-3-over-8 Logistic Regression (l2) Word2Vec 0 0 0 0.312
awnr/Mistral-7B-v0.1-signtensors-3-over-8 Logistic Regression (l2) FastText 0 0 0 0.317
meta-llama/Meta-Llama-3.1-70B-Instruct Logistic Regression (l2) Word2Vec 0.131 0 0 0.701
meta-llama/Meta-Llama-3.1-70B-Instruct Logistic Regression (l2) FastText 0.03 0.01 0.01 0.699
OpenAI/GPT-4o-mini Logistic Regression (l2) Word2Vec 0.101 0.01 0 0.735
OpenAI/GPT-4o-mini Logistic Regression (l2) FastText 0 0 0 0.72
upstage/solar-pro-preview-instruct Logistic Regression (l2) Word2Vec 0 0 0 0.548
upstage/solar-pro-preview-instruct Logistic Regression (l2) FastText 0 0 0 0.535
Stark2008/LayleleFlamPi Logistic Regression (l2) Word2Vec 0 0 0 0.321
Stark2008/LayleleFlamPi Logistic Regression (l2) FastText 0 0 0 0.33
NousResearch/Yarn-Llama-2-13b-128k Logistic Regression (l2) Word2Vec 0 0 0 0.262
NousResearch/Yarn-Llama-2-13b-128k Logistic Regression (l2) FastText 0 0 0 0.263
OpenAI/GPT-4o-2024-05-13 Logistic Regression (l2) Word2Vec 0.727 0 0 0.811
OpenAI/GPT-4o-2024-05-13 Logistic Regression (l2) FastText 0.727 0.01 0.01 0.809
AbacusResearch/Jallabi-34B Logistic Regression (l2) Word2Vec 0 0 0 0.529
AbacusResearch/Jallabi-34B Logistic Regression (l2) FastText 0 0 0 0.519
Stark2008/GutenLaserPi Logistic Regression (l2) Word2Vec 0 0 0 0.333
Stark2008/GutenLaserPi Logistic Regression (l2) FastText 0 0 0 0.347
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra Logistic Regression (l2) Word2Vec 0 0 0 0.327
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra Logistic Regression (l2) FastText 0 0 0 0.32
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model Logistic Regression (l2) Word2Vec 0 0 0 0.108
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model Logistic Regression (l2) FastText 0 0 0 0.113
xxx777xxxASD/L3.1-ClaudeMaid-4x8B Logistic Regression (l2) Word2Vec 0 0 0 0.366
xxx777xxxASD/L3.1-ClaudeMaid-4x8B Logistic Regression (l2) FastText 0 0 0 0.371
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 Logistic Regression (l2) Ngrams-1 0 0 0 0.392
Orenguteng/Llama-3.1-8B-Lexi-Uncensored Logistic Regression (l2) Ngrams-1 0 0 0 0.391
MaziyarPanahi/calme-2.1-qwen2.5-72b Logistic Regression (l2) Ngrams-1 0.131 0.01 0 0.704
awnr/Mistral-7B-v0.1-signtensors-1-over-4 Logistic Regression (l2) Ngrams-1 0 0 0 0.245
wave-on-discord/qwent-7b Logistic Regression (l2) Ngrams-1 0 0 0 0.171
Aryanne/SuperHeart Logistic Regression (l2) Ngrams-1 0 0 0 0.4
Intel/neural-chat-7b-v3-2 Logistic Regression (l2) Ngrams-1 0 0 0 0.263
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco Logistic Regression (l2) Ngrams-1 0 0 0 0.329
Intel/neural-chat-7b-v3-1 Logistic Regression (l2) Ngrams-1 0 0 0 0.271
ssmits/Qwen2.5-95B-Instruct Logistic Regression (l2) Ngrams-1 0.091 0 0 0.66
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 Logistic Regression (l2) Ngrams-1 0 0 0 0.204
migtissera/Tess-v2.5.2-Qwen2-72B Logistic Regression (l2) Ngrams-1 0.172 0 0 0.694
migtissera/Llama-3-70B-Synthia-v3.5 Logistic Regression (l2) Ngrams-1 0.111 0 0 0.638
migtissera/Trinity-2-Codestral-22B-v0.2 Logistic Regression (l2) Ngrams-1 0 0 0 0.332
Stark2008/VisFlamCat Logistic Regression (l2) Ngrams-1 0 0 0 0.329
migtissera/Trinity-2-Codestral-22B Logistic Regression (l2) Ngrams-1 0 0 0 0.327
awnr/Mistral-7B-v0.1-signtensors-7-over-16 Logistic Regression (l2) Ngrams-1 0 0 0 0.314
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge Logistic Regression (l2) Ngrams-1 0 0 0 0.237
apple/DCLM-7B Logistic Regression (l2) Ngrams-1 0 0 0 0.308
Intel/neural-chat-7b-v3-3 Logistic Regression (l2) Ngrams-1 0 0 0 0.266
migtissera/Tess-3-Mistral-Nemo-12B Logistic Regression (l2) Ngrams-1 0 0 0 0.287
migtissera/Llama-3-8B-Synthia-v3.5 Logistic Regression (l2) Ngrams-1 0 0 0 0.292
migtissera/Tess-v2.5-Phi-3-medium-128k-14B Logistic Regression (l2) Ngrams-1 0 0 0 0.4
awnr/Mistral-7B-v0.1-signtensors-5-over-16 Logistic Regression (l2) Ngrams-1 0 0 0 0.304
NousResearch/Hermes-3-Llama-3.1-70B Logistic Regression (l2) Ngrams-1 0 0 0 0.506
OpenAI/GPT-4o-2024-08-06 Logistic Regression (l2) Ngrams-1 0.707 0.081 0 0.83
BlackBeenie/llama-3-luminous-merged Logistic Regression (l2) Ngrams-1 0 0 0 0.372
LenguajeNaturalAI/leniachat-gemma-2b-v0 Logistic Regression (l2) Ngrams-1 0 0 0 0.119
migtissera/Tess-3-7B-SFT Logistic Regression (l2) Ngrams-1 0 0 0 0.296
awnr/Mistral-7B-v0.1-signtensors-3-over-8 Logistic Regression (l2) Ngrams-1 0 0 0 0.302
meta-llama/Meta-Llama-3.1-70B-Instruct Logistic Regression (l2) Ngrams-1 0.03 0 0 0.692
OpenAI/GPT-4o-mini Logistic Regression (l2) Ngrams-1 0.182 0 0 0.747
upstage/solar-pro-preview-instruct Logistic Regression (l2) Ngrams-1 0 0 0 0.547
Stark2008/LayleleFlamPi Logistic Regression (l2) Ngrams-1 0 0 0 0.32
NousResearch/Yarn-Llama-2-13b-128k Logistic Regression (l2) Ngrams-1 0 0 0 0.25
OpenAI/GPT-4o-2024-05-13 Logistic Regression (l2) Ngrams-1 0.677 0.01 0.01 0.822
AbacusResearch/Jallabi-34B Logistic Regression (l2) Ngrams-1 0 0 0 0.51
Stark2008/GutenLaserPi Logistic Regression (l2) Ngrams-1 0 0 0 0.333
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra Logistic Regression (l2) Ngrams-1 0 0 0 0.313
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model Logistic Regression (l2) Ngrams-1 0 0 0 0.11
xxx777xxxASD/L3.1-ClaudeMaid-4x8B Logistic Regression (l2) Ngrams-1 0 0 0 0.356
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 XGBoost OpenAI 0 0 0 0.388
Orenguteng/Llama-3.1-8B-Lexi-Uncensored XGBoost OpenAI 0 0 0 0.386
MaziyarPanahi/calme-2.1-qwen2.5-72b XGBoost OpenAI 0.222 0 0 0.714
awnr/Mistral-7B-v0.1-signtensors-1-over-4 XGBoost OpenAI 0 0 0 0.245
wave-on-discord/qwent-7b XGBoost OpenAI 0 0 0 0.152
Aryanne/SuperHeart XGBoost OpenAI 0 0 0 0.413
Intel/neural-chat-7b-v3-2 XGBoost OpenAI 0 0 0 0.274
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco XGBoost OpenAI 0 0 0 0.338
Intel/neural-chat-7b-v3-1 XGBoost OpenAI 0 0 0 0.277
ssmits/Qwen2.5-95B-Instruct XGBoost OpenAI 0.091 0.02 0 0.661
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 XGBoost OpenAI 0 0 0 0.2
migtissera/Tess-v2.5.2-Qwen2-72B XGBoost OpenAI 0.212 0.02 0 0.708
migtissera/Llama-3-70B-Synthia-v3.5 XGBoost OpenAI 0.152 0.01 0 0.642
migtissera/Trinity-2-Codestral-22B-v0.2 XGBoost OpenAI 0 0 0 0.331
Stark2008/VisFlamCat XGBoost OpenAI 0 0 0 0.33
migtissera/Trinity-2-Codestral-22B XGBoost OpenAI 0 0 0 0.32
awnr/Mistral-7B-v0.1-signtensors-7-over-16 XGBoost OpenAI 0 0 0 0.295
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge XGBoost OpenAI 0 0 0 0.247
apple/DCLM-7B XGBoost OpenAI 0 0 0 0.311
Intel/neural-chat-7b-v3-3 XGBoost OpenAI 0 0 0 0.287
migtissera/Tess-3-Mistral-Nemo-12B XGBoost OpenAI 0 0 0 0.287
migtissera/Llama-3-8B-Synthia-v3.5 XGBoost OpenAI 0 0 0 0.302
migtissera/Tess-v2.5-Phi-3-medium-128k-14B XGBoost OpenAI 0 0 0 0.4
awnr/Mistral-7B-v0.1-signtensors-5-over-16 XGBoost OpenAI 0 0 0 0.32
NousResearch/Hermes-3-Llama-3.1-70B XGBoost OpenAI 0 0 0 0.483
OpenAI/GPT-4o-2024-08-06 XGBoost OpenAI 0.788 0.051 0.01 0.838
BlackBeenie/llama-3-luminous-merged XGBoost OpenAI 0 0 0 0.378
LenguajeNaturalAI/leniachat-gemma-2b-v0 XGBoost OpenAI 0 0 0 0.121
migtissera/Tess-3-7B-SFT XGBoost OpenAI 0 0 0 0.311
awnr/Mistral-7B-v0.1-signtensors-3-over-8 XGBoost OpenAI 0 0 0 0.299
meta-llama/Meta-Llama-3.1-70B-Instruct XGBoost OpenAI 0.222 0.061 0.01 0.709
OpenAI/GPT-4o-mini XGBoost OpenAI 0.303 0 0 0.767
upstage/solar-pro-preview-instruct XGBoost OpenAI 0 0 0 0.543
Stark2008/LayleleFlamPi XGBoost OpenAI 0 0 0 0.32
NousResearch/Yarn-Llama-2-13b-128k XGBoost OpenAI 0 0 0 0.252
OpenAI/GPT-4o-2024-05-13 XGBoost OpenAI 0.747 0.121 0.02 0.836
AbacusResearch/Jallabi-34B XGBoost OpenAI 0 0 0 0.497
Stark2008/GutenLaserPi XGBoost OpenAI 0 0 0 0.334
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra XGBoost OpenAI 0 0 0 0.312
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model XGBoost OpenAI 0 0 0 0.115
xxx777xxxASD/L3.1-ClaudeMaid-4x8B XGBoost OpenAI 0 0 0 0.353
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 XGBoost Word2Vec 0 0 0 0.372
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 XGBoost FastText 0 0 0 0.38
Orenguteng/Llama-3.1-8B-Lexi-Uncensored XGBoost Word2Vec 0 0 0 0.36
Orenguteng/Llama-3.1-8B-Lexi-Uncensored XGBoost FastText 0 0 0 0.367
MaziyarPanahi/calme-2.1-qwen2.5-72b XGBoost Word2Vec 0 0 0 0.692
MaziyarPanahi/calme-2.1-qwen2.5-72b XGBoost FastText 0.141 0.02 0 0.699
awnr/Mistral-7B-v0.1-signtensors-1-over-4 XGBoost Word2Vec 0 0 0 0.231
awnr/Mistral-7B-v0.1-signtensors-1-over-4 XGBoost FastText 0 0 0 0.232
wave-on-discord/qwent-7b XGBoost Word2Vec 0 0 0 0.154
wave-on-discord/qwent-7b XGBoost FastText 0 0 0 0.158
Aryanne/SuperHeart XGBoost Word2Vec 0 0 0 0.395
Aryanne/SuperHeart XGBoost FastText 0 0 0 0.393
Intel/neural-chat-7b-v3-2 XGBoost Word2Vec 0 0 0 0.279
Intel/neural-chat-7b-v3-2 XGBoost FastText 0 0 0 0.271
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco XGBoost Word2Vec 0 0 0 0.351
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco XGBoost FastText 0 0 0 0.338
Intel/neural-chat-7b-v3-1 XGBoost Word2Vec 0 0 0 0.258
Intel/neural-chat-7b-v3-1 XGBoost FastText 0 0 0 0.263
ssmits/Qwen2.5-95B-Instruct XGBoost Word2Vec 0.081 0 0 0.642
ssmits/Qwen2.5-95B-Instruct XGBoost FastText 0.02 0 0 0.639
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 XGBoost Word2Vec 0 0 0 0.208
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 XGBoost FastText 0 0 0 0.201
migtissera/Tess-v2.5.2-Qwen2-72B XGBoost Word2Vec 0.051 0.01 0 0.679
migtissera/Tess-v2.5.2-Qwen2-72B XGBoost FastText 0.03 0.01 0 0.682
migtissera/Llama-3-70B-Synthia-v3.5 XGBoost Word2Vec 0.03 0.01 0 0.618
migtissera/Llama-3-70B-Synthia-v3.5 XGBoost FastText 0.04 0 0 0.612
migtissera/Trinity-2-Codestral-22B-v0.2 XGBoost Word2Vec 0 0 0 0.321
migtissera/Trinity-2-Codestral-22B-v0.2 XGBoost FastText 0 0 0 0.335
Stark2008/VisFlamCat XGBoost Word2Vec 0 0 0 0.319
Stark2008/VisFlamCat XGBoost FastText 0 0 0 0.326
migtissera/Trinity-2-Codestral-22B XGBoost Word2Vec 0 0 0 0.339
migtissera/Trinity-2-Codestral-22B XGBoost FastText 0 0 0 0.339
awnr/Mistral-7B-v0.1-signtensors-7-over-16 XGBoost Word2Vec 0 0 0 0.3
awnr/Mistral-7B-v0.1-signtensors-7-over-16 XGBoost FastText 0 0 0 0.301
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge XGBoost Word2Vec 0 0 0 0.234
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge XGBoost FastText 0 0 0 0.243
apple/DCLM-7B XGBoost Word2Vec 0 0 0 0.315
apple/DCLM-7B XGBoost FastText 0 0 0 0.318
Intel/neural-chat-7b-v3-3 XGBoost Word2Vec 0 0 0 0.278
Intel/neural-chat-7b-v3-3 XGBoost FastText 0 0 0 0.275
migtissera/Tess-3-Mistral-Nemo-12B XGBoost Word2Vec 0 0 0 0.282
migtissera/Tess-3-Mistral-Nemo-12B XGBoost FastText 0 0 0 0.273
migtissera/Llama-3-8B-Synthia-v3.5 XGBoost Word2Vec 0 0 0 0.315
migtissera/Llama-3-8B-Synthia-v3.5 XGBoost FastText 0 0 0 0.3
migtissera/Tess-v2.5-Phi-3-medium-128k-14B XGBoost Word2Vec 0 0 0 0.406
migtissera/Tess-v2.5-Phi-3-medium-128k-14B XGBoost FastText 0 0 0 0.392
awnr/Mistral-7B-v0.1-signtensors-5-over-16 XGBoost Word2Vec 0 0 0 0.292
awnr/Mistral-7B-v0.1-signtensors-5-over-16 XGBoost FastText 0 0 0 0.308
NousResearch/Hermes-3-Llama-3.1-70B XGBoost Word2Vec 0 0 0 0.49
NousResearch/Hermes-3-Llama-3.1-70B XGBoost FastText 0 0 0 0.49
OpenAI/GPT-4o-2024-08-06 XGBoost Word2Vec 0.657 0.01 0.01 0.81
OpenAI/GPT-4o-2024-08-06 XGBoost FastText 0.677 0.01 0.01 0.819
BlackBeenie/llama-3-luminous-merged XGBoost Word2Vec 0 0 0 0.37
BlackBeenie/llama-3-luminous-merged XGBoost FastText 0 0 0 0.379
LenguajeNaturalAI/leniachat-gemma-2b-v0 XGBoost Word2Vec 0 0 0 0.126
LenguajeNaturalAI/leniachat-gemma-2b-v0 XGBoost FastText 0 0 0 0.121
migtissera/Tess-3-7B-SFT XGBoost Word2Vec 0 0 0 0.306
migtissera/Tess-3-7B-SFT XGBoost FastText 0 0 0 0.308
awnr/Mistral-7B-v0.1-signtensors-3-over-8 XGBoost Word2Vec 0 0 0 0.291
awnr/Mistral-7B-v0.1-signtensors-3-over-8 XGBoost FastText 0 0 0 0.303
meta-llama/Meta-Llama-3.1-70B-Instruct XGBoost Word2Vec 0.182 0.071 0 0.699
meta-llama/Meta-Llama-3.1-70B-Instruct XGBoost FastText 0.091 0 0 0.69
OpenAI/GPT-4o-mini XGBoost Word2Vec 0.03 0 0 0.722
OpenAI/GPT-4o-mini XGBoost FastText 0.101 0.01 0 0.74
upstage/solar-pro-preview-instruct XGBoost Word2Vec 0 0 0 0.543
upstage/solar-pro-preview-instruct XGBoost FastText 0 0 0 0.534
Stark2008/LayleleFlamPi XGBoost Word2Vec 0 0 0 0.318
Stark2008/LayleleFlamPi XGBoost FastText 0 0 0 0.314
NousResearch/Yarn-Llama-2-13b-128k XGBoost Word2Vec 0 0 0 0.237
NousResearch/Yarn-Llama-2-13b-128k XGBoost FastText 0 0 0 0.248
OpenAI/GPT-4o-2024-05-13 XGBoost Word2Vec 0.626 0.02 0 0.811
OpenAI/GPT-4o-2024-05-13 XGBoost FastText 0.646 0.01 0 0.81
AbacusResearch/Jallabi-34B XGBoost Word2Vec 0 0 0 0.505
AbacusResearch/Jallabi-34B XGBoost FastText 0 0 0 0.5
Stark2008/GutenLaserPi XGBoost Word2Vec 0 0 0 0.335
Stark2008/GutenLaserPi XGBoost FastText 0 0 0 0.32
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra XGBoost Word2Vec 0 0 0 0.335
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra XGBoost FastText 0 0 0 0.327
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model XGBoost Word2Vec 0 0 0 0.12
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model XGBoost FastText 0 0 0 0.13
xxx777xxxASD/L3.1-ClaudeMaid-4x8B XGBoost Word2Vec 0 0 0 0.358
xxx777xxxASD/L3.1-ClaudeMaid-4x8B XGBoost FastText 0 0 0 0.35
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 XGBoost Ngrams-1 0 0 0 0.378
Orenguteng/Llama-3.1-8B-Lexi-Uncensored XGBoost Ngrams-1 0 0 0 0.391
MaziyarPanahi/calme-2.1-qwen2.5-72b XGBoost Ngrams-1 0.091 0.02 0 0.696
awnr/Mistral-7B-v0.1-signtensors-1-over-4 XGBoost Ngrams-1 0 0 0 0.234
wave-on-discord/qwent-7b XGBoost Ngrams-1 0 0 0 0.162
Aryanne/SuperHeart XGBoost Ngrams-1 0 0 0 0.39
Intel/neural-chat-7b-v3-2 XGBoost Ngrams-1 0 0 0 0.253
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco XGBoost Ngrams-1 0 0 0 0.307
Intel/neural-chat-7b-v3-1 XGBoost Ngrams-1 0 0 0 0.271
ssmits/Qwen2.5-95B-Instruct XGBoost Ngrams-1 0.101 0.01 0 0.651
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 XGBoost Ngrams-1 0 0 0 0.193
migtissera/Tess-v2.5.2-Qwen2-72B XGBoost Ngrams-1 0.121 0.03 0 0.691
migtissera/Llama-3-70B-Synthia-v3.5 XGBoost Ngrams-1 0.121 0 0 0.625
migtissera/Trinity-2-Codestral-22B-v0.2 XGBoost Ngrams-1 0 0 0 0.317
Stark2008/VisFlamCat XGBoost Ngrams-1 0 0 0 0.33
migtissera/Trinity-2-Codestral-22B XGBoost Ngrams-1 0 0 0 0.331
awnr/Mistral-7B-v0.1-signtensors-7-over-16 XGBoost Ngrams-1 0 0 0 0.313
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge XGBoost Ngrams-1 0 0 0 0.234
apple/DCLM-7B XGBoost Ngrams-1 0 0 0 0.32
Intel/neural-chat-7b-v3-3 XGBoost Ngrams-1 0 0 0 0.252
migtissera/Tess-3-Mistral-Nemo-12B XGBoost Ngrams-1 0 0 0 0.281
migtissera/Llama-3-8B-Synthia-v3.5 XGBoost Ngrams-1 0 0 0 0.314
migtissera/Tess-v2.5-Phi-3-medium-128k-14B XGBoost Ngrams-1 0 0 0 0.391
awnr/Mistral-7B-v0.1-signtensors-5-over-16 XGBoost Ngrams-1 0 0 0 0.314
NousResearch/Hermes-3-Llama-3.1-70B XGBoost Ngrams-1 0 0 0 0.492
OpenAI/GPT-4o-2024-08-06 XGBoost Ngrams-1 0.525 0 0 0.809
BlackBeenie/llama-3-luminous-merged XGBoost Ngrams-1 0 0 0 0.374
LenguajeNaturalAI/leniachat-gemma-2b-v0 XGBoost Ngrams-1 0 0 0 0.113
migtissera/Tess-3-7B-SFT XGBoost Ngrams-1 0 0 0 0.291
awnr/Mistral-7B-v0.1-signtensors-3-over-8 XGBoost Ngrams-1 0 0 0 0.292
meta-llama/Meta-Llama-3.1-70B-Instruct XGBoost Ngrams-1 0.152 0 0 0.69
OpenAI/GPT-4o-mini XGBoost Ngrams-1 0.04 0 0 0.72
upstage/solar-pro-preview-instruct XGBoost Ngrams-1 0 0 0 0.563
Stark2008/LayleleFlamPi XGBoost Ngrams-1 0 0 0 0.313
NousResearch/Yarn-Llama-2-13b-128k XGBoost Ngrams-1 0 0 0 0.257
OpenAI/GPT-4o-2024-05-13 XGBoost Ngrams-1 0.556 0 0 0.805
AbacusResearch/Jallabi-34B XGBoost Ngrams-1 0 0 0 0.495
Stark2008/GutenLaserPi XGBoost Ngrams-1 0 0 0 0.332
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra XGBoost Ngrams-1 0 0 0 0.318
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model XGBoost Ngrams-1 0 0 0 0.118
xxx777xxxASD/L3.1-ClaudeMaid-4x8B XGBoost Ngrams-1 0 0 0 0.34

Out-of-distribution results

The table below shows the results of the considered LLM-assessor pairs for three minimum accuracy thresholds (0.8, 0.9, 0.95). The lower values with respect to the in-distribution case highlight the difficulty of predicting performance OOD. Additionally, the top pairs differ from the in-distribution ones, suggesting the latter may not have the highest generalisation power.

LLM Predictive Method Features 0.8 PVR 0.9 PVR 0.95 PVR Area under ARC
Stark2008/VisFlamCat Logistic Regression (l1) Word2Vec 0 0 0 0.558
Stark2008/VisFlamCat Logistic Regression (l1) FastText 0.03 0 0 0.547
migtissera/Tess-v2.5-Phi-3-medium-128k-14B Logistic Regression (l1) Word2Vec 0.03 0 0 0.684
migtissera/Tess-v2.5-Phi-3-medium-128k-14B Logistic Regression (l1) FastText 0 0 0 0.624
MaziyarPanahi/calme-2.1-qwen2.5-72b Logistic Regression (l1) Word2Vec 0.02 0 0 0.753
MaziyarPanahi/calme-2.1-qwen2.5-72b Logistic Regression (l1) FastText 0.495 0 0 0.786
ssmits/Qwen2.5-95B-Instruct Logistic Regression (l1) Word2Vec 0 0 0 0.701
ssmits/Qwen2.5-95B-Instruct Logistic Regression (l1) FastText 0.03 0.01 0 0.745
BlackBeenie/llama-3-luminous-merged Logistic Regression (l1) Word2Vec 0 0 0 0.525
BlackBeenie/llama-3-luminous-merged Logistic Regression (l1) FastText 0 0 0 0.485
LenguajeNaturalAI/leniachat-gemma-2b-v0 Logistic Regression (l1) Word2Vec 0 0 0 0.332
LenguajeNaturalAI/leniachat-gemma-2b-v0 Logistic Regression (l1) FastText 0 0 0 0.294
Intel/neural-chat-7b-v3-1 Logistic Regression (l1) Word2Vec 0 0 0 0.558
Intel/neural-chat-7b-v3-1 Logistic Regression (l1) FastText 0 0 0 0.451
awnr/Mistral-7B-v0.1-signtensors-5-over-16 Logistic Regression (l1) Word2Vec 0 0 0 0.393
awnr/Mistral-7B-v0.1-signtensors-5-over-16 Logistic Regression (l1) FastText 0 0 0 0.341
awnr/Mistral-7B-v0.1-signtensors-7-over-16 Logistic Regression (l1) Word2Vec 0 0 0 0.43
awnr/Mistral-7B-v0.1-signtensors-7-over-16 Logistic Regression (l1) FastText 0 0 0 0.384
awnr/Mistral-7B-v0.1-signtensors-1-over-4 Logistic Regression (l1) Word2Vec 0 0 0 0.35
awnr/Mistral-7B-v0.1-signtensors-1-over-4 Logistic Regression (l1) FastText 0 0 0 0.318
migtissera/Tess-v2.5.2-Qwen2-72B Logistic Regression (l1) Word2Vec 0 0 0 0.646
migtissera/Tess-v2.5.2-Qwen2-72B Logistic Regression (l1) FastText 0.051 0 0 0.718
migtissera/Tess-3-Mistral-Nemo-12B Logistic Regression (l1) Word2Vec 0 0 0 0.46
migtissera/Tess-3-Mistral-Nemo-12B Logistic Regression (l1) FastText 0 0 0 0.377
Intel/neural-chat-7b-v3-2 Logistic Regression (l1) Word2Vec 0 0 0 0.557
Intel/neural-chat-7b-v3-2 Logistic Regression (l1) FastText 0 0 0 0.451
awnr/Mistral-7B-v0.1-signtensors-3-over-8 Logistic Regression (l1) Word2Vec 0 0 0 0.357
awnr/Mistral-7B-v0.1-signtensors-3-over-8 Logistic Regression (l1) FastText 0 0 0 0.357
Stark2008/GutenLaserPi Logistic Regression (l1) Word2Vec 0 0 0 0.56
Stark2008/GutenLaserPi Logistic Regression (l1) FastText 0.02 0 0 0.541
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco Logistic Regression (l1) Word2Vec 0 0 0 0.562
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco Logistic Regression (l1) FastText 0.04 0 0 0.515
NousResearch/Yarn-Llama-2-13b-128k Logistic Regression (l1) Word2Vec 0 0 0 0.377
NousResearch/Yarn-Llama-2-13b-128k Logistic Regression (l1) FastText 0 0 0 0.468
wave-on-discord/qwent-7b Logistic Regression (l1) Word2Vec 0 0 0 0.393
wave-on-discord/qwent-7b Logistic Regression (l1) FastText 0 0 0 0.433
migtissera/Trinity-2-Codestral-22B Logistic Regression (l1) Word2Vec 0 0 0 0.586
migtissera/Trinity-2-Codestral-22B Logistic Regression (l1) FastText 0 0 0 0.485
migtissera/Llama-3-70B-Synthia-v3.5 Logistic Regression (l1) Word2Vec 0 0 0 0.651
migtissera/Llama-3-70B-Synthia-v3.5 Logistic Regression (l1) FastText 0 0 0 0.703
meta-llama/Meta-Llama-3.1-70B-Instruct Logistic Regression (l1) Word2Vec 0 0 0 0.703
meta-llama/Meta-Llama-3.1-70B-Instruct Logistic Regression (l1) FastText 0.111 0.03 0 0.744
AbacusResearch/Jallabi-34B Logistic Regression (l1) Word2Vec 0 0 0 0.636
AbacusResearch/Jallabi-34B Logistic Regression (l1) FastText 0 0 0 0.601
migtissera/Llama-3-8B-Synthia-v3.5 Logistic Regression (l1) Word2Vec 0 0 0 0.526
migtissera/Llama-3-8B-Synthia-v3.5 Logistic Regression (l1) FastText 0 0 0 0.453
migtissera/Trinity-2-Codestral-22B-v0.2 Logistic Regression (l1) Word2Vec 0 0 0 0.577
migtissera/Trinity-2-Codestral-22B-v0.2 Logistic Regression (l1) FastText 0 0 0 0.605
Aryanne/SuperHeart Logistic Regression (l1) Word2Vec 0 0 0 0.54
Aryanne/SuperHeart Logistic Regression (l1) FastText 0 0 0 0.552
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model Logistic Regression (l1) Word2Vec 0 0 0 0.315
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model Logistic Regression (l1) FastText 0 0 0 0.334
upstage/solar-pro-preview-instruct Logistic Regression (l1) Word2Vec 0 0 0 0.682
upstage/solar-pro-preview-instruct Logistic Regression (l1) FastText 0 0 0 0.634
Orenguteng/Llama-3.1-8B-Lexi-Uncensored Logistic Regression (l1) Word2Vec 0 0 0 0.527
Orenguteng/Llama-3.1-8B-Lexi-Uncensored Logistic Regression (l1) FastText 0 0 0 0.471
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 Logistic Regression (l1) Word2Vec 0 0 0 0.358
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 Logistic Regression (l1) FastText 0 0 0 0.355
Intel/neural-chat-7b-v3-3 Logistic Regression (l1) Word2Vec 0 0 0 0.487
Intel/neural-chat-7b-v3-3 Logistic Regression (l1) FastText 0 0 0 0.534
apple/DCLM-7B Logistic Regression (l1) Word2Vec 0 0 0 0.43
apple/DCLM-7B Logistic Regression (l1) FastText 0 0 0 0.379
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge Logistic Regression (l1) Word2Vec 0 0 0 0.389
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge Logistic Regression (l1) FastText 0 0 0 0.386
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 Logistic Regression (l1) Word2Vec 0 0 0 0.527
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 Logistic Regression (l1) FastText 0 0 0 0.488
xxx777xxxASD/L3.1-ClaudeMaid-4x8B Logistic Regression (l1) Word2Vec 0 0 0 0.513
xxx777xxxASD/L3.1-ClaudeMaid-4x8B Logistic Regression (l1) FastText 0 0 0 0.476
NousResearch/Hermes-3-Llama-3.1-70B Logistic Regression (l1) Word2Vec 0.01 0 0 0.735
NousResearch/Hermes-3-Llama-3.1-70B Logistic Regression (l1) FastText 0.141 0 0 0.714
Stark2008/LayleleFlamPi Logistic Regression (l1) Word2Vec 0 0 0 0.561
Stark2008/LayleleFlamPi Logistic Regression (l1) FastText 0 0 0 0.472
migtissera/Tess-3-7B-SFT Logistic Regression (l1) Word2Vec 0 0 0 0.447
migtissera/Tess-3-7B-SFT Logistic Regression (l1) FastText 0 0 0 0.392
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra Logistic Regression (l1) Word2Vec 0 0 0 0.439
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra Logistic Regression (l1) FastText 0 0 0 0.477
Stark2008/VisFlamCat Logistic Regression (l1) Ngrams-1 0 0 0 0.566
migtissera/Tess-v2.5-Phi-3-medium-128k-14B Logistic Regression (l1) Ngrams-1 0 0 0 0.655
MaziyarPanahi/calme-2.1-qwen2.5-72b Logistic Regression (l1) Ngrams-1 0 0 0 0.749
ssmits/Qwen2.5-95B-Instruct Logistic Regression (l1) Ngrams-1 0.04 0 0 0.708
BlackBeenie/llama-3-luminous-merged Logistic Regression (l1) Ngrams-1 0 0 0 0.526
LenguajeNaturalAI/leniachat-gemma-2b-v0 Logistic Regression (l1) Ngrams-1 0 0 0 0.282
Intel/neural-chat-7b-v3-1 Logistic Regression (l1) Ngrams-1 0 0 0 0.536
awnr/Mistral-7B-v0.1-signtensors-5-over-16 Logistic Regression (l1) Ngrams-1 0 0 0 0.384
awnr/Mistral-7B-v0.1-signtensors-7-over-16 Logistic Regression (l1) Ngrams-1 0 0 0 0.502
awnr/Mistral-7B-v0.1-signtensors-1-over-4 Logistic Regression (l1) Ngrams-1 0 0 0 0.343
migtissera/Tess-v2.5.2-Qwen2-72B Logistic Regression (l1) Ngrams-1 0.01 0 0 0.699
migtissera/Tess-3-Mistral-Nemo-12B Logistic Regression (l1) Ngrams-1 0 0 0 0.511
Intel/neural-chat-7b-v3-2 Logistic Regression (l1) Ngrams-1 0 0 0 0.523
awnr/Mistral-7B-v0.1-signtensors-3-over-8 Logistic Regression (l1) Ngrams-1 0 0 0 0.433
Stark2008/GutenLaserPi Logistic Regression (l1) Ngrams-1 0 0 0 0.569
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco Logistic Regression (l1) Ngrams-1 0 0 0 0.499
NousResearch/Yarn-Llama-2-13b-128k Logistic Regression (l1) Ngrams-1 0 0 0 0.382
wave-on-discord/qwent-7b Logistic Regression (l1) Ngrams-1 0 0 0 0.38
migtissera/Trinity-2-Codestral-22B Logistic Regression (l1) Ngrams-1 0 0 0 0.579
migtissera/Llama-3-70B-Synthia-v3.5 Logistic Regression (l1) Ngrams-1 0 0 0 0.667
meta-llama/Meta-Llama-3.1-70B-Instruct Logistic Regression (l1) Ngrams-1 0 0 0 0.684
AbacusResearch/Jallabi-34B Logistic Regression (l1) Ngrams-1 0 0 0 0.586
migtissera/Llama-3-8B-Synthia-v3.5 Logistic Regression (l1) Ngrams-1 0 0 0 0.495
migtissera/Trinity-2-Codestral-22B-v0.2 Logistic Regression (l1) Ngrams-1 0 0 0 0.569
Aryanne/SuperHeart Logistic Regression (l1) Ngrams-1 0 0 0 0.551
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model Logistic Regression (l1) Ngrams-1 0 0 0 0.296
upstage/solar-pro-preview-instruct Logistic Regression (l1) Ngrams-1 0 0 0 0.71
Orenguteng/Llama-3.1-8B-Lexi-Uncensored Logistic Regression (l1) Ngrams-1 0 0 0 0.51
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 Logistic Regression (l1) Ngrams-1 0 0 0 0.31
Intel/neural-chat-7b-v3-3 Logistic Regression (l1) Ngrams-1 0 0 0 0.514
apple/DCLM-7B Logistic Regression (l1) Ngrams-1 0 0 0 0.435
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge Logistic Regression (l1) Ngrams-1 0 0 0 0.381
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 Logistic Regression (l1) Ngrams-1 0 0 0 0.498
xxx777xxxASD/L3.1-ClaudeMaid-4x8B Logistic Regression (l1) Ngrams-1 0 0 0 0.527
NousResearch/Hermes-3-Llama-3.1-70B Logistic Regression (l1) Ngrams-1 0.04 0 0 0.72
Stark2008/LayleleFlamPi Logistic Regression (l1) Ngrams-1 0 0 0 0.568
migtissera/Tess-3-7B-SFT Logistic Regression (l1) Ngrams-1 0 0 0 0.444
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra Logistic Regression (l1) Ngrams-1 0 0 0 0.473
NousResearch/Yarn-Llama-2-13b-128k Logistic Regression (l1) OpenAI 0 0 0 0.398
Orenguteng/Llama-3.1-8B-Lexi-Uncensored Logistic Regression (l1) OpenAI 0 0 0 0.524
apple/DCLM-7B Logistic Regression (l1) OpenAI 0 0 0 0.442
Intel/neural-chat-7b-v3-2 Logistic Regression (l1) OpenAI 0 0 0 0.439
Stark2008/VisFlamCat Logistic Regression (l1) OpenAI 0 0 0 0.546
migtissera/Llama-3-70B-Synthia-v3.5 Logistic Regression (l1) OpenAI 0.061 0 0 0.698
awnr/Mistral-7B-v0.1-signtensors-7-over-16 Logistic Regression (l1) OpenAI 0 0 0 0.449
Intel/neural-chat-7b-v3-3 Logistic Regression (l1) OpenAI 0 0 0 0.505
Aryanne/SuperHeart Logistic Regression (l1) OpenAI 0 0 0 0.559
BlackBeenie/llama-3-luminous-merged Logistic Regression (l1) OpenAI 0 0 0 0.541
AbacusResearch/Jallabi-34B Logistic Regression (l1) OpenAI 0 0 0 0.595
awnr/Mistral-7B-v0.1-signtensors-5-over-16 Logistic Regression (l1) OpenAI 0 0 0 0.41
awnr/Mistral-7B-v0.1-signtensors-1-over-4 Logistic Regression (l1) OpenAI 0 0 0 0.371
meta-llama/Meta-Llama-3.1-70B-Instruct Logistic Regression (l1) OpenAI 0 0 0 0.717
migtissera/Tess-3-7B-SFT Logistic Regression (l1) OpenAI 0 0 0 0.438
migtissera/Trinity-2-Codestral-22B-v0.2 Logistic Regression (l1) OpenAI 0 0 0 0.492
xxx777xxxASD/L3.1-ClaudeMaid-4x8B Logistic Regression (l1) OpenAI 0 0 0 0.518
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra Logistic Regression (l1) OpenAI 0 0 0 0.457
MaziyarPanahi/calme-2.1-qwen2.5-72b Logistic Regression (l1) OpenAI 0.374 0 0 0.768
Stark2008/LayleleFlamPi Logistic Regression (l1) OpenAI 0 0 0 0.534
migtissera/Tess-3-Mistral-Nemo-12B Logistic Regression (l1) OpenAI 0 0 0 0.53
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco Logistic Regression (l1) OpenAI 0 0 0 0.548
migtissera/Trinity-2-Codestral-22B Logistic Regression (l1) OpenAI 0 0 0 0.518
wave-on-discord/qwent-7b Logistic Regression (l1) OpenAI 0 0 0 0.405
Stark2008/GutenLaserPi Logistic Regression (l1) OpenAI 0 0 0 0.515
migtissera/Tess-v2.5.2-Qwen2-72B Logistic Regression (l1) OpenAI 0 0 0 0.701
NousResearch/Hermes-3-Llama-3.1-70B Logistic Regression (l1) OpenAI 0 0 0 0.663
LenguajeNaturalAI/leniachat-gemma-2b-v0 Logistic Regression (l1) OpenAI 0 0 0 0.302
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 Logistic Regression (l1) OpenAI 0 0 0 0.41
awnr/Mistral-7B-v0.1-signtensors-3-over-8 Logistic Regression (l1) OpenAI 0 0 0 0.439
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model Logistic Regression (l1) OpenAI 0 0 0 0.312
migtissera/Llama-3-8B-Synthia-v3.5 Logistic Regression (l1) OpenAI 0 0 0 0.51
Intel/neural-chat-7b-v3-1 Logistic Regression (l1) OpenAI 0 0 0 0.53
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 Logistic Regression (l1) OpenAI 0 0 0 0.485
ssmits/Qwen2.5-95B-Instruct Logistic Regression (l1) OpenAI 0.01 0 0 0.732
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge Logistic Regression (l1) OpenAI 0 0 0 0.392
migtissera/Tess-v2.5-Phi-3-medium-128k-14B Logistic Regression (l1) OpenAI 0 0 0 0.658
upstage/solar-pro-preview-instruct Logistic Regression (l1) OpenAI 0.03 0 0 0.726
Stark2008/VisFlamCat Logistic Regression (l2) Word2Vec 0 0 0 0.536
Stark2008/VisFlamCat Logistic Regression (l2) FastText 0.04 0 0 0.575
migtissera/Tess-v2.5-Phi-3-medium-128k-14B Logistic Regression (l2) Word2Vec 0.01 0 0 0.693
migtissera/Tess-v2.5-Phi-3-medium-128k-14B Logistic Regression (l2) FastText 0.04 0 0 0.641
MaziyarPanahi/calme-2.1-qwen2.5-72b Logistic Regression (l2) Word2Vec 0.03 0 0 0.741
MaziyarPanahi/calme-2.1-qwen2.5-72b Logistic Regression (l2) FastText 0.485 0.04 0 0.784
ssmits/Qwen2.5-95B-Instruct Logistic Regression (l2) Word2Vec 0 0 0 0.713
ssmits/Qwen2.5-95B-Instruct Logistic Regression (l2) FastText 0.01 0 0 0.748
BlackBeenie/llama-3-luminous-merged Logistic Regression (l2) Word2Vec 0 0 0 0.519
BlackBeenie/llama-3-luminous-merged Logistic Regression (l2) FastText 0 0 0 0.519
LenguajeNaturalAI/leniachat-gemma-2b-v0 Logistic Regression (l2) Word2Vec 0 0 0 0.322
LenguajeNaturalAI/leniachat-gemma-2b-v0 Logistic Regression (l2) FastText 0 0 0 0.27
Intel/neural-chat-7b-v3-1 Logistic Regression (l2) Word2Vec 0 0 0 0.548
Intel/neural-chat-7b-v3-1 Logistic Regression (l2) FastText 0 0 0 0.551
awnr/Mistral-7B-v0.1-signtensors-5-over-16 Logistic Regression (l2) Word2Vec 0 0 0 0.367
awnr/Mistral-7B-v0.1-signtensors-5-over-16 Logistic Regression (l2) FastText 0 0 0 0.369
awnr/Mistral-7B-v0.1-signtensors-7-over-16 Logistic Regression (l2) Word2Vec 0 0 0 0.398
awnr/Mistral-7B-v0.1-signtensors-7-over-16 Logistic Regression (l2) FastText 0 0 0 0.389
awnr/Mistral-7B-v0.1-signtensors-1-over-4 Logistic Regression (l2) Word2Vec 0 0 0 0.332
awnr/Mistral-7B-v0.1-signtensors-1-over-4 Logistic Regression (l2) FastText 0 0 0 0.325
migtissera/Tess-v2.5.2-Qwen2-72B Logistic Regression (l2) Word2Vec 0 0 0 0.636
migtissera/Tess-v2.5.2-Qwen2-72B Logistic Regression (l2) FastText 0.051 0 0 0.716
migtissera/Tess-3-Mistral-Nemo-12B Logistic Regression (l2) Word2Vec 0 0 0 0.496
migtissera/Tess-3-Mistral-Nemo-12B Logistic Regression (l2) FastText 0 0 0 0.417
Intel/neural-chat-7b-v3-2 Logistic Regression (l2) Word2Vec 0 0 0 0.512
Intel/neural-chat-7b-v3-2 Logistic Regression (l2) FastText 0 0 0 0.502
awnr/Mistral-7B-v0.1-signtensors-3-over-8 Logistic Regression (l2) Word2Vec 0 0 0 0.361
awnr/Mistral-7B-v0.1-signtensors-3-over-8 Logistic Regression (l2) FastText 0 0 0 0.373
Stark2008/GutenLaserPi Logistic Regression (l2) Word2Vec 0 0 0 0.552
Stark2008/GutenLaserPi Logistic Regression (l2) FastText 0 0 0 0.572
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco Logistic Regression (l2) Word2Vec 0 0 0 0.566
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco Logistic Regression (l2) FastText 0.04 0 0 0.512
NousResearch/Yarn-Llama-2-13b-128k Logistic Regression (l2) Word2Vec 0 0 0 0.356
NousResearch/Yarn-Llama-2-13b-128k Logistic Regression (l2) FastText 0 0 0 0.429
wave-on-discord/qwent-7b Logistic Regression (l2) Word2Vec 0 0 0 0.394
wave-on-discord/qwent-7b Logistic Regression (l2) FastText 0 0 0 0.4
migtissera/Trinity-2-Codestral-22B Logistic Regression (l2) Word2Vec 0.03 0.01 0 0.603
migtissera/Trinity-2-Codestral-22B Logistic Regression (l2) FastText 0.02 0 0 0.568
migtissera/Llama-3-70B-Synthia-v3.5 Logistic Regression (l2) Word2Vec 0.04 0 0 0.655
migtissera/Llama-3-70B-Synthia-v3.5 Logistic Regression (l2) FastText 0.01 0 0 0.702
meta-llama/Meta-Llama-3.1-70B-Instruct Logistic Regression (l2) Word2Vec 0 0 0 0.698
meta-llama/Meta-Llama-3.1-70B-Instruct Logistic Regression (l2) FastText 0.091 0.03 0 0.742
AbacusResearch/Jallabi-34B Logistic Regression (l2) Word2Vec 0 0 0 0.647
AbacusResearch/Jallabi-34B Logistic Regression (l2) FastText 0.061 0 0 0.63
migtissera/Llama-3-8B-Synthia-v3.5 Logistic Regression (l2) Word2Vec 0 0 0 0.521
migtissera/Llama-3-8B-Synthia-v3.5 Logistic Regression (l2) FastText 0 0 0 0.471
migtissera/Trinity-2-Codestral-22B-v0.2 Logistic Regression (l2) Word2Vec 0 0 0 0.599
migtissera/Trinity-2-Codestral-22B-v0.2 Logistic Regression (l2) FastText 0.061 0 0 0.62
Aryanne/SuperHeart Logistic Regression (l2) Word2Vec 0 0 0 0.528
Aryanne/SuperHeart Logistic Regression (l2) FastText 0 0 0 0.528
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model Logistic Regression (l2) Word2Vec 0 0 0 0.284
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model Logistic Regression (l2) FastText 0 0 0 0.328
upstage/solar-pro-preview-instruct Logistic Regression (l2) Word2Vec 0.01 0 0 0.706
upstage/solar-pro-preview-instruct Logistic Regression (l2) FastText 0.061 0 0 0.693
Orenguteng/Llama-3.1-8B-Lexi-Uncensored Logistic Regression (l2) Word2Vec 0 0 0 0.502
Orenguteng/Llama-3.1-8B-Lexi-Uncensored Logistic Regression (l2) FastText 0 0 0 0.481
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 Logistic Regression (l2) Word2Vec 0 0 0 0.335
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 Logistic Regression (l2) FastText 0 0 0 0.309
Intel/neural-chat-7b-v3-3 Logistic Regression (l2) Word2Vec 0 0 0 0.506
Intel/neural-chat-7b-v3-3 Logistic Regression (l2) FastText 0 0 0 0.547
apple/DCLM-7B Logistic Regression (l2) Word2Vec 0 0 0 0.429
apple/DCLM-7B Logistic Regression (l2) FastText 0 0 0 0.392
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge Logistic Regression (l2) Word2Vec 0 0 0 0.4
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge Logistic Regression (l2) FastText 0 0 0 0.403
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 Logistic Regression (l2) Word2Vec 0 0 0 0.521
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 Logistic Regression (l2) FastText 0.03 0 0 0.536
xxx777xxxASD/L3.1-ClaudeMaid-4x8B Logistic Regression (l2) Word2Vec 0 0 0 0.516
xxx777xxxASD/L3.1-ClaudeMaid-4x8B Logistic Regression (l2) FastText 0 0 0 0.504
NousResearch/Hermes-3-Llama-3.1-70B Logistic Regression (l2) Word2Vec 0.02 0 0 0.731
NousResearch/Hermes-3-Llama-3.1-70B Logistic Regression (l2) FastText 0.051 0 0 0.722
Stark2008/LayleleFlamPi Logistic Regression (l2) Word2Vec 0 0 0 0.511
Stark2008/LayleleFlamPi Logistic Regression (l2) FastText 0 0 0 0.573
migtissera/Tess-3-7B-SFT Logistic Regression (l2) Word2Vec 0 0 0 0.436
migtissera/Tess-3-7B-SFT Logistic Regression (l2) FastText 0 0 0 0.445
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra Logistic Regression (l2) Word2Vec 0 0 0 0.418
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra Logistic Regression (l2) FastText 0 0 0 0.469
Stark2008/VisFlamCat Logistic Regression (l2) Ngrams-1 0 0 0 0.498
migtissera/Tess-v2.5-Phi-3-medium-128k-14B Logistic Regression (l2) Ngrams-1 0 0 0 0.619
MaziyarPanahi/calme-2.1-qwen2.5-72b Logistic Regression (l2) Ngrams-1 0 0 0 0.752
ssmits/Qwen2.5-95B-Instruct Logistic Regression (l2) Ngrams-1 0 0 0 0.718
BlackBeenie/llama-3-luminous-merged Logistic Regression (l2) Ngrams-1 0 0 0 0.467
LenguajeNaturalAI/leniachat-gemma-2b-v0 Logistic Regression (l2) Ngrams-1 0 0 0 0.309
Intel/neural-chat-7b-v3-1 Logistic Regression (l2) Ngrams-1 0 0 0 0.47
awnr/Mistral-7B-v0.1-signtensors-5-over-16 Logistic Regression (l2) Ngrams-1 0 0 0 0.381
awnr/Mistral-7B-v0.1-signtensors-7-over-16 Logistic Regression (l2) Ngrams-1 0 0 0 0.45
awnr/Mistral-7B-v0.1-signtensors-1-over-4 Logistic Regression (l2) Ngrams-1 0 0 0 0.324
migtissera/Tess-v2.5.2-Qwen2-72B Logistic Regression (l2) Ngrams-1 0 0 0 0.687
migtissera/Tess-3-Mistral-Nemo-12B Logistic Regression (l2) Ngrams-1 0 0 0 0.457
Intel/neural-chat-7b-v3-2 Logistic Regression (l2) Ngrams-1 0 0 0 0.461
awnr/Mistral-7B-v0.1-signtensors-3-over-8 Logistic Regression (l2) Ngrams-1 0 0 0 0.418
Stark2008/GutenLaserPi Logistic Regression (l2) Ngrams-1 0 0 0 0.5
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco Logistic Regression (l2) Ngrams-1 0 0 0 0.485
NousResearch/Yarn-Llama-2-13b-128k Logistic Regression (l2) Ngrams-1 0 0 0 0.383
wave-on-discord/qwent-7b Logistic Regression (l2) Ngrams-1 0 0 0 0.42
migtissera/Trinity-2-Codestral-22B Logistic Regression (l2) Ngrams-1 0 0 0 0.515
migtissera/Llama-3-70B-Synthia-v3.5 Logistic Regression (l2) Ngrams-1 0 0 0 0.669
meta-llama/Meta-Llama-3.1-70B-Instruct Logistic Regression (l2) Ngrams-1 0 0 0 0.724
AbacusResearch/Jallabi-34B Logistic Regression (l2) Ngrams-1 0 0 0 0.587
migtissera/Llama-3-8B-Synthia-v3.5 Logistic Regression (l2) Ngrams-1 0 0 0 0.474
migtissera/Trinity-2-Codestral-22B-v0.2 Logistic Regression (l2) Ngrams-1 0 0 0 0.522
Aryanne/SuperHeart Logistic Regression (l2) Ngrams-1 0 0 0 0.471
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model Logistic Regression (l2) Ngrams-1 0 0 0 0.31
upstage/solar-pro-preview-instruct Logistic Regression (l2) Ngrams-1 0 0 0 0.682
Orenguteng/Llama-3.1-8B-Lexi-Uncensored Logistic Regression (l2) Ngrams-1 0 0 0 0.474
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 Logistic Regression (l2) Ngrams-1 0 0 0 0.314
Intel/neural-chat-7b-v3-3 Logistic Regression (l2) Ngrams-1 0 0 0 0.482
apple/DCLM-7B Logistic Regression (l2) Ngrams-1 0 0 0 0.436
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge Logistic Regression (l2) Ngrams-1 0 0 0 0.382
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 Logistic Regression (l2) Ngrams-1 0 0 0 0.468
xxx777xxxASD/L3.1-ClaudeMaid-4x8B Logistic Regression (l2) Ngrams-1 0 0 0 0.427
NousResearch/Hermes-3-Llama-3.1-70B Logistic Regression (l2) Ngrams-1 0 0 0 0.632
Stark2008/LayleleFlamPi Logistic Regression (l2) Ngrams-1 0 0 0 0.483
migtissera/Tess-3-7B-SFT Logistic Regression (l2) Ngrams-1 0 0 0 0.434
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra Logistic Regression (l2) Ngrams-1 0 0 0 0.427
NousResearch/Yarn-Llama-2-13b-128k Logistic Regression (l2) OpenAI 0 0 0 0.348
Orenguteng/Llama-3.1-8B-Lexi-Uncensored Logistic Regression (l2) OpenAI 0 0 0 0.513
apple/DCLM-7B Logistic Regression (l2) OpenAI 0 0 0 0.436
Intel/neural-chat-7b-v3-2 Logistic Regression (l2) OpenAI 0 0 0 0.452
Stark2008/VisFlamCat Logistic Regression (l2) OpenAI 0 0 0 0.52
migtissera/Llama-3-70B-Synthia-v3.5 Logistic Regression (l2) OpenAI 0.152 0 0 0.711
awnr/Mistral-7B-v0.1-signtensors-7-over-16 Logistic Regression (l2) OpenAI 0 0 0 0.432
Intel/neural-chat-7b-v3-3 Logistic Regression (l2) OpenAI 0 0 0 0.446
Aryanne/SuperHeart Logistic Regression (l2) OpenAI 0 0 0 0.558
BlackBeenie/llama-3-luminous-merged Logistic Regression (l2) OpenAI 0 0 0 0.552
AbacusResearch/Jallabi-34B Logistic Regression (l2) OpenAI 0 0 0 0.594
awnr/Mistral-7B-v0.1-signtensors-5-over-16 Logistic Regression (l2) OpenAI 0 0 0 0.384
awnr/Mistral-7B-v0.1-signtensors-1-over-4 Logistic Regression (l2) OpenAI 0 0 0 0.332
meta-llama/Meta-Llama-3.1-70B-Instruct Logistic Regression (l2) OpenAI 0.162 0 0 0.731
migtissera/Tess-3-7B-SFT Logistic Regression (l2) OpenAI 0 0 0 0.423
migtissera/Trinity-2-Codestral-22B-v0.2 Logistic Regression (l2) OpenAI 0 0 0 0.533
xxx777xxxASD/L3.1-ClaudeMaid-4x8B Logistic Regression (l2) OpenAI 0 0 0 0.51
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra Logistic Regression (l2) OpenAI 0 0 0 0.492
MaziyarPanahi/calme-2.1-qwen2.5-72b Logistic Regression (l2) OpenAI 0.303 0 0 0.773
Stark2008/LayleleFlamPi Logistic Regression (l2) OpenAI 0 0 0 0.512
migtissera/Tess-3-Mistral-Nemo-12B Logistic Regression (l2) OpenAI 0 0 0 0.462
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco Logistic Regression (l2) OpenAI 0 0 0 0.559
migtissera/Trinity-2-Codestral-22B Logistic Regression (l2) OpenAI 0 0 0 0.544
wave-on-discord/qwent-7b Logistic Regression (l2) OpenAI 0 0 0 0.418
Stark2008/GutenLaserPi Logistic Regression (l2) OpenAI 0 0 0 0.53
migtissera/Tess-v2.5.2-Qwen2-72B Logistic Regression (l2) OpenAI 0.242 0 0 0.739
NousResearch/Hermes-3-Llama-3.1-70B Logistic Regression (l2) OpenAI 0 0 0 0.659
LenguajeNaturalAI/leniachat-gemma-2b-v0 Logistic Regression (l2) OpenAI 0 0 0 0.304
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 Logistic Regression (l2) OpenAI 0 0 0 0.377
awnr/Mistral-7B-v0.1-signtensors-3-over-8 Logistic Regression (l2) OpenAI 0 0 0 0.428
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model Logistic Regression (l2) OpenAI 0 0 0 0.311
migtissera/Llama-3-8B-Synthia-v3.5 Logistic Regression (l2) OpenAI 0 0 0 0.498
Intel/neural-chat-7b-v3-1 Logistic Regression (l2) OpenAI 0 0 0 0.523
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 Logistic Regression (l2) OpenAI 0 0 0 0.542
ssmits/Qwen2.5-95B-Instruct Logistic Regression (l2) OpenAI 0 0 0 0.749
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge Logistic Regression (l2) OpenAI 0 0 0 0.394
migtissera/Tess-v2.5-Phi-3-medium-128k-14B Logistic Regression (l2) OpenAI 0 0 0 0.644
upstage/solar-pro-preview-instruct Logistic Regression (l2) OpenAI 0.03 0 0 0.724
Stark2008/VisFlamCat XGBoost Word2Vec 0 0 0 0.529
Stark2008/VisFlamCat XGBoost FastText 0 0 0 0.555
migtissera/Tess-v2.5-Phi-3-medium-128k-14B XGBoost Word2Vec 0 0 0 0.644
migtissera/Tess-v2.5-Phi-3-medium-128k-14B XGBoost FastText 0 0 0 0.671
MaziyarPanahi/calme-2.1-qwen2.5-72b XGBoost Word2Vec 0.03 0 0 0.737
MaziyarPanahi/calme-2.1-qwen2.5-72b XGBoost FastText 0.303 0 0 0.769
ssmits/Qwen2.5-95B-Instruct XGBoost Word2Vec 0 0 0 0.684
ssmits/Qwen2.5-95B-Instruct XGBoost FastText 0.091 0 0 0.743
BlackBeenie/llama-3-luminous-merged XGBoost Word2Vec 0 0 0 0.489
BlackBeenie/llama-3-luminous-merged XGBoost FastText 0 0 0 0.504
LenguajeNaturalAI/leniachat-gemma-2b-v0 XGBoost Word2Vec 0 0 0 0.301
LenguajeNaturalAI/leniachat-gemma-2b-v0 XGBoost FastText 0 0 0 0.313
Intel/neural-chat-7b-v3-1 XGBoost Word2Vec 0 0 0 0.52
Intel/neural-chat-7b-v3-1 XGBoost FastText 0 0 0 0.511
awnr/Mistral-7B-v0.1-signtensors-5-over-16 XGBoost Word2Vec 0 0 0 0.394
awnr/Mistral-7B-v0.1-signtensors-5-over-16 XGBoost FastText 0 0 0 0.4
awnr/Mistral-7B-v0.1-signtensors-7-over-16 XGBoost Word2Vec 0 0 0 0.444
awnr/Mistral-7B-v0.1-signtensors-7-over-16 XGBoost FastText 0 0 0 0.398
awnr/Mistral-7B-v0.1-signtensors-1-over-4 XGBoost Word2Vec 0 0 0 0.338
awnr/Mistral-7B-v0.1-signtensors-1-over-4 XGBoost FastText 0 0 0 0.326
migtissera/Tess-v2.5.2-Qwen2-72B XGBoost Word2Vec 0.01 0 0 0.654
migtissera/Tess-v2.5.2-Qwen2-72B XGBoost FastText 0 0 0 0.703
migtissera/Tess-3-Mistral-Nemo-12B XGBoost Word2Vec 0 0 0 0.502
migtissera/Tess-3-Mistral-Nemo-12B XGBoost FastText 0.01 0 0 0.531
Intel/neural-chat-7b-v3-2 XGBoost Word2Vec 0 0 0 0.543
Intel/neural-chat-7b-v3-2 XGBoost FastText 0 0 0 0.561
awnr/Mistral-7B-v0.1-signtensors-3-over-8 XGBoost Word2Vec 0 0 0 0.411
awnr/Mistral-7B-v0.1-signtensors-3-over-8 XGBoost FastText 0 0 0 0.395
Stark2008/GutenLaserPi XGBoost Word2Vec 0 0 0 0.529
Stark2008/GutenLaserPi XGBoost FastText 0 0 0 0.557
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco XGBoost Word2Vec 0 0 0 0.568
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco XGBoost FastText 0 0 0 0.575
NousResearch/Yarn-Llama-2-13b-128k XGBoost Word2Vec 0 0 0 0.428
NousResearch/Yarn-Llama-2-13b-128k XGBoost FastText 0 0 0 0.429
wave-on-discord/qwent-7b XGBoost Word2Vec 0 0 0 0.448
wave-on-discord/qwent-7b XGBoost FastText 0 0 0 0.44
migtissera/Trinity-2-Codestral-22B XGBoost Word2Vec 0 0 0 0.536
migtissera/Trinity-2-Codestral-22B XGBoost FastText 0 0 0 0.538
migtissera/Llama-3-70B-Synthia-v3.5 XGBoost Word2Vec 0 0 0 0.65
migtissera/Llama-3-70B-Synthia-v3.5 XGBoost FastText 0 0 0 0.677
meta-llama/Meta-Llama-3.1-70B-Instruct XGBoost Word2Vec 0 0 0 0.7
meta-llama/Meta-Llama-3.1-70B-Instruct XGBoost FastText 0 0 0 0.679
AbacusResearch/Jallabi-34B XGBoost Word2Vec 0 0 0 0.628
AbacusResearch/Jallabi-34B XGBoost FastText 0 0 0 0.619
migtissera/Llama-3-8B-Synthia-v3.5 XGBoost Word2Vec 0 0 0 0.502
migtissera/Llama-3-8B-Synthia-v3.5 XGBoost FastText 0 0 0 0.517
migtissera/Trinity-2-Codestral-22B-v0.2 XGBoost Word2Vec 0 0 0 0.585
migtissera/Trinity-2-Codestral-22B-v0.2 XGBoost FastText 0 0 0 0.584
Aryanne/SuperHeart XGBoost Word2Vec 0 0 0 0.52
Aryanne/SuperHeart XGBoost FastText 0 0 0 0.484
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model XGBoost Word2Vec 0 0 0 0.296
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model XGBoost FastText 0 0 0 0.326
upstage/solar-pro-preview-instruct XGBoost Word2Vec 0 0 0 0.677
upstage/solar-pro-preview-instruct XGBoost FastText 0 0 0 0.713
Orenguteng/Llama-3.1-8B-Lexi-Uncensored XGBoost Word2Vec 0 0 0 0.54
Orenguteng/Llama-3.1-8B-Lexi-Uncensored XGBoost FastText 0 0 0 0.486
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 XGBoost Word2Vec 0 0 0 0.343
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 XGBoost FastText 0 0 0 0.389
Intel/neural-chat-7b-v3-3 XGBoost Word2Vec 0 0 0 0.467
Intel/neural-chat-7b-v3-3 XGBoost FastText 0 0 0 0.547
apple/DCLM-7B XGBoost Word2Vec 0 0 0 0.431
apple/DCLM-7B XGBoost FastText 0 0 0 0.448
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge XGBoost Word2Vec 0 0 0 0.41
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge XGBoost FastText 0 0 0 0.38
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 XGBoost Word2Vec 0 0 0 0.55
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 XGBoost FastText 0 0 0 0.53
xxx777xxxASD/L3.1-ClaudeMaid-4x8B XGBoost Word2Vec 0 0 0 0.55
xxx777xxxASD/L3.1-ClaudeMaid-4x8B XGBoost FastText 0 0 0 0.506
NousResearch/Hermes-3-Llama-3.1-70B XGBoost Word2Vec 0.02 0 0 0.699
NousResearch/Hermes-3-Llama-3.1-70B XGBoost FastText 0 0 0 0.637
Stark2008/LayleleFlamPi XGBoost Word2Vec 0 0 0 0.527
Stark2008/LayleleFlamPi XGBoost FastText 0 0 0 0.545
migtissera/Tess-3-7B-SFT XGBoost Word2Vec 0 0 0 0.464
migtissera/Tess-3-7B-SFT XGBoost FastText 0 0 0 0.493
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra XGBoost Word2Vec 0 0 0 0.462
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra XGBoost FastText 0 0 0 0.448
Stark2008/VisFlamCat XGBoost Ngrams-1 0 0 0 0.558
migtissera/Tess-v2.5-Phi-3-medium-128k-14B XGBoost Ngrams-1 0 0 0 0.595
MaziyarPanahi/calme-2.1-qwen2.5-72b XGBoost Ngrams-1 0.03 0 0 0.756
ssmits/Qwen2.5-95B-Instruct XGBoost Ngrams-1 0 0 0 0.694
BlackBeenie/llama-3-luminous-merged XGBoost Ngrams-1 0 0 0 0.484
LenguajeNaturalAI/leniachat-gemma-2b-v0 XGBoost Ngrams-1 0 0 0 0.316
Intel/neural-chat-7b-v3-1 XGBoost Ngrams-1 0 0 0 0.535
awnr/Mistral-7B-v0.1-signtensors-5-over-16 XGBoost Ngrams-1 0 0 0 0.426
awnr/Mistral-7B-v0.1-signtensors-7-over-16 XGBoost Ngrams-1 0 0 0 0.441
awnr/Mistral-7B-v0.1-signtensors-1-over-4 XGBoost Ngrams-1 0 0 0 0.361
migtissera/Tess-v2.5.2-Qwen2-72B XGBoost Ngrams-1 0 0 0 0.673
migtissera/Tess-3-Mistral-Nemo-12B XGBoost Ngrams-1 0 0 0 0.5
Intel/neural-chat-7b-v3-2 XGBoost Ngrams-1 0 0 0 0.465
awnr/Mistral-7B-v0.1-signtensors-3-over-8 XGBoost Ngrams-1 0 0 0 0.412
Stark2008/GutenLaserPi XGBoost Ngrams-1 0 0 0 0.552
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco XGBoost Ngrams-1 0 0 0 0.458
NousResearch/Yarn-Llama-2-13b-128k XGBoost Ngrams-1 0 0 0 0.336
wave-on-discord/qwent-7b XGBoost Ngrams-1 0 0 0 0.399
migtissera/Trinity-2-Codestral-22B XGBoost Ngrams-1 0 0 0 0.587
migtissera/Llama-3-70B-Synthia-v3.5 XGBoost Ngrams-1 0 0 0 0.657
meta-llama/Meta-Llama-3.1-70B-Instruct XGBoost Ngrams-1 0 0 0 0.679
AbacusResearch/Jallabi-34B XGBoost Ngrams-1 0 0 0 0.616
migtissera/Llama-3-8B-Synthia-v3.5 XGBoost Ngrams-1 0 0 0 0.473
migtissera/Trinity-2-Codestral-22B-v0.2 XGBoost Ngrams-1 0 0 0 0.574
Aryanne/SuperHeart XGBoost Ngrams-1 0 0 0 0.524
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model XGBoost Ngrams-1 0 0 0 0.291
upstage/solar-pro-preview-instruct XGBoost Ngrams-1 0 0 0 0.678
Orenguteng/Llama-3.1-8B-Lexi-Uncensored XGBoost Ngrams-1 0 0 0 0.525
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 XGBoost Ngrams-1 0 0 0 0.324
Intel/neural-chat-7b-v3-3 XGBoost Ngrams-1 0 0 0 0.464
apple/DCLM-7B XGBoost Ngrams-1 0 0 0 0.373
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge XGBoost Ngrams-1 0 0 0 0.364
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 XGBoost Ngrams-1 0 0 0 0.53
xxx777xxxASD/L3.1-ClaudeMaid-4x8B XGBoost Ngrams-1 0 0 0 0.47
NousResearch/Hermes-3-Llama-3.1-70B XGBoost Ngrams-1 0 0 0 0.639
Stark2008/LayleleFlamPi XGBoost Ngrams-1 0 0 0 0.54
migtissera/Tess-3-7B-SFT XGBoost Ngrams-1 0 0 0 0.469
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra XGBoost Ngrams-1 0 0 0 0.413
NousResearch/Yarn-Llama-2-13b-128k XGBoost OpenAI 0 0 0 0.381
Orenguteng/Llama-3.1-8B-Lexi-Uncensored XGBoost OpenAI 0 0 0 0.494
apple/DCLM-7B XGBoost OpenAI 0 0 0 0.411
Intel/neural-chat-7b-v3-2 XGBoost OpenAI 0 0 0 0.487
Stark2008/VisFlamCat XGBoost OpenAI 0 0 0 0.495
migtissera/Llama-3-70B-Synthia-v3.5 XGBoost OpenAI 0 0 0 0.676
awnr/Mistral-7B-v0.1-signtensors-7-over-16 XGBoost OpenAI 0 0 0 0.42
Intel/neural-chat-7b-v3-3 XGBoost OpenAI 0 0 0 0.457
Aryanne/SuperHeart XGBoost OpenAI 0 0 0 0.548
BlackBeenie/llama-3-luminous-merged XGBoost OpenAI 0 0 0 0.521
AbacusResearch/Jallabi-34B XGBoost OpenAI 0 0 0 0.586
awnr/Mistral-7B-v0.1-signtensors-5-over-16 XGBoost OpenAI 0 0 0 0.404
awnr/Mistral-7B-v0.1-signtensors-1-over-4 XGBoost OpenAI 0 0 0 0.353
meta-llama/Meta-Llama-3.1-70B-Instruct XGBoost OpenAI 0.01 0 0 0.686
migtissera/Tess-3-7B-SFT XGBoost OpenAI 0 0 0 0.424
migtissera/Trinity-2-Codestral-22B-v0.2 XGBoost OpenAI 0 0 0 0.561
xxx777xxxASD/L3.1-ClaudeMaid-4x8B XGBoost OpenAI 0 0 0 0.521
argilla-warehouse/Llama-3.1-8B-MagPie-Ultra XGBoost OpenAI 0 0 0 0.472
MaziyarPanahi/calme-2.1-qwen2.5-72b XGBoost OpenAI 0 0 0 0.733
Stark2008/LayleleFlamPi XGBoost OpenAI 0 0 0 0.496
migtissera/Tess-3-Mistral-Nemo-12B XGBoost OpenAI 0 0 0 0.493
BlackBeenie/llama-3.1-8B-Galore-openassistant-guanaco XGBoost OpenAI 0 0 0 0.544
migtissera/Trinity-2-Codestral-22B XGBoost OpenAI 0 0 0 0.573
wave-on-discord/qwent-7b XGBoost OpenAI 0 0 0 0.416
Stark2008/GutenLaserPi XGBoost OpenAI 0 0 0 0.543
migtissera/Tess-v2.5.2-Qwen2-72B XGBoost OpenAI 0.01 0 0 0.706
NousResearch/Hermes-3-Llama-3.1-70B XGBoost OpenAI 0 0 0 0.659
LenguajeNaturalAI/leniachat-gemma-2b-v0 XGBoost OpenAI 0 0 0 0.319
LenguajeNaturalAI/leniachat-qwen2-1.5B-v0 XGBoost OpenAI 0 0 0 0.392
awnr/Mistral-7B-v0.1-signtensors-3-over-8 XGBoost OpenAI 0 0 0 0.45
jaredjoss/pythia-410m-roberta-lr 8e7-kl 01-steps 12000-rlhf-model XGBoost OpenAI 0 0 0 0.288
migtissera/Llama-3-8B-Synthia-v3.5 XGBoost OpenAI 0 0 0 0.476
Intel/neural-chat-7b-v3-1 XGBoost OpenAI 0 0 0 0.491
Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2 XGBoost OpenAI 0 0 0 0.522
ssmits/Qwen2.5-95B-Instruct XGBoost OpenAI 0 0 0 0.703
ymcki/gemma-2-2b-ORPO-jpn-it-abliterated-18-merge XGBoost OpenAI 0 0 0 0.386
migtissera/Tess-v2.5-Phi-3-medium-128k-14B XGBoost OpenAI 0 0 0 0.603
upstage/solar-pro-preview-instruct XGBoost OpenAI 0.091 0 0 0.733

Visualisation of some Accuracy-Rejection Curves

Visualisation of some Accuracy-Rejection Curves

Comparison of the ARC curves for the different in-distribution assessors of the highest accuracy LLM (OpenAI/GPT-4o-2024-08-06) on MMLU-Pro.

Overall, current LLMs are very poorly predictable, particularly at low error thresholds, which is critical for high-stakes scenarios. PredictaBoard makes predictability the key consideration, stimulating research into more predictable AI systems, critical for high-stakes scenarios. These findings underscore the importance of developing AI systems capable of operating under strict error constraints, particularly in high-risk scenarios. Future research will explore even higher thresholds to further enhance AI safety.

BibTeX

@misc{pacchiardi2025predictaboardbenchmarkingllmscore,
              title={{PredictaBoard}: Benchmarking {LLM} Score Predictability},
              author={Lorenzo Pacchiardi and Konstantinos Voudouris and Ben Slater and Fernando Martínez-Plumed and José Hernández-Orallo and Lexin Zhou and Wout Schellaert},
              year={2025},
              eprint={2502.14445},
              archivePrefix={arXiv},
              primaryClass={cs.CL},
              url={https://arxiv.org/abs/2502.14445},
        }

Acknowledgements

This research project has benefitted from the Microsoft Accelerate Foundation Models Research (AFMR) grant program.