ABCD: All Biases Come Disguised

1Dartmouth College
*Equal contribution
ICML 2026
ABCD: All Biases Come Disguised — four MCQ option cards labeled A, B, C, D wearing masks.

Standard MCQ evaluation lets LLMs cheat through label, position, and few-shot prompt artifacts. ABCD swaps distinct labels for uniform dashes and matches generated answers by semantic similarity — cutting accuracy variance with a single-pass, logit-free protocol.

Abstract

Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distribution of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question.

We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented. With a simple sentence similarity model, we demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in performance, exposing the LLM's capabilities under reduced evaluation artifacts — without any help from the prompt examples or the option labels.

Across multiple benchmarks and models, this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance with only a minimal decrease in mean model performance. Through ablation studies on various embedding models and similarity functions, we show that the method is more robust than the standard ones.

NonsenseQA

A diagnostic dataset of 1,000 questions built from random words with randomly assigned "correct" answers. Each answer is placed uniformly across the four positions, so no label, position, or distributional bias is built into the data — accuracy should sit at the 25% chance level.

A NonsenseQA example question composed of random words with four random-word options.

A NonsenseQA example. An example question from NonsenseQA with random answers and a golden answer chosen at random as "D. Arms". Models should score at chance (25%); under standard MCQ evaluation, many score above 90%.

Matched-and-Dashed (M&D)

Our proposed evaluation protocol. Three lightweight changes, one forward pass, no fine-tuning, no logits required:

  • Uniform labels — replace A / B / C / D with
    - / - / - / - to neutralize label bias.
  • Full-text generation — the model writes the answer it chooses, not a letter.
  • Semantic matching — map the generation to the closest candidate option via cosine similarity over sentence embeddings.
Pipeline figure showing the M&D protocol: LLM generates a full-text answer, a regex filter extracts the final sentence, and a sentence similarity model maps it to the most likely option.

The M&D pipeline. The LLM generates a full-text answer; a small regex extracts the final sentence; a sentence embedding model maps it to the closest candidate option by cosine similarity.

Results

Under standard letter-based MCQ evaluation, several LLMs score over 95% — not by reasoning, but by exploiting the few-shot answer distribution, label, and position cues. NonsenseQA exposes three behavioral categories: explicit bias models, implicit bias models, and models that cannot reliably leverage the bias.

How to read these plots. Throughout this section, boxes show median ± standard deviation across answer-moving attack permutations; whiskers show min/max; dots are the original permutation; stars are the SCORE robustness metric.

Box plots comparing M&D (yellow) and S&L (pink) across 13 models on NonsenseQA. S&L medians stretch from 40% to 95% on semantically meaningless inputs; M&D medians collapse toward the 25% chance line.

NonsenseQA. Under standard S&L (pink), several LLMs score >95% on questions made of random words. M&D (yellow) collapses the same models toward the 25% chance line — revealing that the high S&L scores were never reasoning, only bias exploitation.

Across 13 open-source LLMs (DeepSeek-R1, Qwen3, Llama-3.1, Gemma-3, Ministral-3, Nemotron, Phi-4, GPT-OSS) and five benchmarks (CommonsenseQA, ARC, GPQA, MMLU-Pro, INCLUDE), M&D cuts mean accuracy variance under answer-moving attacks by a 3× geometric mean with only minimal performance loss.

Box plots comparing M&D (purple) and S&L (orange) across 13 models on GPQA, showing reduced variance and tighter spreads under the proposed protocol.

GPQA. M&D reduces accuracy variance in 9 of 13 models on graduate-level GPQA questions, and shrinks the gap between original and attack-averaged accuracy in 8 of 13.

Box plots comparing M&D (purple) and S&L (orange) across 13 models on MMLU-Pro, showing reduced variance and tighter spreads under the proposed protocol.

MMLU-Pro. M&D (purple) reduces variance for 7 of 13 models, but original-permutation accuracy remains well above the attack median across both protocols. Our appendix attributes this to MMLU-Pro's 10-option structure and perplexity-based evidence of possible training-data contamination on the original ordering.

Click here to see the results on CommonsenseQA, INCLUDE, and ARC.
Box plots comparing M&D (purple) and S&L (orange) across 13 models on CommonsenseQA, showing reduced variance and tighter spreads under the proposed protocol.

CommonsenseQA. M&D (purple) collapses the spread that S&L (orange) leaves wide open.

Box plots comparing M&D (purple) and S&L (orange) across 13 models on INCLUDE, showing reduced variance and tighter spreads under the proposed protocol.

INCLUDE. M&D (purple) collapses the spread that S&L (orange) leaves wide open.

Box plots comparing M&D (purple) and S&L (orange) across 13 models on ARC, showing reduced variance and tighter spreads under the proposed protocol.

ARC. M&D (purple) collapses the spread that S&L (orange) leaves wide open.


Cross-benchmark rank agreement improves for reasoning-heavy pairs (GPQA ↔ ARC, GPQA ↔ CSQA) and decreases for English ↔ multilingual pairs — differences previously masked by shared evaluation artifacts.

Kendall tau cross-benchmark rank-agreement difference τM &D − τS&L. A positive value indicates a higher M&D cross-benchmark rank agreement, while a negative-higher S&L rank agreement.

Cross-benchmark rank agreement. Kendall tau difference τM&D − τS&L. Positive values indicate stronger agreement under M&D, negative under S&L. Reasoning-heavy pairs (GPQA ↔ ARC, GPQA ↔ CSQA) tighten under M&D once shared evaluation artifacts are removed.

Several recent works study evaluation artifacts in MCQ-based LLM benchmarking. We invite you to check them out as well.

BibTeX

@inproceedings{nowak2026abcd,
  author    = {Nowak, Mateusz and Cadet, Xavier and Chin, Peter},
  title     = {{ABCD}: All Biases Come Disguised},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026},
}