ABCD: All Biases Come Disguised

Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distribution of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question.

We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented. With a simple sentence similarity model, we demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in performance, exposing the LLM's capabilities under reduced evaluation artifacts — without any help from the prompt examples or the option labels.

Across multiple benchmarks and models, this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance 3× with only a minimal decrease in mean model performance. Through ablation studies on various embedding models and similarity functions, we show that the method is more robust than the standard ones.

NonsenseQA

A diagnostic dataset of 1,000 questions built from random words with randomly assigned "correct" answers. Each answer is placed uniformly across the four positions, so no label, position, or distributional bias is built into the data — accuracy should sit at the 25% chance level.

A NonsenseQA example. An example question from NonsenseQA with random answers and a golden answer chosen at random as "D. Arms". Models should score at chance (25%); under standard MCQ evaluation, many score above 90%.

Our proposed evaluation protocol. Three lightweight changes, one forward pass, no fine-tuning, no logits required:

Uniform labels — replace A / B / C / D with
- / - / - / - to neutralize label bias.
Full-text generation — the model writes the answer it chooses, not a letter.
Semantic matching — map the generation to the closest candidate option via cosine similarity over sentence embeddings.

Pipeline figure showing the M&D protocol: LLM generates a full-text answer, a regex filter extracts the final sentence, and a sentence similarity model maps it to the most likely option.

The M&D pipeline. The LLM generates a full-text answer; a small regex extracts the final sentence; a sentence embedding model maps it to the closest candidate option by cosine similarity.

Under standard letter-based MCQ evaluation, several LLMs score over 95% — not by reasoning, but by exploiting the few-shot answer distribution, label, and position cues. NonsenseQA exposes three behavioral categories: explicit bias models, implicit bias models, and models that cannot reliably leverage the bias.

How to read these plots. Throughout this section, boxes show median ± standard deviation across answer-moving attack permutations; whiskers show min/max; dots are the original permutation; stars are the SCORE robustness metric.

Box plots comparing M&D (yellow) and S&L (pink) across 13 models on NonsenseQA. S&L medians stretch from 40% to 95% on semantically meaningless inputs; M&D medians collapse toward the 25% chance line.

NonsenseQA. Under standard S&L (pink), several LLMs score >95% on questions made of random words. M&D (yellow) collapses the same models toward the 25% chance line — revealing that the high S&L scores were never reasoning, only bias exploitation.

Across 13 open-source LLMs (DeepSeek-R1, Qwen3, Llama-3.1, Gemma-3, Ministral-3, Nemotron, Phi-4, GPT-OSS) and five benchmarks (CommonsenseQA, ARC, GPQA, MMLU-Pro, INCLUDE), M&D cuts mean accuracy variance under answer-moving attacks by a 3× geometric mean with only minimal performance loss.

Box plots comparing M&D (purple) and S&L (orange) across 13 models on GPQA, showing reduced variance and tighter spreads under the proposed protocol.

GPQA. M&D reduces accuracy variance in 9 of 13 models on graduate-level GPQA questions, and shrinks the gap between original and attack-averaged accuracy in 8 of 13.

Box plots comparing M&D (purple) and S&L (orange) across 13 models on MMLU-Pro, showing reduced variance and tighter spreads under the proposed protocol.

MMLU-Pro. M&D (purple) reduces variance for 7 of 13 models, but original-permutation accuracy remains well above the attack median across both protocols. Our appendix attributes this to MMLU-Pro's 10-option structure and perplexity-based evidence of possible training-data contamination on the original ordering.

Box plots comparing M&D (purple) and S&L (orange) across 13 models on CommonsenseQA, showing reduced variance and tighter spreads under the proposed protocol.

CommonsenseQA. M&D (purple) collapses the spread that S&L (orange) leaves wide open.

Box plots comparing M&D (purple) and S&L (orange) across 13 models on INCLUDE, showing reduced variance and tighter spreads under the proposed protocol.

INCLUDE. M&D (purple) collapses the spread that S&L (orange) leaves wide open.

Box plots comparing M&D (purple) and S&L (orange) across 13 models on ARC, showing reduced variance and tighter spreads under the proposed protocol.

ARC. M&D (purple) collapses the spread that S&L (orange) leaves wide open.

Cross-benchmark rank agreement improves for reasoning-heavy pairs (GPQA ↔ ARC, GPQA ↔ CSQA) and decreases for English ↔ multilingual pairs — differences previously masked by shared evaluation artifacts.

Kendall tau cross-benchmark rank-agreement difference τM &D − τS&L. A positive value indicates a higher M&D cross-benchmark rank agreement, while a negative-higher S&L rank agreement.

Cross-benchmark rank agreement. Kendall tau difference τ_M&D − τ_S&L. Positive values indicate stronger agreement under M&D, negative under S&L. Reasoning-heavy pairs (GPQA ↔ ARC, GPQA ↔ CSQA) tighten under M&D once shared evaluation artifacts are removed.

Several recent works study evaluation artifacts in MCQ-based LLM benchmarking. We invite you to check them out as well.

Zheng et al. (ICLR 2024) and Zhou et al. (LREC-COLING 2024) introduce answer-moving attacks and study label/position sensitivity in MCQ evaluation.
Pezeshkpour & Hruschka (NAACL 2024 Findings) quantify order sensitivity across diverse LLMs and prompt formats.
Balepur et al. (2025) and Chandak et al. (2025) argue for free-form answer generation and answer matching as alternatives to letter selection.
Nalbandyan et al. (NAACL 2025) introduce the SCORE robustness metric we adopt and supplement with a variance-ratio measurement.

BibTeX

@inproceedings{nowak2026abcd,
  author    = {Nowak, Mateusz and Cadet, Xavier and Chin, Peter},
  title     = {{ABCD}: All Biases Come Disguised},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026},
}

ABCD: All Biases Come Disguised

Standard MCQ evaluation lets LLMs cheat through label, position, and few-shot prompt artifacts. ABCD swaps distinct labels for uniform dashes and matches generated answers by semantic similarity — cutting accuracy variance 3× with a single-pass, logit-free protocol.

Abstract

NonsenseQA

Matched-and-Dashed (M&D)

Results

BibTeX

ABCD: All Biases Come Disguised

Standard MCQ evaluation lets LLMs cheat through label, position, and few-shot prompt artifacts. ABCD swaps distinct labels for uniform dashes and matches generated answers by semantic similarity — cutting accuracy variance 3× with a single-pass, logit-free protocol.

Abstract

NonsenseQA

Matched-and-Dashed (M&D)

Results

Related Links

BibTeX