Multiple-choice question (MCQ) benchmarks have been a standard
evaluation practice for measuring LLMs' ability to reason and
answer knowledge-based questions. Through a synthetic
NonsenseQA benchmark, we observe that different
LLMs exhibit varying degrees of label-position-few-shot-prompt
bias, where the model either uses the answer position, the label
in front of the answer, the distribution of correct answers
present in the few-shot prompt, or a combination of all to answer
each MCQ question.
We propose a simple bias-reduced evaluation protocol that
replaces the labels of each question with uniform, unordered
labels and prompts the LLM to use the whole answer presented.
With a simple sentence similarity model, we demonstrate improved
robustness and lower standard deviation between different
permutations of answers with a minimal drop in performance,
exposing the LLM's capabilities under reduced evaluation
artifacts — without any help from the prompt examples or
the option labels.
Across multiple benchmarks and models, this protocol substantially
improves the robustness to answer permutations, reducing mean
accuracy variance 3× with only a minimal
decrease in mean model performance. Through ablation studies on
various embedding models and similarity functions, we show that
the method is more robust than the standard ones.