Presentation Information

[1Yin-A-06]Mitigating Self-Preference Bias in MLLM-as-a-Judge

〇Shuitsu Koyama1, Yuiga Wada1, Daichi Yashima1, Komei Sugiura1 (1. Keio University)

Keywords:

MLLM-as-a-Judge,self-preference bias,image captioning

Multimodal Large Language Models (MLLMs) are widely utilzied to measure model performance, referred to as MLLM-as-a-Judge. Although these systems may exhibit a tendency to favor outputs from specific MLLMs (model-specific preference bias), the magnitude of such biases remains unclear.
Therefore, we introduce Philautia-Eval, which quantifies the degree of model-specific preference bias in MLLM-as-a-Judge.
Furthermore, we constructed the SelfEval-Cap dataset, comprising approximately 1.2M evaluation scores given by 12 MLLMs and 54k captions generated by the MLLMs.
Our experiments showed that representative MLLMs exhibit self-preference bias in both reference-based and reference-free settings.
Furthermore, we demonstrated that an ensemble of MLLMs effectively mitigates the influence of these model-specific preference biases.