Presentation Information
[3Yin-A-10]LLM-jp Chatbot Arena: Building a Dialogue Evaluation Platform for Japanese LLMs
〇Hirokazu Kiyomaru1, Hiroaki Sugiyama2, Hiroshi Tokoyo3, Takahiro Kubo3, Naoaki Okazaki4,1 (1. National Institute of Informatics, 2. NTT, 3. Amazon Web Services Japan, 4. Institute of Science Tokyo)
Keywords:
LLM,Evaluation,Japanese
We propose LLM-jp Chatbot Arena, a platform for evaluating the dialogue performance of Japanese large language models (LLMs). In LLM-jp Chatbot Arena, two Japanese LLMs independently generate responses to a user query, and the user votes for the response they judge to be better, enabling the evaluation of relative model performance. We operated the platform for approximately seven months with a total of ten models and collected 5,330 context–response pairs and 1,498 votes. Our evaluation based on these votes shows that multilingual LLMs such as gpt-oss, Qwen3, and Gemma3 achieve better Japanese dialogue performance than Japanese LLMs such as LLM-jp-3.1. We also show that the evaluation results exhibit a strong correlation with automatic LLM-based evaluations.
