The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

[3Yin-A-10]LLM-jp Chatbot Arena: Building a Dialogue Evaluation Platform for Japanese LLMs

〇Hirokazu Kiyomaru¹, Hiroaki Sugiyama², Hiroshi Tokoyo³, Takahiro Kubo³, Naoaki Okazaki^4,1 (1. National Institute of Informatics, 2. NTT, 3. Amazon Web Services Japan, 4. Institute of Science Tokyo)

Keywords:

LLM,Evaluation,Japanese

We propose LLM-jp Chatbot Arena, a platform for evaluating the dialogue performance of Japanese large language models (LLMs). In LLM-jp Chatbot Arena, two Japanese LLMs independently generate responses to a user query, and the user votes for the response they judge to be better, enabling the evaluation of relative model performance. We operated the platform for approximately seven months with a total of ten models and collected 5,330 context–response pairs and 1,498 votes. Our evaluation based on these votes shows that multilingual LLMs such as gpt-oss, Qwen3, and Gemma3 achieve better Japanese dialogue performance than Japanese LLMs such as LLM-jp-3.1. We also show that the evaluation results exhibit a strong correlation with automatic LLM-based evaluations.

Comment

To browse or post comments, you must log in.Log in

Back to Session information