Presentation Information
[4K4-GS-6b-04]Examination of Quality Evaluation Criteria for Chit-Chat in Dialogue Systems and Attempt to Build a Quality Evaluation System
〇Shuichi Hirukawa1, Yuya Goto1, Makoto Shiomi1, Shinji Shinjo1, Shigeto Yoshida1 (1. Sharp Corporation)
Keywords:
LLM,dialogue system,Japanese language,LLM-as-a-judge,casual conversation
Developing truly user-oriented dialogue systems requires methods that can quantitatively evaluate 'conversational preferability.' However, no comprehensive evaluation methodology has previously been established for Japanese. In this study, we developed an automatic evaluation system that assesses conversational preferability from multiple perspectives. Through a literature survey, we extracted 29 factors influencing human perception, which we classified into 9 fundamental factors, 13 user-dependent factors, and 7 system-dependent factors. Focusing especially on the 9 fundamental factors, we conducted subjective evaluation experiments to analyze their sensitivity and importance. Based on these results, we designed evaluation prompts for each of the 9 factors using LLM-as-a-judge and devised a scoring method to construct an automatic evaluation system. Evaluation of dialogue agent responses demonstrated a strong correlation with human judgments.
In future work, we plan to incorporate user-dependent factors to realize more human-oriented dialogue systems.
In future work, we plan to incorporate user-dependent factors to realize more human-oriented dialogue systems.
