The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

12:30 PM - 12:45 PM JST(3:30 AM - 3:45 AM UTC)

[5J2-OS-31a-03]Practical Considerations on Quality Evaluation Design for LLM Agent Systems: Insights from Applying Multiple Guidelines

〇Miho Ezawa¹ (1. CRESCO LTD.)

Keywords:

Large Language Model (LLM),AI Agent,Quality Evaluation,AI Governance,Software Engineering

Systems utilizing Large Language Models (LLMs) have evolved from simple question-answering chatbots to agent-based systems that autonomously execute multi-step processes while integrating external tools. In Japan, quality evaluation frameworks have been systematically established through guidelines such as the AI Business Operator Guidelines, the AISI AI Safety Evaluation Perspectives Guide, and the QA4AI Guidelines. Meanwhile, there is a growing need for practical methods to translate these guidelines into concrete evaluation designs. This paper reports on evaluation design workshops conducted in the MLSE LLM Domain Application Working Group, using a restaurant order chatbot and a travel arrangement agent as case studies, and describes how evaluation perspectives from multiple guidelines were prioritized. Through these cases, we demonstrate that the evaluation perspectives to be prioritized differ depending on the system type and domain, highlight the importance of deliberately deciding what not to evaluate, and discuss practical implications for applying guidelines in real-world evaluation practices.

Back to Session information