Presentation Information

[5J2-OS-31a-03]Practical Considerations on Quality Evaluation Design for LLM Agent Systems: Insights from Applying Multiple Guidelines

〇Miho Ezawa1 (1. CRESCO LTD.)

Keywords:

Large Language Model (LLM),AI Agent,Quality Evaluation,AI Governance,Software Engineering

Systems utilizing Large Language Models (LLMs) have evolved from simple question-answering chatbots to agent-based systems that autonomously execute multi-step processes while integrating external tools. In Japan, quality evaluation frameworks have been systematically established through guidelines such as the AI Business Operator Guidelines, the AISI AI Safety Evaluation Perspectives Guide, and the QA4AI Guidelines. Meanwhile, there is a growing need for practical methods to translate these guidelines into concrete evaluation designs. This paper reports on evaluation design workshops conducted in the MLSE LLM Domain Application Working Group, using a restaurant order chatbot and a travel arrangement agent as case studies, and describes how evaluation perspectives from multiple guidelines were prioritized. Through these cases, we demonstrate that the evaluation perspectives to be prioritized differ depending on the system type and domain, highlight the importance of deliberately deciding what not to evaluate, and discuss practical implications for applying guidelines in real-world evaluation practices.

Comment

To browse or post comments, you must log in.Log in