The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

[3Yin-A-52]Comparison of Accuracies of LLMs for Slot Extraction in a Medical Interview Training System

〇Naoki Sakaguchi¹, Chee Siang Leow¹, Hiromitsu Nishizaki¹, Takehito Utsuro², Junichi Hoshino², Kentaro Takagaki^1,3, Kenichi Kawabata¹, Shoji Suzuki¹ (1. University of Yamanashi, 2. University of Tsukuba, 3. Institute of Science Tokyo)

Keywords:

Large Language Model,Clinical Interview Training,Slot Filling,Automated Dialogue Assessment

Automatic extraction of interview slots from medical dialogue is essential for providing quantitative feedback in clinical interview training systems. However, systematic benchmarks comparing slot extraction accuracy across large language models (LLMs), particularly locally deployable ones that address privacy and cost concerns, remain scarce. This study evaluates slot extraction performance of several locally deployable LLMs against a GPT-4o baseline using 232 Japanese medical dialogue test cases covering five chief complaints. The test set was systematically constructed around seven linguistic phenomenon categories (expression variation, indirect responses, negation, etc.), and extraction validity was assessed via a two-stage pipeline combining rule-based guardrails with an LLM-as-judge mechanism. Gemma3:4b (4B parameters) achieved the highest match rate of 80.6\%, outperforming GPT-4o (71.6\%). Inter-rater reliability among three independent LLM evaluators (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Pro) yielded Fleiss' $\kappa = 0.665$ (substantial agreement). These results demonstrate that small, locally deployable LLMs can match or exceed cloud-based models for medical slot extraction, enabling privacy-preserving and cost-effective deployment.

Comment

To browse or post comments, you must log in.Log in

Back to Session information