Presentation Information
[3Yin-A-52]Comparison of Accuracies of LLMs for Slot Extraction in a Medical Interview Training System
〇Naoki Sakaguchi1, Chee Siang Leow1, Hiromitsu Nishizaki1, Takehito Utsuro2, Junichi Hoshino2, Kentaro Takagaki1,3, Kenichi Kawabata1, Shoji Suzuki1 (1. University of Yamanashi, 2. University of Tsukuba, 3. Institute of Science Tokyo)
Keywords:
Large Language Model,Clinical Interview Training,Slot Filling,Automated Dialogue Assessment
Automatic extraction of interview slots from medical dialogue is essential for providing quantitative feedback in clinical interview training systems. However, systematic benchmarks comparing slot extraction accuracy across large language models (LLMs), particularly locally deployable ones that address privacy and cost concerns, remain scarce. This study evaluates slot extraction performance of several locally deployable LLMs against a GPT-4o baseline using 232 Japanese medical dialogue test cases covering five chief complaints. The test set was systematically constructed around seven linguistic phenomenon categories (expression variation, indirect responses, negation, etc.), and extraction validity was assessed via a two-stage pipeline combining rule-based guardrails with an LLM-as-judge mechanism. Gemma3:4b (4B parameters) achieved the highest match rate of 80.6\%, outperforming GPT-4o (71.6\%). Inter-rater reliability among three independent LLM evaluators (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Pro) yielded Fleiss' $\kappa = 0.665$ (substantial agreement). These results demonstrate that small, locally deployable LLMs can match or exceed cloud-based models for medical slot extraction, enabling privacy-preserving and cost-effective deployment.
