Presentation Information

[5Yin-A-48]Evaluation of LLMs' Resistance to Induction Using an NG Word Game

〇Shunichi Yoneda1, Yukitake Yoshioka1, Yurai Takayanagi1, Yumi Fukuda1, Ryoma Obara4, Yusuke Sakai2, Hidetaka Kamigaito2, Katsuhiko Hayashi3, Shogo Matsuno1 (1. University of Electro Communications, 2. NAIST, 3. The University of Tokyo, 4. NEC)

Keywords:

LLM

In real-world LLM deployment, it is important to quantitatively evaluate whether safety constraints, such as refraining from mentioning specific targets, can be maintained over multiple turns under induction or adversarial questioning. This study proposes InduceGuard, a framework that uses the NG Word Game to cast real-world challenges into an evaluable setting. In the NG Word Game, each participant is secretly assigned a prohibited word; others attempt to elicit it, while the assignee tries to avoid uttering it. At each turn, violations are detected by exact match, and semantic proximity is quantified using similarity between the dialogue history and each prohibited word to compute a trajectory of constraint-violation risk. In experiments, the eliminated player was often assessed as having the lowest relative safety at the elimination turn, and in some cases a decline in relative safety was observed before elimination.