The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

[5Yin-A-48]Evaluation of LLMs' Resistance to Induction Using an NG Word Game

〇Shunichi Yoneda¹, Yukitake Yoshioka¹, Yurai Takayanagi¹, Yumi Fukuda¹, Ryoma Obara⁴, Yusuke Sakai², Hidetaka Kamigaito², Katsuhiko Hayashi³, Shogo Matsuno¹ (1. University of Electro Communications, 2. NAIST, 3. The University of Tokyo, 4. NEC)

Keywords:

LLM

In real-world LLM deployment, it is important to quantitatively evaluate whether safety constraints, such as refraining from mentioning specific targets, can be maintained over multiple turns under induction or adversarial questioning. This study proposes InduceGuard, a framework that uses the NG Word Game to cast real-world challenges into an evaluable setting. In the NG Word Game, each participant is secretly assigned a prohibited word; others attempt to elicit it, while the assignee tries to avoid uttering it. At each turn, violations are detected by exact match, and semantic proximity is quantified using similarity between the dialogue history and each prohibited word to compute a trajectory of constraint-violation risk. In experiments, the eliminated player was often assessed as having the lowest relative safety at the elimination turn, and in some cases a decline in relative safety was observed before elimination.

Back to Session information