Presentation Information

[4Yin-B-12]Building a Japanese Prompt-Injection Defense Model for LLMs and Its Challenges

〇Koshiro Wada1, Daichi Sato1, Keichi Yokoyama1 (1. EpicAI)

Keywords:

Security,prompt injection

This study aims to develop a Japanese security-focused LLM for detecting and defending against prompt-injection attacks on large language models. Prompt injection is an emerging threat in which attackers embed malicious instructions in user inputs to override model constraints and induce unintended behavior. By late 2023, peer-reviewed research on this issue was limited, and there were no publicly available datasets or defense models tailored to Japanese. To address this gap, we adapt existing English datasets to Japanese and generate new samples reflecting Japanese-specific attack patterns, including orthographic variations, honorific expressions, and zero-width space insertion. We construct a corpus consisting of approximately 24,000 attack samples and 10,000 benign dialogue instances. Using Qwen3-4B as the base model, we fine-tune a binary classifier with LoRA and achieve around 99% accuracy in Japanese prompt-injection detection, outperforming existing guardrail models primarily designed for English. Through error analysis, we further identify challenges related to Japanese linguistic characteristics and data quality, and discuss the role of our approach within a defense-in-depth strategy.