Presentation Information

[4E5-GS-11b-01]Efficient Guardrails for Large Language Models via Policy Filtering

〇Miyu Yamada2, Kunihiro Ito1 (1. NEC Corporation, 2. Institute of Science Tokyo)

Keywords:

LLM,guardrails,AI Safety

Large Language Models (LLMs) are integrated into various systems such as chatbots and AI agents. To safely operate these systems, it is necessary to properly control LLM outputs. Guardrails based on the LLM-as-a-judge approach—where LLMs process judgment prompts written in natural language—are widely used for controlling LLM output because they allow flexible customization of qualitative checks. However, existing methods can become costly and prone to detection failures when attempting to comprehensively cover all inspection items (policies), as this increases the token count of judgment prompts. This paper proposes dynamic guardrails using policy filtering. The proposed method selects policies based on “violation examples” and performs LLM-as-a-judge evaluations using judgment prompts composed solely of the selected policies. We validated its effectiveness, particularly for input text inspection, using the safety dataset AnswerCarefully. Experiments show the proposed method achieves equivalent or better detection performance using less than one-third the tokens compared to existing methods.

Comment

To browse or post comments, you must log in.Log in