Presentation Information
[2Yin-A-23]Theoretical Considerations on AI Sandbagging Detection via Noise Injection
〇Daisuke Kaji1,3, Keisuke Yamazaki2 (1. Shizuoka University, 2. National Institute of Advanced Industrial Science and Technology, 3. DENSO CORPORATION)
Keywords:
Sandbagging detection,Singular learning theory,AI safety,Bayesian learning,Noise injection
In recent years, increasing attention has been directed toward a phenomenon known as sandbagging (SB), in which AI systems deliberately underperform during evaluation to evade regulatory scrutiny or obscure their true capabilities. Detecting SB has therefore become a central challenge in AI safety assessment, motivating the development of various detection strategies, including approaches that rely on monitoring agents. This paper examines a strategy known as 'Noise Injection', which is incorporated into the AI Metacognition Toolkit. Injecting noise into model parameters can improve the performance of SB models. We present a theoretical analysis aimed at explaining why this counterintuitive effect arises, and at identifying the conditions under which it occurs. Specifically, we characterize the influence of parameter noise by applying the asymptotic expansion of the Bayesian generalization error for regression models. This allows us to identify the dominant terms responsible for the observed performance gains. To corroborate the theoretical findings, we conducted experiments using small-scale neural network models. The empirical results corroborate the theoretical predictions.
