The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

[2Yin-B-47]Replacing the Human in the Loop: Automated Evaluation of LLM Mathematical Proof Generalization

〇Carolina Dias-Alexiou¹, Edison Marrese-Taylor^1,2, Hiroya Takamura², Yutaka Matsuo¹ (1. The University of Tokyo, 2. AIST)

Keywords:

Large Language Models,Mathematical Reasoning,Alignment

As we continue to extend their range of applications, the study of the generalization capabilities of large language models (LLM) has become an important topic. Previous works have recently proposed to study these abilities in the context of mathematical reasoning. This was done by asking models to reproduce proofs that they have most likely seen during training, but when key symbols are replaced. Since this evaluation relies heavily on human feedback, we study to what degree this evaluation can be automatically performed. To that end, we simulate the human evaluation proposed by previous work by interactively presenting the same questions to the LLMs as prompts. We analyze the degree of alignment between human and model answers via an accuracy and correlation analysis, as well as via inter-annotator agreement metrics. We test a selection of models, including ones with reasoning capabilities, and show LLMs currently have important limitations in this regard.

Back to Session information