Presentation Information

[2Yin-B-51]Evaluating the Pure Proof Generation Ability of Large Language ModelsA Benchmark Based on Hilbert System for Propositional Logic

〇Risako Ando1, Rinta Yamada2, Sakiko Yamasaki1, Ryota Akiyoshi2, Koji Mineshima1 (1. Keio University, 2. The University of Electro-Communications)

Keywords:

logical inference,theorem proving,large language model

We propose a new benchmark for systematically evaluating the pure proof-generation capability of large language models (LLMs) in propositional logic, without relying on external automated provers or mathematical libraries. While recent LLM systems achieve strong performance through integration with such tools, it remains insufficiently understood how well LLMs can construct formal proofs solely within a given deductive system. To address this gap, we design a proof-generation task based on the Hilbert system for propositional logic, a setting widely regarded as challenging for automated theorem proving. Using the proposed benchmark, we evaluate current reasoning-oriented LLMs and analyze the complexity of proofs they can generate. The correctness of generated proofs is mechanically checked with the proof assistant Lean. We further examine the effect of imposing proof-construction strategies that are relatively easy for humans, and compare their impact on LLM performance. Our study characterizes LLMs' proof-generation behavior and identifies challenges toward accurate and reliable logical reasoning.