Presentation Information

[5O2-IS-5a-04]Large Language Models Do Not Solve Hallucinations in Materials Knowledge: A Preliminary Analysis

〇Yutong Duan1, Satoshi Kosugi1, Manabu Okumura1, Kotaro Funakoshi1 (1. Insititute of Science Tokyo)
regular

Keywords:

Large Language Models,Materials Science,Numerical Hallucination

Identifier-conditioned numeric retrieval is a high-stakes use case for LLM-based scientific assistants, where outputs must be record-level correct. We evaluate structured property retrieval from the Materials Project (MP) database under five controlled settings across three models: GPT-5-mini, Gemini-2.5-flash-lite, and Llama-4 Scout-17B-16E. LLM-only baselines fail completely and exhibit structured hallucination patterns, including near-zero/default-value collapse for band gap. Tool grounding substantially improves accuracy, but remains brittle due to last-mile outputting and formatting errors; prompt-only mitigations are also unreliable. We propose Retrieve--Verify--Override (RVO), a lightweight correctness layer that deterministically verifies the model output against the retrieved MP record and overrides on any mismatch or parse failure. RVO achieves perfect record-level accuracy across model families in our benchmark, indicating that reliable numeric retrieval requires deterministic enforcement beyond prompting.