The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

[1Yin-B-59]Evaluating Physical Phenomenon Understanding by MLLMs in Real-World EnvironmentsAnalysis of Visual Perception and Physical Reasoning Using Slope-Sliding Tasks

〇Shido Abe¹, Masaharu Yoshioka^2,1 (1. Scool of Engineering, Hokkaido University , 2. Faculty of Information Science and Technology)

Keywords:

Multimodal LLM,Intuitive Physics,explainablity

Recent advancements in Multimodal Large Language Models (MLLMs) have shown remarkable success in solving formal physics problems, but their ability to grasp "intuitive physics" within real-world environments remains questionable.
To address this, this study aims to evaluate the physical reasoning capabilities of MLLMs using a newly constructed dataset of real-world "slope sliding" images, featuring diverse object materials and viewpoint variations.
We conducted comparative experiments between human subjects and state-of-the-art models, such as GPT-4o and Gemini-2.5-flash, to assess their predictive performance on sliding behavior. The results reveal a critical performance gap.
Specifically, while MLLMs possess textbook knowledge, they fail to ground visual information into appropriate physical attributes, exhibiting severe vulnerability to viewpoint changes and geometric hallucinations. Even with Chain-of-Thought prompting, models could not accurately estimate invisible parameters like friction.
We find that current MLLMs fundamentally lack the 3D spatial reasoning required to bridge the gap between visual perception and physical laws.

Comment

To browse or post comments, you must log in.Log in

Back to Session information