Presentation Information
[1Yin-B-59]Evaluating Physical Phenomenon Understanding by MLLMs in Real-World EnvironmentsAnalysis of Visual Perception and Physical Reasoning Using Slope-Sliding Tasks
〇Shido Abe1, Masaharu Yoshioka2,1 (1. Scool of Engineering, Hokkaido University , 2. Faculty of Information Science and Technology)
Keywords:
Multimodal LLM,Intuitive Physics,explainablity
Recent advancements in Multimodal Large Language Models (MLLMs) have shown remarkable success in solving formal physics problems, but their ability to grasp "intuitive physics" within real-world environments remains questionable.
To address this, this study aims to evaluate the physical reasoning capabilities of MLLMs using a newly constructed dataset of real-world "slope sliding" images, featuring diverse object materials and viewpoint variations.
We conducted comparative experiments between human subjects and state-of-the-art models, such as GPT-4o and Gemini-2.5-flash, to assess their predictive performance on sliding behavior. The results reveal a critical performance gap.
Specifically, while MLLMs possess textbook knowledge, they fail to ground visual information into appropriate physical attributes, exhibiting severe vulnerability to viewpoint changes and geometric hallucinations. Even with Chain-of-Thought prompting, models could not accurately estimate invisible parameters like friction.
We find that current MLLMs fundamentally lack the 3D spatial reasoning required to bridge the gap between visual perception and physical laws.
To address this, this study aims to evaluate the physical reasoning capabilities of MLLMs using a newly constructed dataset of real-world "slope sliding" images, featuring diverse object materials and viewpoint variations.
We conducted comparative experiments between human subjects and state-of-the-art models, such as GPT-4o and Gemini-2.5-flash, to assess their predictive performance on sliding behavior. The results reveal a critical performance gap.
Specifically, while MLLMs possess textbook knowledge, they fail to ground visual information into appropriate physical attributes, exhibiting severe vulnerability to viewpoint changes and geometric hallucinations. Even with Chain-of-Thought prompting, models could not accurately estimate invisible parameters like friction.
We find that current MLLMs fundamentally lack the 3D spatial reasoning required to bridge the gap between visual perception and physical laws.
