The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

[3Yin-A-02]Affordance-Aware Hierarchical Multimodal Retrieval-Augmented Generation for Mobile Manipulation

〇Ryosuke Korekata^1,2,3, Quanting Xie³, Yonatan Bisk³, Komei Sugiura^1,2 (1. Keio University, 2. Keio AI Research Center, 3. Carnegie Mellon University)

Keywords:

Domestic Service Robot,Open-Vocabulary Mobile Manipulation,Retrieval-Augmented Generation

In this work, we investigate the problem of open-vocabulary mobile manipulation, where a robot is required to transport a wide variety of objects to appropriate receptacles according to free-form natural language instructions. This task is challenging because it demands both an understanding of visual semantics and reasoning about manipulation affordances. To address these challenges, we propose a zero-shot hierarchical multimodal retrieval-augmented generation framework that constructs an Affordance-Aware Embodied Memory from pre-explored images. The model retrieves candidate targets based on regional and visual semantic information and then reranks them using affordance scores, enabling the robot to select manipulation options that are likely to be executable in real-world environments. Our approach outperforms existing methods in retrieval performance in large-scale indoor environments. Moreover, real-world experiments show that the proposed method achieves a task success rate of 85%, surpassing prior approaches in both retrieval performance and overall task success.

Comment

To browse or post comments, you must log in.Log in

Back to Session information