Presentation Information
[3Yin-A-02]Affordance-Aware Hierarchical Multimodal Retrieval-Augmented Generation for Mobile Manipulation
〇Ryosuke Korekata1,2,3, Quanting Xie3, Yonatan Bisk3, Komei Sugiura1,2 (1. Keio University, 2. Keio AI Research Center, 3. Carnegie Mellon University)
Keywords:
Domestic Service Robot,Open-Vocabulary Mobile Manipulation,Retrieval-Augmented Generation
In this work, we investigate the problem of open-vocabulary mobile manipulation, where a robot is required to transport a wide variety of objects to appropriate receptacles according to free-form natural language instructions. This task is challenging because it demands both an understanding of visual semantics and reasoning about manipulation affordances. To address these challenges, we propose a zero-shot hierarchical multimodal retrieval-augmented generation framework that constructs an Affordance-Aware Embodied Memory from pre-explored images. The model retrieves candidate targets based on regional and visual semantic information and then reranks them using affordance scores, enabling the robot to select manipulation options that are likely to be executable in real-world environments. Our approach outperforms existing methods in retrieval performance in large-scale indoor environments. Moreover, real-world experiments show that the proposed method achieves a task success rate of 85%, surpassing prior approaches in both retrieval performance and overall task success.
