Presentation Information
[1Yin-B-18]Semantic-aware Masked Modeling of Image and Text for Text-based Person Re-identification
〇Sora Araki1, Shuhei Tarashima2 (1. Univ. of Tohoku, 2. NTT DOCOMO BUSINESS, Inc.)
Keywords:
Person Re-identification,Multimodal Learning,Computer Vision
Text-based person re-identification aims to retrieve images of a target person using only natural language descriptions as queries. Recently, bidirectional local matching approaches based on Masked Language Modeling and Masked Image Modeling have been proposed to strengthen local correspondences between images and text. However, because mask tokens are selected randomly, these methods may allow prediction completion using information from the same modality and may mask tokens that are not essential for person identification. In this study, we introduce a masking strategy that considers semantic regions and tokens important for identity discrimination, and we develop a framework called semantic-aware BiLMa (sBiLMa) to enhance multimodal representation learning for text-based person re-identification. Experimental results demonstrate that the proposed method outperforms existing approaches, confirming the effectiveness of the masking design.
