The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

[1Yin-B-18]Semantic-aware Masked Modeling of Image and Text for Text-based Person Re-identification

〇Sora Araki¹, Shuhei Tarashima² (1. Univ. of Tohoku, 2. NTT DOCOMO BUSINESS, Inc.)

Keywords:

Person Re-identification,Multimodal Learning,Computer Vision

Text-based person re-identification aims to retrieve images of a target person using only natural language descriptions as queries. Recently, bidirectional local matching approaches based on Masked Language Modeling and Masked Image Modeling have been proposed to strengthen local correspondences between images and text. However, because mask tokens are selected randomly, these methods may allow prediction completion using information from the same modality and may mask tokens that are not essential for person identification. In this study, we introduce a masking strategy that considers semantic regions and tokens important for identity discrimination, and we develop a framework called semantic-aware BiLMa (sBiLMa) to enhance multimodal representation learning for text-based person re-identification. Experimental results demonstrate that the proposed method outperforms existing approaches, confirming the effectiveness of the masking design.

Comment

To browse or post comments, you must log in.Log in

Back to Session information