Presentation Information
[1Yin-B-40]Investigation of Geometric Structures and Entailment Regularizations Suitable for Vision-Language Embeddings
〇Ren Fujie1, Daiki Yoshikawa1, Takashi Matsubara1,2 (1. Hokkaido University, 2. AI Lab, CyberAgent, Inc.)
Keywords:
Vision Language Representation Learning,similarity measure,hierarchy
Large-scale vision-language models such as CLIP achieve strong performance in classification and retrieval by mapping various concepts into embedding vectors. However, they do not provide an explicit mechanism to represent concept hierarchies or entailment relations.While many approaches have been proposed, including choices of space curvature, similarity measures, and the introduction of entailment constraints, systematic comparisons remain insufficient.In this study, we combined these design elements under identical experimental settings and compared performance on classification, retrieval, and hierarchical classification tasks.Under our experimental settings, we found that the configuration using an inner product with bias terms on a Euclidean space as the similarity measure, without entailment regularizations, consistently achieved strong and stable overall performance.
