The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

[1Yin-B-40]Investigation of Geometric Structures and Entailment Regularizations Suitable for Vision-Language Embeddings

〇Ren Fujie¹, Daiki Yoshikawa¹, Takashi Matsubara^1,2 (1. Hokkaido University, 2. AI Lab, CyberAgent, Inc.)

Keywords:

Vision Language Representation Learning,similarity measure,hierarchy

Large-scale vision-language models such as CLIP achieve strong performance in classification and retrieval by mapping various concepts into embedding vectors. However, they do not provide an explicit mechanism to represent concept hierarchies or entailment relations.While many approaches have been proposed, including choices of space curvature, similarity measures, and the introduction of entailment constraints, systematic comparisons remain insufficient.In this study, we combined these design elements under identical experimental settings and compared performance on classification, retrieval, and hierarchical classification tasks.Under our experimental settings, we found that the configuration using an inner product with bias terms on a Euclidean space as the similarity measure, without entailment regularizations, consistently achieved strong and stable overall performance.

Comment

To browse or post comments, you must log in.Log in

Back to Session information