Presentation Information
[5L3-OS-6b-04]Semantic Manifolds Are Low-Dimensional, But Retrieval Is NotRanking Stability in Dense Embeddings
〇Noriyuki Yamamoto1 (1. GIG Intelligence Inc.)
Keywords:
semantic embedding,intrinsic dimension,manifold hypothesis,dense retrieval,retrieval-augmented generation
Embedding representations derived from large language models are widely used for semantic search and retrieval-augmented generation. Although they are often interpreted through the manifold hypothesis—that semantic meaning lies on a low-dimensional manifold—dimensionality reduction is known to degrade retrieval performance.
In this work, we show that this discrepancy arises from a difference between the geometric structure of semantic representations and the intrinsic requirements of retrieval tasks. By combining global dimensionality measures based on the participation ratio with local intrinsic dimension estimators such as TwoNN and Levina–Bickel MLE, we demonstrate that semantic freedom is governed by local geometric properties. We observe phase-transition-like behavior in ranking performance as the embedding dimension is reduced, and show that this phenomenon originates from insufficient resolution to discriminate semantically close items.
Our results provide a unified geometric explanation for why low-dimensional representations can preserve meaning, while effective retrieval requires higher dimensional embeddings, and offer insights for RAG system design.
In this work, we show that this discrepancy arises from a difference between the geometric structure of semantic representations and the intrinsic requirements of retrieval tasks. By combining global dimensionality measures based on the participation ratio with local intrinsic dimension estimators such as TwoNN and Levina–Bickel MLE, we demonstrate that semantic freedom is governed by local geometric properties. We observe phase-transition-like behavior in ranking performance as the embedding dimension is reduced, and show that this phenomenon originates from insufficient resolution to discriminate semantically close items.
Our results provide a unified geometric explanation for why low-dimensional representations can preserve meaning, while effective retrieval requires higher dimensional embeddings, and offer insights for RAG system design.
