Presentation Information

[1Yin-A-60]Bridging the Query-Document Gap: Question-Based Vector Indexing for Retrieval Augmented Generation (RAG)

〇Takehiko Yamaguchi1, Futoshi Iwama1, Mikio Takeuchi1, Mich Tatsubori1 (1. IBM Japan, Ltd.)

Keywords:

Retrieval Augmented Generation (RAG),Question-Based Indexing

Retrieval-Augmented Generation (RAG) systems rely on vector databases to retrieve relevant document chunks for feeding out-of-model knowledge to Large Language Models (LLMs) during inference with the model. However, current approaches suffer from pragmatic distinctions between user queries, which are typically phrased as questions, and document chunks, which are stored as declarative statements. This mismatch leads to poor retrieval performance and irrelevant context for generation. We propose a novel indexing method that generates possible questions for each document chunk and uses them as vector database keys. Our approach consists of four steps: 1) chunking documents into meaningful pieces, 2) generating questions that can be answered by each chunk using an LLM, 3) embedding the questions into a vector space, and 4) indexing chunks with question embeddings as keys. Experiments on the CLAP Natural Questions dataset is designed to demonstrate that generated questions are positioned significantly closer to user queries in the embedding space compared to original document chunks, effectively bridging the pragmatic distinctions with improvement in retrieval metrics. Our method is particularly suited for offline document processing scenarios with storing documents once to query many times.