Presentation Information

[2Yin-A-58]RAG-Boost: Retrieval-Augmented Generation Enhanced Speech Recognition in LLM-based Spoken Dialogue Systems

〇PENGCHENG WANG1, Sheng Li1, Takahiro Shinozaki1 (1. Institute of Science Tokyo)

Keywords:

ASR,RAG,LLM

Recent years, end-to-end speech foundation models (e.g., Whisper) have demonstrated strong performance in multilingual recognition and acoustic modeling. Besides, prior work has applied Large Language Models (LLMs) powerful context reasoning ability for Automated Speech Recognition (ASR). However, the combination of them still leads to semantic inconsistency and hallucination, particularly in cross-turn and domain-specific dialogues. To solve this problem, we propose RAG-Boost: a retrieval augmented framework for improving LLM-based ASR in complex dialog scenarios. The framework alleviates the hallucination problem by injecting external knowledge into the reasoning process as needed. Specifically, it uses speech representation as a query, to retrieve relevant evidence and domain terms from database, and then fuses the retrieved information into LLM decoding process to correct recognition errors. This design realizes context aware and knowledge-based ASR decoding without modifying the underlying basic model, also avoids the error propagation inherent in previous Retrieval-Augmented Generation (RAG) methods that retrieve based on ASR outputs.