Presentation Information

[4K5-GS-6c-03]Towards Online and Token-Level Direct Preference Optimization in Machine Translation

〇Yin Zhang1, Takehito Utsuro1, Masaaki Nagata2 (1. University of Tsukuba, 2. NTT Communication Science Laboratories)

Keywords:

Machine Translation,Reinforcement Learning

Direct Preference Optimization (DPO) has recently shown strong performance in aligning large language models with human preferences,but existing approaches are mostly applied offline and at the sequence level.This limits their ability to adapt to dynamic feedback and to capture fine-grained translation errors such as omissions,mistranslations, and local fluency issues.In this work, we propose an online token-level DPO framework for machine translation. Our method extends standard DPO in two directions:(1) online optimization, where preference data are generated and incorporated during training, enabling continuous model improvement; and(2) token-level preference modeling, which assigns preferences at a finer granularity instead of treating each translation as a single unit.By integrating token-level preference signals into an online DPO pipeline, the model can better learn which local translation choices contribute to overall translation quality. We apply our approach to machine translation tasks and show that it improves adequacy compared to conventional sequence-level and offline DPO methods. Our results suggest that fine-grained, online preference optimization is a promising direction forbuilding more reliable and adaptive machine translation systems.