Presentation Information

[2F6-OS-19b-03]Verifying the Reduction of Reward Loss through Sequential LLM Selection with a Model Router

〇Shoichi Taguchi1, Hyakka Nakada1, Tatsunosuke Shimada1 (1. Recruit Co., Ltd.)

Keywords:

model router,bandit algorithm,LLM

Currently, Large Language Models (LLMs) are integrated into many commercial services, where their outputs can significantly influence rewards such as user conversion. To maintain high rewards, it is essential to continuously select the model best suited for the service. However, this presents numerous challenges, including the vast array of available models, the diversity of their characteristics, rapid performance evolution, short End-of-Life (EOL) cycles, and the difficulty of selecting appropriate methods to mitigate reward loss. To address these issues, we developed a "Model Router" that utilizes bandit algorithms to aim for reward-optimal LLM selection. This mechanism is designed to preferentially and continuously select models that yield high rewards from a pool of candidates. Experiments conducted in a production environment demonstrated that this approach has the potential to reduce reward loss compared to conventional selection processes relying on A/B testing.

Comment

To browse or post comments, you must log in.Log in