Presentation Information
[5M1-GS-2b-06]Evaluating SFT-free RL Post-training with GRPO for Japanese Large Language Models日本語LLMに対するR1-Zero likeな事後学習手法の多目的評価
〇Naoya Tsuji1 (1. KADOKAWA DWANGO Educational Institute S High school)
Keywords:
Reinforcement Learning,Post-Training,LLM
Supervised Fine-Tuning (SFT) for large language models (LLMs) can induce substantial parameter updates and may lead to output distribution distortion, diversity reduction, and catastrophic forgetting. We investigate SFT-free post-training using Group Relative Policy Optimization (GRPO)---R1-Zero-like training---applied to llm-jp-3-13b, a Japanese-centric LLM pre-trained on approximately 2.1 trillion tokens. Our setup employs external reward model scores and multi-component reward shaping without SFT initialization or SFT loss mixing. Since the reward model is itself trained on supervised data, our claim is limited to the omission of the SFT stage.While existing R1-Zero-like reproductions typically assume large-scale foundations such as Llama 3 (15T+ tokens) or Qwen 3 (36T tokens), this study examines whether R1-Zero-like training remains viable under a relatively smaller pre-training scale. On ELYZA-tasks-100, SFT achieves 2.66 ± 0.08 while R1-Zero-like training achieves 2.42 ± 0.10. However, R1-Zero-like training demonstrates substantially better output distribution preservation: the Jensen-Shannon Divergence (JSD) is 0.1272/0.3456 (Unigram/Bigram) for BASE → R1-Zero versus 0.3342/0.6575 for BASE → SFT. Ablation analysis reveals that reward shaping combined with KL regularization is essential for maintaining output quality.
