Presentation Information
[4Yin-B-44]OCR2Corpus: Structured Corpus Construction from OCR Text for Large Language Model Training
〇Takuro Fujii1, Masato Fujitake1 (1. Fast Accounting Co., Ltd.)
Keywords:
Corpus,OCR,Text Cleaning
OCR text obtained from scanned documents contains long-form and structured text commonly found in specifications, vouchers, and technical reports, and thus serves as an important data source that complements web text for training large language models (LLMs). However, existing training corpus are predominantly web-centric, and OCR text from scanned documents has not been fully exploited due to recognition errors and unstable document structures. In this paper, we propose OCR2Corpus, a two-stage pipeline that transforms OCR text into high-quality training corpus. The proposed method consists of (1) a preprocessing stage that removes OCR noise and converts text into Markdown, and (2) a Refiner that standardizes Markdown documents in a document-type-agnostic manner. The Refiner is trained on paired data of clean Markdown and synthetically noised Markdown generated from Wikipedia articles. Our approach enables effective utilization of OCR text as a training resource for large language models.
