The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

[4Yin-B-44]OCR2Corpus: Structured Corpus Construction from OCR Text for Large Language Model Training

〇Takuro Fujii¹, Masato Fujitake¹ (1. Fast Accounting Co., Ltd.)

Keywords:

Corpus,OCR,Text Cleaning

OCR text obtained from scanned documents contains long-form and structured text commonly found in specifications, vouchers, and technical reports, and thus serves as an important data source that complements web text for training large language models (LLMs). However, existing training corpus are predominantly web-centric, and OCR text from scanned documents has not been fully exploited due to recognition errors and unstable document structures. In this paper, we propose OCR2Corpus, a two-stage pipeline that transforms OCR text into high-quality training corpus. The proposed method consists of (1) a preprocessing stage that removes OCR noise and converts text into Markdown, and (2) a Refiner that standardizes Markdown documents in a document-type-agnostic manner. The Refiner is trained on paired data of clean Markdown and synthetically noised Markdown generated from Wikipedia articles. Our approach enables effective utilization of OCR text as a training resource for large language models.

Comment

To browse or post comments, you must log in.Log in

Back to Session information