Presentation Information

[2F6-OS-19b-04]Design and Preliminary Study of an Evaluation Benchmark for Vision–Language Models in the Fashion Domain for Business Deployment

Sai HtaungKham1, Yuki Shimizu2, Shion Sakurai2, Hayato Tomita2, Hokuto Sasaki2, Sho Totsuka2, Aya Kubori2, Hina Morimoto2, Taiki Miyazono2, 〇Ryotaro Shimizu1 (1. ZOZO Research, 2. ZOZO, Inc.)

Keywords:

Vision-Language Models,Benchmark,Attribute Extraction

This paper designs an evaluation benchmark for assessing the practical deployment suitability of Vision-Language Models (VLMs) in the fashion domain from a business perspective. Existing VLM evaluations are biased toward general object and scene understanding, and the performance on fashion-specific elements such as color, pattern, material, and style, as well as information extraction tasks from unstructured images directly relevant to e-commerce operations, remains insufficiently verified. In this work, we categorize input images into two streams: full-body outfit images and single-item images, define a set of tasks including attribute extraction and tagging, and propose a framework for evaluating multiple VLMs under identical conditions. Our preliminary experiments confirm that model strengths and weaknesses diverge markedly across tasks, and that model-specific error patterns persist consistently even under different prompts, demonstrating that use-case-specific model selection, prompt robustness verification, and continuous monitoring across model updates are essential for operational design aligned with required quality standards.

Comment

To browse or post comments, you must log in.Log in