The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

6:15 PM - 6:30 PM JST(9:15 AM - 9:30 AM UTC)

[2F6-OS-19b-04]Design and Preliminary Study of an Evaluation Benchmark for Vision–Language Models in the Fashion Domain for Business Deployment

Sai HtaungKham¹, Yuki Shimizu², Shion Sakurai², Hayato Tomita², Hokuto Sasaki², Sho Totsuka², Aya Kubori², Hina Morimoto², Taiki Miyazono², 〇Ryotaro Shimizu¹ (1. ZOZO Research, 2. ZOZO, Inc.)

Keywords:

Vision-Language Models,Benchmark,Attribute Extraction

This paper designs an evaluation benchmark for assessing the practical deployment suitability of Vision-Language Models (VLMs) in the fashion domain from a business perspective. Existing VLM evaluations are biased toward general object and scene understanding, and the performance on fashion-specific elements such as color, pattern, material, and style, as well as information extraction tasks from unstructured images directly relevant to e-commerce operations, remains insufficiently verified. In this work, we categorize input images into two streams: full-body outfit images and single-item images, define a set of tasks including attribute extraction and tagging, and propose a framework for evaluating multiple VLMs under identical conditions. Our preliminary experiments confirm that model strengths and weaknesses diverge markedly across tasks, and that model-specific error patterns persist consistently even under different prompts, demonstrating that use-case-specific model selection, prompt robustness verification, and continuous monitoring across model updates are essential for operational design aligned with required quality standards.

Comment

To browse or post comments, you must log in.Log in

Back to Session information