The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

[1Yin-A-10]Context-Aware Basketball Highlight Generation using Large Video-Language Models and Hierarchical Feature Extraction

〇Naoya Matsuo¹, Toshiaki Sota¹, Kaede Shindo², Yosuke Inoue² (1. IBM Japan Systems Engineering Co., Ltd., 2. IBM Japan Co., Ltd.)

Keywords:

Multimodal Learning,Highlight Generation,Sports Analytics

Highlight extraction from extensive sports broadcasts remains a significant challenge due to the high manual effort required. Although Large Multimodal Models (LMMs) have advanced direct video analysis, their performance on long-duration content is often hindered by scene omissions and redundancy. To address this, we present a hierarchical pipeline integrating YamNet-based audio event detection with Gemini’s multimodal capabilities. By using acoustic triggers like cheering as temporal anchors for coarse selection before refined LMM trimming, our method achieved a 38% increase in highlight yield and a substantial reduction in duplication. The approach reached a 50% recall rate against official highlights, outperforming the 40.6% achieved by standalone LMM extraction, with notable success in detecting high-impact moments like dunks. While the results indicate a minor trade-off in descriptive precision due to context fragmentation, this study underscores the efficacy of leveraging audio-visual hierarchies for robust, large-scale video analytics.

Back to Session information