Presentation Information
[3Yin-A-21]A Multi-Persona AI Music Critique System with Real-time Performance Analysis and RAG
〇Kiwamu Sato1 (1. Accenture Japan Ltd)
Keywords:
Music Critique Generation,Retrieval-Augmented Generation,Real-time Audio Analysis
We present Otoprism, a multi-persona AI music critique system that integrates real-time acoustic analysis with
retrieval-augmented generation (RAG) to support understanding of expressive performance during live music lis-
tening. The system analyzes incoming audio on a client device, aggregates acoustic features over a sliding time
window, and converts them into natural-language descriptions across five categories—tempo, dynamics, timbre,
harmony, and expression—via a rule-based module called FeatureToNLP. These descriptions serve as queries to
retrieve relevant human critiques from a performance-criticism corpus, which are injected into LLM prompts to
generate short persona-specific critiques at approximately 40-second intervals. As pilot validation, we compared
generated critiques for two contrasting performances of the same string-quartet piece and confirmed, across three
independent runs under fixed conditions, that the generated text consistently exhibits vocabulary differences aligned
with the intended expressive contrast.
retrieval-augmented generation (RAG) to support understanding of expressive performance during live music lis-
tening. The system analyzes incoming audio on a client device, aggregates acoustic features over a sliding time
window, and converts them into natural-language descriptions across five categories—tempo, dynamics, timbre,
harmony, and expression—via a rule-based module called FeatureToNLP. These descriptions serve as queries to
retrieve relevant human critiques from a performance-criticism corpus, which are injected into LLM prompts to
generate short persona-specific critiques at approximately 40-second intervals. As pilot validation, we compared
generated critiques for two contrasting performances of the same string-quartet piece and confirmed, across three
independent runs under fixed conditions, that the generated text consistently exhibits vocabulary differences aligned
with the intended expressive contrast.
