The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

[1Yin-A-46]Layer-wise Analysis of Prosodic and Phonemic Information Acquisition in Self-Supervised Speech Models

〇Masaru Tanibata¹, Shun Takahashi¹, Hiroki Ouchi¹, Sakriani Sakti¹ (1. Nara Institute of Science and Technology)

Keywords:

knowledge extracting

We investigate when and where self-supervised speech models acquire phonemic and prosodic information during pretraining. Using HuBERT-Base, we perform a unified, time-resolved layer-wise probing study across 50k-step checkpoints covering both stages of pretraining. From every Transformer layer, we extract frame-level representations and train lightweight MLP probes with frozen encoders. Segmental information is evaluated by phoneme classification on TIMIT (38 classes), while suprasegmental information is assessed by prosody classification—pitch, energy, and tempo—on TextrolSpeech (three-level labels). Results reveal two distinct acquisition patterns. Pitch and energy are consistently decodable from lower layers throughout training, showing relatively small variation across checkpoints. In contrast, phoneme and tempo exhibit a pronounced re-localization of information: the most informative layers shift from mid/early layers in stage 1 to higher layers after the transition to stage 2, accompanied by marked improvements in upper-layer accuracy. Moreover, phoneme performance in later layers continues to increase up to 650k steps, suggesting remaining learning capacity beyond this point. These findings clarify stage-wise dynamics in HuBERT representations and inform layer selection for downstream tasks.

Comment

To browse or post comments, you must log in.Log in

Back to Session information