Citation
Hu, Qiong and Azmi Murad, Masrah Azrifah and Azman, Azreen and Nasharuddin, Nurul Amelina
(2026)
Target-conditioned Triple-Path Consistency for distributional music emotion regression.
Knowledge-Based Systems, 336.
art. no. 115317.
pp. 1-14.
ISSN 0950-7051
Abstract
Music Emotion Recognition systems require nuanced representations that capture emotional mixtures, a task where discrete tags or two-dimensional valence–arousal coordinates often fall short. We present Triple-Path Consistency (TPC), a target-conditioned training framework for learning emotion distributions from audio. Our implementation, TPCNet, employs a compact CNN–BiLSTM front-end with cross-attention and an encoder–decoder backbone supporting three coordinated paths: a prediction path generating logits from audio features, a target path decoding ground-truth distributions into hierarchical feature anchors, and a consistency path that re-encodes these anchors to enforce multi-level alignment. This triangular consistency constraint ensures semantic coherence throughout the network without requiring external teachers. We optimize Kullback–Leibler divergence for distributional labels and compare it with Mean Squared Error for valence–arousal regression. Experiments on four benchmarks—S9k, CAL500, MTG-Jamendo, and PMEmo—demonstrate competitive or state-of-the-art performance in distributional shape agreement, as measured by Concordance Correlation Coefficient and Spearman correlation. These results are achieved with the TPC backbone adding only 0.26 million trainable parameters to a compact 19.39M-parameter system, enabling lightweight deployment. Statistical significance tests confirm that TPC's advantage lies in modeling structural integrity rather than point-wise accuracy. The results establish TPC as a practical framework for affect-aware multimedia systems.
Download File
Additional Metadata
Actions (login required)
 |
View Item |