This work proposes a novel meta-learning based paradigm that can retain the advantages of unimodal existence and further boost the performance of multimodal fusion, and introduces the Adaptive Multimodal Meta-Learning (AMML), which achieves state-of-the-art performance.
Multimodal sentiment analysis is an emerging field of artificial intelligence. The most predominant approaches have made notable progress by designing sophisticated fusion architectures, exploring inter-modal interactions between modalities. However, these works tend to utilize a uniform optimization strategy for each modality, so that only sub-optimal unimodal representations are obtained for multimodal fusion. To address this issue, we propose a novel meta-learning based paradigm that can retain the advantages of unimodal existence and further boost the performance of multimodal fusion. Specifically, we introduce the Adaptive Multimodal Meta-Learning (AMML) to meta-learn the unimodal networks and adapt them for multimodal inference. AMML can (1) effectively obtain more optimized unimodal representation via meta-training on unimodal tasks, which adaptively adjusts the learning rate and assigns a more specific optimization procedure for each modality; (2) and adapt the optimized unimodal representations for multimodal fusion via meta-testing on multimodal tasks. Considering multimodal fusion often suffers from the distributional mismatches between features of different modalities due to heterogeneous nature of the signals, we implement a distribution transformation layer on unimodal representations to regularize the unimodal distributions. In this way, distribution gaps can be reduced to achieve a better effect of fusion. Extensive experiments on two widely-used datasets demonstrate that AMML achieves state-of-the-art performance.