Home / Papers / Musical Word Embedding for Music Tagging and Retrieval

Musical Word Embedding for Music Tagging and Retrieval

DOI: 10.48550/arXiv.2404.13569Semantic Scholar

2 Citations•2024•

Seungheon Doh, Jongpil Lee, Dasaem Jeong

ArXiv

It is demonstrated that the effectiveness of musical supervision varies by task - specifically, tag-level supervision improves tagging performance while track-level supervision enhances retrieval performance, which suggests that the choice of musical supervision in representation learning needs to be carefully considered based on the target task.

Abstract

Word embedding has become an essential means for text-based information retrieval. Typically, word embeddings are learned from large quantities of general and unstructured text data. However, in the domain of music, word embedding may have difficulty understanding musical contexts or recognizing music-related entities like artists and tracks. To address this issue, we propose a new approach called Musical Word Embedding (MWE), which involves learning from various types of texts, including both everyday and music-related vocabulary. We integrate MWE into an audio-word joint representation framework for tagging and retrieving music, using words like tag, artist, and track that have different levels of musical specificity. Through extensive experiments, we demonstrate that the effectiveness of musical supervision varies by task - specifically, tag-level supervision improves tagging performance while track-level supervision enhances retrieval performance. This finding suggests that the choice of musical supervision in representation learning needs to be carefully considered based on the target task. To balance this compromise, we suggest multi-prototype training that uses words with different levels of musical specificity jointly. We evaluate both word embedding and audio-word joint embedding on four tasks (tag rank prediction, music tagging, query-by-tag, and query-by-track) across two datasets (Million Song Dataset and MTG-Jamendo). The results show that the suggested MWE is more efficient and effective for both in-domain and out-of-domain datasets compared to conventional word embedding.