Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval
A Tree-augmented Cross-modal Encoding method by jointly learning the linguistic structure of queries and the temporal representation of videos to facilitate video retrieval with complex queries, thereby achieving a better video retrieval performance.
Abstract
The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems. Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries, which are usually ineffective for complex queries that carry far more complex semantics. Recently, embedding-based paradigm has emerged as a popular approach. It aims to map the queries and videos into a shared embedding space where semantically-similar texts and videos are much closer to each other. Despite its simplicity, it forgoes the exploitation of the syntactic structure of text queries, making it suboptimal to model the complex queries.