This work proposes training the visual sentiment analysis model by distilling the result of text classification that takes the pseudo text derived from patch images of object instances, which is lower than the original SentiBank in the Twitter dataset, but achieved higher accuracy in the EmotionROI dataset.
Sentiment Analysis is essential in the area of marketing or communication. In recent years, visual sentiment analysis that estimates how people feel when looking at an image has been widely studied along with the progress of machine learning. The main difficulty of sentiment analysis is to bridge the affective gap in translating from visual features to affective labels. SentiBank, a visual sentiment analysis framework, solved this problem and achieved high accuracy in emotion prediction by the text-based approach through the representation of adjective and noun pairs. However, there is a possibility that the texts need to reflect more sufficiently the information of an image, such as multiple objects or colors of the background. To reflect such information more directly, we propose training the visual sentiment analysis model by distilling the result of text classification that takes the pseudo text derived from patch images of object instances. The experiment shows that our method is lower than the original SentiBank in the Twitter dataset, with a more concrete image and highly accurate labels. However, our method achieved higher accuracy in the EmotionROI dataset, a more abstract dataset with comparably noisy labels than the Twitter dataset.