Home / Papers / Ssd: Towards Better Text-Image Consistency Metric in Text-to-Image Generation

Ssd: Towards Better Text-Image Consistency Metric in Text-to-Image Generation

2 Citations2022
Zhaorui Tan, Zihan Ye, Qiufeng Wang
SSRN Electronic Journal

This paper makes a further step forward to develop a novel CLIP-based metric termed as Semantic Similarity Distance (SSD), which is both theoretically founded from a distributional viewpoint and empirically verified on benchmark datasets.

Abstract

Generating consistent and high-quality images from given texts is essential for visual-language understanding. Although impressive results have been achieved in generating highquality images, text-image consistency is still a major concern in existing GAN-based methods. Particularly, the most popular metric R-precision may not accurately reflect the text-image consistency, often resulting in very misleading semantics in the generated images. Albeit its significance, how to design a better text-image consistency metric surprisingly remains underexplored in the community. In this paper, we make a further step forward to develop a novel CLIP-based metric termed as Semantic Similarity Distance (SSD), which is both theoretically founded from a distributional viewpoint and empirically verified on benchmark datasets. Benefiting from the proposed metric, we further design the Parallel Deep Fusion Generative Adversarial Networks (PDF-GAN) that aims at improving text-image consistency by fusing semantic information at different granularities and capturing accurate semantics. Equipped with two novel plug-and-play components: Hard-Negative Sentence Constructor and Semantic Projection, the proposed PDF-GAN can mitigate inconsistent semantics and bridge the text-image semantic gap. A series of experiments show that, as opposed to current state-ofthe-art methods, our PDF-GAN can lead to significantly better text-image consistency while maintaining decent image quality on the CUB and COCO datasets.