Top Research Papers on Text to Image Generation
Delve into the exciting world of Text to Image Generation with our collection of top research papers. These papers cover innovative techniques and advancements in the field, providing essential insights for researchers and enthusiasts alike. Stay ahead of the curve by exploring the latest developments and trends shaping this intriguing area of study.
Looking for research-backed answers?Try AI Search
The paper argues that the current product-centered view of creativity falls short in the context of text-to-image generation, and provides a high-level summary of this online ecosystem drawing on Rhodes’ conceptual four P model of creativity.
Zero-Shot Text-to-Image Generation
1127 Citations 2021Aditya Ramesh, Mikhail Pavlov, Gabriel Goh + 5 more
arXiv (Cornell University)
Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
Muse: Text-To-Image Generation via Masked Generative Transformers
119 Citations 2023Hui‐Wen Chang, Han Zhang, Jarred Barber + 9 more
arXiv (Cornell University)
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autore...
A taxonomy of prompt modifiers for text-to-image generation
147 Citations 2023Jonas Oppenlaender
Behaviour and Information Technology
Six types of prompt modifiers used by practitioners in the online community based on a 3-month ethnographic study provide researchers a conceptual starting point for investigating the practice of text-to-image generation, but may also help practitioners of AI generated art improve their images.
DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models
119 Citations 2023Zeyang Sha, Zheng Li, Ning Yu + 1 more
journal unavailable
A systematic study on the detection and attribution of fake images generated by text-to-image generation models, which shows that fake images generate by various models can be distinguished from real ones, as there exists a common artifact shared by fake images from different models.
CogView: Mastering Text-to-Image Generation via Transformers
382 Citations 2021Ming Ding, Zhuoyi Yang, Wenyi Hong + 8 more
arXiv (Cornell University)
Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a r...
Semantic Object Accuracy for Generative Text-to-Image Synthesis
151 Citations 2020Tobias Hinz, Stefan Heinrich, Stefan Wermter
IEEE Transactions on Pattern Analysis and Machine Intelligence
A new model that explicitly models individual objects within an image and a new evaluation metric called Semantic Object Accuracy (SOA) that specifically evaluates images given an image caption are introduced that outperform models which only model global image characteristics.
Cross-Modal Contrastive Learning for Text-to-Image Generation
305 Citations 2021Han Zhang, Jing Yu Koh, Jason Baldridge + 2 more
journal unavailable
The Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN) addresses the challenge of text-to-image synthesis systems by maximizing the mutual information between image and text via multiple contrastive losses which capture inter- modality and intra-modality correspondences.
Towards Language-Free Training for Text-to-Image Generation
168 Citations 2022Yufan Zhou, Ruiyi Zhang, Changyou Chen + 6 more
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
The first work to train text-to-image generation models without any text data is proposed, which leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features.
GIT: A Generative Image-to-text Transformer for Vision and Language
208 Citations 2022Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu + 6 more
arXiv (Cornell University)
This paper design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering and establishes new state of the arts on 12 challenging benchmarks with a large margin.
GLIGEN: Open-Set Grounded Text-to-Image Generation
417 Citations 2023Yuheng Li, Haotian Liu, Qingyang Wu + 5 more
journal unavailable
A novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs, and achieves open-world grounded text2img generation with caption and bounding box condition inputs.
Text to Image Generation with Semantic-Spatial Aware GAN
163 Citations 2022Wentong Liao, Kai Hu, Michael Ying Yang + 1 more
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
A novel framework Semantic-Spatial Aware GAN for synthesizing images from input text that learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features and learns a semantic mask in a weakly-supervised way.
GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis
136 Citations 2023Ming Tao, Bing‐Kun Bao, Hao Tang + 1 more
journal unavailable
This work proposes Generative Adversarial CLIPs, namely GALIP, a CLIP-empowered generator that induces the visual concepts from CLIP through bridge features and prompts, and achieves comparable results to large pretrained autoregressive and diffusion models.
Hierarchical Text-Conditional Image Generation with CLIP Latents
2257 Citations 2022Aditya Ramesh, Prafulla Dhariwal, Alex Nichol + 2 more
arXiv (Cornell University)
It is shown that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity, and the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion.
BARTScore: Evaluating Generated Text as Text Generation
318 Citations 2021Weizhe Yuan, Graham Neubig, Pengfei Liu
arXiv (Cornell University)
A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference output or the source text will achieve higher scores when the generated text is better. We operationalize th...
TediGAN: Text-Guided Diverse Face Image Generation and Manipulation
326 Citations 2021Weihao Xia, Yujiu Yang, Jing‐Hao Xue + 1 more
journal unavailable
This work proposes TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions using a control mechanism based on style-mixing, and proposes the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions.
Design Guidelines for Prompt Engineering Text-to-Image Generative Models
500 Citations 2022Vivian Liu, Lydia B. Chilton
CHI Conference on Human Factors in Computing Systems
A study exploring what prompt keywords and model hyperparameters can help produce coherent outputs from text-to-image generative models, structured to include subject and style keywords and investigates success and failure modes of these prompts.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
340 Citations 2022Jiahui Yu, Yuanzhong Xu, Jing Yu Koh + 14 more
arXiv (Cornell University)
The Pathways Autoregressive Text-to-Image (Parti) model is presented, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge and explores and highlights limitations of the models.
CLIP-Mesh: Generating textured meshes from text using pretrained image-text models
172 Citations 2022Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky + 1 more
journal unavailable
A technique for zero-shot generation of a 3D model using only a target text prompt and a number of techniques using image augmentations and the use of a pretrained prior that generates CLIP image embeddings given a text embedding are presented.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
461 Citations 2022Rinon Gal, Yuval Alaluf, Yuval Atzmon + 4 more
arXiv (Cornell University)
This work uses only 3-5 images of a user-provided concept to represent it through new words in the embedding space of a frozen text-to-image model, and finds evidence that a single word embedding is sufficient for capturing unique and varied concepts.