Top Research Papers on Text to Image Generation
Delve into the exciting world of Text to Image Generation with our collection of top research papers. These papers cover innovative techniques and advancements in the field, providing essential insights for researchers and enthusiasts alike. Stay ahead of the curve by exploring the latest developments and trends shaping this intriguing area of study.
Looking for research-backed answers?Try AI Search
The paper argues that the current product-centered view of creativity falls short in the context of text-to-image generation, and provides a high-level summary of this online ecosystem drawing on Rhodes’ conceptual four P model of creativity.
Zero-Shot Text-to-Image Generation
1127 Citations 2021Aditya Ramesh, Mikhail Pavlov, Gabriel Goh + 5 more
arXiv (Cornell University)
Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
Muse: Text-To-Image Generation via Masked Generative Transformers
119 Citations 2023Hui‐Wen Chang, Han Zhang, Jarred Barber + 9 more
arXiv (Cornell University)
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autore...
A taxonomy of prompt modifiers for text-to-image generation
147 Citations 2023Jonas Oppenlaender
Behaviour and Information Technology
Six types of prompt modifiers used by practitioners in the online community based on a 3-month ethnographic study provide researchers a conceptual starting point for investigating the practice of text-to-image generation, but may also help practitioners of AI generated art improve their images.
DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models
119 Citations 2023Zeyang Sha, Zheng Li, Ning Yu + 1 more
journal unavailable
A systematic study on the detection and attribution of fake images generated by text-to-image generation models, which shows that fake images generate by various models can be distinguished from real ones, as there exists a common artifact shared by fake images from different models.
CogView: Mastering Text-to-Image Generation via Transformers
382 Citations 2021Ming Ding, Zhuoyi Yang, Wenyi Hong + 8 more
arXiv (Cornell University)
Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a r...
Semantic Object Accuracy for Generative Text-to-Image Synthesis
151 Citations 2020Tobias Hinz, Stefan Heinrich, Stefan Wermter
IEEE Transactions on Pattern Analysis and Machine Intelligence
A new model that explicitly models individual objects within an image and a new evaluation metric called Semantic Object Accuracy (SOA) that specifically evaluates images given an image caption are introduced that outperform models which only model global image characteristics.
Cross-Modal Contrastive Learning for Text-to-Image Generation
305 Citations 2021Han Zhang, Jing Yu Koh, Jason Baldridge + 2 more
journal unavailable
The Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN) addresses the challenge of text-to-image synthesis systems by maximizing the mutual information between image and text via multiple contrastive losses which capture inter- modality and intra-modality correspondences.
Towards Language-Free Training for Text-to-Image Generation
168 Citations 2022Yufan Zhou, Ruiyi Zhang, Changyou Chen + 6 more
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
The first work to train text-to-image generation models without any text data is proposed, which leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features.
GIT: A Generative Image-to-text Transformer for Vision and Language
208 Citations 2022Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu + 6 more
arXiv (Cornell University)
This paper design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering and establishes new state of the arts on 12 challenging benchmarks with a large margin.
GLIGEN: Open-Set Grounded Text-to-Image Generation
417 Citations 2023Yuheng Li, Haotian Liu, Qingyang Wu + 5 more
journal unavailable
A novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs, and achieves open-world grounded text2img generation with caption and bounding box condition inputs.
Text to Image Generation with Semantic-Spatial Aware GAN
163 Citations 2022Wentong Liao, Kai Hu, Michael Ying Yang + 1 more
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
A novel framework Semantic-Spatial Aware GAN for synthesizing images from input text that learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features and learns a semantic mask in a weakly-supervised way.
GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis
136 Citations 2023Ming Tao, Bing‐Kun Bao, Hao Tang + 1 more
journal unavailable
This work proposes Generative Adversarial CLIPs, namely GALIP, a CLIP-empowered generator that induces the visual concepts from CLIP through bridge features and prompts, and achieves comparable results to large pretrained autoregressive and diffusion models.
Hierarchical Text-Conditional Image Generation with CLIP Latents
2257 Citations 2022Aditya Ramesh, Prafulla Dhariwal, Alex Nichol + 2 more
arXiv (Cornell University)
It is shown that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity, and the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion.
BARTScore: Evaluating Generated Text as Text Generation
318 Citations 2021Weizhe Yuan, Graham Neubig, Pengfei Liu
arXiv (Cornell University)
A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference output or the source text will achieve higher scores when the generated text is better. We operationalize th...
TediGAN: Text-Guided Diverse Face Image Generation and Manipulation
326 Citations 2021Weihao Xia, Yujiu Yang, Jing‐Hao Xue + 1 more
journal unavailable
This work proposes TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions using a control mechanism based on style-mixing, and proposes the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions.
Design Guidelines for Prompt Engineering Text-to-Image Generative Models
500 Citations 2022Vivian Liu, Lydia B. Chilton
CHI Conference on Human Factors in Computing Systems
A study exploring what prompt keywords and model hyperparameters can help produce coherent outputs from text-to-image generative models, structured to include subject and style keywords and investigates success and failure modes of these prompts.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
340 Citations 2022Jiahui Yu, Yuanzhong Xu, Jing Yu Koh + 14 more
arXiv (Cornell University)
The Pathways Autoregressive Text-to-Image (Parti) model is presented, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge and explores and highlights limitations of the models.
CLIP-Mesh: Generating textured meshes from text using pretrained image-text models
172 Citations 2022Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky + 1 more
journal unavailable
A technique for zero-shot generation of a 3D model using only a target text prompt and a number of techniques using image augmentations and the use of a pretrained prior that generates CLIP image embeddings given a text embedding are presented.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
461 Citations 2022Rinon Gal, Yuval Alaluf, Yuval Atzmon + 4 more
arXiv (Cornell University)
This work uses only 3-5 images of a user-provided concept to represent it through new words in the embedding space of a frozen text-to-image model, and finds evidence that a single word embedding is sufficient for capturing unique and varied concepts.
ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic
123 Citations 2022Yoad Tewel, Yoav Shalev, Idan Schwartz + 1 more
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
This work repurposes text-to-image matching models to generate a descriptive text given an image at inference time, without any further training or tuning step, by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models.
InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
104 Citations 2024Shi Jing, Wei Xiong, Zhe Lin + 1 more
journal unavailable
InstantBooth is an innovative approach leveraging existing text-to-image models for instantaneous text-guided image personalization, eliminating the need for test-time finetuning and boasting a 100-fold increase in generation speed.
Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation
114 Citations 2021Yasuhide Miura, Yuhao Zhang, Emily B. Tsai + 2 more
journal unavailable
This work introduces two new simple rewards to encourage the generation of factually complete and consistent radiology reports: one that encourages the system to generate radiology domain entities consistent with the reference, and one that uses natural language inference to encourage these entities to be described in inferentially consistent ways.
CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers
120 Citations 2022Ming Ding, Wendi Zheng, Wenyi Hong + 1 more
arXiv (Cornell University)
This work pretrain a 6B-parameter transformer with a simple and flexible self-supervised task, Cross-modal general language model (CogLM), and finetune it for fast super-resolution in the new text-to-image system, CogView2.
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
295 Citations 2023Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan + 4 more
journal unavailable
This paper proposes a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods, making them suitable for the video domain, and introduces a new task, zero-shot text-to-video generation.
Large-scale Text-to-Image Generation Models for Visual Artists’ Creative Works
138 Citations 2023Hyung-Kwon Ko, Gwanmo Park, Hyeon Jeon + 3 more
journal unavailable
The goal in this work is to understand how visual artists would adopt LTGMs to support their creative works, and provides four design guidelines that future researchers can refer to in making intelligent user interfaces usingLTGMs.
DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis
115 Citations 2020Ming Tao, Hao Tang, Songsong Wu + 3 more
arXiv (Cornell University)
A novel simplified text-to-image backbone which is able to synthesize high-quality images directly by one pair of generator and discriminator, a novel regularization method called Matching-Aware zero-centered Gradient Penalty and a novel fusion module which can exploit the semantics of text descriptions effectively and fuse text and image features deeply during the generation process.
RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge
114 Citations 2020Jun Cheng, Fuxiang Wu, Yanling Tian + 2 more
journal unavailable
A novel rich feature generating text-to-image synthesis, called RiFeGAN, to enrich the given description and exploits multi-captions attentional generative adversarial networks to synthesize images from those features.
Using artificial intelligence in craft education: crafting with text-to-image generative models
206 Citations 2023Henriikka Vartiainen, Matti Tedre
Digital Creativity
The results revealed that making with AI inspired teachers to consider the unique nature of crafts as well as the tensions and tradeoffs of adopting generative AI in craft practices.
Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale
220 Citations 2023Federico Bianchi, Pratyusha Kalluri, Esin Durmus + 7 more
journal unavailable
A broad range of ordinary prompts produce stereotypes, including prompts simply mentioning traits, descriptors, occupations, or objects, and the ways that the mass deployment of text-to-image generation models results in mass dissemination of stereotypes and resulting harms are demonstrated.
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
1765 Citations 2023Nataniel Ruiz, Yuanzhen Li, Varun Jampani + 3 more
journal unavailable
This work presents a new approach for “personalization” of text-to-image diffusion models, and applies it to several previously-unassailable tasks, including subject recontextualization, text-guided view synthesis, and artistic rendering, all while preserving the subject's key features.
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
996 Citations 2021Alex Nichol, Prafulla Dhariwal, Aditya Ramesh + 5 more
arXiv (Cornell University)
Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored...
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
105 Citations 2022Nataniel Ruiz, Yuanzhen Li, Varun Jampani + 3 more
arXiv (Cornell University)
Large text-to-image models achieved a remarkable leap in the evolution of AI, enabling high-quality and diverse synthesis of images from a given text prompt. However, these models lack the ability to mimic the appearance of subjects in a given reference set and synthesize novel renditions of them in different contexts. In this work, we present a new approach for "personalization" of text-to-image diffusion models. Given as input just a few images of a subject, we fine-tune a pretrained text-to-image model such that it learns to bind a unique identifier with that specific subject. Once the subj...
DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models
124 Citations 2023Zijie J. Wang, Evan Montoya, David Munechika + 3 more
journal unavailable
Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, Duen Horng Chau. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023.
ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation
186 Citations 2023Yuxiang Wei, Yabo Zhang, Zhilong Ji + 3 more
journal unavailable
This paper proposes a learning-based encoder, which consists of a global and a local mapping networks for fast and accurate customized text-to-image generation, and demonstrates that it enables high-fidelity inversion and more robust editability with a significantly faster encoding process.
DALL-EVAL: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models
126 Citations 2023Jaemin Cho, Abhay Zala, Mohit Bansal
journal unavailable
It is demonstrated that recent text-to-image generation models learn specific biases about gender and skin tone from web image-text pairs, which will help guide future progress in improving text-to-image generation models on visual reasoning skills and learning socially unbiased representations.
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
421 Citations 2023Jay Zhangjie Wu, Yixiao Ge, Xintao Wang + 7 more
journal unavailable
This work proposes a new T2V generation setting—One-Shot Video Tuning, where only one text-video pair is presented, and introduces Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy.
Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models
127 Citations 2023Stephen Brade, Bryan Wang, Maurício Sousa + 2 more
journal unavailable
This work presents Promptify, an interactive system that supports prompt exploration and refinement for text-to-image generative models, and utilizes a suggestion engine powered by large language models to help users quickly explore and craft diverse prompts.
Evaluation of Text Generation: A Survey
191 Citations 2020Aslı Çelikyılmaz, Elizabeth Clark, Jianfeng Gao
arXiv (Cornell University)
This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models.
LAION-5B: An open large-scale dataset for training next generation image-text models
1035 Citations 2022Christoph Schuhmann, Romain Beaumont, Richard Vencu + 13 more
arXiv (Cornell University)
This work presents LAION-5B - a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language, and shows successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discusses further experiments enabled with an openly available dataset of this scale.