Home / Papers / Top Research Papers on Text to Image Generation

Top Research Papers on Text to Image Generation

Delve into the exciting world of Text to Image Generation with our collection of top research papers. These papers cover innovative techniques and advancements in the field, providing essential insights for researchers and enthusiasts alike. Stay ahead of the curve by exploring the latest developments and trends shaping this intriguing area of study.

Looking for research-backed answers?Try AI Search

The Creativity of Text-to-Image Generation

229 Citations 2022

Jonas Oppenlaender

journal unavailable

The paper argues that the current product-centered view of creativity falls short in the context of text-to-image generation, and provides a high-level summary of this online ecosystem drawing on Rhodes’ conceptual four P model of creativity.

Zero-Shot Text-to-Image Generation

1127 Citations 2021

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh + 5 more

arXiv (Cornell University)

Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

Muse: Text-To-Image Generation via Masked Generative Transformers

119 Citations 2023

Hui‐Wen Chang, Han Zhang, Jarred Barber + 9 more

arXiv (Cornell University)

We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autore...

A taxonomy of prompt modifiers for text-to-image generation

147 Citations 2023

Jonas Oppenlaender

Behaviour and Information Technology

Six types of prompt modifiers used by practitioners in the online community based on a 3-month ethnographic study provide researchers a conceptual starting point for investigating the practice of text-to-image generation, but may also help practitioners of AI generated art improve their images.

DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models

119 Citations 2023

Zeyang Sha, Zheng Li, Ning Yu + 1 more

journal unavailable

A systematic study on the detection and attribution of fake images generated by text-to-image generation models, which shows that fake images generate by various models can be distinguished from real ones, as there exists a common artifact shared by fake images from different models.

CogView: Mastering Text-to-Image Generation via Transformers

382 Citations 2021

Ming Ding, Zhuoyi Yang, Wenyi Hong + 8 more

arXiv (Cornell University)

Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a r...

Semantic Object Accuracy for Generative Text-to-Image Synthesis

151 Citations 2020

Tobias Hinz, Stefan Heinrich, Stefan Wermter

IEEE Transactions on Pattern Analysis and Machine Intelligence

A new model that explicitly models individual objects within an image and a new evaluation metric called Semantic Object Accuracy (SOA) that specifically evaluates images given an image caption are introduced that outperform models which only model global image characteristics.

Cross-Modal Contrastive Learning for Text-to-Image Generation

305 Citations 2021

Han Zhang, Jing Yu Koh, Jason Baldridge + 2 more

journal unavailable

The Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN) addresses the challenge of text-to-image synthesis systems by maximizing the mutual information between image and text via multiple contrastive losses which capture inter- modality and intra-modality correspondences.

Towards Language-Free Training for Text-to-Image Generation

168 Citations 2022

Yufan Zhou, Ruiyi Zhang, Changyou Chen + 6 more

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

The first work to train text-to-image generation models without any text data is proposed, which leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features.

GIT: A Generative Image-to-text Transformer for Vision and Language

208 Citations 2022

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu + 6 more

arXiv (Cornell University)

This paper design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering and establishes new state of the arts on 12 challenging benchmarks with a large margin.

GLIGEN: Open-Set Grounded Text-to-Image Generation

417 Citations 2023

Yuheng Li, Haotian Liu, Qingyang Wu + 5 more

journal unavailable

A novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs, and achieves open-world grounded text2img generation with caption and bounding box condition inputs.

Text to Image Generation with Semantic-Spatial Aware GAN

163 Citations 2022

Wentong Liao, Kai Hu, Michael Ying Yang + 1 more

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

A novel framework Semantic-Spatial Aware GAN for synthesizing images from input text that learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features and learns a semantic mask in a weakly-supervised way.

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

136 Citations 2023

Ming Tao, Bing‐Kun Bao, Hao Tang + 1 more

journal unavailable

This work proposes Generative Adversarial CLIPs, namely GALIP, a CLIP-empowered generator that induces the visual concepts from CLIP through bridge features and prompts, and achieves comparable results to large pretrained autoregressive and diffusion models.

Hierarchical Text-Conditional Image Generation with CLIP Latents

2257 Citations 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol + 2 more

arXiv (Cornell University)

It is shown that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity, and the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion.

BARTScore: Evaluating Generated Text as Text Generation

318 Citations 2021

Weizhe Yuan, Graham Neubig, Pengfei Liu

arXiv (Cornell University)

A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference output or the source text will achieve higher scores when the generated text is better. We operationalize th...

TediGAN: Text-Guided Diverse Face Image Generation and Manipulation

326 Citations 2021

Weihao Xia, Yujiu Yang, Jing‐Hao Xue + 1 more

journal unavailable

This work proposes TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions using a control mechanism based on style-mixing, and proposes the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions.

Design Guidelines for Prompt Engineering Text-to-Image Generative Models

500 Citations 2022

Vivian Liu, Lydia B. Chilton

CHI Conference on Human Factors in Computing Systems

A study exploring what prompt keywords and model hyperparameters can help produce coherent outputs from text-to-image generative models, structured to include subject and style keywords and investigates success and failure modes of these prompts.

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

340 Citations 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh + 14 more

arXiv (Cornell University)

The Pathways Autoregressive Text-to-Image (Parti) model is presented, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge and explores and highlights limitations of the models.

CLIP-Mesh: Generating textured meshes from text using pretrained image-text models

172 Citations 2022

Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky + 1 more

journal unavailable

A technique for zero-shot generation of a 3D model using only a target text prompt and a number of techniques using image augmentations and the use of a pretrained prior that generates CLIP image embeddings given a text embedding are presented.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

461 Citations 2022

Rinon Gal, Yuval Alaluf, Yuval Atzmon + 4 more

arXiv (Cornell University)

This work uses only 3-5 images of a user-provided concept to represent it through new words in the embedding space of a frozen text-to-image model, and finds evidence that a single word embedding is sufficient for capturing unique and varied concepts.