Delve into the exciting world of Text to Image Generation with our collection of top research papers. These papers cover innovative techniques and advancements in the field, providing essential insights for researchers and enthusiasts alike. Stay ahead of the curve by exploring the latest developments and trends shaping this intriguing area of study.
Looking for research-backed answers?Try AI Search
Zhiqiu Lin, Deepak Pathak, Baiqi Li + 5 more
journal unavailable
The VQAScore is introduced, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a"Yes"answer to a simple"Does this figure show '{text}'?"question, and though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many image-text alignment benchmarks.
Songwei Ge, Taesung Park, Jun-Yan Zhu + 1 more
2023 IEEE/CVF International Conference on Computer Vision (ICCV)
This work extracts each word’s attributes from rich text to enable local style control, explicit token reweighting, precise color rendering, and detailed region synthesis, and demonstrates that the method outperforms strong baselines with quantitative evaluations.
For the last three years diffusion denoising approach with a score-based loss function became notable as well: many researches tackling problem of image generation report SOTA results.
J. Oppenlaender
Proceedings of the 25th International Academic Mindtrek Conference
The paper argues that the current product-centered view of creativity falls short in the context of text-to-image generation, and provides a high-level summary of this online ecosystem drawing on Rhodes’ conceptual four P model of creativity.
A. Ramesh, Mikhail Pavlov, Gabriel Goh + 5 more
ArXiv
This work describes a simple approach based on a transformer that autoregressively models the text and image tokens as a single stream of data that is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
Huiwen Chang, Han Zhang, Jarred Barber + 9 more
ArXiv
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autore...
Junyi Li, Wayne Xin Zhao, J. Nie + 1 more
ArXiv
This paper proposes R ENDER D IF - FUSION, a novel diffusion approach for text generation via text-guided image generation that can achieve comparable or even better results than several baselines, including pretrained language models.
Junyi Li, Wayne Xin Zhao, J. Nie + 1 more
journal unavailable
GlyphDiffusion is proposed, a novel diffusion approach for text generation via text-guided image generation that utilizes a cascaded architecture (ie a base and a super-resolution diffusion model) to generate high-fidelity glyph images, conditioned on the input text.
This work develops Surgical Imagen, a diffusion-based generative model that is developed to generate photorealistic and activity-aligned surgical images from triplet-based textual prompts, and designs an instrument-based class balancing technique to counteract data imbalance and skewness, improving training convergence.
Y. Hao, Zewen Chi, Li Dong + 1 more
ArXiv
This work proposes prompt adaptation, a general framework that automatically adapts original user input to model-preferred prompts, and defines a reward function that encourages the policy to generate more aesthetically pleasing images while preserving the original user intentions.
A. Ramesh, Prafulla Dhariwal, Alex Nichol + 2 more
ArXiv
It is shown that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity, and the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion.
Kihyuk Sohn, Nataniel Ruiz, Kimin Lee + 11 more
ArXiv
Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language and out-of-distribution effects make it hard to synthesize image styles, that leverage a specific design pattern, texture or material. In this paper, we introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. The proposed method is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, a...
Yunji Kim, Jiyoung Lee, Jin-Hwa Kim + 2 more
2023 IEEE/CVF International Conference on Computer Vision (ICCV)
This work proposes DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout, and develops an attention modulation method that guides objects to appear in specific regions according to layout guidance.
Yuheng Li, Haotian Liu, Qingyang Wu + 5 more
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
A novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs, and achieves open-world grounded text2img generation with caption and bounding box condition inputs.
Yoad Tewel, Omri Kaduri, Rinon Gal + 4 more
ACM Trans. Graph.
This work presents ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model and introduces a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images.
J. Oppenlaender, Johanna M. Silvennoinen, Ville Paananen + 1 more
Proceedings of the 26th International Academic Mindtrek Conference
It is found that while participants were aware of the risks and dangers associated with the technology, only few participants considered the technology to be a personal risk, which shows that many people are still oblivious of the potential personal risks of generative artificial intelligence and the impending societal changes associated with this technology.
J. Oppenlaender, Aku Visuri, Ville Paananen + 2 more
ArXiv
The study found that participants were aware of the risks and dangers associated with the technology, but only few participants considered the technology to be a risk to themselves, and those who had tried the technology rated its future importance lower than those whoHad not.
Zhengcong Fei, Mingyuan Fan, Li Zhu + 1 more
journal unavailable
This paper presents a progressive model for high-fidelity text-to-image generation that produces significantly better results compared with the previous VQ-AR method in FID score across a wide variety of categories and aspects.
Charlotte M. Bird, Eddie L. Ungless, Atoosa Kasirzadeh
Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society
This paper investigates the direct risks and harms associated with modern text-to-image generative models, such as DALL-E and Midjourney, through a comprehensive literature review, and identifies 22 distinct risk types.
Lukas Höllein, Aljavz Bovzivc, Norman Muller + 5 more
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
This paper proposes to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model, and designs an autoregressive generation that renders more 3D-consistent images at any viewpoint.
Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu + 1 more
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
A customization assistant based on pre-trained large language model and diffusion model is built, which can not only perform customized generation in a tuning-free manner, but also enable more user-friendly interactions: users can chat with the assistant and input ambiguous text or clear instruction.
Yufan Zhou, Bingchen Liu, Yizhe Zhu + 3 more
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Corgi is based on the proposed shifted diffusion model, which achieves better image embedding generation from input text, and achieves new state-of-the-art results across different datasets on downstream language-free text-to-image generation tasks.
Reshma S
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
One of the approaches identified in this study is Cross-modal Semantic Matching Generative Adversarial Networks (CSM-GAN), which is used to increase semantic consistency between text descriptions and synthesised pictures for fine-grained text- to-image creation.
This work creates an end-to-end solution that can generate artistic images from text descriptions due to the lack of datasets with paired text description and artistic images.
Jaemin Cho, Abhaysinh Zala, Mohit Bansal
ArXiv
This work proposes two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation and introduces VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming.
Yuchen Yang, Bo Hui, Haolin Yuan + 2 more
2024 IEEE Symposium on Security and Privacy (SP)
This work proposes SneakyPrompt, the first automated attack framework to jailbreak text-to-image generative models such that they generate NSFW images even if safety filters are adopted, and outperforms existing text adversarial attacks when extended to jailbreak text-to-image generative models.
Aditi Singh
2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC)
The survey provides an overview of the existing literature as well as an analysis of the approaches used in various studies, which covers data preprocessing techniques, neural network types, and evaluation metrics used in the field.
Jiayi Liao, Xu Chen, Qiang Fu + 5 more
journal unavailable
A framework of Text-to-Image generation for Abstract Concepts (TIAC) is proposed, Inspired by the three-layer artwork theory that identifies critical factors, intent, object and form during artistic creation, to demonstrate the effectiveness of this framework in creating images that can sufficiently express abstract concepts.
Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang + 1 more
ArXiv
An organized review of pioneering methods and their improvements on text-to-image generation, and applications beyond image generation, such as text-guided generation for various modalities like videos, and text-guided image editing.
Tianjun Zhang, Yi Zhang, Vibhav Vineet + 2 more
ArXiv
Control-GPT is introduced to guide the diffusion-based text-to-image pipelines with programmatic sketches generated by GPT-4, enhancing their abilities for instruction following and establishes a new state-of-art on the spatial arrangement and object positioning generation and enhances users' control of object positions, sizes, etc.
Cheng Zhang, Xuanbai Chen, Siqi Chai + 4 more
2023 IEEE/CVF International Conference on Computer Vision (ICCV)
It is shown that, for some attributes, images can represent concepts more expressively than text, and a novel approach is proposed, ITI-Gen1, that leverages readily available reference images for Inclusive Text-to-Image GENeration.
Nikita Srivatsan, Sofía Samaniego, Omar Florez + 1 more
ArXiv
This work presents an approach for generating alternative text (or alt-text) descriptions for images shared on social media, specifically Twitter, and is the first to their knowledge that incorporates textual information from the associated social media post into the prefix as well.
Jaewoong Lee, Sang-Sub Jang, Jaehyeong Jo + 5 more
2023 IEEE/CVF International Conference on Computer Vision (ICCV)
A learnable sampling model, Text-Conditioned Token Selection (TCTS), is proposed, to select optimal tokens via localized supervision with text information, and reduces the original inference time by more than 50 % without modifying the original generative model.
Hao Li, Yang Zou, Ying Wang + 7 more
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
This work empirically study the scaling properties of diffusion based T2I models by performing extensive and rigours ablations on scaling both denoising backbones and training set, including training scaled UNet and Transformer variants.
Zhengyuan Yang, Jianfeng Wang, Zhe Gan + 8 more
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
The proposed model, dubbed as ReCo (Region-Controlled T2I), enables the region control for arbitrary objects described by open-ended regional texts rather than by object labels from a constrained category set, and can better control the object count, spatial relationship, and region attributes such as color/size, with the free-form regional description.
Thibault Castells, Hyoung-Kyu Song, Tairen Piao + 6 more
ArXiv
Through the thorough exploration of quantization, profiling, and on-device deployment, this work achieves rapid generation of photo-realistic, text-aligned images in just two steps, with latency under one second on resource-limited edge devices.
Leigang Qu, Haochuan Li, Tan Wang + 4 more
ArXiv
This work rethink the relationship between text-to-image generation and retrieval and propose a unified framework in the context of Multimodal Large Language Models (MLLMs), and introduces a generative retrieval method to perform retrieval in a training-free manner.
Jiao Sun, Deqing Fu, Yushi Hu + 8 more
ArXiv
DreamSync is introduced, a model-agnostic training algorithm by design that improves T2I models to be faithful to the text input and improves both the semantic alignment and aesthetic appeal of two diffusion-based T1I models.
Bowen Li, Philip H. S. Torr, Thomas Lukasiewicz
journal unavailable
Experimental results demonstrate that the proposed memory-driven semi-parametric approach to text-to-image generation produces more realistic images than purely parametric approaches, in terms of both visual fidelity and text-image semantic consistency.
Yuanzhi Zhu, Zhaohai Li, Tianwei Wang + 2 more
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
The proposed CTIG-DM is able to produce image samples that simulate real-world complexity and diversity, and thus can boost the performance of existing text recognizers and shows its appealing potential in domain adaptation and generating images containing Out-Of-Vocabulary words.
Bingshuai Liu, Longyue Wang, Chenyang Lyu + 4 more
journal unavailable
A novel multi-modal metric that considers object-text alignment to filter the fine-tuning data in the target culture, which is used to fine-tune a T2I model to improve cross-cultural generation is proposed.
A. Voynov, Q. Chu, D. Cohen-Or + 1 more
ArXiv
It is shown that XTI is more expressive and precise, and converges faster than the original Textual Inversion (TI) space, and the extended inversion method does not involve any noticeable trade-off between reconstruction and editability and induces more regular inversions.
Roy Ganz, Michael Elad
2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
This work reveals the merits of CLIP with PAG (CLIPAG) in several vision-language generative tasks and shows that seamlessly integrating CLIPAG in a "plug-n-play" manner leads to substantial improvements in vision-language generative applications.
J. Oppenlaender
Behaviour & Information Technology
Six types of prompt modifiers used by practitioners in the online community based on a 3-month ethnographic study provide researchers a conceptual starting point for investigating the practice of text-to-image generation, but may also help practitioners of AI generated art improve their images.
Jingtao Zhan, Qingyao Ai, Yiqun Liu + 5 more
journal unavailable
Inspired by zero-shot machine translation techniques, PRIP innovatively uses the latent representation of a user-preferred image as an intermediary"pivot"between the user and system languages and can leverage abundant data for training.
P. Seshadri, Sameer Singh, Yanai Elazar
journal unavailable
This paper finds that the model appears to amplify gender-occupation biases found in the training data (LAION) considerably, however, it is discovered that amplification can be largely attributed to discrepancies between training captions and model prompts.
Syed sha alam.A, J. N, Mohamed faiz ali B + 1 more
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
This paper presents Perceptual Image Compression, a text-to-image technology that will enable billions of individuals to produce beautiful works of art in in a few seconds and the open ecosystem that will grow up around it as well as new models to really probe the limits of latent space.
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu + 6 more
ArXiv
This paper design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering and establishes new state of the arts on 12 challenging benchmarks with a large margin.
Youwei Liang, Junfeng He, Gang Li + 15 more
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
It is shown that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions.
Lianyu Pang, Jian Yin, Baoquan Zhao + 4 more
ArXiv
AttnDreamBooth is introduced, a novel approach that addresses the limitations of two primary techniques in text-to-image personalization by separately learning the embedding alignment, the attention map, and the subject identity in different training stages and introduces a cross-attention map regularization term to enhance the learning of the attention map.