G LYPH D IFFUSION can achieve comparable or even better results than several pretrained language mod-023 baselines, including pretrained language mod-023 els, and makes significant improve-024 ments compared to the recent diffusion model.
Diffusion models have become a new genera-001 tive paradigm for text generation. Considering 002 the discrete nature of text, in this paper, we 003 propose G LYPH D IFFUSION , a novel diffusion 004 approach for text generation via text-guided im-005 age generation. Our key idea is to render the 006 target text as a glyph image containing visual 007 language content. In this way, conditional text 008 generation can be cast as a text-guided glyph 009 image generation task, and it is then natural 010 to apply continuous diffusion models to dis-011 crete texts. Specially, we utilize a cascaded 012 architecture ( i.e., a base and a super-resolution 013 diffusion model) to generate high-fidelity glyph 014 images based on the input text. Finally, we de-015 sign a text grounding module to transform and 016 refine the visual language content from gener-017 ated glyph images into the final texts. In ex-018 periments over four conditional text generation 019 tasks and two classes of metrics ( i.e., quality 020 and diversity), G LYPH D IFFUSION can achieve 021 comparable or even better results than several 022 baselines, including pretrained language mod-023 els. Our model also makes significant improve-024 ments compared to the recent diffusion model. 025