A novel spatial-style loss is investigated that enables the generator to learn style distribution from text-related real images, even if shapes and postures are misaligned, and outperforms state-of-the-art methods quantitatively and qualitatively.
In this paper, we focus on the problem of generating realistic stylized fashion images on the human body under a given text description. Existing stylized fashion image generation works mainly require additional style images as input and guidance. Very few works focus on the generation only in the context of textual description. And the fashion images generated by these works usually lack stylization and authenticity. There are two main challenges to this problem. Firstly, the shape of the generated garment should vary based on the text description (e.g., long sleeves, short sleeves, sleeveless) and the body shape and posture. Second, it is difficult to generate real textures directly from text descriptions. Although we can consider borrowing texture information from the auxiliary real clothing images, the misalignment of clothing image shapes can degrade performance. We address these changes by proposing a two-stage stylized text-to-fashion image generation framework. In the first stage, we predict the segmentation map for the generated clothing image based on the input text description and original body shape and posture. In the second stage, we focus on rendering colors and textures. In particular, we investigate a novel spatial-style loss that enables the generator to learn style distribution from text-related real images, even if shapes and postures are misaligned. Extensive experiments demonstrate that our proposed method outperforms state-of-the-art methods quantitatively and qualitatively.