Leveraging GANs and Vision Transformers for Text-to-Image Synthesis

Haitham  ALHAJI; Alaa Yaseen  Taqa

doi:10.24996/ijs.2026.67.1.41

Authors

Haitham ALHAJI Computer Science Department, College of Computer Science and Mathematics, University of Mosul, Nineveh, Iraq https://orcid.org/0000-0001-7957-3781
Alaa Yaseen Taqa Computer Science Department, College of Education for Pure Science, University of Mosul, Nineveh, Iraq https://orcid.org/0000-0002-6345-7708

DOI:

https://doi.org/10.24996/ijs.2026.67.1.41

Keywords:

Text-to-Image Synthesis, Generative Adversarial Networks (GANs), Vision Transformers, Generative Models, GAN-based Image Generation, Transformer Models in Image Synthesis

Abstract

Text-to-image (T2I) in recent advances has proven an important headway, but high-quality and efficient image generation is still a challenge. We proposed GANViT, a new model for generative adversarial vision transformer. It is designed for fast and efficient high-quality T2I synthesis. GANViT addresses limitations in existing models, such as extensive data requirements for training, multi-phase processes that slow down synthesis speed, and many parameters needed to achieve adequate performance. GANViT consists mainly of a generator and a discriminator, but these parts are separately based on a vision transformer (ViT). The generator ViT utilizes a feature bridge for fine-tuning, thus enhancing image generation capabilities. Conversely, the discriminator ViT interprets complex scenes through a feature extraction module and assessment phase. GANViT demonstrates substantial improvements in synthesizing images. It achieves a 5.01 Frechet Inception Distance score on the Common Objects in Context dataset and over 10 on the Caltech-UCSD Birds dataset, with 33 times data reduction, 4.5 times fewer parameters, and 23 times faster processing.