Leveraging GANs and Vision Transformers for Text-to-Image Synthesis

Authors

  • Haitham ALHAJI Computer Science Department, College of Computer Science and Mathematics, University of Mosul, Nineveh, Iraq https://orcid.org/0000-0001-7957-3781
  • Alaa Yaseen Taqa Computer Science Department, College of Education for Pure Science, University of Mosul, Nineveh, Iraq

DOI:

https://doi.org/10.24996/ijs.2026.67.1.%25g

Keywords:

Text-to-Image Synthesis, Generative Adversarial Networks (GANs), Vision Transformers, Generative Models, GAN-based Image Generation, Transformer Models in Image Synthesis

Abstract

Text-to-image (T2I) in recent advances has proven an important headway, but high-quality and efficient image generation is still a challenge. We proposed GANViT, a new model for generative adversarial vision transformer. It is designed for fast and efficient high-quality T2I synthesis. GANViT addresses limitations in existing models, such as extensive data requirements for training, multi-phase processes that slow down synthesis speed, and many parameters needed to achieve adequate performance. GANViT consists mainly of a generator and a discriminator, but these parts are separately based on a vision transformer (ViT). The generator ViT utilizes a feature bridge for fine-tuning, thus enhancing image generation capabilities. Conversely, the discriminator ViT interprets complex scenes through a feature extraction module and assessment phase. GANViT demonstrates substantial improvements in synthesizing images. It achieves a 5.01 Frechet Inception Distance score on the Common Objects in Context dataset and over 10 on the Caltech-UCSD Birds dataset, with 33 times data reduction, 4.5 times fewer parameters, and 23 times faster processing.

Downloads

Issue

Section

Computer Science

How to Cite

[1]
H. . ALHAJI and A. Y. . Taqa, “Leveraging GANs and Vision Transformers for Text-to-Image Synthesis”, Iraqi Journal of Science, vol. 67, no. 1, doi: 10.24996/ijs.2026.67.1.%g.