Leveraging GANs and Vision Transformers for Text-to-Image Synthesis
DOI:
https://doi.org/10.24996/ijs.2026.67.1.41Keywords:
Text-to-Image Synthesis, Generative Adversarial Networks (GANs), Vision Transformers, Generative Models, GAN-based Image Generation, Transformer Models in Image SynthesisAbstract
Text-to-image (T2I) in recent advances has proven an important headway, but high-quality and efficient image generation is still a challenge. We proposed GANViT, a new model for generative adversarial vision transformer. It is designed for fast and efficient high-quality T2I synthesis. GANViT addresses limitations in existing models, such as extensive data requirements for training, multi-phase processes that slow down synthesis speed, and many parameters needed to achieve adequate performance. GANViT consists mainly of a generator and a discriminator, but these parts are separately based on a vision transformer (ViT). The generator ViT utilizes a feature bridge for fine-tuning, thus enhancing image generation capabilities. Conversely, the discriminator ViT interprets complex scenes through a feature extraction module and assessment phase. GANViT demonstrates substantial improvements in synthesizing images. It achieves a 5.01 Frechet Inception Distance score on the Common Objects in Context dataset and over 10 on the Caltech-UCSD Birds dataset, with 33 times data reduction, 4.5 times fewer parameters, and 23 times faster processing.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Iraqi Journal of Science

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.



