Transformer-based language model
This is incorrect. Transformer-based language models handle text using self-attention mechanisms (a mechanism that computes how much each word in a sentence is related to every other word in the same sentence and weights them to capture context) and are mainly used for text generation and translation. They are not optimized for generating images.