In recent times, pure language processing fashions have dramatically improved their potential to study general-purpose representations, which has resulted in vital efficiency positive aspects for a variety of pure language technology and pure language understanding duties. Largely, this has been completed via pre-training language fashions on intensive unlabeled textual content corpora.
This pre-training formulation doesn’t make assumptions about enter sign modality, which will be language, imaginative and prescient, or audio, amongst others. A number of latest papers have exploited this formulation to dramatically enhance picture technology outcomes via pre-quantizing photographs into discrete integer codes (represented as pure numbers), and modeling them autoregressively (i.e., predicting sequences one token at a time). In these approaches, a convolutional neural community (CNN) is skilled to encode a picture into discrete tokens, every equivalent to a small patch of the picture. A second stage CNN or Transformer is then skilled to mannequin the distribution of encoded latent variables. The second stage may also be utilized to autoregressively generate a picture after the coaching. However whereas such fashions have achieved sturdy efficiency for picture technology, few research have evaluated the realized illustration for downstream discriminative duties (comparable to picture classification).
In “Vector-Quantized Picture Modeling with Improved VQGAN”, we suggest a two-stage mannequin that reconceives conventional picture quantization methods to yield improved efficiency on picture technology and picture understanding duties. Within the first stage, a picture quantization mannequin, referred to as VQGAN, encodes a picture into lower-dimensional discrete latent codes. Then a Transformer mannequin is skilled to mannequin the quantized latent codes of a picture. This method, which we name Vector-quantized Picture Modeling (VIM), can be utilized for each picture technology and unsupervised picture illustration studying. We describe a number of enhancements to the picture quantizer and present that coaching a stronger picture quantizer is a key element for enhancing each picture technology and picture understanding.
Vector-Quantized Picture Modeling with ViT-VQGAN
One latest, generally used mannequin that quantizes photographs into integer tokens is the Vector-quantized Variational AutoEncoder (VQVAE), a CNN-based auto-encoder whose latent area is a matrix of discrete learnable variables, skilled end-to-end. VQGAN is an improved model of this that introduces an adversarial loss to advertise top quality reconstruction. VQGAN makes use of transformer-like parts within the type of non-local consideration blocks, which permits it to seize distant interactions utilizing fewer layers.
In our work, we suggest taking this method one step additional by changing each the CNN encoder and decoder with ViT. As well as, we introduce a linear projection from the output of the encoder to a low-dimensional latent variable area for lookup of the integer tokens. Particularly, we diminished the encoder output from a 768-dimension vector to a 32- or 8-dimension vector per code, which we discovered encourages the decoder to raised make the most of the token outputs, enhancing mannequin capability and effectivity.
With our skilled ViT-VQGAN, photographs are encoded into discrete tokens represented by integers, every of which encompasses an 8×8 patch of the enter picture. Utilizing these tokens, we prepare a decoder-only Transformer to foretell a sequence of picture tokens autoregressively. This two-stage mannequin, VIM, is ready to carry out unconditioned picture technology by merely sampling token-by-token from the output softmax distribution of the Transformer mannequin.
VIM can be able to performing class-conditioned technology, comparable to synthesizing a selected picture of a given class (e.g., a canine or a cat). We prolong the unconditional technology to class-conditioned technology by prepending a class-ID token earlier than the picture tokens throughout each coaching and sampling.
|Uncurated set of canine samples from class-conditioned picture technology skilled on ImageNet. Conditioned lessons: Irish terrier, Norfolk terrier, Norwich terrier, Yorkshire terrier, wire-haired fox terrier, Lakeland terrier.|
To check the picture understanding capabilities of VIM, we additionally fine-tune a linear projection layer to carry out ImageNet classification, a typical benchmark for measuring picture understanding skills. Just like ImageGPT, we take a layer output at a selected block, common over the sequence of token options (frozen) and insert a softmax layer (learnable) projecting averaged options to class logits. This enables us to seize intermediate options that present extra info helpful for illustration studying.
We prepare all ViT-VQGAN fashions with a coaching batch measurement of 256 distributed throughout 128 CloudTPUv4 cores. All fashions are skilled with an enter picture decision of 256×256. On prime of the pre-learned ViT-VQGAN picture quantizer, we prepare Transformer fashions for unconditional and class-conditioned picture synthesis and examine with earlier work.
We measure the efficiency of our proposed strategies for class-conditioned picture synthesis and unsupervised illustration studying on the broadly used ImageNet benchmark. Within the desk under we reveal the class-conditioned picture synthesis efficiency measured by the Fréchet Inception Distance (FID). In comparison with prior work, VIM improves the FID to three.07 (decrease is healthier), a relative enchancment of 58.6% over the VQGAN mannequin (FID 7.35). VIM additionally improves the capability for picture understanding, as indicated by the Inception Rating (IS), which matches from 188.6 to 227.4, a 20.6% enchancment relative to VQGAN.
|Fréchet Inception Distance (FID) comparability between totally different fashions for class-conditional picture synthesis and Inception Rating (IS) for picture understanding, each on ImageNet with decision 256×256. The acceptance fee exhibits outcomes filtered by a ResNet-101 classification mannequin, much like the method in VQGAN.|
After coaching a generative mannequin, we check the realized picture representations by fine-tuning a linear layer to carry out ImageNet classification, a typical benchmark for measuring picture understanding skills. Our mannequin outperforms earlier generative fashions on the picture understanding process, enhancing classification accuracy via linear probing (i.e., coaching a single linear classification layer, whereas protecting the remainder of the mannequin frozen) from 60.3% (iGPT-L) to 73.2%. These outcomes showcase VIM’s sturdy technology outcomes in addition to picture illustration studying skills.
We suggest Vector-quantized Picture Modeling (VIM), which pretrains a Transformer to foretell picture tokens autoregressively, the place discrete picture tokens are produced from improved ViT-VQGAN picture quantizers. With our proposed enhancements on picture quantization, we reveal superior outcomes on each picture technology and understanding. We hope our outcomes can encourage future work in the direction of extra unified approaches for picture technology and understanding.
We wish to thank Xin Li, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu for the preparation of the VIM paper. We thank Wei Han, Yuan Cao, Jiquan Ngiam, Vijay Vasudevan, Zhifeng Chen and Claire Cui for useful discussions and suggestions, and others on the Google Analysis and Mind Staff for assist all through this venture.