ImageFolder: Autoregressive Image Generation with Folded Tokens

Abstract

Image tokenizers are crucial for visual generative models, e.g., diffusion models (DMs) and autoregressive (AR) models, as they construct the latent representation for modeling. Increasing token length is a common approach to improve the image reconstruction quality. However, tokenizers with longer token lengths are not guaranteed to achieve better generation quality. There exists a trade-off between reconstruction and generation quality regarding token length. In this paper, we investigate the impact of token length on both image reconstruction and generation and provide a flexible solution to the tradeoff. We propose ImageFolder, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling to improve both generation efficiency and quality. To enhance the representative capability without increasing token length, we leverage dual-branch product quantization to capture different contexts of images. Specifically, semantic regularization is introduced in one branch to encourage compacted semantic information while another branch is designed to capture the remaining pixel-level details. Extensive experiments demonstrate the superior quality of image generation and shorter token length with ImageFolder tokenizer.

Pipeline

ImageFolder leverages vision transformers to encode and decode images. Given an image, two sets of KxK learnable tokens are used to generate spatially-aligned low-resolution features from the image. After that, a product quantization is used to obtain discrete image representation. A semantic regularization is applied in one of the quantizers to inject semantic constraints. The quantized tokens are concatenated to serve as input for the image decoder to reconstruct images.

BibTex

@misc{li2024imagefolderautoregressiveimagegeneration, title={ImageFolder: Autoregressive Image Generation with Folded Tokens}, author={Xiang Li and Hao Chen and Kai Qiu and Jason Kuen and Jiuxiang Gu and Bhiksha Raj and Zhe Lin}, year={2024}, eprint={2410.01756}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2410.01756}, }

ImageFolder: Autoregressive Image Generation with Folded Tokens

Xiang Li, Hao Chen, Kai Qiu, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, Zhe Lin

Carnegie Mellon University, Adobe Research, MBZUAI

Abstract

Pipeline

Visualization

BibTex