MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation

CVPR 2025 Highlight

Jinnan Chen1, Lingting Zhu2, Zeyu Hu3, Shengju Qian3, Yugang Chen3, Xin Wang3, Gim Hee Lee1

1National University of Singapore    2The University of Hong Kong    3LIGHTSPEED

Abstract

Recent advances in auto-regressive transformers have revolutionized generative modeling across domains, from language processing to visual generation, demonstrating remarkable capabilities. However, applying these advances to 3D generation presents three key challenges: the unordered nature of 3D data conflicts with sequential prediction paradigms, conventional vector quantization approaches incur substantial compression loss when applied to 3D meshes, and the lack of efficient scaling strategies for higher resolution. To address these limitations, we introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer (Cascaded MAR) for progressive latent token denoising. Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens. Additionally, we propose a cascaded training strategy with condition augmentation that enables efficient up-scaling the latent token resolution. Extensive experiments demonstrate that MAR-3D not only achieves superior performance and generalization capabilities compared to existing methods but also exhibits enhanced scaling properties over joint distribution modeling approaches like diffusion transformers in 3D generation.

Method

Overview of MAR-3D: (a) Pyramid VAE: It processes learnable tokens through separate cross-attention layers, taking multiresolution point clouds and normals as input to generate occupancy fields. (b) Cascaded MAR: Condition on image features extracted by CLIP and DINOv2, we employ a cascaded design: a MAR-LR model for generating low-resolution tokens, and a MAR-HR model for high-resolution token. The MAR architecture details are illustrated in the blue box. While MAR-LR and MAR-HR share the same architecture, they differ in the inputs: MAR-HR additionally requires low-resolution tokens as input (shown in the dashed box).

MAR-3D Method Diagram

VAE Reconstruction

VAE Reconstruction 1

VAE Reconstruction 2

VAE Reconstruction 3

VAE Reconstruction 4

VAE Reconstruction 5

VAE Reconstruction 6

VAE Reconstruction 7

VAE Reconstruction 8

Image Condition Results

Input image 1
Input image 2
Input image 3
Input image 4
Input image 5
Input image 6
Input image 7
Input image 8

Citation

@inproceedings{chen2025mar3d,
  title={MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation},
  author={Chen, Jinnan and Zhu, Lingting and Hu, Zeyu and Qian, Shengju and Chen, Yugang and Wang, Xin and Lee, Gim Hee},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}