Learning Geometric-aware Properties in 2D Representation Using Lightweight CAD Models, or Zero Real 3D Pairs

VISTEC - Vidyasirimedhi Institute of Science and Technology
Rayong, Thailand

CVPR 2023


Cross-modal training using a 2D-3D paired dataset, e.g., multi-view images/scene scans, presents an effective way to enhance 2D scene understanding by introducing geometric and view-invariance priors into 2D features. However, the need for large-scale scene datasets limits their further im- provements and scalability. This paper explores an alternative learning method by leveraging a lightweight and publicly available type of 3D data, i.e., CAD models. We construct a 3D space with geometric-aware alignment where the similarity in this space reflects the geometric similarity of CAD models based on the Chamfer distance. The acquired geometric-aware properties are then induced into 2D features, which boost performance on downstream tasks more effectively than existing RGB-CAD approaches. Our technique is not limited to paired RGB-CAD datasets; we propose an extension for learning such representations on pseudo pairs generated by existing CAD-based reconstruction methods. By training solely on pseudo pairs, we show substantial improvement over SOTA 2D pre-trained models using either ResNet-50 or ViT-B backbone. We also achieve comparable results to SOTA methods trained on scene scans on four 2D understanding tasks in NYUv2, SUNRGB-D, indoor ADE20k, and indoor/outdoor COCO, despite using real or pseudo-generated lightweight CAD models.

Figure: Our pre-training strategy. We learn 2D representations on a joint 2D-3D space from RGB-CAD pairs based on three loss functions. LGEO focuses on learning CAD features from two similar CAD models mined from Chamfer distance, LIMG focuses on learning visual differences between two image augmentations, and LCROSS shares geometric awareness from CAD features to 2D representation.

Geometric-aware properties in 2D representations

We leverage CAD models to train a joint 2D-3D space such that images of objects with similar shapes, based on the Chamfer distance, are attracted to each other, while images with different shapes are separated. This results in a continuous geometric-aware space where the distance between two points reflects their geometric similarity, which could be utilized for downstream 2D object understanding tasks.

Qualitative semantic segmentation results

The semantic segmentation results are fine-tuned and conducted on NYUv2 dataset.

Another project from our lab

StyleGAN Salon (CVPR 2023)