Diff3F is a a novel feature distiller that harnesses the expressive power of in-painting diffusion features and distills them to points on 3D surfaces. Here, the proposed features are employed for point-to-point shape correspondence between assets varying in shape, pose, species and topology. We achieve this without any fine-tuning of the underlying diffusion models, and demonstrate results on untextured meshes, point clouds, and raw scans. Note that we show raw point-to-point correspondence, without any regularization or smoothing. Inputs are point clouds, non-manifold meshes, or 2-manifold meshes. The left most mesh is the source and all remaining 3D shapes are targets. Corresponding points are similarly colored.
We present Diff3F as a simple, robust, and class-agnostic feature descriptor that can be computed for untextured input shapes (meshes or point clouds).
Our method distills diffusion features from image foundational models onto input shapes. Specifically, we use the input shapes to produce depth and normal maps
as guidance for conditional image synthesis, and in the process produce (diffusion) features in 2D that we subsequently lift and aggregate on the original surface.
Our key observation is that even if the conditional image generations obtained from multi-view rendering of the input shapes are inconsistent,
the associated image features are robust and can be directly aggregated across views.
This produces semantic features on the input shapes, without requiring additional data or training.
We perform extensive experiments on multiple benchmarks (SHREC'19, SHREC'20, and TOSCA) and demonstrate that our features, being semantic instead of geometric, produce reliable correspondence across both isometeric and non-isometrically related shape families.
We present dense correspondence results on SHREC'07 [1] to showcase correspondence on shapes other than commonly tested classes of humans and four-legged animals. We show diverse shapes encompassing isometric and highly non-isometric pairs to prove the versatality of our method as a general purpose feature descriptor. Corresponding points are similarly colored.
Source
Target
Source
Target
Source
Target
Source
Target
Source
Target
Source
Target
We decorate 3D points of a given shape in any modality- point clouds or meshes, with rich semantic descriptors. Given the scarcity of 3D geometry data from which to learn these meaningful descriptors, we leverage foundational vision models trained on very large datasets to obtain these features. This enables Diff3F to produce semantic descriptors in a zero-shot way.
Method overview. We render a given shape without textures from multiple views, and the resulting renderings are in-painted by guiding ControlNet with geometric conditions; the generative features from ControlNet are fused with DINO features obtained from the textured rendering, followed by unprojecting to the 3D surface. Note that the textured images obtained by conditioning ControlNet from different views can be inconsistent but when aggregated it produces stable sematic descriptors. Please refer to our paper for more details.
We showcase heatmaps for visualization of our semantic descriptors. For a query point in the source (denoted by a red ball), we see sematically related points being highlighted in the target. The highest similarity point in the target is indicated by a red ball.
Source
Target
Source
Target
Source
Target
Source
Target
Source
Target
Source
Target
K-means clustering can be directly applied to our Diff3F descriptors to extract part segments. Interestingly, we discover that the k-means centroids, extracted from one shape (e.g., human), can be used to segment another (e.g., cat), thanks to the semantic nature of our descriptors. This leads to corresponding part segmentation (arms of the human map to front legs of the cat, head maps to head, etc.) as seen in the figure below.
Diff3F descriptors can be effortlessly plugged into existing geometry processing pipelines such as Functional Maps. We compare the effectiveness of vanilla functional maps [5] with the Wave Kernel Signature as descriptors [6] vs our descriptors Diff3F. Ours being semantic enables Functional Maps to work with non-isometric deformations even though FMs typically struggle with such cases when using traditional geometric descriptors. Our descriptors yield accurate correspondence in most cases, thus eliminating the need for further refinement algorithms typically used in related works.
We compare our Diff3F against SOTA methods (i.e., DPC and SE-ORNet) for the task of point-to-point shape correspondence. Corresponding points are similarly colored. We show results with mesh rendering for the animal pair (top) and results using point cloud rendering of our method for the human pair (bottom). While DPC and SE-ORNet both get confused by the different alignments of the human pair resulting in a laterally flipped prediction, ours, being a multi-view rendering-based method, it is robust to rotation.
Method →
Dataset ↓ |
DPC [2] | SE-ORNet [3] | 3DCODED [4] | FM [5]+WKS [6] | Diff3F (ours) | Diff3F (ours)+FM[5] | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
acc ↑ | err ↓ | acc ↑ | err ↓ | acc ↑ | err ↓ | acc ↑ | err ↓ | acc ↑ | err ↓ | acc ↑ | err ↓ | |
TOSCA [7] | 30.79 | 3.74 | 33.25 | 4.32 | 0.5* | 19.2* | ✘ | 20.27 | 5.69 | ✘ | ||
SHREC'19 [8] | 17.40 | 6.26 | 21.41 | 4.56 | 2.10 | 8.10 | 4.37 | 3.26 | 26.41 | 1.69 | 21.55 | 1.49 |
SHREC'20 [9] | 31.08 | 2.13 | 31.70 | 1.00 | ✘ | 4.13 | 7.29 | 72.60 | 0.93 | 62.34 | 0.71 |
Train | Method | TOSCA [7] | SHREC'19 [8] | SHREC'20 [9] | |||
---|---|---|---|---|---|---|---|
acc ↑ | err ↓ | acc ↑ | err ↓ | acc ↑ | err ↓ | ||
SURREAL | DPC [2] | 29.30 | 5.25 | 17.40 | 6.26 | 31.08 | 2.13 |
SE-ORNET [3] | 16.71 | 9.19 | 21.41 | 4.56 | 31.70 | 1.00 | |
SMAL | DPC [2] | 30.28 | 6.43 | 12.34 | 8.01 | 24.5* | 7.5* |
SE-ORNET [3] | 31.59 | 4.76 | 12.49 | 9.87 | 25.4* | 2.9* | |
Pretrained | Diff3F | 20.27 | 5.69 | 26.41 | 1.69 | 72.60 | 0.93 |
@InProceedings{Dutt_2024_CVPR,
author = {Dutt, Niladri Shekhar and Muralikrishnan, Sanjeev and Mitra, Niloy J.},
title = {Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {4494-4504}
}