Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

Yanqin Jiang^1*, Chaohui Yu,^2* Chenjie Cao², Fan Wang², Weiming Hu¹, Jin Gao¹

¹CASIA
²DAMO Academy, Alibaba Group

NeurIPS 2024

Abstract

Recent advances in 4D generation mainly focus on generating 4D content by distilling pre-trained text or single-view image-conditioned models. It is inconvenient for them to take advantage of various off-the-shelf 3D assets with multi-view attributes, and their results suffer from spatiotemporal inconsistency owing to the inherent ambiguity in the supervision signals. In this work, we present Animate3D, a novel framework for animating any static 3D model. The core idea is two-fold: 1) We propose a novel multi-view video diffusion model (MV-VDM) conditioned on multi-view renderings of the static 3D object, which is trained on our presented large-scale multi-view video dataset (MV-Video). 2) Based on MV-VDM, we introduce a framework combining reconstruction and 4D Score Distillation Sampling (4D-SDS) to leverage the multi-view video diffusion priors for animating 3D objects. Specifically, for MV-VDM, we design a new spatiotemporal attention module to enhance spatial and temporal consistency by integrating 3D and video diffusion models. Additionally, we leverage the static 3D model's multi-view renderings as conditions to preserve its identity. For animating 3D models, an effective two-stage pipeline is proposed: we first reconstruct motions directly from generated multi-view videos, followed by the introduced 4D-SDS to refine both appearance and motion. Benefiting from accurate motion learning, we could achieve straightforward mesh animation. Qualitative and quantitative experiments demonstrate that Animate3D significantly outperforms previous approaches. Data, code, and models will be open-released.

Video

The video is best viewed in 4K mode.

Animate Generated 3D Mesh

We animate 6 mesh assets. Models are generated using commerical 3D generation tools (Rodin Gen-1, Meshy, Tripo3D). Each model is with multiple animations, and you can switch between different animations by clicking the thumbnails below the video. Click the thumbnail above the video to see the input 3D model. When you hover your mouse over the video, a full screen button button will appear in the bottom right corner. Click it to watch the video in 2048×1024; resolution.

Animate Reconstructed 3D Model

We animate 40 reconstructed 3D models. Some models have more than one animation results, and you can switch between different animations by clicking the thumbnails below the video. Click the thumbnail above the video to see the input 3D model. When you hover your mouse over the video, a full screen button button will appear in the bottom right corner. Click it to watch the video in 1024 resolution.

Animate Real-world 3D Scan

We animate 10 real-world 3D scans. Some model have more than one animation results, and you can switch between different animations by clicking the thumbnails below the video. Click the thumbnail above the video to see the input 3D model. When you hover your mouse over the video, a full screen button button will appear in the bottom right corner. Click it to watch the video in 1024 resolution.

Animate Generated 3D Model

We animate 10 generated models. Models are generated using commerical 3D generation tools (Rodin Gen-1, Meshy, Tripo3D). Some models have more than one animation results, and you can switch between different animations by clicking the thumbnails below the video. Click the thumbnail above the video to see the input 3D model. When you hover your mouse over the video, a full screen button will appear in the bottom right corner. Click it to watch the video in 1024 resolution.

Ablation for 4D-SDS

We compare our motion reconstruction results (left) and those w/ 4D-SDS (right) as below. Best viewed in full screen

Training Data

Our training dataset, MV-Video, comprises 115K animations that are available under a public license, consisting of about 53K animated 3D objects at all, which are rendered into over 1.8M multi-view videos.
Notably, our training data is manually selected and with high-quality. It includes the highest quality part of Objaverse (around 29K animated 3D objects), while the rest (around 24K animated 3D objects) are collected by ourselves.

Relevant Works

[1] SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer (ECCV 2024)
[2] STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians (ECCV 2024)
[3] Consistent4D: Consistent 360° Dynamic Object Generation from Monocular Video (ICLR 2024)

Acknowledgements

Some 3D assets for animation are downloaded from sketchfab, under CC Attribution and CC Attribution-NonCommercial. We would like to thank the creatorsfor sharing great 3D assets.