Diffusion²: Dynamic 3D Content Generation via Score Composition of Orthogonal Diffusion Models

Zeyu Yang Zijie Pan¹¹footnotemark: 1 Chun Gu Li Zhang
Fudan University
https://github.com/fudan-zvg/diffusion-square Equally contributedLi Zhang (lizhangfd@fudan.edu.cn) is the corresponding author with School of Data Science, Fudan University.

Abstract

Recent advancements in 3D generation are predominantly propelled by improvements in 3D-aware image diffusion models which are pretrained on Internet-scale image data and fine-tuned on massive 3D data, offering the capability of producing highly consistent multi-view images. However, due to the scarcity of synchronized multi-view video data, it is impractical to adapt this paradigm to 4D generation directly. Despite that, the available video and 3D data are adequate for training video and multi-view diffusion models that can provide satisfactory dynamic and geometric priors respectively. In this paper, we present Diffusion², a novel framework for dynamic 3D content creation that leverages the knowledge about geometric consistency and temporal smoothness from these models to directly sample dense multi-view and multi-frame images which can be employed to optimize continuous 4D representation. Specifically, we design a simple yet effective denoising strategy via score composition of video and multi-view diffusion models based on the probability structure of the images to be generated. Owing to the high parallelism of the image generation and the efficiency of the modern 4D reconstruction pipeline, our framework can generate 4D content within few minutes. Furthermore, our method circumvents the reliance on 4D data, thereby having the potential to benefit from the scalability of the foundation video and multi-view diffusion models. Extensive experiments demonstrate the efficacy of our proposed framework and its capability to flexibly adapt to various types of prompts.

1 Introduction

Refer to caption — Figure 1: Diffusion² is designed to generate dynamic content by creating a dense multi-frame multi-view image matrix in a highly parallel denoising diffusion process with the combination of the foundation video diffusion model and multi-view diffusion model. The generated image can be used to construct a full 4D representation by being fed into an off-the-shelf 4D reconstruction pipeline.

Spurred by the advances from generative image models ho2020denoising ; song2020denoising ; song2020score ; karras2022elucidating ; zhang2023adding , automatic 3D content creation poole2022dreamfusion ; wang2024prolificdreamer ; tang2023dreamgaussian ; hong2023lrm has witnessed remarkable progress in terms of efficiency, fidelity, diversity, and controllability. Coupled with the breakthroughs in 4D representation yang2023deformable ; wu20234d , these advancements further foster substantial development in dynamic content (4D) generation singer2023text ; bahmani20234d ; jiang2023consistent4d ; zhao2023animate124 ; ren2023dreamgaussian4d ; gao2024gaussianflow , which holds significant value across a wide range of applications in animation, film, game, and MetaVerse.

Recently, 3D content generation has achieved considerable breakthroughs in efficiency. Some works liu2023zero ; liu2023syncdreamer ; shi2023mvdream ; wang2023imagedream ; tang2024mvdiffusion++ inject stereo knowledge into the image generation model, enabling these 3D-aware image generators to produce consistent multi-view images, thereby effectively stabilizing and accelerating the optimization. Other efforts hong2023lrm ; chen2024v3d ; voleti2024sv3d ; zuo2024videomv ; tang2024lgm attempt to directly generate 3D representations, such as triplane chen2022tensorf or Gaussian splatting kerbl3Dgaussians . However, the efficiency improvement from these works is largely data-driven wu2023omniobject3d ; reizenstein2021common ; yu2023mvimgnet ; deitke2023objaverse . Consequently, it is infeasible to adapt these approaches to 4D generation due to the scarcity of synchronized multi-view video data. Therefore, most existing optimization-based 4D generation works still suffer from slow and unstable optimization.

However, despite the paucity of 4D data, there are vast available monocular video data and static multi-view image data. Existing works have demonstrated that it is feasible to train diffusion-based generative models to learn the distribution of these two classes of data separately voleti2024sv3d ; liu2023syncdreamer ; tang2024mvdiffusion++ ; blattmann2023align ; blattmann2023stable . Considering that video diffusion model stores the prior of motion and temporal smoothness, and multi-view diffusion model has sound knowledge of geometrical consistency, combining the two generative models to generate 4D assets becomes a highly promising and appealing approach.

Therefore, in this paper we propose a novel 4D generation framework, which combines both the video diffusion model and multi-view diffusion model to directly sample multi-frame multi-view image array imitating the photographing process of 4D content. To demonstrate how we realize this combination, we assume that such an image matrix has a nice structure: the elements not in the same row and column are conditionally independent of each other. Based on the property, we design a simple yet effective denoising strategy in which the estimated score is just the convex combination of the scores predicted by two foundation diffusion models. Our formulation is easy to be adapted to various prompts including single image, single-view video, and static 3D content. Unlike the existing optimization-based counterparts, our image generation is highly parallel. Combined with efficient modern 4D reconstruction methods, we can generate high-fidelity and diverse 4D assets within just several minutes. Besides, our approach can also potentially benefit from the further development of foundation diffusion models videoworldsimulators2024 .

Our contributions can be summarized into threefold: (i) We present a novel 4D content generation framework that achieves zero-shot generation of realistic multi-view multi-frame image arrays, which can be integrated with an off-the-shelf modern 4D reconstruct pipeline to efficiently create 4D content. (ii) We identify the conditional independence that existed in the distribution of the elements composing the image arrays. And based on it, we design a simple yet effective joint denoising strategy that combines video diffusion model and multi-view diffusion model to directly sample multi-view multi-frame image arrays from their natural distribution. (iii) Systematic experiments demonstrate that our proposed method can achieve satisfactory results under different types of prompts, including single image, single-view video, and static 3D content.

2 Reletaed work

3D generation

3D generation aims at creating static 3D content from different prompts like text or image. Early efforts employed GAN-based approaches gao2022get3d ; schwarz2020graf . Recently, significant breakthroughs have been achieved alongside the emergence of diffusion models ho2020denoising in this domain. DreamFusion poole2022dreamfusion introduced score distillation sampling (SDS) to unleash the creativity in diffusion models. Although such approach has exhibited promising results, the original form of SDS encounters challenges such as mode collapse, multi-face Janus issues, and the slow optimization. A series of subsequent works wang2024prolificdreamer ; shi2023mvdream ; wang2023imagedream ; pan2024enhancing ; tang2023dreamgaussian ; yi2023gaussiandreamer try to address these problems by modifying this mechanism. On the other hand, some studies nichol2022point ; jun2023shap ; hong2023lrm ; tang2024lgm ; wang2024crm have explored the direct generation of 3D representations using diffusion models. Another line of research liu2023zero ; liu2023syncdreamer ; long2023wonder3d ; chen2024v3d ; tang2024mvdiffusion++ focuses on generating dense multi-view images with sufficient 3D consistency by training or fine-tuning 2D diffusion models on 3D datasets to make them more suitable for 3D generation tasks. The generated images can be used for reconstruction to obtain textured meshes, point clouds or implicit radiance fields. We also adopt this approach of directly generating consistent images for reconstruction. But unlike these 3D counterparts, there is no large-scale multi-view synchronized video data. Therefore, we opt to combine geometrical consistency priors and video dynamic priors to generate images.

Video generation

Video generation and prediction is an active field that has gained increasing popularity. Recent diffusion model-based video generation methods have exhibited unprecedented levels of realism, diversity, and controllability. Notably, recent breakthroughs have demonstrated the scalability of video generation models combined with Transformers and their potential as physical world simulators. Video LDM blattmann2023align was among the first works to apply the latent diffusion framework rombach2022high to video generation. The subsequent work SVD blattmann2023stable followed its architecture and made effective improvements to the training recipe. W.A.L.T gupta2023photorealistic employed a transformer with window attention tailored for spatiotemporal generative modeling to generate high-resolution videos. VDT lu2023vdt introduced the video diffusion transformer to flexibly capture long-distance spatiotemporal context in videos and a spatial-temporal mask mechanism to uniformly handle different video generation tasks. The recently introduced SORA videoworldsimulators2024 demonstrated a remarkable capability to generate arbitrarily sized long videos with intuitively physical fidelity. Models trained on large-scale video data can generate videos with consistent geometry and realistic dynamics. Furthermore, video diffusion models can also be considered as effective 3D generators chen2024v3d ; han2024vfusion3d ; voleti2024sv3d to generate multi-view consistent images. Therefore, we build our method on this flourishing domain.

4D generation

Animating category-agnostic stuff is a challenging problem and has been receiving a lot of attention from both academia and industry. Compared to 3D generation, 4D generation requires not only predicting consistent geometry but also generating realistic and diverse dynamics. Recent works on 4D generation could be classified into two main streams based on the type of input prompt. The first class of methods predicts 4D representations given a single image and text description as input. For instance, MAV3D singer2023text directly deploys SDS into 4D generation and proposes a three-stage training scheme to stably generate high-resolution videos. DreamGaussian4D ren2023dreamgaussian4d adopt a similar three-stage training scheme but switches the underlying 4D representation from Hexplane cao2023hexplane to the deformable 3D Gaussian and replaces the third stage with the refinement of high-resolution UV texture maps, thereby achieving significantly efficiency improvement. 4DGen yin20234dgen first generates a set of candidate 3D assets according to the image or text prompt and then grounds the 4D generation on user-specified static 3D assets and monocular video sequences. 4D-fy bahmani20234d also adopts the idea of combining diffusion priors. But different from our method, it combines them during SDS, resulting in an extremely slow generation process that typically takes several days. Another line of work predicts dynamic objects from a single-view videos. Although the motion is largely dictated under this setting, generating spatiotemporal consistent structure still entails considerable uncertainty due to the limited view. Therefore, Consistent4D jiang2023consistent4d is first proposes to use a generation approach to address this task. Subsequently, Efficient4D pan2024fast mimics a photogrammetry-based 3D object capture pipeline by directly generating multi-frame multi-view captures of dynamic 3D objects through SyncDreamer-T and reconstructing 4D representations with them. Note that although Efficient4D shares the same SDS-free philosophy as ours, it possesses a strong architecture bias of the foundation diffusion model and is incapable of synthesizing novel dynamics. Compared to previous works, our framework can efficiently generate diverse dynamic 4D content, avoiding the slow, unstable, and intricate multi-stage optimization, and has the potential to continuously benefit from the scalability of the underlying diffusion model.

3 Method

In this section, we present a novel framework designed for efficient and scalable generation of 4D content, which consists of two stages as depicted in Figure 2. In Section 3.1 (stage-1), we will discuss how to generate a dense multi-view multi-frame image array for reconstruction through a highly parallelizable denoising process by integrating the pretrained video diffusion model and multi-view diffusion model, and why it is feasible. In Section 3.2 (stage-2), we will briefly illustrate how to robustly reconstruct 4D content from the image matrix produced in the first stage.

3.1 Image matrix generation

In this stage, our goal is to generate dense multi-frame multi-view images for reconstruction, which can be denoted as a matrix of image

\mathcal{I}=\left\{I_{i,j}\in\mathbb{R}^{H\times W\times 3}\right\}_{i=1,j=1}^% {V,F}=\begin{bmatrix}{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{0,.5,.5}I_{1,1}}&{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{0,.5,.5}\cdots}&{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{% rgb}{0,0,1}I_{1,j}}&{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{0,.5,.5}\cdots}&{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{0,.5,.5}I_{1,F}}\\ {\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\vdots}&% {\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\ddots}&% {\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\vdots}&{% \color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\ddots}&{% \color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\vdots}\\ {\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}I_{i,1% }}&{\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\cdots% }&I_{i,j}&{\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\cdots% }&{\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}I_{i,F% }}\\ {\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\vdots}&% {\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\ddots}&% {\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\vdots}&{% \color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\ddots}&{% \color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\vdots}\\ {\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}I_{V,1}}% &{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\cdots}% &{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}I_{V,j}}&{% \color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\cdots}&{% \color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}I_{V,F}}% \end{bmatrix},

(1)

where $V$ is the number of views, $F$ is the number of video frames, and $(H,W)$ is the size of images. We aim to construct a generative model that allows us to directly sample $\mathcal{I}\sim p(\mathcal{I})$ .

Now, let us first distract our attention to reviewing existing diffusion-based generators for video and multi-view images, which can be utilized for sampling realistic images through the following probabilistic flow ODE:

d\mathrm{x}=-\dot{\sigma}(t)\sigma(t)\nabla_{\mathrm{x}}\log p\left(\mathrm{x}% ;\sigma(t)\right)dt.

(2)

Here, $\mathrm{x}=\left\{I_{i}\in\mathbb{R}^{H\times W\times 3}\right\}_{i=1}^{N}$ is a series of images with $N$ frames or $N$ views, $\nabla_{\mathrm{x}}\log p\left(\mathrm{x};\sigma\right)$ is the score function, which can be parameterized as $\nabla_{\mathrm{x}}\log p\left(\mathrm{x};\sigma\right)\approx\left(D_{\theta}% (\mathrm{x};\sigma)\right)/\sigma^{2}$ karras2022elucidating ; blattmann2023stable , where $D_{\theta}(\mathrm{x};\sigma)$ is a neural network trained via denoising score matching.

We want to extend the above formulation to the sampling of $\mathcal{I}$ . The question is, how do we estimate the score function of the joint distribution of $V\times F$ images?

For simplicity, let $\mathcal{I}_{-i,j}\triangleq\{I_{i^{\prime},j}|1\leq i^{\prime}\leq V,i^{% \prime}\neq i\}$ , $\mathcal{I}_{i,-j}\triangleq\{I_{i,j^{\prime}}|1\leq j^{\prime}\leq F,j^{% \prime}\neq j\}$ and $\mathcal{I}_{-i,-j}\triangleq\{I_{i^{\prime},j^{\prime}}|1\leq i^{\prime}\leq V% ,1\leq j^{\prime}\leq F,i^{\prime}\neq i,j^{\prime}\neq j\}$ . We first make an assumption on the structure of $p(\mathcal{I})$ .

Assumption 3.1.

Given any image $I_{i,j}$ , the underlying geometry $\mathcal{I}_{-i,j}$ and the dynamics $\mathcal{I}_{i,-j}$ are conditionally independent, i.e.,

p\left({\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}% \mathcal{I}_{-i,j}},{\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb% }{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{% 0}\mathcal{I}_{i,-j}}|I_{i,j}\right)=p\left({\color[rgb]{0,0,1}\definecolor[% named]{pgfstrokecolor}{rgb}{0,0,1}\mathcal{I}_{-i,j}}|I_{i,j}\right)p\left({% \color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}% \mathcal{I}_{i,-j}}|I_{i,j}\right).

(3)

The assumption 3.1 implies that given the front view of a 3D object, its motion as seen from the front does not correlate with its appearance from the back, which aligns with our intuition. A natural corollary is that the mollified distribution derived by adding Gaussian noise into the data distribution still maintains conditional independence:

Corollary 3.1.

Denote $\hat{\mathcal{I}}$ as the noisy version of $\mathcal{I}$ , i.e.,

\displaystyle\hat{\mathcal{I}}=\{\hat{I}_{i,j}\in\mathbb{R}^{H\times W\times 3% }\}_{i=1,j=1}^{V,F}\quad\text{with}~{}\hat{I}_{i,j}=\alpha I_{i,j}+\varepsilon% _{i,j},

(4)

where $\alpha\in\mathbb{R}$ is a constant and $\varepsilon_{i,j}\in\mathbb{R}^{H\times W\times 3}$ are independent Gaussian noises. Then we have

p\left({\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{% \mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{% rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{% 0}{0}\hat{\mathcal{I}}_{i,-j}}|\hat{I}_{i,j}\right)=p\left({\color[rgb]{0,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}}|\hat{% I}_{i,j}\right)p\left({\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{% rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{% 0}{0}\hat{\mathcal{I}}_{i,-j}}|\hat{I}_{i,j}\right).

(5)

This nice property allows us to sample the desired image matrix by progressively denoising from pure Gaussian noise through the combination of two estimated scores of its marginal distribution, which can be obtained from the pretrained video and multi-view diffusion models respectively. Therefore, we can derive our main theorem.

Theorem 3.1.

For $\mathrm{x}=\hat{I}_{i,j}$ , we have

\nabla_{\mathrm{x}}\log p\left(\hat{\mathcal{I}}\right)=\nabla_{\mathrm{x}}% \log p\left(\hat{\mathcal{I}}_{\{1:V\},j}\right)+\nabla_{\mathrm{x}}\log p% \left(\hat{\mathcal{I}}_{i,\{1:F\}}\right)-\nabla_{\mathrm{x}}\log p\left(\hat% {I}_{i,j}\right).

(6)

Proof.

We first decompose $p\left(\hat{\mathcal{I}}\right)$ by

	$\displaystyle p\left(\hat{\mathcal{I}}\right)$	$\displaystyle=p\left(\hat{I}_{i,j},{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}% {0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}},{\color[rgb% ]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\hat{\mathcal{I}}_{% -i,-j}}\right)$		(7)
		$\displaystyle=p\left(\hat{I}_{i,j},{\color[rgb]{0,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{0,.5,.5}\hat{\mathcal{I}}_{-i,-j}}\|{\color[rgb]{0,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{% \color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{% \mathcal{I}}_{i,-j}}\right)p\left({\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}% {0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}}\right).$		(8)

Note that for any $\hat{I}_{i^{\prime},j^{\prime}}\in{\color[rgb]{0,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{0,.5,.5}\hat{\mathcal{I}}_{-i,-j}}$ , $I_{i^{\prime},j^{\prime}}$ and $I_{i,j}$ are independent conditioned on $I_{i^{\prime},j}$ by corollary 3.1, hence

p\left(\hat{I}_{i,j},{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{0,.5,.5}\hat{\mathcal{I}}_{-i,-j}}|{\color[rgb]{0,0,1}\definecolor[named]% {pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}% {0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}}\right)=p% \left({\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \hat{\mathcal{I}}_{-i,-j}}|{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}% {0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}}\right)p% \left(\hat{I}_{i,j}|{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb% }{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}% \pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}}\right).

(9)

Since $p\left({\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \hat{\mathcal{I}}_{-i,-j}}|{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}% {0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}}\right)$ does not contain $I_{i,j}$ , its derivative with respect to $I_{i,j}$ is zero. Then combined with equation (8) and equation (9), taking the derivative of $\log p\left(\hat{\mathcal{I}}\right)$ with respect to $\mathrm{x}$ , we achieve

	$\displaystyle\nabla_{\mathrm{x}}\log p\left(\hat{\mathcal{I}}\right)$	$\displaystyle=\nabla_{\mathrm{x}}\log p\left(\hat{I}_{i,j}\|{\color[rgb]{0,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{% \color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{% \mathcal{I}}_{i,-j}}\right)p\left({\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}% {0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}}\right)$		(10)
		$\displaystyle=\nabla_{\mathrm{x}}\log p\left(\hat{I}_{i,j},{\color[rgb]{0,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{% \color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{% \mathcal{I}}_{i,-j}}\right).$		(11)

Finally, by further decomposing $p\left(\hat{I}_{i,j},{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{% rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}% \pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}}\right)$ and directly applying corollary 3.1, we obtain

$\displaystyle\nabla_{\mathrm{x}}\log p\left(\hat{\mathcal{I}}\right)$	$\displaystyle=\nabla_{\mathrm{x}}\log p\left({\color[rgb]{0,0,1}\definecolor[% named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1% }\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1% }{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}}\|\hat{I}_{i% ,j}\right)p\left(\hat{I}_{i,j}\right)$	(12)
	$\displaystyle=\nabla_{\mathrm{x}}\log p\left({\color[rgb]{0,0,1}\definecolor[% named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}}\|\hat{I}_{i,j}% \right)p\left({\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{% 1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}% \hat{\mathcal{I}}_{i,-j}}\|\hat{I}_{i,j}\right)p\left(\hat{I}_{i,j}\right)$	(13)
	$\displaystyle=\nabla_{\mathrm{x}}\log\frac{p\left(\hat{\mathcal{I}}_{\{1:V\},j% }\right)p\left(\hat{\mathcal{I}}_{i,\{1:F\}}\right)}{p\left(\hat{I}_{i,j}% \right)}$	(14)
	$\displaystyle=\nabla_{\mathrm{x}}\log p\left(\hat{\mathcal{I}}_{\{1:V\},j}% \right)+\nabla_{\mathrm{x}}\log p\left(\hat{\mathcal{I}}_{i,\{1:F\}}\right)-% \nabla_{\mathrm{x}}\log p\left(\hat{I}_{i,j}\right).$	(15)

∎

Here $\nabla_{\mathrm{x}}\log p\left(\hat{\mathcal{I}}_{i,\{1:F\}}\right)$ and $\nabla_{\mathrm{x}}\log p\left(\hat{\mathcal{I}}_{\{1:V\},j}\right)$ are just the score functions of the video diffusion model and the multi-view diffusion model respectively . And we use the convex combination of them to estimate $\nabla_{\mathrm{x}}\log p\left(\hat{I}_{i,j}\right)$ as:

\nabla_{\mathrm{x}}\log p\left(\hat{I}_{i,j}\right)=(1-s)\nabla_{\mathrm{x}}% \log p\left(\hat{\mathcal{I}}_{i,\{1:F\}}\right)+s\nabla_{\mathrm{x}}\log p% \left(\hat{\mathcal{I}}_{\{1:V\},j}\right).

(16)

In practice, we employ a logistic schedule to adjust the change of $s$ with denoising step. Given the current denoising step $i$ and the number of total steps $N$ , we set $s=1-\frac{1}{1+e^{k(i/N-s_{0})}}$ . This function has a sigmoidal curve, which is relatively flat at the extremes away from middle $s_{0}$ and changes sharply near it, with the derivative controlled by $k$ . This schedule decouples the generation of consistent geometry and temporally smooth appearance to some extent.

Sampling in latent space

For convenience, the above theorem assumes that the sample object is the image in the original RGB space. However, modern high-resolution diffusion models typically generate images encoded into a latent space by VQVAE van2017neural . The legitimacy of the aforementioned derivation requires that the multi-view generator and the video generator share the same latent space. Although this requirement is not met in the most current instantiation of them, we believe that this condition will be increasingly satisfied by more multi-view generation models in the future. Because as pointed out by SVD blattmann2023stable ; chen2024v3d ; videoworldsimulators2024 ; li2024sora , video generation models trained on large-scale video datasets have learned a strong stereo knowledge, thus can provide a better pretraining for fine-tuning multi-view diffusion models than those trained solely on image data. And the latent encoder is usually frozen during fine-tuning on the multi-view images.

Generation with various input conditions

Note that the formulation described above is based on unconditional generation. However, we are more interested in controllable generation in practice. Then we will extend the aforementioned process to conditional generation. Consider the augmented matrix $\mathcal{I}_{\text{aug}}$ defined by

\mathcal{I}_{\text{aug}}=\begin{bmatrix}I_{0,0}&\mathcal{I}_{0,\{1:V\}}\\ \mathcal{I}_{\{1:F\},0}&\mathcal{I}\end{bmatrix},

(17)

where $I_{0,0}$ are the input image, and $\mathcal{I}_{0,\{1:V\}},\mathcal{I}_{\{1:F\},0}$ is the first row and column of $\mathcal{I}_{\text{aug}}$ we need to first create as the condition for the subsequent generation of full matrix. Then we will demonstrate how we obtain them from various conditions. For convenience, we denote $\mathbf{V}$ as the multi-view diffusion model and $\mathbf{F}$ as the video diffusion model.

•

Single image. Given $I_{0,0}$ as input, we use $\mathbf{V}$ to generate $\mathcal{I}_{0,\{1:V\}}$ which dictates the geometry of the generated 4D assets and use $\mathbf{F}$ to generate $\mathcal{I}_{\{1:F\},0}$ which endows the static image with dynamics.
•

Single-view video. Given $\mathcal{I}_{\{0:F\},0}$ as input, we use the last frame $I_{0,0}$ as the condition of $\mathbf{V}$ to generate $\mathcal{I}_{0,\{1:V\}}$ .
•

Static 3D model. Similarly, Given $\mathcal{I}_{0,\{0:V\}}$ as input, we use the front view $I_{0,0}$ as the condition of $\mathbf{F}$ to generate $\mathcal{I}_{\{1:F\},0}$ .

Assumption 3.1 ensures the safety of independently generating the geometry $\mathcal{I}_{0,\{1:V\}}$ and the motion $\mathcal{I}_{\{1:F\},0}$ . Additionally, there is no computational or data dependency between these two generation processes, allowing their total time cost to be reduced to a single reverse diffusion process. Then we will denoise the rest part of $\mathcal{I}_{\text{aug}}$ from pure Gaussian noise. In each step, we run the score estimators for each row and column conditioned on the $\mathcal{I}_{0,\{1:V\}}$ and $\mathcal{I}_{\{1:F\},0}$ , and combine their results as in Theorem 3.1 to update the noisy latent. Since the score estimation for each row and column can also be parallelized, the time cost can be decreased to running a single diffusion step. Therefore, with sufficient GPU memory, the total time spent on the process illustrated in Figure 2(ii) remains the same as that for generating a single video.

3.2 Robust reconstruction

4D representation

Given generated synchronized multi-view videos conditioned on any type of prompts, there are numerous methods that can be employed to reconstruct 4D assets. Among the numerous candidates, we adopt the 4D Gaussian Splatting due to its superior fitting capabilities and efficient optimization given the dense multi-view supervision.

Optimization

Although the images generated in the first stage already have intuitively satisfactory spatiotemporal consistency, the performance limitation of the foundational multi-view generation components still makes it difficult to achieve precise pixel-level matching across different views and frames. Therefore, we follow chen2024v3d to optimize the combination of perception loss $\mathcal{L}_{lpips}$ zhang2018unreasonable and D-SSIM $\mathcal{L}_{ssim}$ wang2004image while ignoring L1 loss. In addition, we weight each term with the confidence score, then the total objective is defined as $\mathcal{L}_{total}=\lambda_{lpips}\mathcal{C}_{lpips}\mathcal{L}_{lpips}+% \lambda_{ssim}\mathcal{C}_{ssim}\mathcal{L}_{ssim}$ , where $\mathcal{C}_{ssim}$ is just the ssim between ground truth and rendered images and $\mathcal{C}_{lpips}$ is defined as $1-\mathcal{L}_{lpips}$ .

4 Experiments

4.1 Implementation details

In the first stage, we use Stable Video Diffusion blattmann2023stable as our foundation video diffusion model, predicting 25 frames each time according to the image prompt. SV3D^p voleti2024sv3d is chosen as the foundation multi-view diffusion model. For simplicity, we only generate orbital videos that have 21 uniformly spaced azimuths and fixed elevation with manual filter of side view as these views typically contain thin structures that pose challenges for the video generation model and subsequent reconstruction processes. By default, we set the number of sampling steps to 50 for both generative models. In the reconstruction stage, we optimized 4D Gaussian Splatting for 5k iterations without bells and whistles. The image size is set to (576 $\times$ 576) in both stages.

4.2 4D generation from single image

Table 1: User study on image-to-4D generation. The proportions of different methods that best match user preferences under three criteria are reported.

Method	Details	Geometry	Temporal	Overall model	Generation
Method	Details	Geometry	smoothness	quality	time
Animate124	11.3%	31.0%	16.0%	18.0%	9h
DreamGaussian4D	27.7%	24.3%	48.0%	25.7%	12m
Ours	61.0%	44.7%	36.0%	56.3%	10m

Figure 3 (a), we show the results generated by the proposed method and provide the comparison with other alternatives. It can be observed that our concise and elegant pipeline is capable of generating 4D assets of comparable quality to those produced by state-of-the-art SDS-based methods with sophisticated multi-stage optimization. Furthermore, the parallel evaluation of a large 2D generative model can provide a potential efficiency advantage for our method. We also conducted a user study (See appendix for details), the results of which are reported in Table 1. It suggests that our method garnered the highest human preference in the multi-view consistency, detail, and overall model quality.

4.3 4D generation from single view video

Table 2: Quantitative comparisons on video-to-4D generation.

Method	Type	CLIP Similarity $\uparrow$	Generation time $\downarrow$
Consistent4D	Optimization-based	0.87	2h
4DGen	Optimization-based	0.89	2h10m
Efficient4D	Photogrammetry-based	0.92	14m
Our	Photogrammetry-based	0.94	10m

Generating 4D dynamic objects from fixed-view video is a practical task first introduced in jiang2023consistent4d . Compared to generation from a single image, this task additionally constrains the object’s motion. Our proposed framework can be easily adapted to deal with this task as detailed in Section 3.1. We perform both quantitative and qualitative comparisons with other counterparts under this setting. The qualitative result is shown in Figure 3 (b). It can be seen that our method slightly reduces the over-saturated appearance. For qualitative evaluation, we report the CLIP-similarity between generated views and ground truth images to indicate overall semantic consistency and the recognizable of generated images. The quantitative metrics also support our superiority.

4.4 4D generation from static 3D content

Naturally, Diffusion² can also animate static 3D models into dynamic objects as described in Section 3.1, which has substantial practical implications. From Figure 4 we can observe that our method is capable of endowing 3D models with diverse and realistic dynamics while maintaining satisfactory temporal and geometrical consistency.

4.5 Ablation studies

Since previous methods have explored dynamic content reconstruction well, we mainly focus on the ablation of key design choices in the stage of image matrix generation. In Figure 5 (a), we adjusted two parameters that control the curve of the logistic schedule. The results reveal that: when only using video prior, the generated images fail to ensure consistency in perspective, geometry, and detail with other views. For example, in the second row and second column of Figure 5 (a), the dress still evenly sags down both sides of the body, unlike in the reference view where the dress flutters to the left side of the body. This indicates that without the guidance of a geometry prior, the dynamics of each view will be totally independent of each other. Another extreme is when we set $s$ to 1, that is, using only multi-view prior, where it can no longer guarantee the consistency of details between different frames of invisible views. The comparison of the last column of the first two rows of Figure 5 (a) shows a remarkable change in the shape of the tail of the hair. So we finally adopt a compromised option to set $s$ to 0.5, i.e., starting to drastically reduce the weight of the multi-view score at halfway of the denoising process. In most cases, this choice can achieve smooth temporal transitions as well as geometrical consistency. In addition, the decreasing speed of $s$ may also affect the quality of generated images, which is controlled by another parameter $k$ . Therefore, we test the image generation under different $k$ given $s=0.5$ . It can be seen that when $k$ is relatively small, the generated images may exhibit ghosting effects. On the other hand, when $k$ is too large, it may weaken the temporal smoothing effect brought by the video diffusion model in the early stages of denoising. Consequently, we adopt $s=0.5,k=20$ as the default setting. Finally, we also examined the impact of different noise schedules, as shown in Figure 5 (b). It can be observed that all other choices would lead to ghosting artifacts.

5 Conclusion

In this work, we present a novel 4D content generation framework, dubbed Diffusion², which efficiently generates dense, consistent multi-view multi-frame image arrays with high parallelism and then feeds them into 4D reconstruction pipeline to create full 4D presentation. Our key assumption is that elements $\mathcal{I}_{i,j}$ and $\mathcal{I}_{k,l}$ ( $i\neq k,j\neq l$ ) in the multi-view multi-frame image array $\mathcal{I}$ are conditionally independent given $\mathcal{I}_{i,l}$ or $\mathcal{I}_{k,j}$ . This aligns with our intuition: past or future motion and the appearance in other views are decoupled to a large extent. Based on this assumption, we prove that we can directly sample synchronized multi-view videos $\mathcal{I}$ in a denoising process by combining pretrained video diffusion models and multi-view diffusion models. Experimental results show that the proposed framework can flexibly adapt to various types of prompts. We hope that our work can inspire future research on unleashing and combining the geometrical and dynamic priors from foundation 3D and video diffusion models.

References

(1) Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. In CVPR, 2024.
(2) Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint, 2023.
(3) Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
(4) Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. https://openai.com/research/video-generation-models-as-world-simulators, 2024.
(5) Ang Cao and Justin Johnson. Hexplane: A fast representation for dynamic scenes. In CVPR, 2023.
(6) Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In ECCV, 2022.
(7) Zilong Chen, Yikai Wang, Feng Wang, Zhengyi Wang, and Huaping Liu. V3d: Video diffusion models are effective 3d generators. arXiv preprint, 2024.
(8) Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In CVPR, 2023.
(9) Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. In NeurIPS, 2022.
(10) Quankai Gao, Qiangeng Xu, Zhe Cao, Ben Mildenhall, Wenchao Ma, Le Chen, Danhang Tang, and Ulrich Neumann. Gaussianflow: Splatting gaussian dynamics for 4d content creation. arXiv preprint, 2024.
(11) Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. arXiv preprint, 2023.
(12) Junlin Han, Filippos Kokkinos, and Philip Torr. Vfusion3d: Learning scalable 3d generative models from video diffusion models. arXiv preprint, 2024.
(13) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020.
(14) Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. In ICLR, 2024.
(15) Yanqin Jiang, Li Zhang, Jin Gao, Weimin Hu, and Yao Yao. Consistent4d: Consistent 360 $\{$ $\backslash$ deg $\}$ dynamic object generation from monocular video. In ICLR, 2024.
(16) Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint, 2023.
(17) Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022.
(18) Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. In ACM TOG, 2023.
(19) Xuanyi Li, Daquan Zhou, Chenxu Zhang, Shaodong Wei, Qibin Hou, and Ming-Ming Cheng. Sora generates videos with stunning geometrical consistency. arXiv preprint, 2024.
(20) Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023.
(21) Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Generating multiview-consistent images from a single-view image. In ICLR, 2024.
(22) Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint, 2023.
(23) Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, and Mingyu Ding. Vdt: General-purpose video diffusion transformers via mask modeling. In ICLR, 2024.
(24) Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint, 2022.
(25) Zijie Pan, Jiachen Lu, Xiatian Zhu, and Li Zhang. Enhancing high-resolution 3d generation through pixel-wise gradient clipping. In ICLR, 2024.
(26) Zijie Pan, Zeyu Yang, Xiatian Zhu, and Li Zhang. Fast dynamic 3d object generation from a single-view video. arXiv preprint, 2024.
(27) Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023.
(28) Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV, 2021.
(29) Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, and Ziwei Liu. Dreamgaussian4d: Generative 4d gaussian splatting. arXiv preprint, 2023.
(30) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
(31) Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. Graf: Generative radiance fields for 3d-aware image synthesis. In NeurIPS, 2020.
(32) Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint, 2023.
(33) Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dynamic scene generation. arXiv preprint, 2023.
(34) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR, 2021.
(35) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
(36) Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint, 2024.
(37) Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. In ICLR, 2024.
(38) Shitao Tang, Jiacheng Chen, Dilin Wang, Chengzhou Tang, Fuyang Zhang, Yuchen Fan, Vikas Chandra, Yasutaka Furukawa, and Rakesh Ranjan. Mvdiffusion++: A dense high-resolution multi-view diffusion model for single or sparse-view 3d object reconstruction. arXiv preprint, 2024.
(39) Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017.
(40) Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion. arXiv preprint, 2024.
(41) Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint, 2023.
(42) Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In NeurIPS, 2023.
(43) Zhengyi Wang, Yikai Wang, Yifei Chen, Chendong Xiang, Shuo Chen, Dajiang Yu, Chongxuan Li, Hang Su, and Jun Zhu. Crm: Single image to 3d textured mesh with convolutional reconstruction model. arXiv preprint, 2024.
(44) Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. In IEEE TIP, 2004.
(45) Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. In CVPR, 2024.
(46) Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Liang Pan Jiawei Ren, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, Dahua Lin, and Ziwei Liu. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In CVPR, 2023.
(47) Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. In CVPR, 2024.
(48) Taoran Yi, Jiemin Fang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. In CVPR, 2024.
(49) Yuyang Yin, Dejia Xu, Zhangyang Wang, Yao Zhao, and Yunchao Wei. 4dgen: Grounded 4d content generation with spatial-temporal consistency. arXiv preprint, 2023.
(50) Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, et al. Mvimgnet: A large-scale dataset of multi-view images. In CVPR, 2023.
(51) Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
(52) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
(53) Yuyang Zhao, Zhiwen Yan, Enze Xie, Lanqing Hong, Zhenguo Li, and Gim Hee Lee. Animate124: Animating one image to 4d dynamic scene. arXiv preprint, 2023.
(54) Qi Zuo, Xiaodong Gu, Lingteng Qiu, Yuan Dong, Zhengyi Zhao, Weihao Yuan, Rui Peng, Siyu Zhu, Zilong Dong, Liefeng Bo, et al. Videomv: Consistent multi-view generation based on large video generative model. arXiv preprint, 2024.

Diffusion2: Dynamic 3D Content Generation via Score Composition of Orthogonal Diffusion Models