Diffusion2: Dynamic 3D Content Generation via Score Composition of Orthogonal Diffusion Models

Zeyu Yang  Zijie Pan11footnotemark: 1  Chun Gu  Li Zhang
Fudan University
https://github.com/fudan-zvg/diffusion-square
Equally contributedLi Zhang (lizhangfd@fudan.edu.cn) is the corresponding author with School of Data Science, Fudan University.
Abstract

Recent advancements in 3D generation are predominantly propelled by improvements in 3D-aware image diffusion models which are pretrained on Internet-scale image data and fine-tuned on massive 3D data, offering the capability of producing highly consistent multi-view images. However, due to the scarcity of synchronized multi-view video data, it is impractical to adapt this paradigm to 4D generation directly. Despite that, the available video and 3D data are adequate for training video and multi-view diffusion models that can provide satisfactory dynamic and geometric priors respectively. In this paper, we present Diffusion2, a novel framework for dynamic 3D content creation that leverages the knowledge about geometric consistency and temporal smoothness from these models to directly sample dense multi-view and multi-frame images which can be employed to optimize continuous 4D representation. Specifically, we design a simple yet effective denoising strategy via score composition of video and multi-view diffusion models based on the probability structure of the images to be generated. Owing to the high parallelism of the image generation and the efficiency of the modern 4D reconstruction pipeline, our framework can generate 4D content within few minutes. Furthermore, our method circumvents the reliance on 4D data, thereby having the potential to benefit from the scalability of the foundation video and multi-view diffusion models. Extensive experiments demonstrate the efficacy of our proposed framework and its capability to flexibly adapt to various types of prompts.

1 Introduction

Refer to caption

Figure 1: Diffusion2 is designed to generate dynamic content by creating a dense multi-frame multi-view image matrix in a highly parallel denoising diffusion process with the combination of the foundation video diffusion model and multi-view diffusion model. The generated image can be used to construct a full 4D representation by being fed into an off-the-shelf 4D reconstruction pipeline.

Spurred by the advances from generative image models ho2020denoising ; song2020denoising ; song2020score ; karras2022elucidating ; zhang2023adding , automatic 3D content creation poole2022dreamfusion ; wang2024prolificdreamer ; tang2023dreamgaussian ; hong2023lrm has witnessed remarkable progress in terms of efficiency, fidelity, diversity, and controllability. Coupled with the breakthroughs in 4D representation yang2023deformable ; wu20234d , these advancements further foster substantial development in dynamic content (4D) generation singer2023text ; bahmani20234d ; jiang2023consistent4d ; zhao2023animate124 ; ren2023dreamgaussian4d ; gao2024gaussianflow , which holds significant value across a wide range of applications in animation, film, game, and MetaVerse.

Recently, 3D content generation has achieved considerable breakthroughs in efficiency. Some works liu2023zero ; liu2023syncdreamer ; shi2023mvdream ; wang2023imagedream ; tang2024mvdiffusion++ inject stereo knowledge into the image generation model, enabling these 3D-aware image generators to produce consistent multi-view images, thereby effectively stabilizing and accelerating the optimization. Other efforts hong2023lrm ; chen2024v3d ; voleti2024sv3d ; zuo2024videomv ; tang2024lgm attempt to directly generate 3D representations, such as triplane chen2022tensorf or Gaussian splatting kerbl3Dgaussians . However, the efficiency improvement from these works is largely data-driven wu2023omniobject3d ; reizenstein2021common ; yu2023mvimgnet ; deitke2023objaverse . Consequently, it is infeasible to adapt these approaches to 4D generation due to the scarcity of synchronized multi-view video data. Therefore, most existing optimization-based 4D generation works still suffer from slow and unstable optimization.

However, despite the paucity of 4D data, there are vast available monocular video data and static multi-view image data. Existing works have demonstrated that it is feasible to train diffusion-based generative models to learn the distribution of these two classes of data separately voleti2024sv3d ; liu2023syncdreamer ; tang2024mvdiffusion++ ; blattmann2023align ; blattmann2023stable . Considering that video diffusion model stores the prior of motion and temporal smoothness, and multi-view diffusion model has sound knowledge of geometrical consistency, combining the two generative models to generate 4D assets becomes a highly promising and appealing approach.

Therefore, in this paper we propose a novel 4D generation framework, which combines both the video diffusion model and multi-view diffusion model to directly sample multi-frame multi-view image array imitating the photographing process of 4D content. To demonstrate how we realize this combination, we assume that such an image matrix has a nice structure: the elements not in the same row and column are conditionally independent of each other. Based on the property, we design a simple yet effective denoising strategy in which the estimated score is just the convex combination of the scores predicted by two foundation diffusion models. Our formulation is easy to be adapted to various prompts including single image, single-view video, and static 3D content. Unlike the existing optimization-based counterparts, our image generation is highly parallel. Combined with efficient modern 4D reconstruction methods, we can generate high-fidelity and diverse 4D assets within just several minutes. Besides, our approach can also potentially benefit from the further development of foundation diffusion models videoworldsimulators2024 .

Our contributions can be summarized into threefold: (i) We present a novel 4D content generation framework that achieves zero-shot generation of realistic multi-view multi-frame image arrays, which can be integrated with an off-the-shelf modern 4D reconstruct pipeline to efficiently create 4D content. (ii) We identify the conditional independence that existed in the distribution of the elements composing the image arrays. And based on it, we design a simple yet effective joint denoising strategy that combines video diffusion model and multi-view diffusion model to directly sample multi-view multi-frame image arrays from their natural distribution. (iii) Systematic experiments demonstrate that our proposed method can achieve satisfactory results under different types of prompts, including single image, single-view video, and static 3D content.

2 Reletaed work

3D generation

3D generation aims at creating static 3D content from different prompts like text or image. Early efforts employed GAN-based approaches gao2022get3d ; schwarz2020graf . Recently, significant breakthroughs have been achieved alongside the emergence of diffusion models ho2020denoising in this domain. DreamFusion poole2022dreamfusion introduced score distillation sampling (SDS) to unleash the creativity in diffusion models. Although such approach has exhibited promising results, the original form of SDS encounters challenges such as mode collapse, multi-face Janus issues, and the slow optimization. A series of subsequent works wang2024prolificdreamer ; shi2023mvdream ; wang2023imagedream ; pan2024enhancing ; tang2023dreamgaussian ; yi2023gaussiandreamer try to address these problems by modifying this mechanism. On the other hand, some studies nichol2022point ; jun2023shap ; hong2023lrm ; tang2024lgm ; wang2024crm have explored the direct generation of 3D representations using diffusion models. Another line of research liu2023zero ; liu2023syncdreamer ; long2023wonder3d ; chen2024v3d ; tang2024mvdiffusion++ focuses on generating dense multi-view images with sufficient 3D consistency by training or fine-tuning 2D diffusion models on 3D datasets to make them more suitable for 3D generation tasks. The generated images can be used for reconstruction to obtain textured meshes, point clouds or implicit radiance fields. We also adopt this approach of directly generating consistent images for reconstruction. But unlike these 3D counterparts, there is no large-scale multi-view synchronized video data. Therefore, we opt to combine geometrical consistency priors and video dynamic priors to generate images.

Video generation

Video generation and prediction is an active field that has gained increasing popularity. Recent diffusion model-based video generation methods have exhibited unprecedented levels of realism, diversity, and controllability. Notably, recent breakthroughs have demonstrated the scalability of video generation models combined with Transformers and their potential as physical world simulators. Video LDM blattmann2023align was among the first works to apply the latent diffusion framework rombach2022high to video generation. The subsequent work SVD blattmann2023stable followed its architecture and made effective improvements to the training recipe. W.A.L.T gupta2023photorealistic employed a transformer with window attention tailored for spatiotemporal generative modeling to generate high-resolution videos. VDT lu2023vdt introduced the video diffusion transformer to flexibly capture long-distance spatiotemporal context in videos and a spatial-temporal mask mechanism to uniformly handle different video generation tasks. The recently introduced SORA videoworldsimulators2024 demonstrated a remarkable capability to generate arbitrarily sized long videos with intuitively physical fidelity. Models trained on large-scale video data can generate videos with consistent geometry and realistic dynamics. Furthermore, video diffusion models can also be considered as effective 3D generators chen2024v3d ; han2024vfusion3d ; voleti2024sv3d to generate multi-view consistent images. Therefore, we build our method on this flourishing domain.

4D generation

Animating category-agnostic stuff is a challenging problem and has been receiving a lot of attention from both academia and industry. Compared to 3D generation, 4D generation requires not only predicting consistent geometry but also generating realistic and diverse dynamics. Recent works on 4D generation could be classified into two main streams based on the type of input prompt. The first class of methods predicts 4D representations given a single image and text description as input. For instance, MAV3D singer2023text directly deploys SDS into 4D generation and proposes a three-stage training scheme to stably generate high-resolution videos. DreamGaussian4D ren2023dreamgaussian4d adopt a similar three-stage training scheme but switches the underlying 4D representation from Hexplane cao2023hexplane to the deformable 3D Gaussian and replaces the third stage with the refinement of high-resolution UV texture maps, thereby achieving significantly efficiency improvement. 4DGen yin20234dgen first generates a set of candidate 3D assets according to the image or text prompt and then grounds the 4D generation on user-specified static 3D assets and monocular video sequences. 4D-fy bahmani20234d also adopts the idea of combining diffusion priors. But different from our method, it combines them during SDS, resulting in an extremely slow generation process that typically takes several days. Another line of work predicts dynamic objects from a single-view videos. Although the motion is largely dictated under this setting, generating spatiotemporal consistent structure still entails considerable uncertainty due to the limited view. Therefore, Consistent4D jiang2023consistent4d is first proposes to use a generation approach to address this task. Subsequently, Efficient4D pan2024fast mimics a photogrammetry-based 3D object capture pipeline by directly generating multi-frame multi-view captures of dynamic 3D objects through SyncDreamer-T and reconstructing 4D representations with them. Note that although Efficient4D shares the same SDS-free philosophy as ours, it possesses a strong architecture bias of the foundation diffusion model and is incapable of synthesizing novel dynamics. Compared to previous works, our framework can efficiently generate diverse dynamic 4D content, avoiding the slow, unstable, and intricate multi-stage optimization, and has the potential to continuously benefit from the scalability of the underlying diffusion model.

3 Method

Refer to caption
Figure 2: The overall pipeline of Diffusion2. (i) Given a reference image, Diffusion2 first independently generates the animation under the reference view (denoted 0,1:Fsubscript:01𝐹\mathcal{I}_{0,1:F}caligraphic_I start_POSTSUBSCRIPT 0 , 1 : italic_F end_POSTSUBSCRIPT) and the multi-view images at the reference time (denoted 0,1:Fsubscript:01𝐹\mathcal{I}_{0,1:F}caligraphic_I start_POSTSUBSCRIPT 0 , 1 : italic_F end_POSTSUBSCRIPT) as the condition for the subsequent generation of the full matrix, denoted \mathcal{I}caligraphic_I. Depending on the type of given prompt, the condition images 1:V,0subscript:1𝑉0\mathcal{I}_{1:V,0}caligraphic_I start_POSTSUBSCRIPT 1 : italic_V , 0 end_POSTSUBSCRIPT or 0,1:Fsubscript:01𝐹\mathcal{I}_{0,1:F}caligraphic_I start_POSTSUBSCRIPT 0 , 1 : italic_F end_POSTSUBSCRIPT can be specified by users. (ii) Then, Diffusion2 directly samples a dense multi-frame multi-view image array by blending the estimated scores from pretrained video and multi-view diffusion models in the reverse-time SDE. (iii) The generated image arrays are employed as supervision to optimize a continuous 4D content representation.

In this section, we present a novel framework designed for efficient and scalable generation of 4D content, which consists of two stages as depicted in Figure 2. In Section 3.1 (stage-1), we will discuss how to generate a dense multi-view multi-frame image array for reconstruction through a highly parallelizable denoising process by integrating the pretrained video diffusion model and multi-view diffusion model, and why it is feasible. In Section 3.2 (stage-2), we will briefly illustrate how to robustly reconstruct 4D content from the image matrix produced in the first stage.

3.1 Image matrix generation

In this stage, our goal is to generate dense multi-frame multi-view images for reconstruction, which can be denoted as a matrix of image

={Ii,jH×W×3}i=1,j=1V,F=[I1,1I1,jI1,FIi,1Ii,jIi,FIV,1IV,jIV,F],superscriptsubscriptsubscript𝐼𝑖𝑗superscript𝐻𝑊3formulae-sequence𝑖1𝑗1𝑉𝐹matrixsubscript𝐼11subscript𝐼1𝑗subscript𝐼1𝐹subscript𝐼𝑖1subscript𝐼𝑖𝑗subscript𝐼𝑖𝐹subscript𝐼𝑉1subscript𝐼𝑉𝑗subscript𝐼𝑉𝐹\mathcal{I}=\left\{I_{i,j}\in\mathbb{R}^{H\times W\times 3}\right\}_{i=1,j=1}^% {V,F}=\begin{bmatrix}{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{0,.5,.5}I_{1,1}}&{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{0,.5,.5}\cdots}&{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{% rgb}{0,0,1}I_{1,j}}&{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{0,.5,.5}\cdots}&{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{0,.5,.5}I_{1,F}}\\ {\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\vdots}&% {\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\ddots}&% {\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\vdots}&{% \color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\ddots}&{% \color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\vdots}\\ {\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}I_{i,1% }}&{\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\cdots% }&I_{i,j}&{\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\cdots% }&{\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}I_{i,F% }}\\ {\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\vdots}&% {\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\ddots}&% {\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\vdots}&{% \color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\ddots}&{% \color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\vdots}\\ {\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}I_{V,1}}% &{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\cdots}% &{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}I_{V,j}}&{% \color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\cdots}&{% \color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}I_{V,F}}% \end{bmatrix},caligraphic_I = { italic_I start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V , italic_F end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL italic_I start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_I start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_I start_POSTSUBSCRIPT 1 , italic_F end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_I start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_I start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_I start_POSTSUBSCRIPT italic_i , italic_F end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_I start_POSTSUBSCRIPT italic_V , 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_I start_POSTSUBSCRIPT italic_V , italic_j end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_I start_POSTSUBSCRIPT italic_V , italic_F end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] , (1)

where V𝑉Vitalic_V is the number of views, F𝐹Fitalic_F is the number of video frames, and (H,W)𝐻𝑊(H,W)( italic_H , italic_W ) is the size of images. We aim to construct a generative model that allows us to directly sample p()similar-to𝑝\mathcal{I}\sim p(\mathcal{I})caligraphic_I ∼ italic_p ( caligraphic_I ).

Now, let us first distract our attention to reviewing existing diffusion-based generators for video and multi-view images, which can be utilized for sampling realistic images through the following probabilistic flow ODE:

dx=σ˙(t)σ(t)xlogp(x;σ(t))dt.𝑑x˙𝜎𝑡𝜎𝑡subscriptx𝑝x𝜎𝑡𝑑𝑡d\mathrm{x}=-\dot{\sigma}(t)\sigma(t)\nabla_{\mathrm{x}}\log p\left(\mathrm{x}% ;\sigma(t)\right)dt.italic_d roman_x = - over˙ start_ARG italic_σ end_ARG ( italic_t ) italic_σ ( italic_t ) ∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( roman_x ; italic_σ ( italic_t ) ) italic_d italic_t . (2)

Here, x={IiH×W×3}i=1Nxsuperscriptsubscriptsubscript𝐼𝑖superscript𝐻𝑊3𝑖1𝑁\mathrm{x}=\left\{I_{i}\in\mathbb{R}^{H\times W\times 3}\right\}_{i=1}^{N}roman_x = { italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is a series of images with N𝑁Nitalic_N frames or N𝑁Nitalic_N views, xlogp(x;σ)subscriptx𝑝x𝜎\nabla_{\mathrm{x}}\log p\left(\mathrm{x};\sigma\right)∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( roman_x ; italic_σ ) is the score function, which can be parameterized as xlogp(x;σ)(Dθ(x;σ))/σ2subscriptx𝑝x𝜎subscript𝐷𝜃x𝜎superscript𝜎2\nabla_{\mathrm{x}}\log p\left(\mathrm{x};\sigma\right)\approx\left(D_{\theta}% (\mathrm{x};\sigma)\right)/\sigma^{2}∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( roman_x ; italic_σ ) ≈ ( italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( roman_x ; italic_σ ) ) / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT karras2022elucidating ; blattmann2023stable , where Dθ(x;σ)subscript𝐷𝜃x𝜎D_{\theta}(\mathrm{x};\sigma)italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( roman_x ; italic_σ ) is a neural network trained via denoising score matching.

We want to extend the above formulation to the sampling of \mathcal{I}caligraphic_I. The question is, how do we estimate the score function of the joint distribution of V×F𝑉𝐹V\times Fitalic_V × italic_F images?

For simplicity, let i,j{Ii,j|1iV,ii}subscript𝑖𝑗conditional-setsubscript𝐼superscript𝑖𝑗formulae-sequence1superscript𝑖𝑉superscript𝑖𝑖\mathcal{I}_{-i,j}\triangleq\{I_{i^{\prime},j}|1\leq i^{\prime}\leq V,i^{% \prime}\neq i\}caligraphic_I start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT ≜ { italic_I start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUBSCRIPT | 1 ≤ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_V , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_i }, i,j{Ii,j|1jF,jj}subscript𝑖𝑗conditional-setsubscript𝐼𝑖superscript𝑗formulae-sequence1superscript𝑗𝐹superscript𝑗𝑗\mathcal{I}_{i,-j}\triangleq\{I_{i,j^{\prime}}|1\leq j^{\prime}\leq F,j^{% \prime}\neq j\}caligraphic_I start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT ≜ { italic_I start_POSTSUBSCRIPT italic_i , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | 1 ≤ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_F , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_j } and i,j{Ii,j|1iV,1jF,ii,jj}subscript𝑖𝑗conditional-setsubscript𝐼superscript𝑖superscript𝑗formulae-sequence1superscript𝑖𝑉1superscript𝑗𝐹formulae-sequencesuperscript𝑖𝑖superscript𝑗𝑗\mathcal{I}_{-i,-j}\triangleq\{I_{i^{\prime},j^{\prime}}|1\leq i^{\prime}\leq V% ,1\leq j^{\prime}\leq F,i^{\prime}\neq i,j^{\prime}\neq j\}caligraphic_I start_POSTSUBSCRIPT - italic_i , - italic_j end_POSTSUBSCRIPT ≜ { italic_I start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | 1 ≤ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_V , 1 ≤ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_F , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_i , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_j }. We first make an assumption on the structure of p()𝑝p(\mathcal{I})italic_p ( caligraphic_I ).

Assumption 3.1.

Given any image Ii,jsubscript𝐼𝑖𝑗I_{i,j}italic_I start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, the underlying geometry i,jsubscript𝑖𝑗\mathcal{I}_{-i,j}caligraphic_I start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT and the dynamics i,jsubscript𝑖𝑗\mathcal{I}_{i,-j}caligraphic_I start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT are conditionally independent, i.e.,

p(i,j,i,j|Ii,j)=p(i,j|Ii,j)p(i,j|Ii,j).𝑝subscript𝑖𝑗conditionalsubscript𝑖𝑗subscript𝐼𝑖𝑗𝑝conditionalsubscript𝑖𝑗subscript𝐼𝑖𝑗𝑝conditionalsubscript𝑖𝑗subscript𝐼𝑖𝑗p\left({\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}% \mathcal{I}_{-i,j}},{\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb% }{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{% 0}\mathcal{I}_{i,-j}}|I_{i,j}\right)=p\left({\color[rgb]{0,0,1}\definecolor[% named]{pgfstrokecolor}{rgb}{0,0,1}\mathcal{I}_{-i,j}}|I_{i,j}\right)p\left({% \color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}% \mathcal{I}_{i,-j}}|I_{i,j}\right).italic_p ( caligraphic_I start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT | italic_I start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) = italic_p ( caligraphic_I start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT | italic_I start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) italic_p ( caligraphic_I start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT | italic_I start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) . (3)

The assumption 3.1 implies that given the front view of a 3D object, its motion as seen from the front does not correlate with its appearance from the back, which aligns with our intuition. A natural corollary is that the mollified distribution derived by adding Gaussian noise into the data distribution still maintains conditional independence:

Corollary 3.1.

Denote ^^\hat{\mathcal{I}}over^ start_ARG caligraphic_I end_ARG as the noisy version of \mathcal{I}caligraphic_I, i.e.,

^={I^i,jH×W×3}i=1,j=1V,FwithI^i,j=αIi,j+εi,j,formulae-sequence^superscriptsubscriptsubscript^𝐼𝑖𝑗superscript𝐻𝑊3formulae-sequence𝑖1𝑗1𝑉𝐹withsubscript^𝐼𝑖𝑗𝛼subscript𝐼𝑖𝑗subscript𝜀𝑖𝑗\displaystyle\hat{\mathcal{I}}=\{\hat{I}_{i,j}\in\mathbb{R}^{H\times W\times 3% }\}_{i=1,j=1}^{V,F}\quad\text{with}~{}\hat{I}_{i,j}=\alpha I_{i,j}+\varepsilon% _{i,j},over^ start_ARG caligraphic_I end_ARG = { over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V , italic_F end_POSTSUPERSCRIPT with over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_α italic_I start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , (4)

where α𝛼\alpha\in\mathbb{R}italic_α ∈ blackboard_R is a constant and εi,jH×W×3subscript𝜀𝑖𝑗superscript𝐻𝑊3\varepsilon_{i,j}\in\mathbb{R}^{H\times W\times 3}italic_ε start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT are independent Gaussian noises. Then we have

p(^i,j,^i,j|I^i,j)=p(^i,j|I^i,j)p(^i,j|I^i,j).𝑝subscript^𝑖𝑗conditionalsubscript^𝑖𝑗subscript^𝐼𝑖𝑗𝑝conditionalsubscript^𝑖𝑗subscript^𝐼𝑖𝑗𝑝conditionalsubscript^𝑖𝑗subscript^𝐼𝑖𝑗p\left({\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{% \mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{% rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{% 0}{0}\hat{\mathcal{I}}_{i,-j}}|\hat{I}_{i,j}\right)=p\left({\color[rgb]{0,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}}|\hat{% I}_{i,j}\right)p\left({\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{% rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{% 0}{0}\hat{\mathcal{I}}_{i,-j}}|\hat{I}_{i,j}\right).italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT | over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) = italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT | over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT | over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) . (5)

This nice property allows us to sample the desired image matrix by progressively denoising from pure Gaussian noise through the combination of two estimated scores of its marginal distribution, which can be obtained from the pretrained video and multi-view diffusion models respectively. Therefore, we can derive our main theorem.

Theorem 3.1.

For x=I^i,jxsubscript^𝐼𝑖𝑗\mathrm{x}=\hat{I}_{i,j}roman_x = over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, we have

xlogp(^)=xlogp(^{1:V},j)+xlogp(^i,{1:F})xlogp(I^i,j).subscriptx𝑝^subscriptx𝑝subscript^conditional-set1𝑉𝑗subscriptx𝑝subscript^𝑖conditional-set1𝐹subscriptx𝑝subscript^𝐼𝑖𝑗\nabla_{\mathrm{x}}\log p\left(\hat{\mathcal{I}}\right)=\nabla_{\mathrm{x}}% \log p\left(\hat{\mathcal{I}}_{\{1:V\},j}\right)+\nabla_{\mathrm{x}}\log p% \left(\hat{\mathcal{I}}_{i,\{1:F\}}\right)-\nabla_{\mathrm{x}}\log p\left(\hat% {I}_{i,j}\right).∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG caligraphic_I end_ARG ) = ∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT { 1 : italic_V } , italic_j end_POSTSUBSCRIPT ) + ∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , { 1 : italic_F } end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) . (6)
Proof.

We first decompose p(^)𝑝^p\left(\hat{\mathcal{I}}\right)italic_p ( over^ start_ARG caligraphic_I end_ARG ) by

p(^)𝑝^\displaystyle p\left(\hat{\mathcal{I}}\right)italic_p ( over^ start_ARG caligraphic_I end_ARG ) =p(I^i,j,^i,j,^i,j,^i,j)absent𝑝subscript^𝐼𝑖𝑗subscript^𝑖𝑗subscript^𝑖𝑗subscript^𝑖𝑗\displaystyle=p\left(\hat{I}_{i,j},{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}% {0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}},{\color[rgb% ]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\hat{\mathcal{I}}_{% -i,-j}}\right)= italic_p ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , - italic_j end_POSTSUBSCRIPT ) (7)
=p(I^i,j,^i,j|^i,j,^i,j)p(^i,j,^i,j).absent𝑝subscript^𝐼𝑖𝑗conditionalsubscript^𝑖𝑗subscript^𝑖𝑗subscript^𝑖𝑗𝑝subscript^𝑖𝑗subscript^𝑖𝑗\displaystyle=p\left(\hat{I}_{i,j},{\color[rgb]{0,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{0,.5,.5}\hat{\mathcal{I}}_{-i,-j}}|{\color[rgb]{0,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{% \color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{% \mathcal{I}}_{i,-j}}\right)p\left({\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}% {0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}}\right).= italic_p ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , - italic_j end_POSTSUBSCRIPT | over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT ) italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT ) . (8)

Note that for any I^i,j^i,jsubscript^𝐼superscript𝑖superscript𝑗subscript^𝑖𝑗\hat{I}_{i^{\prime},j^{\prime}}\in{\color[rgb]{0,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{0,.5,.5}\hat{\mathcal{I}}_{-i,-j}}over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∈ over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , - italic_j end_POSTSUBSCRIPT, Ii,jsubscript𝐼superscript𝑖superscript𝑗I_{i^{\prime},j^{\prime}}italic_I start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and Ii,jsubscript𝐼𝑖𝑗I_{i,j}italic_I start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT are independent conditioned on Ii,jsubscript𝐼superscript𝑖𝑗I_{i^{\prime},j}italic_I start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUBSCRIPT by corollary 3.1, hence

p(I^i,j,^i,j|^i,j,^i,j)=p(^i,j|^i,j,^i,j)p(I^i,j|^i,j,^i,j).𝑝subscript^𝐼𝑖𝑗conditionalsubscript^𝑖𝑗subscript^𝑖𝑗subscript^𝑖𝑗𝑝conditionalsubscript^𝑖𝑗subscript^𝑖𝑗subscript^𝑖𝑗𝑝conditionalsubscript^𝐼𝑖𝑗subscript^𝑖𝑗subscript^𝑖𝑗p\left(\hat{I}_{i,j},{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{0,.5,.5}\hat{\mathcal{I}}_{-i,-j}}|{\color[rgb]{0,0,1}\definecolor[named]% {pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}% {0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}}\right)=p% \left({\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \hat{\mathcal{I}}_{-i,-j}}|{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}% {0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}}\right)p% \left(\hat{I}_{i,j}|{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb% }{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}% \pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}}\right).italic_p ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , - italic_j end_POSTSUBSCRIPT | over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT ) = italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , - italic_j end_POSTSUBSCRIPT | over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT ) italic_p ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT ) . (9)

Since p(^i,j|^i,j,^i,j)𝑝conditionalsubscript^𝑖𝑗subscript^𝑖𝑗subscript^𝑖𝑗p\left({\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \hat{\mathcal{I}}_{-i,-j}}|{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}% {0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}}\right)italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , - italic_j end_POSTSUBSCRIPT | over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT ) does not contain Ii,jsubscript𝐼𝑖𝑗I_{i,j}italic_I start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, its derivative with respect to Ii,jsubscript𝐼𝑖𝑗I_{i,j}italic_I start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is zero. Then combined with equation (8) and equation (9), taking the derivative of logp(^)𝑝^\log p\left(\hat{\mathcal{I}}\right)roman_log italic_p ( over^ start_ARG caligraphic_I end_ARG ) with respect to xx\mathrm{x}roman_x, we achieve

xlogp(^)subscriptx𝑝^\displaystyle\nabla_{\mathrm{x}}\log p\left(\hat{\mathcal{I}}\right)∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG caligraphic_I end_ARG ) =xlogp(I^i,j|^i,j,^i,j)p(^i,j,^i,j)absentsubscriptx𝑝conditionalsubscript^𝐼𝑖𝑗subscript^𝑖𝑗subscript^𝑖𝑗𝑝subscript^𝑖𝑗subscript^𝑖𝑗\displaystyle=\nabla_{\mathrm{x}}\log p\left(\hat{I}_{i,j}|{\color[rgb]{0,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{% \color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{% \mathcal{I}}_{i,-j}}\right)p\left({\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}% {0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}}\right)= ∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT ) italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT ) (10)
=xlogp(I^i,j,^i,j,^i,j).absentsubscriptx𝑝subscript^𝐼𝑖𝑗subscript^𝑖𝑗subscript^𝑖𝑗\displaystyle=\nabla_{\mathrm{x}}\log p\left(\hat{I}_{i,j},{\color[rgb]{0,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{% \color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{% \mathcal{I}}_{i,-j}}\right).= ∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT ) . (11)

Finally, by further decomposing p(I^i,j,^i,j,^i,j)𝑝subscript^𝐼𝑖𝑗subscript^𝑖𝑗subscript^𝑖𝑗p\left(\hat{I}_{i,j},{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{% rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}% \pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}}\right)italic_p ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT ) and directly applying corollary 3.1, we obtain

xlogp(^)subscriptx𝑝^\displaystyle\nabla_{\mathrm{x}}\log p\left(\hat{\mathcal{I}}\right)∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG caligraphic_I end_ARG ) =xlogp(^i,j,^i,j|I^i,j)p(I^i,j)absentsubscriptx𝑝subscript^𝑖𝑗conditionalsubscript^𝑖𝑗subscript^𝐼𝑖𝑗𝑝subscript^𝐼𝑖𝑗\displaystyle=\nabla_{\mathrm{x}}\log p\left({\color[rgb]{0,0,1}\definecolor[% named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}},{\color[rgb]{1,0,1% }\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1% }{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\hat{\mathcal{I}}_{i,-j}}|\hat{I}_{i% ,j}\right)p\left(\hat{I}_{i,j}\right)= ∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT | over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) italic_p ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) (12)
=xlogp(^i,j|I^i,j)p(^i,j|I^i,j)p(I^i,j)absentsubscriptx𝑝conditionalsubscript^𝑖𝑗subscript^𝐼𝑖𝑗𝑝conditionalsubscript^𝑖𝑗subscript^𝐼𝑖𝑗𝑝subscript^𝐼𝑖𝑗\displaystyle=\nabla_{\mathrm{x}}\log p\left({\color[rgb]{0,0,1}\definecolor[% named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{I}}_{-i,j}}|\hat{I}_{i,j}% \right)p\left({\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{% 1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}% \hat{\mathcal{I}}_{i,-j}}|\hat{I}_{i,j}\right)p\left(\hat{I}_{i,j}\right)= ∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT | over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , - italic_j end_POSTSUBSCRIPT | over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) italic_p ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) (13)
=xlogp(^{1:V},j)p(^i,{1:F})p(I^i,j)absentsubscriptx𝑝subscript^conditional-set1𝑉𝑗𝑝subscript^𝑖conditional-set1𝐹𝑝subscript^𝐼𝑖𝑗\displaystyle=\nabla_{\mathrm{x}}\log\frac{p\left(\hat{\mathcal{I}}_{\{1:V\},j% }\right)p\left(\hat{\mathcal{I}}_{i,\{1:F\}}\right)}{p\left(\hat{I}_{i,j}% \right)}= ∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log divide start_ARG italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT { 1 : italic_V } , italic_j end_POSTSUBSCRIPT ) italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , { 1 : italic_F } end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) end_ARG (14)
=xlogp(^{1:V},j)+xlogp(^i,{1:F})xlogp(I^i,j).absentsubscriptx𝑝subscript^conditional-set1𝑉𝑗subscriptx𝑝subscript^𝑖conditional-set1𝐹subscriptx𝑝subscript^𝐼𝑖𝑗\displaystyle=\nabla_{\mathrm{x}}\log p\left(\hat{\mathcal{I}}_{\{1:V\},j}% \right)+\nabla_{\mathrm{x}}\log p\left(\hat{\mathcal{I}}_{i,\{1:F\}}\right)-% \nabla_{\mathrm{x}}\log p\left(\hat{I}_{i,j}\right).= ∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT { 1 : italic_V } , italic_j end_POSTSUBSCRIPT ) + ∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , { 1 : italic_F } end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) . (15)

Here xlogp(^i,{1:F})subscriptx𝑝subscript^𝑖conditional-set1𝐹\nabla_{\mathrm{x}}\log p\left(\hat{\mathcal{I}}_{i,\{1:F\}}\right)∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , { 1 : italic_F } end_POSTSUBSCRIPT ) and xlogp(^{1:V},j)subscriptx𝑝subscript^conditional-set1𝑉𝑗\nabla_{\mathrm{x}}\log p\left(\hat{\mathcal{I}}_{\{1:V\},j}\right)∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT { 1 : italic_V } , italic_j end_POSTSUBSCRIPT ) are just the score functions of the video diffusion model and the multi-view diffusion model respectively . And we use the convex combination of them to estimate xlogp(I^i,j)subscriptx𝑝subscript^𝐼𝑖𝑗\nabla_{\mathrm{x}}\log p\left(\hat{I}_{i,j}\right)∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) as:

xlogp(I^i,j)=(1s)xlogp(^i,{1:F})+sxlogp(^{1:V},j).subscriptx𝑝subscript^𝐼𝑖𝑗1𝑠subscriptx𝑝subscript^𝑖conditional-set1𝐹𝑠subscriptx𝑝subscript^conditional-set1𝑉𝑗\nabla_{\mathrm{x}}\log p\left(\hat{I}_{i,j}\right)=(1-s)\nabla_{\mathrm{x}}% \log p\left(\hat{\mathcal{I}}_{i,\{1:F\}}\right)+s\nabla_{\mathrm{x}}\log p% \left(\hat{\mathcal{I}}_{\{1:V\},j}\right).∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) = ( 1 - italic_s ) ∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_i , { 1 : italic_F } end_POSTSUBSCRIPT ) + italic_s ∇ start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT roman_log italic_p ( over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT { 1 : italic_V } , italic_j end_POSTSUBSCRIPT ) . (16)

In practice, we employ a logistic schedule to adjust the change of s𝑠sitalic_s with denoising step. Given the current denoising step i𝑖iitalic_i and the number of total steps N𝑁Nitalic_N, we set s=111+ek(i/Ns0)𝑠111superscript𝑒𝑘𝑖𝑁subscript𝑠0s=1-\frac{1}{1+e^{k(i/N-s_{0})}}italic_s = 1 - divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT italic_k ( italic_i / italic_N - italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG. This function has a sigmoidal curve, which is relatively flat at the extremes away from middle s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and changes sharply near it, with the derivative controlled by k𝑘kitalic_k. This schedule decouples the generation of consistent geometry and temporally smooth appearance to some extent.

Sampling in latent space

For convenience, the above theorem assumes that the sample object is the image in the original RGB space. However, modern high-resolution diffusion models typically generate images encoded into a latent space by VQVAE van2017neural . The legitimacy of the aforementioned derivation requires that the multi-view generator and the video generator share the same latent space. Although this requirement is not met in the most current instantiation of them, we believe that this condition will be increasingly satisfied by more multi-view generation models in the future. Because as pointed out by SVD blattmann2023stable ; chen2024v3d ; videoworldsimulators2024 ; li2024sora , video generation models trained on large-scale video datasets have learned a strong stereo knowledge, thus can provide a better pretraining for fine-tuning multi-view diffusion models than those trained solely on image data. And the latent encoder is usually frozen during fine-tuning on the multi-view images.

Generation with various input conditions

Note that the formulation described above is based on unconditional generation. However, we are more interested in controllable generation in practice. Then we will extend the aforementioned process to conditional generation. Consider the augmented matrix augsubscriptaug\mathcal{I}_{\text{aug}}caligraphic_I start_POSTSUBSCRIPT aug end_POSTSUBSCRIPT defined by

aug=[I0,00,{1:V}{1:F},0],subscriptaugmatrixsubscript𝐼00subscript0conditional-set1𝑉subscriptconditional-set1𝐹0\mathcal{I}_{\text{aug}}=\begin{bmatrix}I_{0,0}&\mathcal{I}_{0,\{1:V\}}\\ \mathcal{I}_{\{1:F\},0}&\mathcal{I}\end{bmatrix},caligraphic_I start_POSTSUBSCRIPT aug end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_I start_POSTSUBSCRIPT 0 , 0 end_POSTSUBSCRIPT end_CELL start_CELL caligraphic_I start_POSTSUBSCRIPT 0 , { 1 : italic_V } end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_I start_POSTSUBSCRIPT { 1 : italic_F } , 0 end_POSTSUBSCRIPT end_CELL start_CELL caligraphic_I end_CELL end_ROW end_ARG ] , (17)

where I0,0subscript𝐼00I_{0,0}italic_I start_POSTSUBSCRIPT 0 , 0 end_POSTSUBSCRIPT are the input image, and 0,{1:V},{1:F},0subscript0conditional-set1𝑉subscriptconditional-set1𝐹0\mathcal{I}_{0,\{1:V\}},\mathcal{I}_{\{1:F\},0}caligraphic_I start_POSTSUBSCRIPT 0 , { 1 : italic_V } end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT { 1 : italic_F } , 0 end_POSTSUBSCRIPT is the first row and column of augsubscriptaug\mathcal{I}_{\text{aug}}caligraphic_I start_POSTSUBSCRIPT aug end_POSTSUBSCRIPT we need to first create as the condition for the subsequent generation of full matrix. Then we will demonstrate how we obtain them from various conditions. For convenience, we denote 𝐕𝐕\mathbf{V}bold_V as the multi-view diffusion model and 𝐅𝐅\mathbf{F}bold_F as the video diffusion model.

  • Single image. Given I0,0subscript𝐼00I_{0,0}italic_I start_POSTSUBSCRIPT 0 , 0 end_POSTSUBSCRIPT as input, we use 𝐕𝐕\mathbf{V}bold_V to generate 0,{1:V}subscript0conditional-set1𝑉\mathcal{I}_{0,\{1:V\}}caligraphic_I start_POSTSUBSCRIPT 0 , { 1 : italic_V } end_POSTSUBSCRIPT which dictates the geometry of the generated 4D assets and use 𝐅𝐅\mathbf{F}bold_F to generate {1:F},0subscriptconditional-set1𝐹0\mathcal{I}_{\{1:F\},0}caligraphic_I start_POSTSUBSCRIPT { 1 : italic_F } , 0 end_POSTSUBSCRIPT which endows the static image with dynamics.

  • Single-view video. Given {0:F},0subscriptconditional-set0𝐹0\mathcal{I}_{\{0:F\},0}caligraphic_I start_POSTSUBSCRIPT { 0 : italic_F } , 0 end_POSTSUBSCRIPT as input, we use the last frame I0,0subscript𝐼00I_{0,0}italic_I start_POSTSUBSCRIPT 0 , 0 end_POSTSUBSCRIPT as the condition of 𝐕𝐕\mathbf{V}bold_V to generate 0,{1:V}subscript0conditional-set1𝑉\mathcal{I}_{0,\{1:V\}}caligraphic_I start_POSTSUBSCRIPT 0 , { 1 : italic_V } end_POSTSUBSCRIPT.

  • Static 3D model. Similarly, Given 0,{0:V}subscript0conditional-set0𝑉\mathcal{I}_{0,\{0:V\}}caligraphic_I start_POSTSUBSCRIPT 0 , { 0 : italic_V } end_POSTSUBSCRIPT as input, we use the front view I0,0subscript𝐼00I_{0,0}italic_I start_POSTSUBSCRIPT 0 , 0 end_POSTSUBSCRIPT as the condition of 𝐅𝐅\mathbf{F}bold_F to generate {1:F},0subscriptconditional-set1𝐹0\mathcal{I}_{\{1:F\},0}caligraphic_I start_POSTSUBSCRIPT { 1 : italic_F } , 0 end_POSTSUBSCRIPT.

Assumption 3.1 ensures the safety of independently generating the geometry 0,{1:V}subscript0conditional-set1𝑉\mathcal{I}_{0,\{1:V\}}caligraphic_I start_POSTSUBSCRIPT 0 , { 1 : italic_V } end_POSTSUBSCRIPT and the motion {1:F},0subscriptconditional-set1𝐹0\mathcal{I}_{\{1:F\},0}caligraphic_I start_POSTSUBSCRIPT { 1 : italic_F } , 0 end_POSTSUBSCRIPT. Additionally, there is no computational or data dependency between these two generation processes, allowing their total time cost to be reduced to a single reverse diffusion process. Then we will denoise the rest part of augsubscriptaug\mathcal{I}_{\text{aug}}caligraphic_I start_POSTSUBSCRIPT aug end_POSTSUBSCRIPT from pure Gaussian noise. In each step, we run the score estimators for each row and column conditioned on the 0,{1:V}subscript0conditional-set1𝑉\mathcal{I}_{0,\{1:V\}}caligraphic_I start_POSTSUBSCRIPT 0 , { 1 : italic_V } end_POSTSUBSCRIPT and {1:F},0subscriptconditional-set1𝐹0\mathcal{I}_{\{1:F\},0}caligraphic_I start_POSTSUBSCRIPT { 1 : italic_F } , 0 end_POSTSUBSCRIPT, and combine their results as in Theorem 3.1 to update the noisy latent. Since the score estimation for each row and column can also be parallelized, the time cost can be decreased to running a single diffusion step. Therefore, with sufficient GPU memory, the total time spent on the process illustrated in Figure 2(ii) remains the same as that for generating a single video.

3.2 Robust reconstruction

4D representation

Given generated synchronized multi-view videos conditioned on any type of prompts, there are numerous methods that can be employed to reconstruct 4D assets. Among the numerous candidates, we adopt the 4D Gaussian Splatting due to its superior fitting capabilities and efficient optimization given the dense multi-view supervision.

Optimization

Although the images generated in the first stage already have intuitively satisfactory spatiotemporal consistency, the performance limitation of the foundational multi-view generation components still makes it difficult to achieve precise pixel-level matching across different views and frames. Therefore, we follow chen2024v3d to optimize the combination of perception loss lpipssubscript𝑙𝑝𝑖𝑝𝑠\mathcal{L}_{lpips}caligraphic_L start_POSTSUBSCRIPT italic_l italic_p italic_i italic_p italic_s end_POSTSUBSCRIPT zhang2018unreasonable and D-SSIM ssimsubscript𝑠𝑠𝑖𝑚\mathcal{L}_{ssim}caligraphic_L start_POSTSUBSCRIPT italic_s italic_s italic_i italic_m end_POSTSUBSCRIPT wang2004image while ignoring L1 loss. In addition, we weight each term with the confidence score, then the total objective is defined as total=λlpips𝒞lpipslpips+λssim𝒞ssimssimsubscript𝑡𝑜𝑡𝑎𝑙subscript𝜆𝑙𝑝𝑖𝑝𝑠subscript𝒞𝑙𝑝𝑖𝑝𝑠subscript𝑙𝑝𝑖𝑝𝑠subscript𝜆𝑠𝑠𝑖𝑚subscript𝒞𝑠𝑠𝑖𝑚subscript𝑠𝑠𝑖𝑚\mathcal{L}_{total}=\lambda_{lpips}\mathcal{C}_{lpips}\mathcal{L}_{lpips}+% \lambda_{ssim}\mathcal{C}_{ssim}\mathcal{L}_{ssim}caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_l italic_p italic_i italic_p italic_s end_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_l italic_p italic_i italic_p italic_s end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_l italic_p italic_i italic_p italic_s end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_s italic_s italic_i italic_m end_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_s italic_s italic_i italic_m end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_s italic_i italic_m end_POSTSUBSCRIPT, where 𝒞ssimsubscript𝒞𝑠𝑠𝑖𝑚\mathcal{C}_{ssim}caligraphic_C start_POSTSUBSCRIPT italic_s italic_s italic_i italic_m end_POSTSUBSCRIPT is just the ssim between ground truth and rendered images and 𝒞lpipssubscript𝒞𝑙𝑝𝑖𝑝𝑠\mathcal{C}_{lpips}caligraphic_C start_POSTSUBSCRIPT italic_l italic_p italic_i italic_p italic_s end_POSTSUBSCRIPT is defined as 1lpips1subscript𝑙𝑝𝑖𝑝𝑠1-\mathcal{L}_{lpips}1 - caligraphic_L start_POSTSUBSCRIPT italic_l italic_p italic_i italic_p italic_s end_POSTSUBSCRIPT.

4 Experiments

4.1 Implementation details

In the first stage, we use Stable Video Diffusion blattmann2023stable as our foundation video diffusion model, predicting 25 frames each time according to the image prompt. SV3Dp voleti2024sv3d is chosen as the foundation multi-view diffusion model. For simplicity, we only generate orbital videos that have 21 uniformly spaced azimuths and fixed elevation with manual filter of side view as these views typically contain thin structures that pose challenges for the video generation model and subsequent reconstruction processes. By default, we set the number of sampling steps to 50 for both generative models. In the reconstruction stage, we optimized 4D Gaussian Splatting for 5k iterations without bells and whistles. The image size is set to (576 ×\times× 576) in both stages.

Refer to caption
(a) Image-to-4D
Refer to caption
(b) Video-to-4D
Figure 3: Qualitative comparisons on (a) image-to-4D generation and (b) video-to-4D generation.

4.2 4D generation from single image

Table 1: User study on image-to-4D generation. The proportions of different methods that best match user preferences under three criteria are reported.
Method Details Geometry Temporal Overall model Generation
smoothness quality time
Animate124 11.3% 31.0% 16.0% 18.0% 9h
DreamGaussian4D 27.7% 24.3% 48.0% 25.7% 12m
Ours 61.0% 44.7% 36.0% 56.3% 10m

Figure 3 (a), we show the results generated by the proposed method and provide the comparison with other alternatives. It can be observed that our concise and elegant pipeline is capable of generating 4D assets of comparable quality to those produced by state-of-the-art SDS-based methods with sophisticated multi-stage optimization. Furthermore, the parallel evaluation of a large 2D generative model can provide a potential efficiency advantage for our method. We also conducted a user study (See appendix for details), the results of which are reported in Table 1. It suggests that our method garnered the highest human preference in the multi-view consistency, detail, and overall model quality.

4.3 4D generation from single view video

Table 2: Quantitative comparisons on video-to-4D generation.
Method Type CLIP Similarity \uparrow Generation time \downarrow
Consistent4D Optimization-based 0.87 2h
4DGen Optimization-based 0.89 2h10m
Efficient4D Photogrammetry-based 0.92 14m
Our Photogrammetry-based 0.94 10m

Generating 4D dynamic objects from fixed-view video is a practical task first introduced in jiang2023consistent4d . Compared to generation from a single image, this task additionally constrains the object’s motion. Our proposed framework can be easily adapted to deal with this task as detailed in Section 3.1. We perform both quantitative and qualitative comparisons with other counterparts under this setting. The qualitative result is shown in Figure 3 (b). It can be seen that our method slightly reduces the over-saturated appearance. For qualitative evaluation, we report the CLIP-similarity between generated views and ground truth images to indicate overall semantic consistency and the recognizable of generated images. The quantitative metrics also support our superiority.

4.4 4D generation from static 3D content

Refer to caption
Figure 4: Synthesized images from static 3D models.

Naturally, Diffusion2 can also animate static 3D models into dynamic objects as described in Section 3.1, which has substantial practical implications. From Figure 4 we can observe that our method is capable of endowing 3D models with diverse and realistic dynamics while maintaining satisfactory temporal and geometrical consistency.

4.5 Ablation studies

Refer to caption
Figure 5: Ablation studies. (a) The parameter controlling the the logistic schedule. (b) Different type of scale schedule. Best viewed with zoom-in.

Since previous methods have explored dynamic content reconstruction well, we mainly focus on the ablation of key design choices in the stage of image matrix generation. In Figure 5 (a), we adjusted two parameters that control the curve of the logistic schedule. The results reveal that: when only using video prior, the generated images fail to ensure consistency in perspective, geometry, and detail with other views. For example, in the second row and second column of Figure 5 (a), the dress still evenly sags down both sides of the body, unlike in the reference view where the dress flutters to the left side of the body. This indicates that without the guidance of a geometry prior, the dynamics of each view will be totally independent of each other. Another extreme is when we set s𝑠sitalic_s to 1, that is, using only multi-view prior, where it can no longer guarantee the consistency of details between different frames of invisible views. The comparison of the last column of the first two rows of Figure 5 (a) shows a remarkable change in the shape of the tail of the hair. So we finally adopt a compromised option to set s𝑠sitalic_s to 0.5, i.e., starting to drastically reduce the weight of the multi-view score at halfway of the denoising process. In most cases, this choice can achieve smooth temporal transitions as well as geometrical consistency. In addition, the decreasing speed of s𝑠sitalic_s may also affect the quality of generated images, which is controlled by another parameter k𝑘kitalic_k. Therefore, we test the image generation under different k𝑘kitalic_k given s=0.5𝑠0.5s=0.5italic_s = 0.5. It can be seen that when k𝑘kitalic_k is relatively small, the generated images may exhibit ghosting effects. On the other hand, when k𝑘kitalic_k is too large, it may weaken the temporal smoothing effect brought by the video diffusion model in the early stages of denoising. Consequently, we adopt s=0.5,k=20formulae-sequence𝑠0.5𝑘20s=0.5,k=20italic_s = 0.5 , italic_k = 20 as the default setting. Finally, we also examined the impact of different noise schedules, as shown in Figure 5 (b). It can be observed that all other choices would lead to ghosting artifacts.

5 Conclusion

In this work, we present a novel 4D content generation framework, dubbed Diffusion2, which efficiently generates dense, consistent multi-view multi-frame image arrays with high parallelism and then feeds them into 4D reconstruction pipeline to create full 4D presentation. Our key assumption is that elements i,jsubscript𝑖𝑗\mathcal{I}_{i,j}caligraphic_I start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and k,lsubscript𝑘𝑙\mathcal{I}_{k,l}caligraphic_I start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT (ik,jlformulae-sequence𝑖𝑘𝑗𝑙i\neq k,j\neq litalic_i ≠ italic_k , italic_j ≠ italic_l) in the multi-view multi-frame image array \mathcal{I}caligraphic_I are conditionally independent given i,lsubscript𝑖𝑙\mathcal{I}_{i,l}caligraphic_I start_POSTSUBSCRIPT italic_i , italic_l end_POSTSUBSCRIPT or k,jsubscript𝑘𝑗\mathcal{I}_{k,j}caligraphic_I start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT. This aligns with our intuition: past or future motion and the appearance in other views are decoupled to a large extent. Based on this assumption, we prove that we can directly sample synchronized multi-view videos \mathcal{I}caligraphic_I in a denoising process by combining pretrained video diffusion models and multi-view diffusion models. Experimental results show that the proposed framework can flexibly adapt to various types of prompts. We hope that our work can inspire future research on unleashing and combining the geometrical and dynamic priors from foundation 3D and video diffusion models.

References

  • (1) Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. In CVPR, 2024.
  • (2) Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint, 2023.
  • (3) Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  • (4) Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. https://openai.com/research/video-generation-models-as-world-simulators, 2024.
  • (5) Ang Cao and Justin Johnson. Hexplane: A fast representation for dynamic scenes. In CVPR, 2023.
  • (6) Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In ECCV, 2022.
  • (7) Zilong Chen, Yikai Wang, Feng Wang, Zhengyi Wang, and Huaping Liu. V3d: Video diffusion models are effective 3d generators. arXiv preprint, 2024.
  • (8) Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In CVPR, 2023.
  • (9) Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. In NeurIPS, 2022.
  • (10) Quankai Gao, Qiangeng Xu, Zhe Cao, Ben Mildenhall, Wenchao Ma, Le Chen, Danhang Tang, and Ulrich Neumann. Gaussianflow: Splatting gaussian dynamics for 4d content creation. arXiv preprint, 2024.
  • (11) Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. arXiv preprint, 2023.
  • (12) Junlin Han, Filippos Kokkinos, and Philip Torr. Vfusion3d: Learning scalable 3d generative models from video diffusion models. arXiv preprint, 2024.
  • (13) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  • (14) Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. In ICLR, 2024.
  • (15) Yanqin Jiang, Li Zhang, Jin Gao, Weimin Hu, and Yao Yao. Consistent4d: Consistent 360 {{\{{\\\backslash\deg}}\}} dynamic object generation from monocular video. In ICLR, 2024.
  • (16) Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint, 2023.
  • (17) Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022.
  • (18) Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. In ACM TOG, 2023.
  • (19) Xuanyi Li, Daquan Zhou, Chenxu Zhang, Shaodong Wei, Qibin Hou, and Ming-Ming Cheng. Sora generates videos with stunning geometrical consistency. arXiv preprint, 2024.
  • (20) Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023.
  • (21) Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Generating multiview-consistent images from a single-view image. In ICLR, 2024.
  • (22) Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint, 2023.
  • (23) Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, and Mingyu Ding. Vdt: General-purpose video diffusion transformers via mask modeling. In ICLR, 2024.
  • (24) Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint, 2022.
  • (25) Zijie Pan, Jiachen Lu, Xiatian Zhu, and Li Zhang. Enhancing high-resolution 3d generation through pixel-wise gradient clipping. In ICLR, 2024.
  • (26) Zijie Pan, Zeyu Yang, Xiatian Zhu, and Li Zhang. Fast dynamic 3d object generation from a single-view video. arXiv preprint, 2024.
  • (27) Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023.
  • (28) Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV, 2021.
  • (29) Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, and Ziwei Liu. Dreamgaussian4d: Generative 4d gaussian splatting. arXiv preprint, 2023.
  • (30) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  • (31) Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. Graf: Generative radiance fields for 3d-aware image synthesis. In NeurIPS, 2020.
  • (32) Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint, 2023.
  • (33) Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dynamic scene generation. arXiv preprint, 2023.
  • (34) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR, 2021.
  • (35) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
  • (36) Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint, 2024.
  • (37) Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. In ICLR, 2024.
  • (38) Shitao Tang, Jiacheng Chen, Dilin Wang, Chengzhou Tang, Fuyang Zhang, Yuchen Fan, Vikas Chandra, Yasutaka Furukawa, and Rakesh Ranjan. Mvdiffusion++: A dense high-resolution multi-view diffusion model for single or sparse-view 3d object reconstruction. arXiv preprint, 2024.
  • (39) Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017.
  • (40) Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion. arXiv preprint, 2024.
  • (41) Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint, 2023.
  • (42) Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In NeurIPS, 2023.
  • (43) Zhengyi Wang, Yikai Wang, Yifei Chen, Chendong Xiang, Shuo Chen, Dajiang Yu, Chongxuan Li, Hang Su, and Jun Zhu. Crm: Single image to 3d textured mesh with convolutional reconstruction model. arXiv preprint, 2024.
  • (44) Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. In IEEE TIP, 2004.
  • (45) Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. In CVPR, 2024.
  • (46) Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Liang Pan Jiawei Ren, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, Dahua Lin, and Ziwei Liu. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In CVPR, 2023.
  • (47) Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. In CVPR, 2024.
  • (48) Taoran Yi, Jiemin Fang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. In CVPR, 2024.
  • (49) Yuyang Yin, Dejia Xu, Zhangyang Wang, Yao Zhao, and Yunchao Wei. 4dgen: Grounded 4d content generation with spatial-temporal consistency. arXiv preprint, 2023.
  • (50) Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, et al. Mvimgnet: A large-scale dataset of multi-view images. In CVPR, 2023.
  • (51) Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  • (52) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  • (53) Yuyang Zhao, Zhiwen Yan, Enze Xie, Lanqing Hong, Zhenguo Li, and Gim Hee Lee. Animate124: Animating one image to 4d dynamic scene. arXiv preprint, 2023.
  • (54) Qi Zuo, Xiaodong Gu, Lingteng Qiu, Yuan Dong, Zhengyi Zhao, Weihao Yuan, Rui Peng, Siyu Zhu, Zilong Dong, Liefeng Bo, et al. Videomv: Consistent multi-view generation based on large video generative model. arXiv preprint, 2024.