X-Ray: A Sequential 3D Representation
for Generation

Tao Hu 1   Wenhang Ge 2   Yuyang Zhao 111footnotemark: 1   Gim Hee Lee 1

1 National University of Singapore
2 Hong Kong University of Science and Technology (Guangzhou)

taohu@nus.edu.sg
These authors contributed equally to this work.
Abstract

In this paper, we introduce X-Ray, an innovative approach to 3D generation that employs a new sequential representation, drawing inspiration from the depth-revealing capabilities of X-Ray scans to meticulously capture both the external and internal features of objects. Central to our method is the utilization of ray casting techniques originating from the camera’s viewpoint, meticulously recording the geometric and textural details encountered across all intersected surfaces. This process efficiently condenses complete objects or scenes into a multi-frame format, just like videos. Such a structure ensures the 3D representation is composed solely of critical surface information. Highlighting the practicality and adaptability of our X-Ray representation, we showcase its utility in synthesizing 3D objects, employing a network architecture akin to that used in video diffusion models. The outcomes reveal our representation’s superior performance in enhancing both the accuracy and efficiency of 3D synthesis, heralding new directions for ongoing research and practical implementations in the field. The project page can be found in https://tau-yihouxiang.github.io/projects/X-Ray/X-Ray.html.

Refer to caption
Figure 1: Comparison between our the proposed X-Ray with the rendering-based 3D synthesis [35]. The former usually focus on the visible surface within camera view, while ours can sense all the visible and invisible surfaces thus can generate 3D object with both outside and inside shape and appearance.

1 Introduction

In the research field of 3D synthesis, the pursuit of efficient, high-resolution, and generalized representations stands as three critical objectives. The significance of this goal stems from the ever-expanding array of applications reliant on 3D technology, ranging from virtual reality and augmented reality to computer-aided design and beyond. Previous approaches to 3D representation, such as meshes, point clouds, voxels, Neural Radiance Fields (NeRF) [25, 33, 39, 20, 12], and 3D Gaussian Splatting [17], each possess unique strengths but also face common challenges in achieving the desired balance between lightweight design and generalization. Meshes, while widely used, can become exceedingly complex when detailing high-resolution objects, yet constrained by its topological and generalization. Point clouds offer a more flexible capture of object geometries but lack connectivity information and consistent the feature extraction [9, 30]. Voxels simplify spatial reasoning at the cost of immense storage needs as resolution increases. NeRF and 3D Gaussian Splatting presents an impressive leap in rendering photorealistic scenes through deep learning; however, since they are designed for single scene with multi-view images, leading to not perfect matching with generative model.

The limitations of existing representations for 3D synthesis lies in their struggle to encapsulate the full spectrum of 3D object attributes in a form that is both comprehensive and computationally manageable. A pivotal aspect often overlooked is the primary importance of surface information in the perception and interaction with 3D environments. Surfaces define the visible boundaries of objects, and capturing their properties accurately is essential for realistic rendering. As shown in Fig. 1, Depth is an efficient representation to describe the visible surface within the filed of camera, yet the question is how to observe the invisible surfaces. This insight guides our proposal of the X-Ray representation, inspired from the medical imaging technique, a novel approach that focuses on efficiently capturing and can penetrate the object and store all surface information.

Our X-Ray records the geometric (depth and normal) and textural attributes (color) along all the intersected surfaces through ray casting. This method compresses 3D data into a multi-layered surface representation, organized into slim grid voxels, which significantly reduces the data footprint while preserving essential detail. Moreover, the X-Ray data structure’s compatibility with video formats opens up novel pathways for leveraging video diffusion models in 3D synthesis. By treating X-Ray representations as sequences of frames, we harness the power of video diffusion models to generate 3D objects from images and text. This approach not only yields high-quality results but also benefits from the advanced capabilities and efficiency of video processing techniques.

We demonstrate the advantages of the X-Ray representation through comprehensive experiments, showcasing its superiority in lightweight design, efficiency, and general applicability to various synthesis tasks. Our findings reveal that X-Ray not only achieves a remarkable balance between detail and manageability but also facilitates a significant leap forward in the speed and quality of 3D synthesis, positioning it as a groundbreaking solution to the longstanding challenges in the field.

The main contribution of the paper can be summarized as follows:

  1. 1.

    We introduces X-Ray, a novel 3D representation that captures surface details through ray casting, significantly reducing data volume while maintaining detail.

  2. 2.

    We demonstrates X-Ray’s compatibility with video diffusion models, enabling efficient synthesis of 3D objects from images and text.

  3. 3.

    We showcases X-Ray’s superior performance in terms of speed and quality in 3D synthesis, setting a new benchmark for lightweight and generalizable 3D modeling.

Refer to caption
Figure 2: Samples of our proposed X-Ray Representation. The first row displays the raw 3D objects to be converted. The second row illustrates the images rendered from random camera perspectives. Subsequent rows reveal the hit 𝐇𝐇\mathbf{H}bold_H, depth 𝐃𝐃\mathbf{D}bold_D, normal 𝐍𝐍\mathbf{N}bold_N, and color 𝐂𝐂\mathbf{C}bold_C Maps of X-Ray from the 1stsuperscript1𝑠𝑡1^{st}1 start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT to the 8thsuperscript8𝑡8^{th}8 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer. It is important to note that the number of layers in an arbitrary X-Ray is not fixed, indicating various complexity of 3D objects.

2 Related Work

2.1 Representation for 3D Models

In the digital world, handling 3D data (which includes depth) is much more complex and resource-intensive than dealing with 1D (like text and voice) and 2D (like images) data. This complexity makes it crucial to find effective ways to organize, process, and display 3D information. Traditional methods for representing 3D data include meshes, which are good for creating detailed visuals but hard to be generalized; point clouds, which are simple and useful for capturing real-world scenes but lack consistent and dense structure in 3D creation; Although, 3D Gaussian Splatting [17] smooths point cloud data into continuous surfaces but requires additional an initial point cloud as shape, making them less flexible for 3D synthesis; and voxels, which are excellent for detailed volumetric data but require much computing resources. Multi-Plane Images [24, 36] try to extent the depth concept to multi-layer with a fixed distance, while they can only describe the visible surfaces toward camera.

Recent advancements in 3D representation have primarily concentrated on point-level details and implicit functions, encompassing techniques such as Occupancy, Signed Distance Fields (SDF) [37, 8], Triplanes [7, 13], Neural Radiance Fields (NeRF) [25]. These methods have significantly advanced the capabilities of modeling and rendering. Occupancy models map the location of any 3D point to its probability of being inside or outside an object, offering a probabilistic approach to shape definition. SDFs [37, 15] refine this concept by quantifying the nearest surface distance from any given point, enhancing the precision of surface representations. Triplanes [4] provide a more efficient route to 3D representation, employing intersecting 2D planes, albeit at the expense of some detail. NeRFs, as highlighted by [25], produce remarkably realistic renderings from a limited number of viewpoints, although they require extensive computational effort. Despite these advancements, implicit function-based models often face challenges in feature extraction and generalization, leading to a fallback on voxel-based representations at lower resolutions, which impedes the high-resolution generalization. To summarize, the representation that focus on surface is more efficient and the representation with grid representation is more easily to be generalized. Therefore, capturing all surface attributes and organizing in a dense but lightweight data structure renders our X-Ray technique an more efficient and generalized representation.

2.2 Generative Models for 3D Generation

The state-of-the-art 3D generative model can be mainly divided into two types, including diffusion-based [22, 26, 27] and rendering-based [11, 10, 35].

Diffusion-based generative models have emerged as powerful tools for 3D generation, leveraging the principles of stochastic diffusion processes to gradually transition from noise to structured 3D shapes. These models, such as Point-Diffusion [22], DiT-3D [26], and Point-E [27], have demonstrated remarkable ability in generating high-quality 3D point clouds. They operate by iterative refining a random noise distribution into a coherent structure that resembles the target 3D shape, capturing complex geometries and surface details with high fidelity. The strength of these models lies in their capacity to model the distribution of 3D points in a continuous space, allowing for the generation of 3D objects with nuanced variations and detailed textures. However, only sparse point clouds can be generated because of the limitation of Point-based Network [21, 26, 14]. On the other hand, rendering-based generative models focus on the visualization aspect of 3D generation, transforming abstract 3D representations into detailed and photorealistic images or videos. Models such as LRM [11], Open-LRM [10], LRM [34], DMV3D [38], and TripoSR [35] employ advanced rendering techniques to achieve this. These models integrate traditional 3D modeling methods with neural rendering to create visuals that are not only realistic but also adjustable in real-time for various lighting conditions, perspectives, and scenes. However, rendering-based models are only optimized to the visible surface of object, making it hard to synthesize the invisible or inside surfaces. In response to these challenges, our approach utilizes a video diffusion model as the foundation for developing 3D X-Ray, a method that is fine-tuned to detect all surfaces. This strategy benefits from the strengths of existing video diffusion models while innovatively addressing the limitation of rendering-based techniques, offering a more comprehensive solution for 3D generation that is sensitive to both inside and outside parts of objects.

3 X-Ray Representation

This section is dedicated to detailing the methodologies behind encoding and decoding processes that facilitate the conversion between traditional 3D mesh formats and our X-Ray representation. The encoding process aims to transfer a 3D mesh into our proposed sequential representation and the decoding process can transfer the X-Ray to 3D Mesh.

3.1 Encoding

Given the image under any camera view, we apply the ray casting algorithm to encode a 3D object mesh into the proposed X-Ray representation. This algorithm plays a crucial role in both computer graphics and computational geometry, where it is used for scene rendering, visibility determination, and addressing geometric queries. It emits a ray from camera center into the environment and assessing how this ray interacts with target 3D objects. For each ray r𝑟ritalic_r that intersects with the sequential L𝐿Litalic_L layer faces in the mesh, we record their 3D attributes, encompassing depth (distance to camera center) 𝐃r=(𝐝1,𝐝2,,𝐝L)subscript𝐃𝑟subscript𝐝1subscript𝐝2subscript𝐝𝐿\mathbf{D}_{r}=(\mathbf{d}_{1},\mathbf{d}_{2},...,\mathbf{d}_{L})bold_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = ( bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_d start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ), normal 𝐍r=(𝐧1,𝐧2,,𝐧L)subscript𝐍𝑟subscript𝐧1subscript𝐧2subscript𝐧𝐿\mathbf{N}_{r}=(\mathbf{n}_{1},\mathbf{n}_{2},...,\mathbf{n}_{L})bold_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = ( bold_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_n start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ), and color 𝐂r=(𝐜1,𝐜2,,𝐜L)subscript𝐂𝑟subscript𝐜1subscript𝐜2subscript𝐜𝐿\mathbf{C}_{r}=(\mathbf{c}_{1},\mathbf{c}_{2},...,\mathbf{c}_{L})bold_C start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = ( bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_c start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ). For the sake of indication, we denote Hit 𝐇r=(𝐡1,𝐡2,,𝐡L){0,1}subscript𝐇𝑟subscript𝐡1subscript𝐡2subscript𝐡𝐿01\mathbf{H}_{r}=(\mathbf{h}_{1},\mathbf{h}_{2},...,\mathbf{h}_{L})\in\{0,1\}bold_H start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = ( bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) ∈ { 0 , 1 } to sign whether there is a surface. We denote 𝐗𝐑H×W×L×8𝐗superscript𝐑𝐻𝑊𝐿8\mathbf{X}\in\mathbf{R}^{H\times W\times L\times 8}bold_X ∈ bold_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_L × 8 end_POSTSUPERSCRIPT as the final representation, then the ray 𝐗ijsubscript𝐗𝑖𝑗\mathbf{X}_{ij}bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT with coordinate [i,j]𝑖𝑗[i,j][ italic_i , italic_j ] can be represented by:

𝐗ij=𝐗r={𝐇r,𝐃r,𝐍r,𝐂r}.subscript𝐗𝑖𝑗subscript𝐗𝑟subscript𝐇𝑟subscript𝐃𝑟subscript𝐍𝑟subscript𝐂𝑟\mathbf{X}_{ij}=\mathbf{X}_{r}=\{\mathbf{H}_{r},\mathbf{D}_{r},\mathbf{N}_{r},% \mathbf{C}_{r}\}.bold_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = bold_X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = { bold_H start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , bold_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , bold_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } . (1)

Noted that X[i,j,k]=0𝑋𝑖𝑗𝑘0X[i,j,k]=0italic_X [ italic_i , italic_j , italic_k ] = 0 when there is no surface for kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer at coordinate [i,j]𝑖𝑗[i,j][ italic_i , italic_j ]. The encoding samples of X-Ray are illustrated in Fig. 2. By the encoding process, we can transform any mesh to a sequential representation with different length like a sequential of video frames, which facilitates subsequent synthesis.

3.2 Decoding

The decoding process is converting the X-Ray back to 3D Mesh. To achieve that, we first transfer the video form of X-Ray to a point cloud, then using traditional Screened Poisson algorithm [16] to transform the point cloud to 3D mesh.

X-Ray \rightarrow Point Cloud. The transformation from X-Ray to point cloud is executed with ease. The output point cloud 𝐏𝐫={𝐏𝐱,𝐏𝐧,𝐏𝐜}subscript𝐏𝐫subscript𝐏𝐱subscript𝐏𝐧subscript𝐏𝐜\mathbf{P}_{\mathbf{r}}=\{\mathbf{P}_{\mathbf{x}},\mathbf{P}_{\mathbf{n}},% \mathbf{P}_{\mathbf{c}}\}bold_P start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT = { bold_P start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT bold_n end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT } for Ray r𝑟ritalic_r includes location 𝐏𝐱subscript𝐏𝐱\mathbf{P}_{\mathbf{x}}bold_P start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT, color 𝐏𝐜subscript𝐏𝐜\mathbf{P}_{\mathbf{c}}bold_P start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT, and normal 𝐏𝐧subscript𝐏𝐧\mathbf{P}_{\mathbf{n}}bold_P start_POSTSUBSCRIPT bold_n end_POSTSUBSCRIPT, and is defined by the equation

𝐏𝐱=𝐫o+𝐃𝐫d,𝐏𝐧=𝐍,𝐏𝐜=𝐂,when𝐇=1.formulae-sequencesubscript𝐏𝐱subscript𝐫𝑜𝐃subscript𝐫𝑑formulae-sequencesubscript𝐏𝐧𝐍formulae-sequencesubscript𝐏𝐜𝐂when𝐇1\mathbf{P}_{\mathbf{x}}=\mathbf{r}_{o}+\mathbf{D}\cdot\mathbf{r}_{d},\quad% \mathbf{P}_{\mathbf{n}}=\mathbf{N},\quad\mathbf{P}_{\mathbf{c}}=\mathbf{C},% \quad\text{when}\quad\mathbf{H}=1.bold_P start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT = bold_r start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + bold_D ⋅ bold_r start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT bold_n end_POSTSUBSCRIPT = bold_N , bold_P start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT = bold_C , when bold_H = 1 . (2)

Here, 𝐫osubscript𝐫𝑜\mathbf{r}_{o}bold_r start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT and 𝐫dsubscript𝐫𝑑\mathbf{r}_{d}bold_r start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT denote the origin and direction of the camera rays, respectively. Upon processing all camera rays, we obtain a comprehensive point cloud representation that includes location, normal, and color attributes of the 3D object.

Point Cloud \rightarrow Mesh. The Screened Poisson algorithm [16] for converting point clouds into 3D meshes is a sophisticated method that leverages the mathematical principles of the Poisson equation. The core idea involves solving a variation of the Poisson equation to interpolate a smooth surface that fits the input point cloud. The Poisson equation is a partial differential equation of the form: 2ϕ=fsuperscript2italic-ϕ𝑓\nabla^{2}\phi=f∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϕ = italic_f, where 2superscript2\nabla^{2}∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT denotes the Laplace operator (which represents the divergence of the gradient of a function), ϕitalic-ϕ\phiitalic_ϕ is the potential field to be solved, and f𝑓fitalic_f is a scalar function representing the divergence of the vector field derived from the input point cloud. In the context of point cloud to 3D mesh conversion, the algorithm first employs the given normal to define a vector field that suggests the orientation of the surface at each point. The divergence of this vector field serves as the function f𝑓fitalic_f the Poisson equation.

Encoding-Decoding Reconstruction Error Following the generation of X-Ray from text or images, we embark on reconstructing the 3D object through the decoding process. This phase introduces a reconstruction error that varies with the number of layer L𝐿Litalic_L and the frame resolution (H,W)𝐻𝑊(H,W)( italic_H , italic_W ). To address this, we conduct an experiment in Sec. 5.2 aimed at analyzing these variables to identify optimal values. Our goal is to achieve a balance where all pertinent information is preserved while maintaining a lightweight model.

Refer to caption
Figure 3: Overview of our proposed Generative Model for the X-Ray 3D representation. Initially, a Diffusion Model operates at the core of our approach, transforming random Gaussian noise into a low-resolution X-Ray representation, conditioned by either input image or generated image via an image diffusion model. Subsequently, these low-resolution X-Rays undergo enhancement to high-resolution through the application of a 3D Spatial-Temporal Upsampler. The culmination of this process sees the high-resolution X-Rays decoded into 3D Meshes, leveraging a combination of Point Cloud transformation and the Screened Poisson algorithm.

4 X-Ray for 3D Generation

The primary objective of introducing a new 3D representation model is to facilitate the generation of 3D structures from textual or visual inputs. The challenge lies in accurately predicting the characteristics that are not immediately visible on the first surface when only a single image is available. To overcome this, we utilize a diffusion model approach for X-Ray synthesis. Given that the proposed X-Ray is organized as a video-like format, we leverage advanced Video Diffusion models for our backbone for X-Ray synthesis. To exploit this structure for high-resolution X-Ray synthesis, we incorporate principles from advanced video diffusion models as our foundational framework. Notable models in this domain include Stable Video Diffusion (SVD) [3], VideoFusion [23], and the state-of-the-art Sora. To efficiently train the diffusion model, we begin by training a low-resolution X-Ray diffusion model that synthesizes X-Ray images from either text or lower-quality images. Subsequently, we employ an upsampler to enhance these synthesized X-Rays to high resolution. This two-step approach ensures a more manageable and efficient training process, gradually improving the quality of the output.

4.1 X-Ray Diffusion Model

Diffusion models [31] are generative models that transform a random noise distribution into a data distribution through a reverse process, counteracting a forward process that incrementally adds Gaussian noise to the data. The forward process is a Markov chain described by xt=αtxt1+1αtϵsubscript𝑥𝑡subscript𝛼𝑡subscript𝑥𝑡11subscript𝛼𝑡italic-ϵx_{t}=\sqrt{\alpha_{t}}x_{t-1}+\sqrt{1-\alpha_{t}}\epsilonitalic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ, where xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the data at step t𝑡titalic_t, αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT controls the noise level, and ϵ𝒩(0,I)similar-toitalic-ϵ𝒩0𝐼\epsilon\sim\mathcal{N}(0,I)italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) is sampled noise. The reverse process, aimed at reconstructing the original data from noise, is modeled by a neural network predicting the noise added at each step or directly denoising the data, following xt1=1αt(xt1αt1αt2ϵθ(xt,t))subscript𝑥𝑡11subscript𝛼𝑡subscript𝑥𝑡1subscript𝛼𝑡1superscriptsubscript𝛼𝑡2subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡x_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(x_{t}-\frac{1-\alpha_{t}}{\sqrt{1-% \alpha_{t}^{2}}}\epsilon_{\theta}(x_{t},t)\right)italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ), with ϵθ(xt,t)subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}(x_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) being the predicted noise. Training involves optimizing the network to minimize the difference between the original and reconstructed data, effectively learning to invert the noise addition process, as the following equation:

Ldm=𝔼x,ϵ𝒩(0,1),t[ϵθ(xt,t)2].subscript𝐿𝑑𝑚subscript𝔼formulae-sequencesimilar-to𝑥italic-ϵ𝒩01𝑡delimited-[]superscriptnormitalic-ϵ𝜃subscript𝑥𝑡𝑡2L_{dm}=\mathbb{E}_{x,\epsilon\sim\mathcal{N}(0,1),t}\left[\|\epsilon-\theta(x_% {t},t)\|^{2}\right].italic_L start_POSTSUBSCRIPT italic_d italic_m end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_x , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) , italic_t end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_θ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (3)

where t𝑡titalic_t is uniformly sampled from the set {1, …, T𝑇Titalic_T}.

Diffusion Model for X-Ray A prevalent technique in diffusion models is the utilization of latent spaces, typically necessitating a Vector Quantized-Variational AutoEncoder (VQ-VAE) [6] to perform the initial data transformation to compress the data. This method poses a significant challenge for our X-Ray application, as it requires the development of a bespoke VQ-VAE model from scratch due to the absence of a suitable off-the-shelf latent model. This requirement will subsequently increase our training burden.

Another promising approach for efficiently training high-resolution generators is the cascaded synthesis pipeline. Illustrated by technologies such as Imagen [32], DeepFloyd IF [2], and Stable Cascaded [29], this method involves progressively training the diffusion model or upsampler from lower to higher resolutions. Considering our limited computing resources, we have chosen to implement this cascaded up-sampling strategy. This technique facilitates a more gradual and controlled improvement of X-Ray image quality, offering a more flexible and efficient alternative to traditional latent space diffusion models.

Specifically, we adopt the 3D U-Net network architecture from Stable Video Diffusion [3] for our diffusion model to generate low-resolution X-Rays, with modifications limited to the input and output channels. This model employs spatial-temporal attention mechanisms to alternately extract features from 2D frames space and 1D time sequences, effectively enhancing its ability to process and interpret the different layers of the X-Ray. This approach allows for a more nuanced handling of the temporal information inherent in sequential X-Ray data, crucial for achieving high-quality diffusion results.

4.2 X-Ray Upsampler

The previous diffusion model was confined to generating X-Ray images at low resolutions from either text or other images. The subsequent phase focuses on enhancing these low-resolution X-Rays to a higher resolution. We explore two primary methods: point cloud up-sampling and video up-sampling. Given that we already obtain a coarse representation of shape and appearance in X-Rays, encoding this data into a point cloud with color and normal is a straightforward process, as detailed in Sec 3.1. However, point cloud representation is too unstructured to perform dense prediction, and conventional point cloud up-sampling techniques [30, 40, 1, 19] often simply increase the number of points, which may not be effective for up-sampling attributes such as texture and color.

To streamline our process and ensure consistency throughout our pipeline, we have opted to utilize a video up-sampling model. This model, adapted from the spatial-temporal VAE decoder of Stable Video Diffusion (SVD) [3]. It is specifically trained from scratch to upsample synthesized X-Ray frames by a factor of 4, while maintaining the original number of layers L𝐿Litalic_L. The decoder is capable of independently performing attention at both the frame level and the layer level. This dual-level attention mechanism not only enhances the resolution but also significantly improves the overall quality of the images. These capabilities make this video up-sampling model a more cohesive and effective solution for our needs in high-resolution X-Ray generation.

Loss. The loss function for the Upsampler differs notably from that of the diffusion model. While the diffusion model loss typically addresses volumetric or textural aspects, the Upsampler loss concentrates specifically on the surface area accuracy, reflecting the critical importance of maintaining high fidelity in the enhanced images. The specific loss function we use for the Upsampler is detailed in the equation below:

𝐋up=𝐗gt[𝐇gt]𝐗up[𝐇gt]2+𝐇gt𝐇up2subscript𝐋𝑢𝑝superscriptnormsubscript𝐗𝑔𝑡delimited-[]subscript𝐇𝑔𝑡subscript𝐗𝑢𝑝delimited-[]subscript𝐇𝑔𝑡2superscriptnormsubscript𝐇𝑔𝑡subscript𝐇𝑢𝑝2\mathbf{L}_{up}=\|\mathbf{X}_{gt}[\mathbf{H}_{gt}]-\mathbf{X}_{up}[\mathbf{H}_% {gt}]\|^{2}+\|\mathbf{H}_{gt}-\mathbf{H}_{up}\|^{2}bold_L start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT = ∥ bold_X start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT [ bold_H start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT ] - bold_X start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT [ bold_H start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_H start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT - bold_H start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (4)

Here, 𝐗gt[𝐇gt]subscript𝐗𝑔𝑡delimited-[]subscript𝐇𝑔𝑡\mathbf{X}_{gt}[\mathbf{H}_{gt}]bold_X start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT [ bold_H start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT ] represents the ground-truth high-resolution X-Ray at hit surface and 𝐗up[𝐇gt]subscript𝐗𝑢𝑝delimited-[]subscript𝐇𝑔𝑡\mathbf{X}_{up}[\mathbf{H}_{gt}]bold_X start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT [ bold_H start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT ] denotes the Upsampler’s output at hit surface, and 𝐇gtsubscript𝐇𝑔𝑡\mathbf{H}_{gt}bold_H start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT, 𝐇upsubscript𝐇𝑢𝑝\mathbf{H}_{up}bold_H start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT respectively denotes the ground-truth and upsampled Hit. The loss is computed as the squared Euclidean distance between these two matrices, quantifying the pixel-wise discrepancy in surface details. This metric effectively ensures that the upsampling process preserves essential surface features, thereby optimizing the quality and utility of the resulting high-resolution X-Ray.

5 Experiments

5.1 Dataset and Implementation

Dataset. We conduct our experiments using a curated subset of the Objaverse dataset [5], from which we have removed entries with missing textures and inadequate prompts as outlined in [34]. This subset consists of more than 60,000 3D objects. For each object, we select 4 random camera views, covering azimuth angles from -180 to 180 degrees and elevation angles from -45 to 45 degrees with camera distance to object center fixed at 1.5. The images are then rendered using Blender Software, and the corresponding X-Rays are generated through the ray casting algorithm provided by the trimesh library. Through these processes, we can create over 240,000 paired images and X-Ray datasets to train generative model.

Implementation Detail. Our X-Ray diffusion model is closely based on the Spatial-Temporal UNet architecture used in Stable Video Diffusion (SVD) [3], with a minor adaptation: our model is configured to synthesize 8 channels: 1 hit channel, 1 depth channel, and 6 normal channels—compared to the 4 channels in the original network. Given the substantial differences between X-Ray imaging and traditional video, we train our model from scratch for the big gap between the X-Ray and video domain. The training is conducted on 8 NVIDIA A100 GPU servers for a week. Throughout this period, the learning rate is maintained at 0.0001 using the AdamW optimizer. Since different X-Ray has different number of layers, we pad or cut them to the same 8 layers for better batching and training, and the frame of each layer has dimensions of 64×64646464\times 6464 × 64. For the up-sampling model, the output of layer L𝐿Litalic_L is still 8, yet the resolution of each frame is increased to 256×256256256256\times 256256 × 256, enhancing detail and clarity in the upscale X-Ray.

5.2 Analysis of Encoding-Decoding Reconstruction Error

Due to the finite number of layer L𝐿Litalic_L and resolution H,W𝐻𝑊H,Witalic_H , italic_W of each frame, a slight reconstruction error is inevitable during the encoding of 3D meshes into X-Ray format and the subsequent decoding back into 3D meshes. To quantitatively assess this error, we conducted an experiment evaluating the Chamfer Distance (CD) between the original (ground-truth) meshes and their reconstructed counterparts across various resolutions. In the conducted experiment, we varied the number of layers from 1111 to 16161616 and adjusted the frame resolution through a set of predefined values: 32323232, 64646464, 128128128128, 256256256256, 512512512512, and 1024102410241024. The outcomes of these variations, illustrating the impact of layer count and resolution on the reconstruction accuracy, are presented in Figure 4.

Refer to caption
(a) The Reconstruction error of frame height (H𝐻Hitalic_H) or width (W𝑊Witalic_W), when L=16𝐿16L=16italic_L = 16.
Refer to caption
(b) The Reconstruction error of layer number (L𝐿Litalic_L), when H=W=256𝐻𝑊256H=W=256italic_H = italic_W = 256.
Figure 4: The Reconstruction error after encoding and decoding process.
Refer to caption
Figure 5: Visualization of Image-to-3D Generation from X-Ray.
Refer to caption
Figure 6: Visualization of Text-to-3D Generation from X-Ray.

5.3 3D Generation from Image or Text

For image to 3D mesh generation, we simply concatenate the latent representation of the input image with the low-resolution X-Ray and train both the diffusion model and the upsampler. This integrated approach allows us to efficiently leverage the existing spatial information in the images for more accurate 3D mesh outputs.

For Text to 3D mesh generation, rather than developing a new text-conditioned diffusion model, we utilize established diffusion models that are already proficient in image synthesis from textual descriptions. One Model such as Stable Diffusion [31], Stable Cascaded [29], or DiT [28] are employed to generate images based on the input text. Following this, we apply an image segmentation tool, specifically the Segment Anything Model [18], to eliminate the background. This streamlined method avoids the complexities of training a new model from scratch, instead making use of sophisticated pre-trained models to handle the text-to-image translation, thereby simplifying the process of generating 3D meshes from textual inputs. The output results of Image-to-3D and Text-to-3D are illustrated in Fig. 5 and Fig. 6.

6 Conclusion

In this work, we introduced X-Ray representation, a novel approach to representing 3D objects. Unlike traditional depth map, which captures only the visible surfaces, our X-Ray representation encompasses both visible and hidden surfaces within the camera’s field of view. We demonstrated the possibility of the X-Ray approach in facilitating 3D generation tasks, including image-to-3D and text-to-3D. Moreover, we established that the underlying generator for X-Ray shares a foundational similarity with existing video diffusion models, allowing us to leverage their inherent advantages. The empirical results showcase outstanding performance, underscoring the potential of our proposed method. However, Our study utilizes the Stable Video Diffusion (SVD) [3] pipeline as the primary framework for generating and upsampling high-quality X-Ray images, which presents certain limitations. A significant concern is that the X-Rays produced through this method consist of multiple sequential layers, each of which has distinct characteristics. Additionally, the posterior layers of these X-Rays tend to be sparse, which can compromise the quality and detail of the generated images. Addressing these issues will be a priority in future research. Exploring more advanced network architectures that are better suited to handling the unique complexities of X-Ray data, including the sparsity and sequential nature of layers, will be crucial in improving the fidelity and utility of the generated images. Further more, we aim to investigate additional applications for the X-Ray representation, broadening its utility and impact in the realm of 3D modeling and beyond.

References

  • Akhtar et al. [2022] Anique Akhtar, Zhu Li, Geert Van Der Auwera, Li Li, and Jianle Chen. Pu-dense: Sparse tensor-based point cloud geometry upsampling. IEEE Trans. Image Process., 2022.
  • at StabilityAI [2023] DeepFloyd Lab at StabilityAI. DeepFloyd IF: a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding. https://www.deepfloyd.ai/deepfloyd-if, 2023. Retrieved on 2023-11-08.
  • Blattmann et al. [2023] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets. CoRR, 2023.
  • Chan et al. [2022] Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, and Gordon Wetzstein. Efficient geometry-aware 3D generative adversarial networks. In CVPR, 2022.
  • Deitke et al. [2022] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022.
  • Esser et al. [2020] Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis, 2020.
  • Gao et al. [2022] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. In Advances In Neural Information Processing Systems, 2022.
  • Ge et al. [2023] Wenhang Ge, Tao Hu, Haoyu Zhao, Shu Liu, and Ying-Cong Chen. Ref-neus: Ambiguity-reduced neural implicit surface learning for multi-view reconstruction with reflection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4251–4260, 2023.
  • Gwak et al. [2020] JunYoung Gwak, Christopher B Choy, and Silvio Savarese. Generative sparse detection networks for 3d single-shot object detection. In European conference on computer vision, 2020.
  • He and Wang [2023] Zexin He and Tengfei Wang. Openlrm: Open-source large reconstruction models. https://github.com/3DTopia/OpenLRM, 2023.
  • Hong et al. [2023] Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: large reconstruction model for single image to 3d. CoRR, abs/2311.04400, 2023.
  • Hu et al. [2022] Tao Hu, Shu Liu, Yilun Chen, Tiancheng Shen, and Jiaya Jia. Efficientnerf efficient neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12902–12911, 2022.
  • Hu et al. [2023a] Tao Hu, Xiaogang Xu, Ruihang Chu, and Jiaya Jia. Trivol: Point cloud rendering via triple volumes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20732–20741, 2023a.
  • Hu et al. [2023b] Tao Hu, Xiaogang Xu, Shu Liu, and Jiaya Jia. Point2pix: Photo-realistic point cloud rendering via neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8349–8358, 2023b.
  • Hui et al. [2024] Ka-Hei Hui, Aditya Sanghi, Arianna Rampini, Kamal Rahimi Malekshan, Zhengzhe Liu, Hooman Shayani, and Chi-Wing Fu. Make-a-shape: a ten-million-scale 3d shape model. CoRR, 2024.
  • Kazhdan and Hoppe [2013] Michael Kazhdan and Hugues Hoppe. Screened poisson surface reconstruction. ACM Trans. Graph., 2013.
  • Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
  • Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023.
  • Li et al. [2021] Ruihui Li, Xianzhi Li, Pheng-Ann Heng, and Chi-Wing Fu. Point cloud upsampling via disentangled refinement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Liu et al. [2020] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields. NeurIPS, 2020.
  • Liu et al. [2019] Zhijian Liu, Haotian Tang, Yujun Lin, and Song Han. Point-voxel CNN for efficient 3d deep learning. In NeurIPS, 2019.
  • Luo and Hu [2021] Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. In CVPR, 2021.
  • Luo et al. [2023] Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Mildenhall et al. [2019] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 2019.
  • Mildenhall et al. [2020] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  • Mo et al. [2023] Shentong Mo, Enze Xie, Yue Wu, Junsong Chen, Matthias Nießner, and Zhenguo Li. Fast training of diffusion transformer with extreme masking for 3d point clouds generation. arXiv preprint arXiv: 2312.07231, 2023.
  • Nichol et al. [2022] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts. CoRR, abs/2212.08751, 2022.
  • Peebles and Xie [2023] William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, 2023.
  • Pernias et al. [2023] Pablo Pernias, Dominic Rampas, Mats L. Richter, Christopher J. Pal, and Marc Aubreville. Wuerstchen: An efficient architecture for large-scale text-to-image diffusion models, 2023.
  • Qi et al. [2017] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 2017.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, 2022.
  • Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  • Sara Fridovich-Keil and Alex Yu et al. [2022] Sara Fridovich-Keil and Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. In CVPR, 2022.
  • Tang et al. [2024] Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint arXiv:2402.05054, 2024.
  • Tochilkin et al. [2024] Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, , Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: Fast 3d object reconstruction from a single image. arXiv preprint arXiv:2403.02151, 2024.
  • Tucker and Snavely [2020] Richard Tucker and Noah Snavely. Single-view view synthesis with multiplane images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Wang et al. [2021] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. NeurIPS, 2021.
  • Xu et al. [2023] Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Jiahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, and Kai Zhang. DMV3D: denoising multi-view diffusion using 3d large reconstruction model. CoRR, abs/2311.09217, 2023.
  • Yu et al. [2021] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and Angjoo Kanazawa. PlenOctrees for real-time rendering of neural radiance fields. In ICCV, 2021.
  • Yu et al. [2018] Lequan Yu, Xianzhi Li, Chi-Wing Fu, Daniel Cohen-Or, and Pheng-Ann Heng. Pu-net: Point cloud upsampling network. In CVPR, 2018.