Automatically generating high-quality real world 3D scenes is of enormous interest for applications such as virtual reality and robotics simulation. Towards this goal, we introduce NeuralField-LDM, a generative model capable of synthesizing complex 3D environments. We leverage Latent Diffusion Models that have been successfully utilized for efficient high-quality 2D content creation. We first train a scene auto-encoder to express a set of image and pose pairs as a neural field, represented as density and feature voxel grids that can be projected to produce novel views of the scene. To further compress this representation, we train a latent-autoencoder that maps the voxel grids to a set of latent representations. A hierarchical diffusion model is then fit to the latents to complete the scene generation pipeline. We achieve a substantial improvement over existing state-of-the-art scene generation models. Additionally, we show how NeuralField-LDM can be used for a variety of 3D content creation applications, including conditional scene generation, scene inpainting and scene style manipulation.
Overview
Fig.1 NeuralField-LDM
Overview of NeuralField-LDM. We first encode RGB images with camera poses into a neural field represented by density and feature voxel grids. We compress the neural field into smaller latent spaces and fit a hierarchical latent diffusion model on the latent space. Sampled latents can then be decoded into a neural field that can be rendered into a given viewpoint.
Scene Auto-Encoder
Fig.2 Scene Auto-Encoder
The goal of the scene auto-encoder is to obtain a 3D representation of the scene from input images by learning to reconstruct them. Each input image is processed with a 2D CNN then lifted up to 3D and merged into the shared voxel grids.
The scene encoder is a 2D CNN and processes each RGB image separately, producing a dimensional 2D tensor for each image, where and are smaller than ’s size. Then a discrete frustum of size with the camera poses for each image is defined. Take the th channel of the CNN’s output at pixel as the density value of the frustum entry at . The occupancy weight of each element in the frustum using the density values is :
where is the distance between each depth in the frustum. Using the occupancy weights, put the last channels of the CNN’s output into the frustum to get:
where denotes the -channeled feature vector at pixel which is scaled by for at depth . After the frustum of each view being constructed, they are transformed into world coordinates and fused into a shared 3D neural field which contains both and . For each voxel indexed by , the density and feature are defined as mean of the corresponding frustum entries. 2D features are extracted by using the camera poses to project on and then fed into a CNN decoder that produces the output image.
The scane auto-encoding pipeline is trained with an image reconstruction loss and a depth supervision loss.
Latent Voxel Auto-Encoder
and are concatenated along the channel dimension and thein encoded into a hierarchy of three latents: 1D global latent , 3D coarse latent , and 2D fine latent . The intuition for this design is that is responsible for representing the global properties of the scene, such as the time of the day, represents coarse 3D scene structure, and is a 2D tensor with the same horizontal size as , which gives further details for each location in BEV perspective. The camera trajectory information are concatenated to and also being learned to sample.
The encoder is a 2D CNN taking as input, the vertical axis of which is concatenated along the channel dimension. The decoder uses conditional group normalization layers with as the conditioning variable. LAE is trained with the voxel reconstruction loss along with the image reconstruction loss with a fixed scene auto-encoder.
Hierarchical Latent Diffusion Models
Fig.3 Latent Diffusion Models
Given the latent variables , , that represent a voxel-based scene representation ,the generative model based on DDPM is defined as . is conditioned by conditional group normalization layers and is conditioned by being interpolated and concatenated to the input of .
Post-Optimizing Generated Neural Fields
The initially generated voxels, , canbe further optimized by rendering viewpoints from the scene and applying SDS loss on each image independently.