Zero-1-to-3:Zero-shot One Image to 3D Object
Abstract
We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image. To perform novel view synthesis in this under-constrained setting, we capitalize on the geometric priors that large-scale diffusion models learn about natural images. Our conditional diffusion model uses a synthetic dataset to learn controls of the relative camera viewpoint, which allow new images to be generated of the same object under a specified camera transformation. Even though it is trained on a synthetic dataset, our model retains a strong zero-shot generalization ability to out-of-distribution datasets as well as in-the-wild images, including impressionist paintings. Our viewpoint-conditioned diffusion approach can further be used for the task of 3D reconstruction from a single image. Qualitative and quantitative experiments show that our method significantly outperforms state-of-the-art single-view 3D reconstruction and novel view synthesis models by leveraging Internet-scale pre-training.
Overview

Given a single RGB image
where
Learning to Control Camera Viewpoint
Since diffusion models have been trained on internetscale data, their support of the natural image distribution likely covers most viewpoints for most objects, but these viewpoints cannot be controlled in the pre-trained models. Zero123 aims to o teach the model a mechanism to control the camera extrinsics with which a photo is captured so that unlock the ability to perform novel view synthesis. To this end, given a dataset of paired images and their relative camera extrinsics
View-Conditioned Diffusion
3D reconstruction from a single image requires both lowlevel perception (depth, shading, texture, etc.) and highlevel understanding (type, function, structure, etc.). Therefore, On one hand, a CLIP embedding of the input image is concatenated with
details of the object being synthesized. To be able to apply classifier-free guidance the input image and the posed CLIP embedding are setting a null vector randomly, and the conditional information is scaled during inference.

In fact, a spherical coordinate system is used to represent camera locations and their relative transformations. Thus, During training, when two images from different viewpoints are sampled, let their camera locations be