Student: | Y. Wei |
---|---|
Timeline: | September 2021 - 7 September 2025 |
3D scene understanding is about developing computer algorithms that typically analyze or synthesize objects as they exist in a 3D physical world, which enables interaction between a machine (e.g., robot) and its 3D surroundings. With the advances in data-driven algorithms especially those based on deep learning, 3D scene understanding has shown a tremendous impact on a wide range of applications, including urban planning, autonomous driving, augmented reality and virtual reality. However, it is costly to collect sufficient 3D training data for deep learning-based approaches. For instance, 3D point clouds can be acquired by laser scanning techniques (e.g.,LiDAR), but LiDAR sensors are more expensive than regular cameras, and the limited sensor resolution may result in unsatisfactory point clouds. In addition, SfM techniques can reconstruct 3D models from multi-view images, but feature matching becomes hard when multiple viewpoints are separated by a large margin and local appearance varies a lot. Therefore, this research is interested in addressing these challenges from a generative perspective, aiming at the efficient production of 3D data at both the object level and the scene level.
Generative AI techniques, which have attracted remarkable attention from both academia and industry, offer a promising avenue for 3D data generation from other user-friendly modalities, such as single image-to-3D and text-to-3D. Most of the existing methods can handle with simple object-centric 3D scenes where they focus on 3D shape (or with texture) of a single object (e.g., airplane). However, these methods sometimes fail to control the generated content, leading to results that are inconsistent with the condition inputs. Furthermore, there remain open questions that hinder their further up-taking in practical applications. Taking the geo-information production as an example, real-world building objects are far more complex than the standard objects that previous works in computer vision fields have commonly focused on. Besides, inferring the 3D shapes and spatial relationships of multiple objects is a long-standing objective towards realistic 3D scenes. Our research would explore the general-purpose 3D scene generation by leveraging generative models such as VAEs, GANs, Normalizing Flows and Diffusion-based models; meanwhile, we especially focus on the development and applicability of generative models in geo-information production fields. Specifically, this research starts with object-level 3D point cloud generation from a single image, considering cross-category generalization for common object categories as well as intra-category specificity for complex objects. It covers diverse scenes, including indoor scenes composed of furniture and urban scenes composed of buildings. After solving the problems of controllable generation of multi-object 3D scenes, our long-term goal is to build a unified framework that could generate and manipulate 3D scenes from multi-modal conditional inputs.
This Ph.D. project is under the supervision of Prof. George Vosselman and Dr. Michael Yang. Regarding single image-to-3D point cloud, we present a new method FlowGAN [1] which inherits the flow-based generative methods for generating an arbitrary number of points, while improving the detailed 3D structures of the generated point clouds by leveraging an adversarial training strategy. As shown in the left part of Figure, our single-flow-based method can generate point clouds with cleaner global shapes and finer details while reducing computation costs during inference, compared to recent flow-based methods DPF-Nets and MixNFs which rely on multiple flow models. Also, it demonstrates the generalization capacity of our method since the 3D shapes can be reconstructed from real images even if the models are trained on synthetic images. Recently, we propose a diffusion-based method BuilDiff [2] for generating 3D point clouds of buildings from single general-view images in a coarse-to-fine manner. Guided by an image auto-encoder, a base diffusion model coarsely identifies the overall structures of buildings, and an up-sampler diffusion then derives higher resolution point clouds, as shown in the right part of Figure where the reference and predicted point clouds are colored by orange and blue, respectively. We believe our work could bridge the rapidly developing generative modeling techniques and the urgent problem of 3D building generation.
References:
[1] Wei, Y. , Vosselman, G. , & Yang, M. (2022). Flow-based GAN for 3D Point Cloud Generation from a Single Image. In 33rd British Machine Vision Conference 2022: London, UK, November 21-24, 2022 BMVA Press.
[2] Wei, Y. , Vosselman, G. , & Yang, M. Y. (2023). BuilDiff: 3D Building Shape Generation using Single-Image Conditional Point Cloud Diffusion Models. In IEEE/CVF International Conference on Computer Vision (ICCV) Workshops (pp. 2910-2919). IEEE.