VR3Dense: Voxel Representation Learning for 3D Object Detection and Monocular Dense Depth Reconstruction

Shubham Shrivastava

Ford Greenfield Labs, Palo Alto

[sshriva5@ford.com]

Abstract

3D object detection and dense depth estimation are one of the most vital tasks in autonomous driving. Multiple sensor modalities can jointly attribute towards better robot perception, and to that end, we introduce a method for jointly training 3D object detection and monocular dense depth reconstruction neural networks. It takes as inputs, a LiDAR point-cloud, and a single RGB image during inference and produces object pose predictions as well as a densely reconstructed depth map. LiDAR point-cloud is converted into a set of voxels, and its features are extracted using 3D convolution layers, from which we regress object pose parameters. Corresponding RGB image features are extracted using another 2D convolutional neural network. We further use these combined features to predict a dense depth map. While our object detection is trained in a supervised manner, the depth prediction network is trained with both self-supervised and supervised loss functions. We also introduce a loss function, edge-preserving smooth loss, and show that this results in better depth estimation compared to the edge-aware smooth loss function, frequently used in depth prediction works.

How does it work?


We combine two streams of perception tasks (LiDAR point-cloud based 3D object detection and Monocular 3D Object Detection) and introduce a method of jointly training LiDAR point-cloud based 3D object detection and monocular image to dense depth estimation neural networks. We represent the input point-cloud as a set of non-cubic voxels, each encoding the density of points contained within, and then use a set of 3D convolution layers for extracting spatial features which are concatenated with the latent vector of RGB image to depth estimation network. We pose 3D object detection as a regression problem and regress object pose parameters along with their confidence and class probabilities from the extracted spatial features using a set of fully-connected layers. The dense depth estimation network on the other hand is an hourglass architecture with skip connections, built using 2D convolution layers. We call our method - VR3Dense, which requires point-cloud, left and right stereo images, and object pose labels for training. During inference, we only require a LiDAR point-cloud and corresponding left camera image as inputs and predict object poses along with pixel-wise dense depth map as outputs. We work with KITTI 3D object detection dataset for training and testing our method.

Figure below summarizes our technical approach for joint 3D object detection and monocular depth estimation. For learning spatial representation from LiDAR point-clouds, we first encode points into a collection of non-cubic voxels with each voxel encoding density of points within the occupied volume. This voxelized point-cloud is then passed to a 3D convolutional neural network for feature extraction. At the same time, corresponding left stereo camera image, is passed through a U-Net like encoder-decoder architecture with residual blocks to produce a dense depth map. Encoder of this network encodes the RGB image into a latent vector, to which the encoded voxelized point-cloud features are concatenated before being fed to the decoder. We qualitatively and quantitatively show that this results in better depth estimation. We further use fully-connected layers to extract object pose parameter along with class probabilities from encoded voxel representations.

Figure - VR3Dense jointly learns 3D object detection from LiDAR point-cloud and dense depth estimation from corresponding monocular camera frame. 3D object detection is trained with supervision from ground-truth labels while monocular depth estimation is trained in a semi-supervised way; self-supervision from stereo image reconstruction and supervision from sparse LiDAR points projected onto the image plane. Learned feature vector from voxel representation is concatenated with the latent vector of depth estimation network which allows depth decoder to take advantage of spatial features extracted from LiDAR points.

qualitative results

Figure - VR3Dense tested on KITTI raw dataset. Left: Date:2011-09-26 / Sequence: 0009, Right: Date:2011-09-26 / Sequence: 0104

Figure - RGB unprojection with predicted depth for KITTI raw dataset - Date:2011-09-26 / Sequence: 0009

citation

@misc{shrivastava2021vr3dense,

title={VR3Dense: Voxel Representation Learning for 3D Object Detection and Monocular Dense Depth Reconstruction},

author={Shubham Shrivastava},

year={2021},

eprint={2104.05932},

archivePrefix={arXiv},

primaryClass={cs.CV}

}