We combine two streams of perception tasks (LiDAR point-cloud based 3D object detection and Monocular 3D Object Detection) and introduce a method of jointly training LiDAR point-cloud based 3D object detection and monocular image to dense depth estimation neural networks. We represent the input point-cloud as a set of non-cubic voxels, each encoding the density of points contained within, and then use a set of 3D convolution layers for extracting spatial features which are concatenated with the latent vector of RGB image to depth estimation network. We pose 3D object detection as a regression problem and regress object pose parameters along with their confidence and class probabilities from the extracted spatial features using a set of fully-connected layers. The dense depth estimation network on the other hand is an hourglass architecture with skip connections, built using 2D convolution layers. We call our method - VR3Dense, which requires point-cloud, left and right stereo images, and object pose labels for training. During inference, we only require a LiDAR point-cloud and corresponding left camera image as inputs and predict object poses along with pixel-wise dense depth map as outputs. We work with KITTI 3D object detection dataset for training and testing our method.
Figure below summarizes our technical approach for joint 3D object detection and monocular depth estimation. For learning spatial representation from LiDAR point-clouds, we first encode points into a collection of non-cubic voxels with each voxel encoding density of points within the occupied volume. This voxelized point-cloud is then passed to a 3D convolutional neural network for feature extraction. At the same time, corresponding left stereo camera image, is passed through a U-Net like encoder-decoder architecture with residual blocks to produce a dense depth map. Encoder of this network encodes the RGB image into a latent vector, to which the encoded voxelized point-cloud features are concatenated before being fed to the decoder. We qualitatively and quantitatively show that this results in better depth estimation. We further use fully-connected layers to extract object pose parameter along with class probabilities from encoded voxel representations.