CubifAE-3D: Monocular Camera Space Cubification on Autonomous Vehicles for Auto-Encoder based 3D Object Detection

Shubham Shrivastava and Punarjay Chakravarty

Ford Greenfield Labs, Palo Alto

[] []


We introduce a method for 3D object detection using a single monocular image. Depth data is used to pre-train an RGB-to-Depth Auto-Encoder (AE). The embedding learnt from this AE is then used to train a 3D Object Detector (3DOD) CNN which is used to regress the parameters of 3D object poses after the encoder from the AE generates a latent embedding from the RGB image. We show that we can pre-train the AE using paired RGB and depth images from simulation data once and subsequently only train the 3DOD network using real data, comprising of RGB images and 3D object pose labels (without the requirement of dense depth). Our 3DOD network utilizes a particular cubification of 3D space around the camera, where each cuboid is tasked with predicting N object poses, along with their class and confidence values. The AE pre-training and this method of dividing the 3D space around the camera into cuboids give our method its name - CubifAE-3D. We demonstrate results for monocular 3D object detection on the Virtual KITTI 2, and KITTI datasets for Autonomous Vehicle (AV) perception.

How does it work?

Our method of performing 3D object detection relies on first learning the latent space embeddings for per-pixel RGB-to-depth predictions in an image. This is achieved by training an auto-encoder to predict the dense depth map from a single RGB image. Once trained, the decoder is detached, and the latent space embedding is fed to our 3DOD network.

By training the auto-encoder first, we force its latent space to learn a compact RGB-to-depth embedding representation which is encoded in the latent space. A model which then operates on these encodings is thus able to formulate a relationship between the structures present in the RGB image and its real-world depth. We then cubify the monocular camera space and train our 3DOD model to detect object poses. So, at the test time, only an RGB image is needed for detecting object poses. An additional classifier network with a small number of parameters is then used to classify all of these detected objects at once by resizing and stacking the object crops and feeding them to the classifier model. This is done instead of predicting the classes directly as a part of the vector corresponding to each object from the 3DOD network in the favour of reducing number of parameters in the fully connected layers and hence the inference time. We apply non-max suppression to further filter out the object pose predictions with high IoU by retaining only the objects with the highest confidence.

Detailed model architecture of CubifAE-3D. The RGB-to-depth auto-encoder (top branch) is first trained in a supervised way with a combination of MSE and Edge-Aware Smoothing Loss. Once trained, the decoder is detached, encoder weights are frozen, and the encoder output is fed to the 3DOD model (middle branch), which is trained to regress the parameters of object poses. A 2D bounding-box is obtained for each object by projecting its detected 3D bounding-box onto the camera image plane, cropped, and resized to 64x64 and fed to the classifier model (bottom branch) along with the normalized whl vector for class prediction. The dimensions indicated correspond to the output tensor for each block.

We prepare training labels for the 3DOD network in a way that allows each part of the network to only be responsible for detecting objects within a certain physical space relative to the ego-vehicle camera. We cubify the 3D region-of-interest (ROI) of the ego-camera into a 3-dimensional grid. This 3D grid is of size 4xM, where the camera image plane is divided into 4 regions along the (x,y) dimensions of the camera coordinate frame, with z axis further quantified into M cuboids for each of these 4 regions. Each cuboid in this 4xM grid is responsible for predicting up to N objects in increasing order of z (depth) from the center of the ego-camera.

Cubification of the camera space: The perception region of interest is divided into a 4x4xM grid (4x4 in the x and y directions aligned with the camera image plane, where each grid has stacked on it, M cuboids in the z-direction). Each cuboid is responsible for predicting up to N object poses. The object coordinates and dimensions are then normalized between 0 and 1 in accordance with a prior that is computed from data statistics.

qualitative results

Qualitative results on the KITTI dataset. The top part of each image shows a bounding box obtained as a 2D projection of their 3D poses (red: car, yellow: truck, green: van, blue: pedestrian, cyan: tram, cyclist, and others). The bottom part shows a birds-eye view of the object poses with the ego-vehicle positioned at the center of the red circle drawn on the left; pointing towards the right of the image.



title={CubifAE-3D: Monocular Camera Space Cubification on Autonomous Vehicles for Auto-Encoder based 3D Object Detection},

author={Shubham Shrivastava and Punarjay Chakravarty},