How does it work?
Detailed model architecture of CubifAE-3D. The RGB-to-depth auto-encoder (top branch) is first trained in a supervised way with a combination of MSE and Edge-Aware Smoothing Loss. Once trained, the decoder is detached, encoder weights are frozen, and the encoder output is fed to the 3DOD model (middle branch), which is trained to regress the parameters of object poses. A 2D bounding-box is obtained for each object by projecting its detected 3D bounding-box onto the camera image plane, cropped, and resized to 64x64 and fed to the classifier model (bottom branch) along with the normalized whl vector for class prediction. The dimensions indicated correspond to the output tensor for each block.
We prepare training labels for the 3DOD network in a way that allows each part of the network to only be responsible for detecting objects within a certain physical space relative to the ego-vehicle camera. We cubify the 3D region-of-interest (ROI) of the ego-camera into a 3-dimensional grid. This 3D grid is of size 4xM, where the camera image plane is divided into 4 regions along the (x,y) dimensions of the camera coordinate frame, with z axis further quantified into M cuboids for each of these 4 regions. Each cuboid in this 4xM grid is responsible for predicting up to N objects in increasing order of z (depth) from the center of the ego-camera.
Cubification of the camera space: The perception region of interest is divided into a 4x4xM grid (4x4 in the x and y directions aligned with the camera image plane, where each grid has stacked on it, M cuboids in the z-direction). Each cuboid is responsible for predicting up to N object poses. The object coordinates and dimensions are then normalized between 0 and 1 in accordance with a prior that is computed from data statistics.
Qualitative results on the KITTI dataset. The top part of each image shows a bounding box obtained as a 2D projection of their 3D poses (red: car, yellow: truck, green: van, blue: pedestrian, cyan: tram, cyclist, and others). The bottom part shows a birds-eye view of the object poses with the ego-vehicle positioned at the center of the red circle drawn on the left; pointing towards the right of the image.