Machine learning

Perception is one of the most significant systems of autonomous vehicles/robots. It allows the vehicle to perceive the 360-degrees environment around it. Perception system helps the vehicle understands its surroundings like drivable area, lanes, vehicles, pedestrians, road boundaries, traffic signs, and traffic lights. This system uses multiple sensors like LiDAR, RADAR, and Camera to perceive the world and detect objects. This page explores various subsystems of perception and shows some methods of ground-plane segmentation, object detection, lane detection, and object classification.

After the advent of Machine Learning, it has been considered as the number one tool to solve any and all perception tasks. Several machine learning solutions have been proposed in the past decade to solve problems such as object detection, lane detection, 3d scene representation, and even trajectory prediction.

[Monocular 3d object detection]

CubifAE-3D: Monocular Camera Space Cubification on Autonomous Vehicles for Auto-Encoder based 3D Object Detection

Authors: Shubham Shrivastava and Punarjay Chakravarty

We introduce a method for 3D object detection using a single monocular image. Starting from a synthetic dataset, we pre-train an RGB-to-Depth Auto-Encoder (AE). The embedding learnt from this AE is then used to train a 3D Object Detector (3DOD) CNN which is used to regress the parameters of 3D object poses after the encoder from the AE generates a latent embedding from the RGB image. We show that we can pre-train the AE using paired RGB and depth images from simulation data once and subsequently only train the 3DOD network using real data, comprising of RGB images and 3D object pose labels (without the requirement of dense depth). Our 3DOD network utilizes a particular `cubification' of 3D space around the camera, where each cuboid is tasked with predicting N object poses, along with their class and confidence values. The AE pre-training and this method of dividing the 3D space around the camera into cuboids give our method its name - CubifAE-3D. We demonstrate results for monocular 3D object detection in the Autonomous Vehicle (AV) use-case with the Virtual KITTI 2 and the KITTI datasets.

[domain adaptation]

Deflating Dataset Bias Using Synthetic Data Augmentation

Authors: Nikita Jaipuria, Xianling Zhang, Rohan Bhasin, Mayar Arafa, Punarjay Chakravarty, Shubham Shrivastava, Sagar Manglani, and Vidya N. Murali

Deep Learning has seen an unprecedented increase in vision applications since the publication of large-scale object recognition datasets and the introduction of scalable compute hardware. State-of-the-art methods for most vision tasks for Autonomous Vehicles (AVs) rely on supervised learning and often fail to generalize to domain shifts and/or outliers. Dataset diversity is thus key to successful real-world deployment. No matter how big the size of the dataset, capturing long tails of the distribution pertaining to task-specific environmental factors is impractical. The goal of this paper is to investigate the use of targeted synthetic data augmentation - combining the benefits of gaming engine simulations and sim2real style transfer techniques - for filling gaps in real datasets for vision tasks. Empirical studies on three different computer vision tasks of practical use to AVs -parking slot detection, lane detection, and monocular depth estimation - consistently show that having synthetic data in the training mix provides a significant boost in cross-dataset generalization performance as compared to training on real data only, for the same size of the training set.

[3D Vision learning]

VR3Dense: Voxel Representation Learning for 3D Object Detection and Monocular Dense Depth Reconstruction

Author: Shubham Shrivastava

3D object detection and dense depth estimation are one of the most vital tasks in autonomous driving. Multiple sensor modalities can jointly attribute towards better robot perception, and to that end, we introduce a method for jointly training 3D object detection and monocular dense depth reconstruction neural networks. It takes as inputs, a LiDAR point-cloud, and a single RGB image during inference and produces object pose predictions as well as a densely reconstructed depth map. LiDAR point-cloud is converted into a set of voxels, and its features are extracted using 3D convolution layers, from which we regress object pose parameters. Corresponding RGB image features are extracted using another 2D convolutional neural network. We further use these combined features to predict a dense depth map. While our object detection is trained in a supervised manner, the depth prediction network is trained with both self-supervised and supervised loss functions. We also introduce a loss function, edge-preserving smooth loss, and show that this results in better depth estimation compared to the edge-aware smooth loss function, frequently used in depth prediction works.

Monocular Depth Estimation

Monocular depth estimation is one of the most vital tasks in autonomous driving and robotics. It is a field of computer vision research that has been getting a lot of attention recently in 3D scene understanding. Obtaining a pixel-wise depth map provides us with a plethora of information about the scene and helps machines understand a dense representation of the environment. A major deterrent in obtaining dense depth though is the scarcity of available public datasets and the hardships involved in generating ground-truth labels. Most available dense depth estimation datasets like NYU Depth Dataset V2 uses RGB-D cameras for indoor settings, and classical computer vision techniques such as stereo matching, LiDAR points super-resolution, or LiDAR points projected onto the image plane for the outdoor environment.

This project demonstrates the development of a monocular depth estimation network which is adapted from "High Quality Monocular Depth Estimation via Transfer Learning".

[META learning]

Meta-Regularization by Enforcing Mutual-Exclusiveness

Authors: Shubham Shrivastava, Edwin Pan, Pankaj Rajak (In no particular order)

Meta-learning models have two objectives. First, they need to be able to make predictions over a range of task distributions while utilizing only a small amount of training data. Second, they also need to adapt to new novel unseen tasks at meta-test time again by using only a small amount of training data from that task. It is the second objective where meta-learning models fail for non-mutually exclusive tasks due to task overfitting. Given that guaranteeing mutually exclusive tasks is often difficult, there is a significant need for regularization methods that can help reduce the impact of task-memorization in meta-learning. For example, in the case of N-way, K-shot classification problems, tasks become non-mutually exclusive when the labels associated with each task are fixed. Under this design, the model will simply memorize the class labels of all the training tasks, and thus will fail to recognize a new task (class) at meta-test time. A direct observable consequence of this memorization is that the meta-learning model simply ignores the task-specific training data in favor of directly classifying based on the test-data input. In our work, we propose a regularization technique for meta-learning models that gives the model designer more control over the information flow during meta-training. Our method consists of a regularization function that is constructed by maximizing the distance between task-summary statistics, in the case of black-box models and task-specific network parameters in the case of optimization-based models during meta-training. Our proposed regularization function shows an accuracy boost of ∼ 36% on the Omniglot dataset for 5-way, 1-shot classification using the black-box method and for 20-way, 1-shot classification problem using the optimization-based method.

Model-Agnostic Meta-Learning (MAML) and Prototypical Networks

Model-Agnostic Meta-Learning (MAML)

In MAML, during meta-training phase, MAML operates in two loops, an inner loop, and an outer loop. In the inner loop, MAML computes gradient updates using examples from each task and calculates the loss on test examples from the same task using the updated model parameters. In the outer loop, MAML aggregates the per-task post-update losses and performs a meta-gradient update on the original model parameters. At meta-test time, MAML computes new model parameters based on a few examples from an unseen class and uses the new model parameters to predict the label of a test example from the same unseen class.

Prototypical Networks

A prototypical Network is a non-parametric meta-learning algorithm. The basic idea of prototypical networks resembles nearest neighbors to class prototypes. It computes the prototype of each class using a set of support examples and then calculates the distance between the query example and each prototype. The query example is classified based on the label of the prototype it’s closest to.

[2D object detection]

DAC-DC : Divide and Conquer for Detection and Classification

This implementation is inspired by YOLO for performing 2D object detection and tracking. This is an anchor-based method of object detection that uses grid-quantization for detecting objects locally. The results shown here are for VirtualKITTI dataset across multiple weather conditions and camera positions.