Machine learning

Perception is one of the most significant systems of autonomous vehicles/robots. It allows the vehicle to perceive the 360-degrees environment around it. Perception system helps the vehicle understands its surroundings like drivable area, lanes, vehicles, pedestrians, road boundaries, traffic signs, and traffic lights. This system uses multiple sensors like LiDAR, RADAR, and Camera to perceive the world and detect objects. This page explores various subsystems of perception and shows some methods of ground-plane segmentation, object detection, lane detection, and object classification.

After the advent of Machine Learning, it has been considered as the number one tool to solve any and all perception tasks. Several machine learning solutions have been proposed in the past decade to solve problems such as object detection, lane detection, 3d scene representation, and even trajectory prediction.

[3D Vision learning]

VR3Dense: Voxel Representation Learning for 3D Object Detection and Monocular Dense Depth Reconstruction

 Author: Shubham Shrivastava

3D object detection and dense depth estimation are one of the most vital tasks in autonomous driving. Multiple sensor modalities can jointly attribute towards better robot perception, and to that end, we introduce a method for jointly training 3D object detection and monocular dense depth reconstruction neural networks. It takes as inputs, a LiDAR point-cloud, and a single RGB image during inference and produces object pose predictions as well as a densely reconstructed depth map. LiDAR point-cloud is converted into a set of voxels, and its features are extracted using 3D convolution layers, from which we regress object pose parameters. Corresponding RGB image features are extracted using another 2D convolutional neural network. We further use these combined features to predict a dense depth map. While our object detection is trained in a supervised manner, the depth prediction network is trained with both self-supervised and supervised loss functions. We also introduce a loss function, edge-preserving smooth loss, and show that this results in better depth estimation compared to the edge-aware smooth loss function, frequently used in depth prediction works.

CubifAE-3D: Monocular Camera Space Cubification on Autonomous Vehicles for Auto-Encoder based 3D Object Detection

 Authors: Shubham Shrivastava and Punarjay Chakravarty

We introduce a method for 3D object detection using a single monocular image. Starting from a synthetic dataset, we pre-train an RGB-to-Depth Auto-Encoder (AE). The embedding learnt from this AE is then used to train a 3D Object Detector (3DOD) CNN which is used to regress the parameters of 3D object poses after the encoder from the AE generates a latent embedding from the RGB image. We show that we can pre-train the AE using paired RGB and depth images from simulation data once and subsequently only train the 3DOD network using real data, comprising of RGB images and 3D object pose labels (without the requirement of dense depth). Our 3DOD network utilizes a particular `cubification' of 3D space around the camera, where each cuboid is tasked with predicting N object poses, along with their class and confidence values. The AE pre-training and this method of dividing the 3D space around the camera into cuboids give our method its name - CubifAE-3D. We demonstrate results for monocular 3D object detection in the Autonomous Vehicle (AV) use-case with the Virtual KITTI 2 and the KITTI datasets. 

Monocular Depth Estimation

Monocular depth estimation is one of the most vital tasks in autonomous driving and robotics. It is a field of computer vision research that has been getting a lot of attention recently in 3D scene understanding. Obtaining a pixel-wise depth map provides us with a plethora of information about the scene and helps machines understand a dense representation of the environment. A major deterrent in obtaining dense depth though is the scarcity of available public datasets and the hardships involved in generating ground-truth labels. Most available dense depth estimation datasets like NYU Depth Dataset V2 uses RGB-D cameras for indoor settings, and classical computer vision techniques such as stereo matching, LiDAR points super-resolution, or LiDAR points projected onto the image plane for the outdoor environment. 

This project demonstrates the development of a  monocular depth estimation network which is adapted from "High Quality Monocular Depth Estimation via Transfer Learning". 

[TRAJECTORY PREDICTION]

Propagating State Uncertainty Through Trajectory Forecasting

 Authors: Boris Ivanovic, Yifeng (Richard) Lin, Shubham Shrivastava, Punarjay Chakravarty, Marco Pavone

Uncertainty pervades through the modern robotic autonomy stack, with nearly every component (e.g., sensors, detection, classification, tracking, behavior prediction) producing continuous or discrete probabilistic distributions. Trajectory forecasting, in particular, is surrounded by uncertainty as its inputs are produced by (noisy) upstream perception and its outputs are predictions that are often probabilistic for use in downstream planning. However, most trajectory forecasting methods do not account for upstream uncertainty, instead taking only the most-likely values. As a result, perceptual uncertainties are not propagated through forecasting and predictions are frequently overconfident. To address this, we present a novel method for incorporating perceptual state uncertainty in trajectory forecasting, a key component of which is a new statistical distance-based loss function which encourages predicting uncertainties that better match upstream perception. We evaluate our approach both in illustrative simulations and on large-scale, real-world data, demonstrating its efficacy in propagating perceptual state uncertainty through prediction and producing more calibrated predictions.

[GENERATIVE AI]

Deflating Dataset Bias Using Synthetic Data Augmentation

 Authors: Nikita Jaipuria, Xianling Zhang, Rohan Bhasin, Mayar Arafa, Punarjay Chakravarty, Shubham Shrivastava, Sagar Manglani, and Vidya N. Murali

Deep Learning has seen an unprecedented increase in vision applications since the publication of large-scale object recognition datasets and the introduction of scalable compute hardware. State-of-the-art methods for most vision tasks for Autonomous Vehicles (AVs) rely on supervised learning and often fail to generalize to domain shifts and/or outliers. Dataset diversity is thus key to successful real-world deployment. No matter how big the size of the dataset, capturing long tails of the distribution pertaining to task-specific environmental factors is impractical. The goal of this paper is to investigate the use of targeted synthetic data augmentation - combining the benefits of gaming engine simulations and sim2real style transfer techniques - for filling gaps in real datasets for vision tasks. Empirical studies on three different computer vision tasks of practical use to AVs -parking slot detection, lane detection, and monocular depth estimation - consistently show that having synthetic data in the training mix provides a significant boost in cross-dataset generalization performance as compared to training on real data only, for the same size of the training set.

Pose-Conditioned Image Generation

 Authors: Shubham Shrivastava and Amir Ziai

Autonomous driving application requires detection and localization of objects within a few centimeters. Deep learning models that can achieve this level of accuracy require millions of training data samples. Manually annotating data at this scale is highly cumbersome and expensive. Furthermore, annotating object poses in 3D is not possible from just a single image without requiring either additional sensors or multi-frame non-linear optimization. In this work we present \textit{PoseGen}, a methodology for generating a realistic image of an object given a desired pose, appearance, and background. We do this in an unsupervised way by conditioning the generation process on various attributes. We also release a dataset of paired objects and silhouette masks ( TeslaPose dataset ) which we hope will help the research community further tackle this problem


[NATURAL LANGUAGE PROCESSING]

QAGAN: Adversarial Approach To Learning Domain Invariant Language Features

 Authors: Shubham Shrivastava and Kaiyue Wang

Training models that are robust to data domain shift have gained an increasing interest both in academia and industry. Question-Answering language models, being one of the typical problem in Natural Language Processing (NLP) research, has received much success with the advent of large transformer models. However, existing approaches mostly work under the assumption that data is drawn from same distribution during training and testing which is unrealistic and non-scalable in the wild.

In this paper, we explore adversarial training approach towards learning domain-invariant features so that language models can generalize well to out-of-domain datasets. We also inspect various other ways to boost our model performance including data augmentation by paraphrasing sentences, conditioning end of answer span prediction on the start word, and carefully designed annealing function. Our initial results shows that in combination with these methods, we are able to achieve 15.2% improvement in EM score and 5.6% boost in F1 score on out-of-domain validation datasets over the baseline. We also dissect our model outputs and visualize the model hidden-states by projecting them onto a lower-dimensional space, and discover that our specific adversarial training approach indeed encourages the model to learn domain invariant embedding and bring them closer in the multi-dimensional space.

[VISION-BASED LOCALIZATION]

Improving Worst Case Visual Localization Coverage via Place-specific Sub-selection in Multi-camera Systems

 Authors: Stephen Hausler, Ming Xu, Sourav Garg, Punarjay Chakravarty, Shubham Shrivastava, Ankit Vora, Michael Milford

6-DoF visual localization systems utilize principled approaches rooted in 3D geometry to perform accurate camera pose estimation of images to a map. Current techniques use hierarchical pipelines and learned 2D feature extractors to improve scalability and increase performance. However, despite gains in typical recall@0.25m type metrics, these systems still have limited utility for real-world applications like autonomous vehicles because of their `worst' areas of performance - the locations where they provide insufficient recall at a certain required error tolerance. Here we investigate the utility of using `place specific configurations', where a map is segmented into a number of places, each with its own configuration for modulating the pose estimation step, in this case selecting a camera within a multi-camera system. On the Ford AV benchmark dataset, we demonstrate substantially improved worst-case localization performance compared to using off-the-shelf pipelines - minimizing the percentage of the dataset which has low recall at a certain error tolerance, as well as improved overall localization performance. Our proposed approach is particularly applicable to the crowdsharing model of autonomous vehicle deployment, where a fleet of AVs are regularly traversing a known route. 

[META learning]

Meta-Regularization by Enforcing Mutual-Exclusiveness

 Authors: Shubham Shrivastava, Edwin Pan, Pankaj Rajak (In no particular order)

Meta-learning models have two objectives. First, they need to be able to make predictions over a range of task distributions while utilizing only a small amount of training data. Second, they also need to adapt to new novel unseen tasks at meta-test time again by using only a small amount of training data from that task. It is the second objective where meta-learning models fail for non-mutually exclusive tasks due to task overfitting. Given that guaranteeing mutually exclusive tasks is often difficult, there is a significant need for regularization methods that can help reduce the impact of task-memorization in meta-learning. For example, in the case of N-way, K-shot classification problems, tasks become non-mutually exclusive when the labels associated with each task are fixed. Under this design, the model will simply memorize the class labels of all the training tasks, and thus will fail to recognize a new task (class) at meta-test time. A direct observable consequence of this memorization is that the meta-learning model simply ignores the task-specific training data in favor of directly classifying based on the test-data input. In our work, we propose a regularization technique for meta-learning models that gives the model designer more control over the information flow during meta-training. Our method consists of a regularization function that is constructed by maximizing the distance between task-summary statistics, in the case of black-box models and task-specific network parameters in the case of optimization-based models during meta-training. Our proposed regularization function shows an accuracy boost of ∼ 36% on the Omniglot dataset for 5-way, 1-shot classification using the black-box method and for 20-way, 1-shot classification problem using the optimization-based method. 

Model-Agnostic Meta-Learning (MAML) and Prototypical Networks

Model-Agnostic Meta-Learning (MAML)

In MAML, during meta-training phase, MAML operates in two loops, an inner loop, and an outer loop. In the inner loop, MAML computes gradient updates using examples from each task and calculates the loss on test examples from the same task using the updated model parameters. In the outer loop, MAML aggregates the per-task post-update losses and performs a meta-gradient update on the original model parameters. At meta-test time, MAML computes new model parameters based on a few examples from an unseen class and uses the new model parameters to predict the label of a test example from the same unseen class.

Prototypical Networks

A prototypical Network is a non-parametric meta-learning algorithm. The basic idea of prototypical networks resembles nearest neighbors to class prototypes. It computes the prototype of each class using a set of support examples and then calculates the distance between the query example and each prototype. The query example is classified based on the label of the prototype it’s closest to. 

[2D object detection]

DAC-DC : Divide and Conquer for Detection and Classification

This implementation is inspired by YOLO for performing 2D object detection and tracking. This is an anchor-based method of object detection that uses grid-quantization for detecting objects locally. The results shown here are for VirtualKITTI dataset across multiple weather conditions and camera positions.