Portfolio

Shubham Shrivastava

Head of Machine Learning @ Kodiak RoboticsIEEE Senior Member

Shubham Shrivastava leads all AI and ML development at Kodiak Robotics, driving advancements in generative AI, vision-language models (VLMs), large foundation models, and spatio-temporal multimodal models. He architected GigaFusionNet, a core technology that fuses data from cameras, lidars, and radars across time and space to deliver a real-time, 3D understanding of the environment for safe and scalable autonomy. He also developed scalable auto-labeling pipelines and an automated end-to-end AI flywheel that continuously enhances model performance through self-sustaining data cycles.

Under his leadership, Kodiak became the first in the industry to deploy multiple driverless trucks operating 24/7, autonomously delivering commercial payloads and setting a new benchmark for large-scale autonomous logistics.

Shubham has published papers at top-tier conferences such as CVPR, ICCV, ECCV, ICRA, and IROS, and holds over 20 patents in the field.

Previously, he led the perception team at Ford Autonomy, building state-of-the-art models to enable 360-degree vehicle perception. He holds advanced degrees in Computer Science with a specialization in AI from Stanford University.

Kodiak Robotics, Mountain View, CA

Head of Machine Learning

September 2023-Present

Led all AI and ML initiatives at Kodiak Robotics, driving both strategic and hands-on development of core technologies for autonomous trucking.

Built and deployed GigaFusionNet, a scalable spatio-temporal multimodal model that processes sensor data (cameras, lidars, and radars) across time and space, providing real-time 3D perception to support complex decision-making.
Developed modular AI architectures, including foundation models, vision-language models (VLMs), and generative AI solutions, ensuring adaptability and scalability in diverse real-world driving scenarios.
Operationalized scalable auto-labeling systems and an automated end-to-end AI flywheel, enabling continuous and autonomous model improvement through data-driven learning loops.
Led the deployment of the industry's first 24/7 driverless trucks, achieving large-scale autonomous operations with a focus on commercial logistics.
Designed and implemented Kodiak’s Modular Cognitive Architecture (MCA) to enable redundancy and fault tolerance, ensuring no single point of failure and supporting high safety and performance standards.
Focused on verifiable AI by developing robust, scalable perception and decision-making pipelines that can be continuously validated and improved.
Delivered industry-leading innovations that resulted in safer, more reliable autonomous systems, positioning Kodiak as a pioneer in scalable autonomous logistics.

Ford Greenfield Labs, Palo Alto, CA

September 2019 - September 2023

Technical Expert and Lead - 3D Perception

September 2022 - September 2023

Led a team of talented machine learning and robotics engineers towards building vision-centric 3D perception solutions at Ford Autonomy. My work included building an end-to-end 3D Perception stack for Ford L2+ vehicles on the road, and the development of a flexible and scalable machine-learning framework for all ML tasks within Ford.

Sr. Research Scientist - Machine Learning and Computer Vision

September 2019 - September 2022

My research includes computer vision and advanced machine learning methods including convolutional neural networks, generative adversarial networks, variational autoencoders, and 3D perception; with significant emphasis on object detection, semantics learning, 3D scene understanding, multi-view geometry, and visual odometry.

In addition to researching and developing novel methodology, I built the complete cloud-based MLOps pipeline including intelligent data sourcing, auto-annotation, training, model optimization, and deployment (TensorRT C++) for putting our prediction engine in production.
Two major projects for which I built end-to-end perception pipelines are (1) Ford Autonomous Shuttles (2) Ford Factory of the Future - Infrastructure-based autonomous vehicle marshaling through assembly plants.

[Topics of Research]

Monocular RGB camera and LiDAR-based 3D object detection, Classification, and Tracking in both indoor and outdoor environments.
Unsupervised and Semi-Supervised Object 6-DoF Pose Estimation to reduce the cost of manual data annotation from millions of dollars down to zero.
Multi-headed multi-task neural networks for scene understanding incorporating Sim2Real methods for zero-cost training of the networks.
Generative Adversarial Networks for realistic image generation with semantic and cycle consistency from simulation data to fill the gap between both worlds.
A combination of computer vision and traditional methods including non-linear optimization for object pose estimation.
Automated global localization of multiple spatially distributed sensors within the infrastructure to a common coordinate frame using an autonomous robot.
Perception system for localizing objects of various classes to within 10 centimeters and sub-degrees orientation accuracy within Ford’s Factory of the Future.

Renesas Electronics America, Inc.

Applications Engineer, Perception R&D - ADAS and Autonomous Driving

March 2017 - September 2019

I worked as part of a very small team towards building the ADAS & Autonomous Driving perception reference platform “Perception Quick Start” which includes end-to-end solutions for camera and LiDAR based road feature and object detection.

Developed complete lane detection pipeline from scratch. Pipeline includes lane pixel extraction using a combination of classical computer vision method and deep learning, lane detection, polyfit, noise suppression, lane tracking, lane smoothing, confidence computation, lane extrapolation, lane departure warning, lane offset, lane curvature, and lane types.
Developed C based Computer Vision Library for basic image processing functions like image read/write, hough transforms, edge detection (canny, horizontal, vertical), colorspace conversions, image filtering (sharpen, gaussian smooth, sobel, emboss, edge). Created advanced math library for functions like least-squares polyfit and matrix operations.
Stereo Camera calibration, rectification, disparity map generation, flat road free-space estimation, Object Detection using V-Disparity and 3D density-based clustering, 3D point cloud rendering along with 3D bounding-box, and depth perception.
Developed dynamic image ROI stabilization module for correcting rotation and translation using angular pose/velocity data by computing and applying homography at run-time. Developed a general-purpose positioning driver for bringing in GNSS/IMU data into the perception stack.
Optimized embedded implementation of algorithms for parallel computing on R-Car SoC HW Accelerators.
Developed the complete V2V Solution from Scratch for the Renesas’ V2X Platform including CAN Framework, GPS/INS driver, GPS+IMU fusion for localization, Concise Path History computation, Path Prediction, CSV and KML logging module, 360-degree lane-level Target Classification, Basic Safety Applications, and an HMI using QT for displaying Warnings, Vehicle Tracking, Maps, and Debug Information.

Changan US Research & Development Center, Inc.

Intelligent Vehicle Engineer (Connected Autonomous Vehicle Research Group)

August 2016 - March 2017

Worked within Changan's Connected and Autonomous Vehicle research team to design and develop various vehicle safety models for 360-degrees target classification and warnings notifications with and without line-of-sight requirements.

DELPHI (now known as APTIV)

Embedded Software Engineer

May 2016 - August 2016

Worked with application teams, forward systems algorithm group, and controller design groups to define the functionality, develop algorithms, and implement them in accordance with the V-Model Software Development Life Cycle.

BlackBerry QNX

Software Development Intern (Board Support Package)

January 2016 - May 2016

Developed the BSP (Board Support Package) for custom hardware with i.MX6 Solo Processor and several peripherals. Worked towards low-level board bring-ups and provided support for the following peripherals.

Support for RAM file system to manipulate files during runtime.
Support for SPI NOR Flash and Parallel NOR Flash mounted as a filesystem at startup.
Support for Removable storage (SD, microSD, USB Flash). Also, provided support for its auto-detection and auto mounting at the attachment.
Support for USB OTG to be used for the Console Service, Mass-Storage Device, and USB-to-Ethernet Adapter attachments.
Added new features to the QNX OS for the BSP including auto-detection and switching between device stack to provide console service, host stack to provide auto-mounting of mass storage devices, and host stack to provide networking with USB-Ethernet Adapter based on the type of attachment.

The University of Texas at Arlington Research Institute

Research Intern

August 2015 - December 2015

Designed and Developed the control GUI for a prosthetic system used to help rehabilitate post-stroke patients. It used an Arduino controller for adaptive adjustment of the air bubble pressure at the desired psi value for various points on the leg. Also designed a GUI which allows the user to enter the desired psi values for each air bubble and simultaneously measure current bubble pressure and display it on the GUI in real-time.

Used two Arduino UNO, one for sending the signals to 32 solenoids controlling airflow from Alicat Mass Flow Controller into the respective air bubbles, and one for receiving sense signal from 32 corresponding air pressure sensor. Signals were also sent to the Solenoids for deflating the air bubbles when required.

Indian Institute of Science (IISc)

Trainee Engineer

January 2014 - May 2014

Designed and developed a 2 Dimensional plotter (Smart XY Plotter) at the Mechatronics Lab, IISc, capable of plotting any 2D image using a pen, which was controlled by means of a Solenoid and two stepper motors (responsible for x, y, and z directional movement).

The control system was governed by an ARM Processor (STM32F4 Discovery Board) to plot images which features were extracted using MATLAB.
Developed the GUI in MATLAB which allows a user to either upload the image of their choice or select any other plot (Arbitrary interpolated curve, texts, shapes).
Used two timers for controlling and synchronizing the parallel movement of X and Y motors to provide any curve of any desired slope.
The solenoid setup was brought back to its initial position after every plot. Limit switches were used to detect its arrival at the desired reset position.

[YouTube Link]

Keynotes and Media Coverage

The Brave Technologist - Podcast

AI in Action - Podcast

In my talk at Auto AI 2024, I delved into the transformative power of early fusion of multiple modalities and introduced GigaFusionNet, our cutting-edge spatio-temporal multimodal fusion architecture at Kodiak. This innovation is designed to enhance perception capabilities by integrating diverse data streams seamlessly.

I also unveiled our Modular Cognitive Architecture (MCA), a robust approach to our autonomy stack. MCA emphasizes redundancy, end-to-end learnability, interpretability, generalizability, and cost-effective validation, setting a new standard for autonomous driving systems.

Lastly, I emphasized the critical role of vision-language models (VLMs) in autonomous driving, exploring the complexities of data distribution and their impact on performance.

Auto AI 2024 - Keynote

IROS 2023 - Panel Discussion

I participated in a panel debate with Davide Scaramuzza, Sebastian Scherer, Ayoung Kim, Michael Mangan, and Punarjay Chakravarty at IROS 2023.

IROS 2023 - Keynote

I gave a keynote at IROS 2023 workshop: “It’s what you see, not where you are! : Localization through Perception Lens”. In my talk, I shed light on how KodiakDriver is trailblazing in the industry with its innovative approaches. Notably, its design emphasizes localizing akin to humans, setting a new standard for autonomous systems.

Auto AI 2023 - Keynote

Keynote talk @ Auto.AI USA 2023, delving into the dynamic world of vision-centric perception algorithms, breakthroughs, and bridging the academia-industry gap. 🌍✨ A key message to the research community: It's not just about building models that can do more, but rather empowering them to do "More with Less."

Panel discussion on Advanced Computer Vision Use Cases at Ai4 2023.

we.CONECT interview on the landscape of vision-centric perception for autonomous vehicles.

[Keynote Talk @In.Cabin Sensing Europe] From the outside to the inside – Rethinking the implications of autonomous driving for the communication with the driver (link)
[Panel Discussion @In.Cabin Sensing Europe] On the way to smart cabin – How to find the balance between in-cabin sensing, communication, privacy and user experience? (link)

Education

Stanford University

Graduate Program - Artificial Intelligence

GPA: 4+/4.0

Courses in Machine Learning, Meta-Learning, Multi-Task Learning, Deep Generative Models, Natural Language Processing, Computer Vision, and 3D Reconstruction

August 2020 - December 2022

The University of Texas at Arlington

Master of Science in Electrical Engineering

GPA: 4.0/4.0

August 2014 - August 2016

Visvesvaraya Technological University

Bachelor of Engineering in Electronics and Communication Engineering

GPA: 4.0/4.0, First Class with Distinction, Aggregate Percentage: 86%

August 2014 - August 2016

Papers and Publications

ICCV 2023 Workshop

DatasetEquity: Are All Samples Created Equal? In The Quest For Equity Within Datasets

Shubham Shrivastava, Xianling Zhang, Sushruth Nagesh, Armin Parchami

Data imbalance is a well-known issue in the field of machine learning, attributable to the cost of data collection, the difficulty of labeling, and the geographical distribution of the data. In computer vision, bias in data distribution caused by image appearance remains highly unexplored. Compared to categorical distributions using class labels, image appearance reveals complex relationships between objects beyond what class labels provide. Clustering deep perceptual features extracted from raw pixels gives a richer representation of the data. This paper presents a novel method for addressing data imbalance in machine learning. The method computes sample likelihoods based on image appearance using deep perceptual embeddings and clustering. It then uses these likelihoods to weigh samples differently during training with a proposed Generalized Focal Loss function. This loss can be easily integrated with deep learning algorithms. Experiments validate the method's effectiveness across autonomous driving vision datasets including KITTI and nuScenes. The loss function improves state-of-the-art 3D object detection methods, achieving over 200% AP gains on under-represented classes (Cyclist) in the KITTI dataset. The results demonstrate the method is generalizable, complements existing techniques, and is particularly beneficial for smaller datasets and rare classes. Code is available at: https://github.com/towardsautonomy/DatasetEquity

ICCV 2023 Workshop

Ref-DVGO: Reflection-Aware Direct Voxel Grid Optimization for an Improved Quality-Efficiency Trade-Off in Reflective Scene Reconstruction

Georgios Kouros, Minye Wu, Shubham Shrivastava, Sushruth Nagesh, Punarjay Chakravarty, Tinne Tuytelaars

Neural Radiance Fields (NeRFs) have revolutionized the field of novel view synthesis, demonstrating remarkable performance. However, the modeling and rendering of reflective objects remain challenging problems. Recent methods have shown significant improvements over the baselines in handling reflective scenes, albeit at the expense of efficiency. In this work, we aim to strike a balance between efficiency and quality. To this end, we investigate an implicit-explicit approach based on conventional volume rendering to enhance the reconstruction quality and accelerate the training and rendering processes. We adopt an efficient density-based grid representation and reparameterize the reflected radiance in our pipeline. Our proposed reflection-aware approach achieves a competitive quality efficiency trade-off compared to competing methods. Based on our experimental results, we propose and discuss hypotheses regarding the factors influencing the results of density-based methods for reconstructing reflective objects. The source code is available [here].

ICRA 2022

Propagating State Uncertainty Through Trajectory Forecasting

Boris Ivanovic, Yifeng Lin, Shubham Shrivastava, Punarjay Chakravarty, Marco Pavone

Uncertainty pervades through the modern robotic autonomy stack, with nearly every component (e.g., sensors, detection, classification, tracking, behavior prediction) producing continuous or discrete probabilistic distributions. Trajectory forecasting, in particular, is surrounded by uncertainty as its inputs are produced by (noisy) upstream perception and its outputs are predictions that are often probabilistic for use in downstream planning. However, most trajectory forecasting methods do not account for upstream uncertainty, instead taking only the most-likely values. As a result, perceptual uncertainties are not propagated through forecasting and predictions are frequently overconfident. To address this, we present a novel method for incorporating perceptual state uncertainty in trajectory forecasting, a key component of which is a new statistical distance-based loss function which encourages predicting uncertainties that better match upstream perception. We evaluate our approach both in illustrative simulations and on large-scale, real-world data, demonstrating its efficacy in propagating perceptual state uncertainty through prediction and producing more calibrated predictions.

BMVC 2022

Category-Level Pose Retrieval with Contrastive Features Learnt with Occlusion Augmentation

Georgios Kouros, Shubham Shrivastava, Cédric Picron, Sushruth Nagesh, Punarjay Chakravarty, Tinne Tuytelaars

Pose estimation is usually tackled as either a bin classification problem or as a regression problem. In both cases, the idea is to directly predict the pose of an object. This is a non-trivial task because of appearance variations of similar poses and similarities between different poses. Instead, we follow the key idea that it is easier to compare two poses than to estimate them. Render-and-compare approaches have been employed to that end, however, they tend to be unstable, computationally expensive, and slow for real-time applications. We propose doing category-level pose estimation by learning an alignment metric using a contrastive loss with a dynamic margin and a continuous pose-label space. For efficient inference, we use a simple real-time image retrieval scheme with a reference set of renderings projected to an embedding space. To achieve robustness to real-world conditions, we employ synthetic occlusions, bounding box perturbations, and appearance augmentations. Our approach achieves state-of-the-art performance on PASCAL3D and OccludedPASCAL3D, as well as high-quality results on KITTI3D.

IROS 2023

DisPlacing Objects: Improving Dynamic Vehicle Detection via Visual Place Recognition under Adverse Conditions

Stephen Hausler, Sourav Garg, Punarjay Chakravarty, Shubham Shrivastava, Ankit Vora, Michael Milford

Can knowing where you are assist in perceiving objects in your surroundings, especially under adverse weather and lighting conditions? In this work we investigate whether a prior map can be leveraged to aid in the detection of dynamic objects in a scene without the need for a 3D map or pixel-level map-query correspondences. We contribute an algorithm which refines an initial set of candidate object detections and produces a refined subset of highly accurate detections using a prior map. We begin by using visual place recognition (VPR) to retrieve a reference map image for a given query image, then use a binary classification neural network that compares the query and mapping image regions to validate the query detection. Once our classification network is trained, on approximately 1000 query-map image pairs, it is able to improve the performance of vehicle detection when combined with an existing off-the-shelf vehicle detector. We demonstrate our approach using standard datasets across two cities (Oxford and Zurich) under different settings of train-test separation of map-query traverse pairs. We further emphasize the performance gains of our approach against alternative design choices and show that VPR suffices for the task, eliminating the need for precise ground truth localization.

IROS 2023

Locking On: Leveraging Dynamic Vehicle-Imposed Motion Constraints to Improve Visual Localization

Stephen Hausler, Sourav Garg, Punarjay Chakravarty, Shubham Shrivastava, Ankit Vora, Michael Milford

Most 6-DoF localization and SLAM systems use static landmarks but ignore dynamic objects because they cannot be usefully incorporated into a typical pipeline. Where dynamic objects have been incorporated, typical approaches have attempted relatively sophisticated identification and localization of these objects, limiting their robustness or general utility. In this research, we propose a middle ground, demonstrated in the context of autonomous vehicles, using dynamic vehicles to provide limited pose constraint information in a 6-DoF frame-by-frame PnP-RANSAC localization pipeline. We refine initial pose estimates with a motion model and propose a method for calculating the predicted quality of future pose estimates, triggered based on whether or not the autonomous vehicle's motion is constrained by the relative frame-to-frame location of dynamic vehicles in the environment. Our approach detects and identifies suitable dynamic vehicles to define these pose constraints to modify a pose filter, resulting in improved recall across a range of localization tolerances from 0.25m to 5m, compared to a state-of-the-art baseline single image PnP method and its vanilla pose filtering. Our constraint detection system is active for approximately 35% of the time on the Ford AV dataset and localization is particularly improved when the constraint detection is active.

IJCAI 2021 AI for Autonomous Driving Workshop

VR3Dense: Voxel Representation Learning for 3D Object Detection and Monocular Dense Depth Reconstruction

Shubham Shrivastava

3D object detection and dense depth estimation are one of the most vital tasks in autonomous driving. Multiple sensor modalities can jointly attribute towards better robot perception, and to that end, we introduce a method for jointly training 3D object detection and monocular dense depth reconstruction neural networks. It takes as inputs, a LiDAR point-cloud, and a single RGB image during inference and produces object pose predictions as well as a densely reconstructed depth map. LiDAR point-cloud is converted into a set of voxels, and its features are extracted using 3D convolution layers, from which we regress object pose parameters. Corresponding RGB image features are extracted using another 2D convolutional neural network. We further use these combined features to predict a dense depth map. While our object detection is trained in a supervised manner, the depth prediction network is trained with both self-supervised and supervised loss functions. We also introduce a loss function, edge-preserving smooth loss, and show that this results in better depth estimation compared to the edge-aware smooth loss function, frequently used in depth prediction works.

IROS 2022 + RA-L

Improving Worst Case Visual Localization Coverage via Place-specific Sub-selection in Multi-camera Systems

Stephen Hausler, Ming Xu, Sourav Garg, Punarjay Chakravarty, Shubham Shrivastava, Ankit Vora, Michael Milford

6-DoF visual localization systems utilize principled approaches rooted in 3D geometry to perform accurate camera pose estimation of images to a map. Current techniques use hierarchical pipelines and learned 2D feature extractors to improve scalability and increase performance. However, despite gains in typical recall@0.25m type metrics, these systems still have limited utility for real-world applications like autonomous vehicles because of their `worst' areas of performance - the locations where they provide insufficient recall at a certain required error tolerance. Here we investigate the utility of using `place specific configurations', where a map is segmented into a number of places, each with its own configuration for modulating the pose estimation step, in this case selecting a camera within a multi-camera system. On the Ford AV benchmark dataset, we demonstrate substantially improved worst-case localization performance compared to using off-the-shelf pipelines - minimizing the percentage of the dataset which has low recall at a certain error tolerance, as well as improved overall localization performance. Our proposed approach is particularly applicable to the crowdsharing model of autonomous vehicle deployment, where a fleet of AVs are regularly traversing a known route.

CVPR 2020 Workshop

Deflating Dataset Bias Using Synthetic Data Augmentation

Nikita Jaipuria, Xianling Zhang, Rohan Bhasin, Mayar Arafa, Punarjay Chakravarty, Shubham Shrivastava, Sagar Manglani, Vidya N. Murali

Deep Learning has seen an unprecedented increase in vision applications since the publication of large-scale object recognition datasets and the introduction of scalable compute hardware. State-of-the-art methods for most vision tasks for Autonomous Vehicles (AVs) rely on supervised learning and often fail to generalize to domain shifts and/or outliers. Dataset diversity is thus key to successful real-world deployment. No matter how big the size of the dataset, capturing long tails of the distribution pertaining to task-specific environmental factors is impractical. The goal of this paper is to investigate the use of targeted synthetic data augmentation - combining the benefits of gaming engine simulations and sim2real style transfer techniques - for filling gaps in real datasets for vision tasks. Empirical studies on three different computer vision tasks of practical use to AVs - parking slot detection, lane detection, and monocular depth estimation - consistently show that having synthetic data in the training mix provides a significant boost in cross-dataset generalization performance as compared to training on real data only, for the same size of the training set.

CubifAE-3D: Monocular Camera Space Cubification for Auto-Encoder based 3D Object Detection

Shubham Shrivastava and Punarjay Chakravarty

We introduce a method for 3D object detection using a single monocular image. Starting from a synthetic dataset, we pre-train an RGB-to-Depth Auto-Encoder (AE). The embedding learnt from this AE is then used to train a 3D Object Detector (3DOD) CNN which is used to regress the parameters of 3D object poses after the encoder from the AE generates a latent embedding from the RGB image. We show that we can pre-train the AE using paired RGB and depth images from simulation data once and subsequently only train the 3DOD network using real data, comprising of RGB images and 3D object pose labels (without the requirement of dense depth). Our 3DOD network utilizes a particular `cubification' of 3D space around the camera, where each cuboid is tasked with predicting N object poses, along with their class and confidence values. The AE pre-training and this method of dividing the 3D space around the camera into cuboids give our method its name - CubifAE-3D. We demonstrate results for monocular 3D object detection in the Autonomous Vehicle (AV) use-case with the Virtual KITTI 2 and the KITTI datasets.

QAGAN: Adversarial Approach To Learning Domain Invariant Language Features

Shubham Shrivastava and Kaiyue Wang

Training models that are robust to data domain shift have gained an increasing interest both in academia and industry. Question-Answering language models, being one of the typical problem in Natural Language Processing (NLP) research, has received much success with the advent of large transformer models. However, existing approaches mostly work under the assumption that data is drawn from same distribution during training and testing which is unrealistic and non-scalable in the wild.In this paper, we explore adversarial training approach towards learning domain-invariant features so that language models can generalize well to out-of-domain datasets. We also inspect various other ways to boost our model performance including data augmentation by paraphrasing sentences, conditioning end of answer span prediction on the start word, and carefully designed annealing function. Our initial results shows that in combination with these methods, we are able to achieve 15.2% improvement in EM score and 5.6% boost in F1 score on out-of-domain validation datasets over the baseline. We also dissect our model outputs and visualize the model hidden-states by projecting them onto a lower-dimensional space, and discover that our specific adversarial training approach indeed encourages the model to learn domain invariant embedding and bring them closer in the multi-dimensional space.

PoseGen: Pose-Conditioned Image Generation

Shubham Shrivastava and Amir Ziai

Autonomous driving application requires detection and localization of objects within a few centimeters. Deep learning models that can achieve this level of accuracy require millions of training data samples. Manually annotating data at this scale is highly cumbersome and expensive. Furthermore, annotating object poses in 3D is not possible from just a single image without requiring either additional sensors or multi-frame non-linear optimization. In this work we present \textit{PoseGen}, a methodology for generating a realistic image of an object given a desired pose, appearance, and background. We do this in an unsupervised way by conditioning the generation process on various attributes. We also release a dataset of paired objects and silhouette masks TeslaPose dataset which we hope will help the research community further tackle this problem

Meta-Regularization by Enforcing Mutual-Exclusiveness

Shubham Shrivastava, Edwin Pan, and Pankaj Rajak

Meta-learning models have two objectives. First, they need to be able to make predictions over a range of task distributions while utilizing only a small amount of training data. Second, they also need to adapt to new novel unseen tasks at meta-test time again by using only a small amount of training data from that task. It is the second objective where meta-learning models fail for non-mutually exclusive tasks due to task overfitting. Given that guaranteeing mutually exclusive tasks is often difficult, there is a significant need for regularization methods that can help reduce the impact of task-memorization in meta-learning. For example, in the case of N-way, K-shot classification problems, tasks become non-mutually exclusive when the labels associated with each task is fixed. Under this design, the model will simply memorize the class labels of all the training tasks, and thus will fail to recognize a new task (class) at meta-test time. A direct observable consequence of this memorization is that the meta-learning model simply ignores the task-specific training data in favor of directly classifying based on the test data input. In our work, we propose a regularization technique for meta-learning models that gives the model designer more control over the information flow during meta-training. Our method consists of a regularization function that is constructed by maximizing the distance between task-summary statistics, in the case of black-box models, and task-specific network parameters in the case of optimization-based models during meta-training. Our proposed regularization function shows an accuracy boost of ∼36% on the Omniglot dataset for 5-way, 1-shot classification using black-box method and for 20-way, 1-shot classification problem using optimization-based methods.

An A* Curriculum Approach to Reinforcement Learning for RGBD Indoor Robot Navigation

Kaushik Balakrishnan, Punarjay Chakravarty, Shubham Shrivastava

Training robots to navigate diverse environments is a challenging problem as it involves the confluence of several different perception tasks such as mapping and localization, followed by optimal path-planning and control. Recently released photo-realistic simulators such as Habitat allow for the training of networks that output control actions directly from perception: agents use Deep Reinforcement Learning (DRL) to regress directly from the camera image to a control output in an end-to-end fashion. This is data-inefficient and can take several days to train on a GPU. Our paper tries to overcome this problem by separating the training of the perception and control neural nets and increasing the path complexity gradually using a curriculum approach. Specifically, a pre-trained twin Variational AutoEncoder (VAE) is used to compress RGBD (RGB & depth) sensing from an environment into a latent embedding, which is then used to train a DRL-based control policy. A*, a traditional path-planner is used as a guide for the policy and the distance between start and target locations is incrementally increased along the A* route, as training progresses. We demonstrate the efficacy of the proposed approach, both in terms of increased performance and decreased training times for the PointNav task in the Habitat simulation environment. This strategy of improving the training of direct-perception based DRL navigation policies is expected to hasten the deployment of robots of particular interest to industry such as co-bots on the factory floor and last-mile delivery robots.

Computer Vision Conference (CVC) 2019 Conference Paper

Stereo Vision Based Object Detection Using V-Disparity and 3D Density-Based Clustering

Shubham Shrivastava

In recent years, autonomous driving has inexorably progressed from the domain of science fiction to reality. For a self-driving car, it is of utmost importance that it knows its surroundings. Several sensors like RADARs, LiDARs, and Cameras have been primarily used to sense the environment and make a judgment on the next course of action. Object detection is of a great significance in Autonomous Driving wherein the self-driving car needs to identify the objects around it and must take necessary actions to avoid a collision. Several perception-based methods like classical Computer Vision techniques and Convolutional Neural Networks (CNN) exist today which detects and classifies an object. This paper discusses an object detection technique based on Stereo Vision. One challenge in this process though is to eliminate regions of the image which are insignificant for the detection, like unoccupied road and buildings far ahead. This paper proposes a method to first get rid of such regions using V-Disparity and then detect objects using 3D density-based clustering. Results given in this paper show that the proposed system can detect objects on the road very accurately and robustly.

S-BEV: Semantic Birds-Eye View Representation for Weather and Lighting Invariant 3-DoF Localization

Mokshith Voodarla, Shubham Shrivastava, Sagar Manglani, Ankit Vora, Siddharth Agarwal, Punarjay Chakravarty

We describe a light-weight, weather and lighting invariant, Semantic Bird's Eye View (S-BEV) signature for vision-based vehicle re-localization. A topological map of S-BEV signatures is created during the first traversal of the route, which are used for coarse localization in subsequent route traversal. A fine-grained localizer is then trained to output the global 3-DoF pose of the vehicle using its S-BEV and its coarse localization. We conduct experiments on vKITTI2 virtual dataset and show the potential of the S-BEV to be robust to weather and lighting. We also demonstrate results with 2 vehicles on a 22 km long highway route in the Ford AV dataset.

Sim2Real for Self-Supervised Monocular Depth and Segmentation

Nithin Raghavan, Punarjay Chakravarty, Shubham Shrivastava

Image-based learning methods for autonomous vehicle perception tasks require large quantities of labelled, real data in order to properly train without overfitting, which can often be incredibly costly. While leveraging the power of simulated data can potentially aid in mitigating these costs, networks trained in the simulation domain usually fail to perform adequately when applied to images in the real domain. Recent advances in domain adaptation have indicated that a shared latent space assumption can help to bridge the gap between the simulation and real domains, allowing the transference of the predictive capabilities of a network from the simulation domain to the real domain. We demonstrate that a twin VAE-based architecture with a shared latent space and auxiliary decoders is able to bridge the sim2real gap without requiring any paired, ground-truth data in the real domain. Using only paired, ground-truth data in the simulation domain, this architecture has the potential to generate perception tasks such as depth and segmentation maps. We compare this method to networks trained in a supervised manner to indicate the merit of these results.

Chapter in the Book: Connected Vehicles

V2V Vehicle Safety Communication

Shubham Shrivastava

National Highway Traffic Safety Administration (NTHSA) has been interested in vehicle-to-vehicle (V2V) communication as the next step in addressing grooving rates of fatalities from vehicle related crashes. Today’s crash avoidance technologies depend on on-board sensors like camera and radar to provide awareness input to the safety applications. These applications warn the driver of imminent danger or sometimes even act on the driver’s behalf. However, even technologies like those cannot “predict” a crash that might happen because of a vehicle which is not very close or not in the line of sight to the host vehicle. A technology that can “see” through another vehicle or obstacles like buildings and predict a danger can fill these gaps and reduce crashes drastically. V2V communications can provide vehicles the ability to talk to each other and therefore see around corners and through the obstacles over a longer distance compared to the current on-board sensors. It is estimated that V2X communications address up to 80% of the unimpaired crashes [1]. By means of Notice of Proposed Rulemaking (NPRM), NHTSA is working towards standardization of V2V communications and potentially mandating the broadcast of vehicle data (e.g. GPS coordinates, speed, acceleration) over DSRC through V2V.

Patents

[11107228] S Shrivastava. “Realistic Image Perspective Transformation Using Neural Networks”. A system based on a deep neural network to synthesize multiple realistic perspectives of an image.
[20230419539] S Nagesh, S Shrivastava, P Chakravarty. “Vehicle Pose Management”. Pose estimation through unsupervised landmark estimation.
[11189049] P Chakravarty, and S Shrivastava. “Vehicle Neural Network Perception and Localization”. Using Map-Perception Disagreement for Robust Perception and Localization with Generative Models.
[11348278] P Chakravarty, S Shrivastava, G Pandey, and X Wong. “Object Detection”. Automatic Calibration of Automobile Cameras – In the Factory & On The Road.
[11482007] S Shrivastava. “Event-Based Vehicle Pose Estimation Using Monochromatic Imaging”. Event prediction derived 9-DOF Vehicle Pose Estimation in Garage-Like Space using Monocular Cameras.
[11562571] N Raghavan, S Shrivastava, and P Chakravarty. “Vehicle Neural Network”. Zero-Cost Training of Perception Tasks using a Sim-to-Real Architecture with Auxiliary Decoding.
[11619727] S Manglani, P Chakravarty, and S Shrivastava. “Determining Multi-Degree-Of-Freedom Pose For Sensor Calibration”. A robotic calibration device and a method of calculating a global multi-degree of freedom (MDF) pose of an array of cameras affixed to a structure.
[11670088] M Voodarla, P Chakravarty, and S Shrivastava. “Vehicle Neural Network Localization”. Semantic Birds-Eye View Representation for Weather and Lighting Invariant 3 DoF Localization.
[20230186587] S Shrivastava, P Chakravarty, G Pandey. “Three-Dimensional Object Detection”. A method of complete 3D Scene Understanding including Dynamic and Static Object Detection and Tracking from a single RGB camera image using an end-to-end Neural Network.
[11887317] B Ivanovic, Y Lin, S Shrivastava, P Chakravarty, M Pavone. “Object Trajectory Forecasting”. A method of agent trajectory prediction by propagating estimated state uncertainty through object perception.
[20230025152] S Shrivastava, G Pandey, and P Chakravarty. “Object Pose Estimation”. A method of end-to-end self-supervised method of 4-DoF vehicle pose estimation through 3D rendering engine.
[20230097584] P Chakravarty, and S Shrivastava. “ Object Pose Estimation”. Automated stitching of multiple-camera feeds without requiring overlaps for unsupervised object pose estimation using 3D model rendering.
[11710254] S Shrivastava, P Chakravarty, and G Pandey. “Neural Network Object Detection”. Multi-Camera assisted Semi-Supervised Monocular 3D Object Detection.
[20230025152] S Shrivastava, G Pandey, and P Chakravarty. “Object Pose Estimation”. Unsupervised end-to-end pose estimation through differentiable rendering.
[US20240087332A1] C Picron, T Tuytelaars, P Chakravarty, and S Shrivastava. “Object Detection with Images”. FFDet, Fast-Converging, Feature-based Two-Stage Object-Detector.
[US20240054673A1] S Shrivastava, B Ghadge, and P Chakravarty. “Vehicle Pose Management”. 6-DoF vehicle pose estimation using static monocular cameras utilizing 2D bounding-box and ray casting.
[20230252667] M Xu, S Garg, M Milford, P Chakravarty, and S Shrivastava. “Vehicle Localization”. A solution to the visual localization problem along repeated routes using automatic place-specific hashing of parameters.
[20230267640] P Chakravarty, S Mishra, A Parchami, G Pandey, and S Shrivastava. “Pose Estimation”. 6-DoF pose estimation of objects as viewed from a static fisheye camera utilizing geometric approach and neural networks trained on synthetic data.
[11827203] P Chakravarty, and S Shrivastava. “Multi-Degree-Of-Freedom Pose For Vehicle Navigation”. A weakly-supervised method of 6-DoF pose estimation for known objects by means of keypoints, visual tracking, and non-linear optimization.
[20230136871 ] P Chakravarty, S Shrivastava, B Ghadge, A Parchami, G Pandey. “Automated Camera Pose Estimation using Traffic Monitoring”. Automated perception node localization in an infrastructure system by monitoring traffic scenes.
[20220214692] P Chakravarty, K Balakrishnan, and S Shrivastava. “Vision-Based Navigation By Coupling Deep Reinforcement Learning And A Path Planning Algorithm”. Robot Navigation Using Vision Embeddings and A* for Improved Training of Deep-Reinforcement Learning Policies.

Peer Reviews

[6 Papers] The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) 2024
[2 Papers] The IEEE International Conference on Computer Vision (ICCV) 2023
The IEEE Robotics and Automation Letters (RA-L) 2023
The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2023
The IEEE International Conference on Robotics and Automation (ICRA) 2023
The IEEE Robotics and Automation Letters (RA-L) 2022
[2 Papers] The IEEE International Conference on Robotics and Automation (ICRA) 2022
[2 Papers] The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2022
[2 Papers] ASME Journal of Autonomous Vehicles and Systems 2021

Download Printable Version

Google Sites

Report abuse