Traditional computer vision

Structure-from-motion

Structure from Motion (SfM) is inspired by our ability to learn about the 3D structure in the surrounding environment by moving through it. Given a sequence of images, we are able to simultaneously estimate both the 3D structure and the path the camera took.

SfM can be solved using either Factorization Method or Algebraic Method. Some of the drawbacks of the Factorization Method are: (1) It assumes that the 3D points are visible in all cameras. (2) It assumes Affine Cameras and therefore, there exists affine ambiguity in the solution in addition to similarity ambiguity.

In Algebraic Method, we assume a projective camera and then solve for camera poses and 3D points using non-linear optimization for bundle adjustment. We first compute Essential Matrix and then use it to find out 4 possible pairs of R,T for each pair of cameras. For each pair of R,T, we then compute the linear estimate of 3D points which is further refined using bundle adjustment for minimizing reprojection errors between pairwise cameras with non-linear optimization method (Gauss-Newton). These 3D points can further be used to filter the best R,T, by enforcing the criteria that all the points should be ahead of camera planes.

Visit the GitHub repository for the implementation of both methods along with bundle adjustment.

Space carving of voxels for building 3d models

RGB Image and Corresponding Silhouette

3D Model built with a series of image silhouettes from various viewpoint.

Contours of an object from various viewpoints provide an immense amount of information about the object 3D geometry. This requires the ability to compute object silhouette from image plane projection, and then enforcing consistency across multiple views.

We start from an initial volume of voxels and then carve away the voxels that do not exist within the visual cone for the object silhouette in the image space. We do this for each camera and compute the intersection of all visual cones to eventually obtain a visual hull.

If we have enough viewpoints available and the corresponding camera poses known, then this visual hull will correspond to a good 3D model representation of the object.

LiDAR-based Ground-Plane Segmentation and Object Detection

Point-Cloud is a set of data points in 3D space which represents the LiDAR laser rays reflected by objects. Each point within the point-cloud is the point of interaction between the transmitted ray and the environment. Ground plane can be easily segmented out by finding planes within the point-cloud and selecting the one with maximum number of inliers. This can be achieved using a very well known algorithm, RANSAC (RANdom SAmple Consensus).

RANSAC first picks a few samples randomly within the point-cloud and fits a plane through that. Then it counts the number of inliers by computing closest distance between the plane and all other points. If this distance is within a certain threshold, then this point is added to the list of inliers. This process is repeated multiple number of times to find the fit that estimates the ground-plane. A plane can be fit through points using either SVD or Least-Squares method.

Once we have segmented the ground-plane, we can look into all other points to find objects (if any) within the point-cloud. To find an object, we can take advantage of the fact that the points corresponding to the object are distributed very close to each other. If we can find clusters of points within the point-cloud, then the clusters will correspond to objects like vehicles, pedestrians, road boundary walls, buildings, and trees. Further, we can employ point-cloud filtering using multiple constraints such as bounding-box length, width, and height to differentiate between different classes. A 3D bounding-box defines the boundary of objects such as x_min, x_max, y_min, y_max, z_min, and z_max.

Clustering requires looking for all the points closer to a seed point. The computation can very quickly increase exponentially in the brute-force method of computing distance between all the possible points. If the points are stored in a k-d tree structure, then the search problem becomes much easier and computationally cheap. Euclidian Clustering method is utilized in this project to find clusters within the point-cloud. This clustering method uses k-d tree search to find points that are close together.

This has been implemented as a part of TAPL - more details and implementation [here].

fundamental matrix computation

Epipolar lines for Set1

Epipolar lines for Set1

Epipolar lines for Set2

Estimating Fundamental Matrix from point-correspondence between two images is a key component to solving multi-view geometry problems. In multi-view geometry, point-projections in each camera plane are related by this 3x3 matrix - projection of a 3D point in one camera plane corresponds to a line (epipolar line) in other camera planes. It allows us to triangulate 3D points using multiple-views and subsequently determine the 3D geometry.

One of the very well-known algorithms to compute fundamental matrix is called the 8-point algorithm. As the name suggests, it requires at least 8-point correspondences between images and uses linear least-squares to solve for the fundamental matrix using SVD. Since the points are in pixel-space, the magnitude of some elements in linear equation formulation [ Ax = 0 ] tends to be much higher than the others resulting in an average error between pixel and epipolar line is of order ~10px. This issue, however, can be resolved using a variant of the algorithm - normalized 8-point algorithm. This algorithm normalizes the points such that they are centered at the centroid with a mean-squared distance of 2 pixels.

stereo rectification

Epipolar lines for Set1

Epipolar lines for Set1

Rectified Stereo Images with parallel and aligned epipolar lines

Stereo Rectification is the process of making any two given images parallel such that the epipolar lines are aligned and parallel to the image horizontal axis. This requires generating two homography matrices H1, and H2 for left and right images respectively.

First step in the process is to compute a homography matrix, H2, which moves the right image epipole to infinity along the horizontal axis, thereby making the epipolar lines parallel. Next, the homography matrix, H1, is computed which transforms the left image such that the epipolar lines in left and right images are parallel.

Given both homography matrices, rectified images can be computed using inverse pixel mapping.

camera intrinsic calibration from a single image

Estimating camera intrinsic parameters is essential for any computer vision task. Usually, this entails using a calibration rig of known dimensions which contains a simple pattern like a checkerboard. By using at least 6 correspondences, we can obtain 12 constraints since each correspondence provides us with 2 constraints in x and y. This allows us to solve for extrinsic and intrinsic matrix together, which is a 3x4 matrix and has 11 degrees-of-freedom.

An easier but less accurate way of estimating intrinsic matrix is by making use of orthogonal planes and by leveraging our knowledge about the world. Vanishing points in the image plane define 2D projections of 3D points at infinity, and one can compute the 3D ray vector given the 2D coordinate of the vanishing point and camera intrinsic matrix, K. We utilize this property to get a constraint by each pair of vanishing points. Thus, with a minimum of 3 vanishing points, we get 3 constraints to solve the intrinsic matrix. By making (1) square pixels, and (2) zero-skew assumption, we reduce the degrees of freedom to 3, and can estimate the camera parameters.

Lane Detection and Tracking using Birds-Eye view

In this project, the goal is to identify lane boundaries using traditional computer vision techniques, compute the curvature of each lane and the lane center offset from vehicle center. Specifically:

  • Compute the camera calibration matrix and distortion coefficients given a set of chessboard images.

  • Apply a distortion correction to raw images.

  • Use color transforms, gradients, etc., to create a thresholded binary image.

  • Apply a perspective transform to rectify binary image ("birds-eye view").

  • Detect lane pixels and fit to find the lane boundary.

  • Determine the curvature of the lane and vehicle position with respect to the center of the lane.

  • Warp the detected lane boundaries back onto the original image.

  • Output visual display of the lane boundaries and numerical estimation of lane curvature and vehicle position.

homography matrix estimation and inverse pixel mapping

Source Image which needs to be warped.

Warped Image using the Homography Matrix

Homography matrix (H) relates a set of pixel coordinates in source images with a different set of pixel coordinates in the destination image. It is a 3x3 matrix that transforms a source coordinate into a destination coordinate system. In order to find this matrix, we need to build up linear equations relating the two coordinate systems. Since each equation gives us two constraints (x and y), we require a minimum of 4 equations (or corresponding points). These overdetermined set of linear equations, Ax = 0, can then be solved using SVD subject to ||x||=1, which is also the reason that why the last element of 3x3 matrix , H, is equal to 1. This reduces the degree-of-freedom of matrix, H, to be 8.

Please visit the GitHub repository for code implementation of this homography matrix estimation given a set of source and destination points.