# Depth Sensing using Semi-global Block Matching

In order to judge depth, humans use the difference between the left and right eye images. It is quite intuitive that objects that are nearer to the eyes generate a larger misalignment between images than objects that are further away. This misalignment, called visual disparity provides substantial depth cues which help in framing an understanding of the world around us and guide us how to respond to it.
Yet, while it is very easy to perceive three-dimensional structure using biological systems, it is a herculean task to imitate this behavior in a computer vision systems. One method called Stereo matching creates a three-dimensional reconstruction of the scene using an image set formed by images taken from different viewpoints. This job involves searching matching pixels between images and then transforming the measured disparity into 3D depth map using the scene and camera geometry information. The required output is a dense depth map which indicates correct depths for each pixel in the input images. To understand, how points in a scene are imaged at different orientations and positions based on their distances from the viewer, we need to first understand the geometry of the stereo matching.

### Epipolar Geometry

The primary objective of the stereo matching is to match a given pixel in one image with its corresponding pixel in the other image. At first, this task seems to require an exploration through the entire image. However, this exploration is reduced to a single line through epipolar constraint specific to the stereo systems. As shown in the above figure, a 3D point $p$ is viewed from 2 cameras with the corresponding projection at point $x_0$ and $x_1$. $c_0$ and $c_0$ are the center of the left and right camera respectively. Any point on the line $c_0p$ is always projected at point $x_0$ on left image. But this can lead to projection at different points on right image. In general, a point $x_0$ on left image projects onto a line segment in the right image called the epipolar line ($l_1$). the projection of the left camera center $c_0$ in the right image called the epipole $e_1$, and on the other end by the vanishing point of the viewing ray from the left camera to $p$. $px_0x_1$ form the epipolar plane. An object imaged on the epipolar line in the left image can only be imaged on the corresponding epipolar line in the right image, and thus the search for correspondences is reduced to a line. This is called epipolar constraint.

In the case of calibrated cameras whose relative position is represented by a rotation $R$ and translation $t$, we may express the epipolar constraint more formally. A pixel $x_0$ in the left image is then mapped onto the right image at location $x_1$ by the transformation.

$x_1 = Rx_0 + t$

Given that the rays $x_0$ and $x_1$ intersect at the point p, the vectors describing these rays, namely $x_1$ and $Rx_O$, and the vector connecting the two camera centers $c_1 - c_0 = t$ must be coplanar. Hence, the triple product is equal to zero.
$x_1.(t \times Rx_0) = 0$
Thus the epipolar constraint can be written as: $x_1^TEx_0 = 0$. Here $E$ is the essential matrix given by:
$E = \left[\begin{array}{ccc} 0 & -t_z & t_y \\ t_z & 0 & -t_x \\ -t_y & t_x & 0 \end{array} \right]R$

The essential matrix E maps a point $x_0$ in the left image to a line $l_1 = Ex_0$ in the right image.

In the case of uncalibrated cameras, a similar quantity to the essential matrix, known as the fundamental matrix, may also be computed from seven or more point matches. The fundamental matrix captures both the intrinsic and extrinsic parameters of the system of cameras but serves the same function of mapping points in the left image to lines in the right image. Even though the epipolar geometry constrains the search for potential correspondences along epipolar lines, the arbitrary orientations of these lines make it inconvenient for algorithms to compare pixels. To overcome this issue, the input images are commonly rectified so that the epipolar lines reduce to corresponding horizontal scanlines. Rectification is achieved by first rotating both cameras such that their optical axes are perpendicular to the line joining the two camera centers, or the baseline. Next, the vertical y-axis of the camera is rotated to become perpendicular to the baseline. Finally, the images are rescaled, if necessary, to compensate for differences in focal lengths.

After rectification, the camera geometry is transformed into a canonical form. By considering similar triangles in the ray diagram, we can derive a simple inverse relationship between depth $Z$ and disparity $d$ between corresponding pixels in the two images.

$d = \frac{Bf}{Z}$

In this equation, f is the focal length(measured in pixels), and B is the baseline. The disparity d describes the difference in location between corresponding pixels in the left and right images i.e. $(x_0,y_0)$ and $(x_1,y_1) = (x_0 + d,y_0)$. By estimating the disparity for every pixel in a reference image, we obtain a disparity map $d(x, y)$, from which the object depth can then be computed.

### Finding Correspondence

Over the years, numerous algorithms have been proposed to find correspondences in
a set of images. These may be broadly classified into two categories.
A) Finding Sparse Correspondence: In this equation, $f$ is the focal length(measured in pixels), and $B$ is the baseline. The disparity $d$ describes the difference in location between corresponding pixels in the left and right images i.e. $(x_0,y_0)$ and $(x_1,y_1) = (x_0 + d,y_0)$. By estimating the disparity for every pixel in a reference image, we obtain a disparity map $d(x, y)$, from which the object depth can then be computed.

B) Finding Dense Correspondence:  This method seeks to find a dense set of correspondences between two or more images, since recovering a smooth and detailed depth map is more useful for 3D modeling and rendering applications. Based on the taxonomy of dense correspondence algorithms proposed by Scharstein and Szeliski, such techniques typically involve the calculation and aggregation of matching costs, from which disparity values may then be computed and refined. Dense stereo techniques can further be divided into two main strategies: local and global. Local approaches determine the correspondence of a point by selecting the candidate point along the epipolar lines that minimizes a cost function. To reduce matching ambiguity, the matching costs are aggregated over a support window rather than point-wise computation. On the other hand, global methods generally do not aggregate the matching costs, but instead, rely on explicit smoothness assumptions to ensure accuracy. The objective of global stereo algorithms is to find the disparity assignment which minimizes an energy function that includes both data (cost) and smoothness terms. In this study, we deal primarily with the problem of finding dense correspondences.

### Similarity Measures

Regardless of the type of stereo algorithm used, a means of determining the similarity between pixels in different images is key to disambiguating potential matches and finding correspondences. In order to find where a given pixel $x = (x, y)$ in the left image $I_l$ appears in the right image $I_r$, we use a fixed support window sampled at locations $x_i = (x_i, y_i)$ around the pixel and measure the degree of similarity with $I_r$ at different disparities. A basic measure is the sum-of-squared-differences (SSD) function, given by:
$C_{SSD}(x,d) = \sum_{i}[I_r(x_i + d) - I_l(x_i)]^2$

### Semi-global Block Matching

Semi-global Block Matching (SGBM) successfully concatenates concepts of global and local stereo methods. SGBM is based on the concept of pixel-wise matching of mutual information and a global 2D smoothness constraint is approximated by merging many 1D constraints. The algorithm considers pairs of stereo images with known extrinsic and intrinsic parameters obtained from camera calibration. It presents a very good trade-off between accuracy and run-time and is resilient against radiometric differences and insensitive to the choice of parameters. The core algorithm can be divided into four distinct processing steps, namely, Pixel-wise Cost Calculation, Aggregation of Costs, Disparity computation, Disparity Refinement.

A) Pixel-wise cost calculation: The pixel-wise matching cost is computed for a base image pixel $p$ from its intensity $I_{bp}$ at the guesses correspondence $I_{mq}$ with $q = e_{bm}(p,d)$ of the match image. The function $e_{bm}(p,d)$ represents the epipolar line in the match image for pixel p of the base image with the line parameter or disparity $d$. For rectified images, with the match image on the right of the base image, $e_{bm}(p,d) = [p_x - d,p_y]^{T}$ with $d$ as disparity. The size and shape of the area is an important ingredient that is contemplated for matching. The resilience of matching is increased with large areas. However, the inferred supposition of unchanging disparity inside the area fails at discontinuities, leading to blurred object boundaries and fine structures. This supposition is discarded in this algorithm and the Pixel-wise cost is calculated using the measure of Birchfield and Tomasi\cite{similarity}, which is insensitive to sampling. The matching cost $C_{BT}(p,d)$ is computed as the minimum absolute difference of pixel values at p and in the range of half a pixel in each direction along the epipolar line. To compute the matching cost, first we define $\hat{I_R}$ as the linearly interpolated function between the sample points of the right scanline, then how well the intensity at $x_L$ fits into the linearly interpolated region surrounding $x_R$ is measured. That is, we define the following quantity:
$\bar{d}(x_L,x_R,I_L,I_R) = \min\limits_{x_R-\frac{1}{2}\le x\le x_R+\frac{1}{2}}\left|I_L(x_L) - \hat{I_R}(x)\right|$
Defining $\hat{I_L}$ similarly, we obtain a symmetric quantity:
$\bar{d}(x_R,x_L,I_R,I_L) = \min\limits_{x_L-\frac{1}{2}\le x\le x_L+\frac{1}{2}}|I_L(x) - \hat{I_R}(x_R)|$
the matching cost is defined as minimum of these two quantities:
$d(x_L,x_R) = \min{\{\bar{d}(x_L,x_R,I_L,I_R),\bar{d}(x_R,x_L,I_R,I_L) \}}$

B) Aggregation of Costs: Pixel-wise cost calculation is generally prone to errors and incorrect matches can easily have a lower cost than correct ones, due to noise, and so forth. Therefore, an additional constraint that penalizes change of neighboring disparities is used in order to support smoothness. The pixel-wise cost and the smoothness constraints are expressed in terms of energy $E(D)$:

$E(D) = \sum_{p}{\left(C(p,D_p) + \sum_{q\epsilon N_p}P_1T[|D_p - D_q| = 1] + \sum_{d\epsilon N_p}P_2T[|D_p - D_q| > 1] \right)}$
The first term is the sum of all pixel matching costs for the disparities of $D$. The second term adds a constant penalty $P_1$ for all pixels ($q$) in the neighborhood ($N_p$) of $p$, for which the disparity changes by small amount. The third term adds a larger constant penalty $P_2$, for all larger disparity changes. Using a lower penalty for small changes permits an adaptation to slanted or curved surfaces. The constant penalty for all larger changes preserves discontinuities. Discontinuities are often visible as intensity changes. This is exploited by adapting $P_2$ to the intensity gradient, i.e., $P_2 = \frac{{P_2'}}{|I_{bp} - I_{bq}|}$ for neighbouring pixels pixels $p$ and $q$ in the base image $I_b$.

The problem of stereo matching can now be formulated as finding the disparity image $D$ that minimizes the energy $E(D)$. Unfortunately, such a global minimization is NP-complete for many discontinuity preserving energies. In contrast, the minimization along individual image rows, can be performed efficiently in polynomial time using Dynamic Programming. The aggregated (smoothed) cost $S(p,d)$ for a pixel $p$ and disparity $d$ is calculated by summing the costs of all 1D minimum cost paths that end in $p$ at disparity $d$, as shown in Figure below.

These paths through disparity space are projected as straight lines into the base image but as non-straight lines into the corresponding match image, according to disparity changes along the paths. The cost $L'_r(p,d)$ along a path traversed in the direction $r$ of the pixel $p$ at disparity $d$ is defined recursively as
$L'_r(p,d) = C_{BT}(p,d) + \min{(L'_r(p - r,d), L'_r(p - r,d-1) + P_1, L'_r(p - r,d+1) + P_1, \min\limits_i{{L'_r(p - r,i)} + P_2)}}$
The pixel-wise matching cost is $C_{BT}$. The remainder of the equation adds the lowest cost of the previous pixel $(p - r)$ of the path, including the appropriate penalty for discontinuities. The values of latex $L'$ permanently increase along the path, which may lead to very large values. However it can be modified by subtracting the minimum path cost of the previous pixel from the whole term:
$L'_r(p,d) = C_{BT}(p,d) + \min{(L'_r(p - r,d), L'_r(p - r,d-1) + P_1, L'_r(p - r,d+1) + P_1, \min\limits_i{{L'_r(p - r,i)} + P_2)}} - \min\limits_k{L'_r(p - r,i)}$
The number of paths must be at least eight and should be less than 16 for providing a good coverage of the 2D image. In the latter case, paths that are not horizontal, vertical, or diagonal are implemented by going one step horizontal or vertical followed by one step diagonally:
$S(\textbf{p},d) = \sum_r{L_r(\textbf{p},d)}$
The upper limit for $S$ is determined as $S \le 16(C_{max} + P_2$, for 16 paths. An efficient implementation would pre-calculate the pixel-wise matching costs $C(p,d)$, down scaled to 11-bit integer values, i.e. $C_{max} \le 2^{11}$, by a factor $s$ if necessary. Scaling to 11-bit guarantees that the aggregated costs in subsequent calculations do not exceed the 16-bit limit. All costs are stored in a 16-bit array $C[]$ of size $W \times H \times D$. Thus, $C[p,d] = sC(p,d)$. A second 16-bit integer array $S[]$ of the same size is used for storing the aggregated cost values. The array is initialized by 0 values. The calculation starts for each direction $r$ at all pixels $b$ of the image border with $L_r{b,d} = C[b,d]$. The path is traversed in forward direction. For each visited pixel $p$ along the path, the costs $L_r(p,d)$ are added to the values $S[b,d]$ for all disparities $d$. The calculation of $L_r(p,d)$ requires O(D) steps at each pixel, since the minimum cost of the previous pixel is constant for all disparities of a pixel and can be pre-calculated. Each pixel is visited exactly 16 times, which results in a total complexity of O(W,H,D).

C) Disparity Computation: The disparity image $D_b$ that corresponds to the base image $I_b$ is determined as in local stereo methods by selecting for each pixel $p$ the disparity $d$ that corresponds to the minimum cost, i.e. $\min\limits_d{S[p,d]}$. For sub-pixel estimation, a quadratic curve is fitted through the neighboring costs, i.e. at the next higher and lower disparity, and the position of the minimum is calculated. Using a quadratic curve is theoretically justified only for correlation using the sum of squared differences. However, it is used as an approximation due to the simplicity of calculation. This supports fast computation. The disparity image $D_m$ that corresponds to the match image $I_m$ can be determined from the same costs by traversing the epipolar line that corresponds to the pixel $q$ of the match image. Again, the disparity $d$ is selected, which corresponds to the minimum cost, that is, $\min\limits_d{S[e_{mb}(q,d),d]}$. However, the cost aggregation step does not treat the base and match images symmetrically. Slightly better results can be expected, if $D_m$ is calculated from scratch by performing pixel-wise matching and aggregation again but with $I_m$ as the base and $I_b$ as match image. It depends on the application whether or not an increased run-time is acceptable for slightly better object borders. Outliers are filtered from $D_b$ and $D_m$ using a median filter with a small window ($3 \times 3$). The calculation of $D_b$, as well as $D_m$, permits the determination of occlusions and false matches by performing a consistency check. Each disparity of $D_b$ is compared to its corresponding disparity of $D_m$. The disparity is set to invalid $D_{inv}$ if both differ:
$D_p = \begin{cases} D_{bp} \text{ if} |D_{bp} - D_{mq} \le 1|\\ D_{inv} \text{ otherwise} \end{cases}$
$q = e_{bm}(\textbf{p},D_{bp})$
The consistency check enforces the uniqueness constraint, by permitting one to one mappings only. The disparity computation and consistency check require visiting each pixel at each disparity a constant number of times. Thus, the complexity of this step is again O(WHD).

D) Disparity Refinement: The resulting disparity image can still contain certain kinds of errors. Furthermore, there are generally areas of invalid values that need to be recovered. Both can be handled by post-processing of the disparity image. The post-processing may include steps like removal of peaks, intensity consistent disparity selection and discontinuity preserving interpolation.

### Setup

For the setup 2 cameras (both Logitech C270) are used. For the camera calibration, lots of pictures of the chessboard were taken. It’s important that the chessboard is visible for both cameras simultaneously, and optimally it should be photographed from several different angles and positions. For camera calibration and 3D reconstruction, OpenCV library was used.

Once calibration is done its necessary to make sure that both the cameras capture frame simultaneously otherwise it would be a problem when objects are moving. To solve this problem, multi-threaded programming has been employed. After turning on the camera, the job of reading a frame is given to a thread for each camera. Since running several threads is similar to running several programmes concurrently, captured data almost looks synchronized. Now the captured images are undistorted and rectified, followed by cost calculation, aggregation, disparity computation and disparity refinement.

The following code shows the Python implementation of the process.

import sys
import cv2
import threading
from DataQueue import DataQueue
from FrameQueue import FrameQueue
from stereovision.calibration import StereoCalibration
import time
import os
import threading
import numpy as np
from matplotlib import pyplot as plt
import time
import scipy.misc

right = cv2.VideoCapture(1) # Right Camera
left = cv2.VideoCapture(0) # Left Camera
calibration = StereoCalibration()

# Camera Parmeters
roi1 = np.load('roi1.npy')
roi2 = np.load('roi2.npy')
xx1 = max(roi1[0], roi2[0])
yy1 = max(roi1[1], roi2[1])
xx2 = min(roi1[2], roi2[2])
yy2 = min(roi1[3], roi2[3])
Q = np.load('Q.npy')

# Camera Threads
class MyThread1(threading.Thread):
def __init__(self, threadID, name, cam):
threading.Thread.__init__(self)
self.cam = cam
self.id = threadID
self.name = name
self.queue = FrameQueue()
self.dataAvailable = 0
def run(self):
ret, frame = self.cam.read()
self.queue.put(frame)

# Semi-global Block Matching Algorithm
bm1 = cv2.StereoSGBM_create(minDisparity = 0,
numDisparities = 192,
blockSize = 5,
P1 = 600,
P2 = 2400,
disp12MaxDiff = 2,
preFilterCap = 63,
uniquenessRatio = 15,
speckleWindowSize = 200,
speckleRange = 32, mode = 5)

# Combined Red Mask for left and right camera images
def red_mask_or(left,right):
'''
left: Left Camera Image
right: Right Camera Image
'''
param1=[0, 0, 30]
param2=[30, 30, 190]
kernel = np.ones((2,2),np.uint8)
lowerLed = np.array(param1)    # Convert the parameters into a form that OpenCV can understand
upperLed = np.array(param2)
mask_left = cv2.inRange(left, lowerLed, upperLed)
mask_left = cv2.erode(mask_left,kernel,iterations = 1)
mask_left = cv2.dilate(mask_left,kernel,iterations = 2)
mask_right = cv2.inRange(right, lowerLed, upperLed)
mask_right = cv2.erode(mask_right,kernel,iterations = 1)
mask_right = cv2.dilate(mask_right,kernel,iterations = 2)
mask_or=np.bitwise_or(mask_left,mask_right)
return(mask_or)

# Disparity Map Calculation
def computeDisparity():
images = []
thread1 = MyThread1(1, 'left', left) # Left Camera Thread
thread2 = MyThread1(0, 'right', right) # Right Camera Thread

thread1.start()
thread2.start()
thread1.join()
thread2.join()

rgb_left = thread1.queue.tryGet()
rgb_right = thread2.queue.tryGet()

or_image=red_mask_or(rgb_left,rgb_right)  # Combined Red Mask

rgb_left = scipy.misc.imrotate(rgb_left,180)
images.append(rgb_right)
images.append(rgb_left)
rectified_pair = calibration.rectify(images)  # Rectification
disparity = bm1.compute(rectified_pair[0][xx1:xx2,yy1:yy2],rectified_pair[1][xx1:xx2,yy1:yy2])#.astype(np.float32) / 16.0<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>
return(disparity,or_image)


### Results

The following two images display the original BGR images captured by the left and right camera respectively.

The images are then rectified and undistorted resulting in following two images.

Finally, the disparity map is generated as shown below.

In disparity map, nearby objects are depicted in red and faraway objects in blue. As the distance increases, the color gets transformed from red to green and then to blue.

The following slideshow shows the disparity generation for another set of images.

This slideshow requires JavaScript.