Ivan Kirigin
{ Computer Vision Specialist }

{ Research} Advised by Sanjiv Singh at the Robotics Institute, Carnegie Mellon University

{Tracking in X,Y,Theta}

Tracking in X, Y, Theta
Derivation of a 3-Dimensional Kanade-Lucas-Tomasi tracker

Begin with an image I to be aligned to image J, indexed by , the pixel locations and degree to which the image is rotated about the image center.

Given a warping of image I by an adjustment of each pixel location d, a good error function for the alignment is the following:

where w(X) is a weight assigned to each location. Here, it is assigned to be 1 everywhere, and will be neglected. To minimize this error to find an optimal alignment, we can take the derivative and set to 0, but first take the Taylor expansion of an image in this representation to the first term about:

In this case, expand I(X + d) about X:

Put this in the error term:

Take the derivative with respect to d:

Set to zero, and separate terms:

Label these:


Z is a 3x3 matrix:

Each term in Z is the sum of the appropriate gradients over all pixels. This yields a clear system of equations:

This can be solved by inverting
Z:

This is a formulation for an algorithm to find the warping needed to align two images. This alignment is usually done for many feature templates in an image. There are a few particulars in implementation:

• Because of the use of the Taylor series, the displacement is very small, so this must be iterative.

• Though iterative, the derivatives of the target needn’t be recomputed, because ‘moving’ the template
I over J doesn’t change. The residual J(X) - I(X) does change, but is much faster to compute. This makes the algorithm real time.

• Derivatives with respect to must be weighted by the pixel location in the template. If derivatives along image axes
(u, v) are considered changes over a small distance (such as a pixel and perhaps weighted by a Gaussian kernel to reduce noise), the distance over which a derivative in is to be measured varies by the distance of that point to the point of rotation.

• Stopping conditions are important. In my implementation there were three:

o Reaching a maximum number of iterations,
o
Z has 0 determinant, meaning it is not invertible (usually caused by texture-less locations in the image)
o Very small residual
J(X) - I(X)

• Success conditions on whether a particular feature template is tracked well are more difficult. Here are three possibilities:

o Features which aligned with large final residuals could be rejected
o Features which do not converge should be rejected

o In a shape-from-motion (SFM) framework, the geometric consistency of the features could be incorporated. This is a solution to the Simultaneous Localization and Mapping (SLAM) problem, where the tracker is the mechanism for data association.

Initial locations for features should be chosen such that the algorithm is well conditioned, i.e. the area around the feature has large gradients in all dimensions used. These features are termed Harris features and are described in detail in the paper “Good Features to Track” by Shi & Tomasi.

• Feature reacquisition, i.e. correlating currently seen features with features found in the past, is very difficult, but essential to solve the SLAM problem. In such a framework, features could be tested for correlation by checking a residual between features which are estimated to belong to the same point in the world. This is fairly unreliable, and a better framework to solve this (posed as an object recognition algorithm but is very appropriate here) has been developed by David Lowe, a Shift Invariant Feature Detector (SIFT).

The following is a very simple demonstration of the tracker, on a synthetic sequence to emphasize that a lower dimensional tracker would fail, especially at greater motion. The 5 features are correctly tracked throughout the entire sequence.


View a video sequence: wheel_10fps.mov

Back to top

Professional
{ Research }
{ I-Kbot }