RHEX is a robotic hexapod. It has six compliant legs, meaning they can bend, each powered by a single motor. This allows for a wide variety of gaits which give RHEX incredible versatility to traverse a large range of environments, from stairs to mud. This simplicity also allows for a great deal of robustness. A great deal more information about RHEX can be found at RHEX's website: www.rhex.net[unfortunately, this website is down.]
While RHEX can reliably traverse many terrains, it does so without control. To control RHEX, the envirornment must be engineered. If bright pink balls are placed in its environment, RHEX could succeed in simultaneous localization and mapping (SLAM). Positive results have been generated in combining camera and inertial sensors for a task such as line following. To see an example, view this clip: Running.mov
The problem with these approaches is that they rely on an engineered environment, which is rarely the case in the real world.
I have been working with an image sequence which does not operate under the assumption of an engineered environment. Using inertial sensors and a very wide angle field of view camera, I have generated some positive resuts.
Here is a look at the challenge I am up against when trying to develop perception for RHEX:
This video gives you a great example of how RHEX moves with extreme motion. The sequence is warped because it is taken with a camera with a wide-angle lens. The result is extremely powerful. This lens can view 180 degrees of the environment, giving excellent situational awareness. This makes the extreme motion easier to handle, because most things in the image remain visible from frame to frame, yet the sequence is still quite challenging.
An initial result came from applying traditional tracking methods to the sequence:
This tracking result is rather poor. Many features are lost and many features are not tracked properly. The following result is adding image stabilization about the optical axis using an inertial measurement device:
This result is significantly better. The image sequence is notably easier to follow in a qualitative sense. The features are retained longer and miss-tracked less often. This can be demonstrated quantitatively in the histogram below, which compares the number of features tracked between images, but not the number of miss-tracks. Note the dramatic increase in the mean number of features tracked by stabilizing with inertial sensors.
Historgram KLT+ no stabilization, number of features tracked
Historgram KLT+ with stabilization, number of features tracked
# features tracked (of 50)
# features tracked (of 50)
Yet, this is till not good enough. Even if 80% of the features are normally tracked correctly, this means 20% are lost between each image. In addition, mis-tracks are hard to detect because the solution is a local maxima and does not stand out. This is especially bad considering that one major application of this tracking information is shape-from-motion which is highly sensitive to mis-tracked features and usually requires long lasting features. An alternative method is needed.
Shift Invariant Feature Transform (SIFT) features are high dimensional local features in an image which do not change value with rotation, change brightness, or scale changes. This means that the data association problem between features seen before and those currently seen becomes a k-nearest-neighbor problem. The framework is suitable for a fairly robust solution to the general object recognition problem.
Here, it is adapted to act as a robust tracker, where features are extracted from each pair of consecutive images, and correlated to find matches. In the following sequence, the feature tracking is significantly better than traditional tracking methods. Furthermore, the few errors that exist are not based on local maxima in a gradient descent search, so are not near the correct solution. This means that filtering the solution - for instance by using RANSAC - is far easier.
In the video below, the left image sequence has not been stabilized, so the dominant motion is the roll. The output from the tracker could be used to stabilize the sequence in place of more expensive inertial measurement devices which are not saturated by such large motion. On the right, the image sequence has already been stabilized, so the derivative information could be some higher level information, like the gait of RHEX, or the turning direction. Either of these could be incorporated into a larger shape-from-motion framework, which is highly sensitive to the quality of the tracking data.
An alternative to this metric estimation of motion and the environment is to forgo all metric estimates, simply focusing on a task to be accomplished. The SIFT tracking framework can be applied to a visual servoing problem, where the goal is to seek some larger component of the image. In the image sequence below, a seed template is given on the left, and the estimate for the location of this template is in the red box on the right. Basically, SIFT features are extracted from the template, and the image, and are correlated to find matching features, which yields an estimate for the template in the image. If a correlation is unsuccessful, the template on the left updates to the last estimated location of the template in the last image where tracking was successful, and another comparison is done. The is initially masked to ensure that no background is included, which could potentially cause the tracker to 'focus' on the background and track it instead. This can be solved automatically by using dense wide-baseline stereo on two templates used to segment and mask out the background from the forground.
Note that despite the extreme image motion, warping because of the wide angle field of view, changes in 3D perspective, changes in lighting, and changes in the background, the tracking is still successful. Furthermore, if the search in the image is limited to a sub-window based on the last known location updated by inertial measurements, this can run in real time on a modern CPU.