A Coarse-to-fine Approach for Fast Deformable Object Detection
Marco Pedersoli
Andrea Vedaldi
Jordi González
Abstract
We present a method that can dramatically accelerate object detection with part based models. The method is based on the observation that the cost of detection is likely to be dominated by the cost of matching each part to the image, and not by the cost of computing the optimal configuration of the parts as commonly assumed. Therefore accelerating detection requires minimizing the number of part-to-image comparisons. To this end we propose a multiple-resolutions hierarchical part based model and a corresponding coarse-to-fine inference procedure that recursively eliminates from the search space unpromising part placements. We evaluate our method extensively on the PASCAL VOC and INRIA datasets, demonstrating a very high increase in the detection speed with little degradation of the accuracy.
Paper Link
This Blog is maintained by the Robot Perception and Learning lab at CSIE, NTU, Taiwan. Our scientific interests are driven by the desire to build intelligent robots and computers, which are capable of servicing people more efficiently than equivalent manned systems in a wide variety of dynamic and unstructured environments.
Wednesday, September 28, 2011
Wednesday, September 21, 2011
Lab Meeting September 22nd, 2011 (Jimmy): Vector Field SLAM
Title: Vector Field SLAM
Authors: Jens-Steffen Gutmann, Gabriel Brisson, Ethan Eade, Philip Fong and Mario Munich
In: ICRA 2010
Abstract
Localization in unknown environments using low-cost sensors remains a challenge. This paper presents a new localization approach that learns the spatial variation of an observed continuous signal. We model the signal as a piecewise linear function and estimate its parameters using a simultaneous localization and mapping (SLAM) approach. We apply our framework to a sensor measuring bearing to active beacons where measurements are systematically distorted due to occlusion and signal reflections of walls and other objects present in the environment. Experimental results from running GraphSLAM and EKF-SLAM on manually collected sensor measurements as well as on data recorded on a vacuum-cleaner robot validate our model.
[pdf]
Authors: Jens-Steffen Gutmann, Gabriel Brisson, Ethan Eade, Philip Fong and Mario Munich
In: ICRA 2010
Abstract
Localization in unknown environments using low-cost sensors remains a challenge. This paper presents a new localization approach that learns the spatial variation of an observed continuous signal. We model the signal as a piecewise linear function and estimate its parameters using a simultaneous localization and mapping (SLAM) approach. We apply our framework to a sensor measuring bearing to active beacons where measurements are systematically distorted due to occlusion and signal reflections of walls and other objects present in the environment. Experimental results from running GraphSLAM and EKF-SLAM on manually collected sensor measurements as well as on data recorded on a vacuum-cleaner robot validate our model.
[pdf]
Sunday, September 18, 2011
Lab Meeting September 22nd, 2011 (Jim): Learning the semantics of object–action relations by observation
Title: Learning the semantics of object–action relations by observation
Author: Eren Erdal Aksoy, Alexey Abramov, Johannes Dörr, Kejun Ning, Babette Dellen, and Florentin Wörgötter
The International Journal of Robotics Research 2011;30 1229-1249
Author: Eren Erdal Aksoy, Alexey Abramov, Johannes Dörr, Kejun Ning, Babette Dellen, and Florentin Wörgötter
The International Journal of Robotics Research 2011;30 1229-1249
Abstract:
Recognizing manipulations performed by a human and the transfer and execution of this by a robot is a difficult problem. We address this in the current study by introducing a novel representation of the relations between objects at decisive time points during a manipulation. Thereby, we encode the essential changes in a visual scenery in a condensed way such that a robot can recognize and learn a manipulation without prior object knowledge. To achieve this we continuously track image segments in the video and construct a dynamic graph sequence. Topological transitions of those graphs occur whenever a spatial relation between some segments has changed in a discontinuous way and these moments are stored in a transition matrix called the semantic event chain (SEC). We demonstrate that these time points are highly descriptive for distinguishing between different manipulations. Employing simple sub-string search algorithms, SECs can be compared and type-similar manipulations can be recognized with high confidence. As the approach is generic, statistical learning can be used to find the archetypal SEC of a given manipulation class. ...
http://ijr.sagepub.com/content/30/10/1229.full.pdf+html
Friday, September 09, 2011
Lab Meeting September 9th, 2011 (Steven): Learning Generic Invariances in Object Recognition: Translation and Scale
Title: Learning Generic Invariances in Object Recognition: Translation and Scale
Authors: Joel Z Leibo, Jim Mutch, Lorenzo Rosasco, Shimon Ullman4, and Tomaso Poggio
Abstract:
Authors: Joel Z Leibo, Jim Mutch, Lorenzo Rosasco, Shimon Ullman4, and Tomaso Poggio
Abstract:
Invariance to various transformations is key to object recognition but existing definitions of invariance are somewhat confusing while discussions of invariance are often confused. In this report, we provide an operational definition of invariance by formally defining perceptual tasks as classification problems. The definition should be appropriate for physiology, psychophysics and computational modeling.
For any specific object, invariance can be trivially “learned” by memorizing a sufficient number of example images of the transformed object. While our formal definition of invariance also covers such cases, this report focuses instead on invariance from very few images and mostly on invariances from one example. Image-plane invariances – such as translation, rotation and scaling – can be computed from a single image for any object. They are called generic since in principle they can be hardwired or learned (during development) for any object.
In this perspective, we characterize the invariance range of a class of feedforward architectures for visual recognition that mimic the hierarchical organization of the ventral stream.
We show that this class of models achieves essentially perfect translation and scaling invariance for novel images. In this architecture a new image is represented in terms of weights of ”templates” (e.g. “centers” or “basis functions”) at each level in the hierarchy. Such a representation inherits the invariance of each template, which is implemented through replication of the corresponding “simple” units across positions or scales and their “association” in a “complex” unit. We show simulations on real images that characterize the type and number of templates needed to support the invariant recognition of novel objects. We find that 1) the templates need not be visually similar to the target objects and that 2) a very small number of them is sufficient for good recognition.
These somewhat surprising empirical results have intriguing implications for the learning of invariant recognition during the development of a biological organism, such as a human baby. In particular, we conjecture that invariance to translation and scale may be learned by the association – through temporal contiguity – of a small number of primal templates, that is patches extracted from the images of an object moving on the retina across positions and scales. The number of templates can later be augmented by bootstrapping mechanisms using the correspondence provided by the primal templates – without the need of temporal contiguity.
Link
These somewhat surprising empirical results have intriguing implications for the learning of invariant recognition during the development of a biological organism, such as a human baby. In particular, we conjecture that invariance to translation and scale may be learned by the association – through temporal contiguity – of a small number of primal templates, that is patches extracted from the images of an object moving on the retina across positions and scales. The number of templates can later be augmented by bootstrapping mechanisms using the correspondence provided by the primal templates – without the need of temporal contiguity.
Link
Thursday, September 08, 2011
Lab Meeting September 9th, 2011 (Chih Chung): Identification and Representation of Homotopy (RSS 2011 Best paper)
title: Identification and Representation of Homotopy
Classes of Trajectories for Search-based Path
Planning in 3D
Authors: Subhrajit Bhattacharya, Maxim Likhachev and Vijay Kumar
Abstract: There are many applications in motion planning
where it is important to consider and distinguish between
different homotopy classes of trajectories. Two trajectories are
homotopic if one trajectory can be continuously deformed into
another without passing through an obstacle, and a homotopy
class is a collection of homotopic trajectories. In this paper
we consider the problem of robot exploration and planning in
three-dimensional configuration spaces to (a) identify and classify
different homotopy classes; and (b) plan trajectories constrained
to certain homotopy classes or avoiding specified homotopy
classes. In previous work [1] we have solved this problem for
two-dimensional, static environments using the Cauchy Integral
Theorem in concert with graph search techniques. The robot
workspace is mapped to the complex plane and obstacles are poles
in this plane. The Residue Theorem allows the use of integration
along the path to distinguish between trajectories in different
homotopy classes. However, this idea is fundamentally limited
to two dimensions. In this work we develop new techniques to
solve the same problem, but in three dimensions, using theorems
from electromagnetism. The Biot-Savart law lets us design an
appropriate vector field, the line integral of which, using the
integral form of Ampere’s Law, encodes information about
homotopy classes in three dimensions. Skeletons of obstacles
in the robot world are extracted and are modeled by currentcarrying
conductors. We describe the development of a practical
graph-search based planning tool with theoretical guarantees
by combining integration theory with search techniques, and
illustrate it with examples in three-dimensional spaces such as
two-dimensional, dynamic environments and three-dimensional
static environments.
link
Classes of Trajectories for Search-based Path
Planning in 3D
Authors: Subhrajit Bhattacharya, Maxim Likhachev and Vijay Kumar
Abstract: There are many applications in motion planning
where it is important to consider and distinguish between
different homotopy classes of trajectories. Two trajectories are
homotopic if one trajectory can be continuously deformed into
another without passing through an obstacle, and a homotopy
class is a collection of homotopic trajectories. In this paper
we consider the problem of robot exploration and planning in
three-dimensional configuration spaces to (a) identify and classify
different homotopy classes; and (b) plan trajectories constrained
to certain homotopy classes or avoiding specified homotopy
classes. In previous work [1] we have solved this problem for
two-dimensional, static environments using the Cauchy Integral
Theorem in concert with graph search techniques. The robot
workspace is mapped to the complex plane and obstacles are poles
in this plane. The Residue Theorem allows the use of integration
along the path to distinguish between trajectories in different
homotopy classes. However, this idea is fundamentally limited
to two dimensions. In this work we develop new techniques to
solve the same problem, but in three dimensions, using theorems
from electromagnetism. The Biot-Savart law lets us design an
appropriate vector field, the line integral of which, using the
integral form of Ampere’s Law, encodes information about
homotopy classes in three dimensions. Skeletons of obstacles
in the robot world are extracted and are modeled by currentcarrying
conductors. We describe the development of a practical
graph-search based planning tool with theoretical guarantees
by combining integration theory with search techniques, and
illustrate it with examples in three-dimensional spaces such as
two-dimensional, dynamic environments and three-dimensional
static environments.
link
Thursday, September 01, 2011
Lab Meeting September 2nd, 2011 (David): Multiclass Multimodal Detection and Tracking in Urban Environments
Title: Multiclass Multimodal Detection and Tracking in Urban Environments
Author: Luciano Spinello, Rudolph Triebel and Roland Siegwart
Abstract:
This paper presents a novel approach to detect and track people and cars based on the combined information retrieved from a camera and a laser range scanner. Laser data points are classified by using boosted Conditional Random Fields (CRF), while the image based detector uses an extension of the Implicit Shape Model (ISM), which learns a codebook of local descriptors from a set of hand-labeled images and uses them to vote for centers of detected objects. Our extensions to ISM include the learning of object parts and template masks to obtain more distinctive votes for the particular object classes. The detections from both sensors are then fused and the objects are tracked using a Kalman Filter with multiple motion models. Experiments conducted in real-world urban scenarios demonstrate the effectiveness of our approach.
Link:
IJRR copy
localcopy
Author: Luciano Spinello, Rudolph Triebel and Roland Siegwart
Abstract:
This paper presents a novel approach to detect and track people and cars based on the combined information retrieved from a camera and a laser range scanner. Laser data points are classified by using boosted Conditional Random Fields (CRF), while the image based detector uses an extension of the Implicit Shape Model (ISM), which learns a codebook of local descriptors from a set of hand-labeled images and uses them to vote for centers of detected objects. Our extensions to ISM include the learning of object parts and template masks to obtain more distinctive votes for the particular object classes. The detections from both sensors are then fused and the objects are tracked using a Kalman Filter with multiple motion models. Experiments conducted in real-world urban scenarios demonstrate the effectiveness of our approach.
Link:
IJRR copy
localcopy