Presented by: Hank Lin
From: Proc. of the International Conference on Machine Learning (ICML'12), Edinburgh, Scotland, 2012.
Abstract:
Scene parsing, or semantic segmentation, consists in la-
beling each pixel in an image with the category of the object
it belongs to. It is a challenging task that involves the simul-
taneous detection, segmentation and recognition of all the
objects in the image.
The scene parsing method proposed here starts by com-
puting a tree of segments from a graph of pixel dissimilari-
ties. Simultaneously, a set of dense feature vectors is com-
puted which encodes regions of multiple sizes centered on
each pixel. The feature extractor is a multiscale convolu-
tional network trained from raw pixels. The feature vec-
tors associated with the segments covered by each node in
the tree are aggregated and fed to a classifier which pro-
duces an estimate of the distribution of object categories
contained in the segment. A subset of tree nodes that cover
the image are then selected so as to maximize the aver-
age “purity” of the class distributions, hence maximizing
the overall likelihood that each segment will contain a sin-
gle object. The convolutional network feature extractor is
trained end-to-end from raw pixels, alleviating the need for
engineered features. After training, the system is parameter
free.
The system yields record accuracies on the Stanford
Background Dataset (8 classes), the Sift Flow Dataset (33
classes) and the Barcelona Dataset (170 classes) while
being an order of magnitude faster than competing ap-
proaches, producing a 320 × 240 image labeling in less
than 1 second.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.