Thursday, February 18, 2010

CMU PhD Thesis proposal: Data-driven Scene Parsing With the Visual Memex

Data-driven Scene Parsing With the Visual Memex

Tomasz Malisiewicz
Carnegie Mellon University

February 18, 2010, 4:00 p.m., NSH 3305

Abstract: This proposal is concerned with the problem of image understanding. Given a single static image, the goal is to explain the entire image by recognizing all of the objects depicted in the image. We formulate the problem of image understanding as image parsing -- breaking up the image into semantically meaningful regions and recognizing the objects embedded in each region. In our approach we strive to obtain a dense understanding of the image by not leaving any portion of the image unexplained. While most approaches to scene understanding formulate the problem as that of recognizing abstract object categories (and for object asking “what is this?”), we use a data-driven model of recognition more akin to memory (and ask the question: “what is this like?”). We present an exemplar-based framework for reasoning about objects and their relationships in images dubbed the Visual Memex. The Visual Memex is a non-parametric graph-based model of objects which encodes two types of object relationships: visual similarity between object exemplars, and 2D spatial context between objects in a single image. We use a region-based representation of exemplar objects which has been shown to be superior to the popular rectangular window approach for a wide array of things and stuff found in natural scenes. During training, we learn a set of similarity functions per-exemplar and formulate recognition as association between automatically extracted regions from the input image and exemplar regions in the Visual Memex. We use both bottom-up image segmentation, mid-level reasoning about segment relationships as well as spatial relationships between exemplars in the Visual Memex as complementary sources of object hypotheses. I propose an iterative image parsing framework which builds an interpretation of an input image by iteratively conditioning on a current (partial) interpretation and generating novel segment hypotheses using low-level, mid-level, and high-level cues. An evaluation is proposed which evaluates the system with respect to recognition as well as segmentation on real world scenes from LabelMe.

Thesis Committee
Alexei A. Efros, Chair
Martial Hebert
Takeo Kanade
Pietro Perona, California Institute of Technology

No comments: