Friday, March 13, 2009

CMU talk: Beyond Nouns and Verbs: Learning Visually Grounded Stories of Images and Videos using Language and Vision Abhinav Gupta

VASC Seminar
Monday, March 16, 2009

Beyond Nouns and Verbs: Learning Visually Grounded Stories of Images and Videos using Language and Vision
Abhinav Gupta
University of Maryland, College Park

Abstract:
In this talk, I will present our recent work on exploring synergy between language and vision for learning visually grounded contextual structures. Our work departs from the traditional view to visual and contextual learning where individual detectors and relationships are learned separately. Our work focuses on simultaneous learning of visual appearance and contextual models from richly annotated, weakly labeled datasets. In the first part of the talk, I will show how rich annotations can be utilized to constrain the learning of visually grounded models of nouns, prepositions and comparative adjectives from weakly labeled data. I will also show how visually grounded models of prepositions and comparative adjectives can be utilized as contextual models for scene analysis.

In the second part, I will present storyline models for interpretation of videos. Storyline models go beyond pair-wise contextual models and represent higher order constraints by allowing only a few and finite number of possible action sequences (stories). Visual inference using storyline models involve inferring the "plot" of the video (sequence of actions) and recognizing individual activities in the plot. I will also present an iterative approach to learn visually grounded storyline models from video and linguistic information provided in captions.

Bio:
Abhinav Gupta is a doctoral candidate in the Department of Computer Science at University of Maryland, College Park. He received MS in Computer Science from University of Maryland in 2007 and BTech in Computer Science and Engineering from Indian Institute of Technology, Kanpur, in 2004. His research focuses on visually grounded semantic models, and how language and vision can be exploited to learn such models. His other research interests include combining multiple cues, probabilistic graphical models, human body tracking and camera networks. He is also a recipient of the University of Maryland Dean's Fellowship for excellence in research.

No comments: