Robot Perception and Learning: NTU talk: Human Action Recognition Using Bag of Video Words

Title: Human Action Recognition Using Bag of Video Words
Speaker: Dr. Mubarak Shah, Agere Chair Professor of Computer Science, University of Central Florida
Time: 4:00pm, Dec 24 (Thu), 2009
Place: Room 210, CSIE building

Abstract:

The traditional approach for video analysis involves detection of objects, followed by tracking of objects from frame to frame and finally analysis of tracks for human action recognition. However, in some videos of complex scenes it is not possible to reliably detect and track objects. Therefore, recently in computer vision there has been lots of interest in the bag of video words approach, which bypasses the object detection and tracking steps. In bag of video words approach an action is described by a distribution of spatiotemporal cuboids (3D interest points).

In this talk, first I will describe a method to automatically discover the optimal number of video words clusters by utilizing the Maximization of Mutual Information (MMI). Unlike the k-means algorithm which is typically used to cluster spatiotemporal cuboids into video words based on their appearance similarity, MMI clustering further groups the video-words, such that the semantically similar video-words, e.g. words corresponding to the same part of the body during an action, are grouped in the same cluster.

The above method for human action recognition uses only one kind of features, spatiotemporal cuboids. However, single feature based representation for human action is not sufficient to capture the imaging variations (view-point, illumination etc.) and attributes of individuals (size, age, gender etc.).

Next I will present a method which uses two types of features: i) a quantized vocabulary of local spatio-temporal (ST) volumes (or cuboids), and ii) a quantized vocabulary of spin-images. To optimally combine these features, we treat different features and videos as nodes in a graph, where weighted edges between the nodes represent the strength of the relationship between entities. The graph is then embedded into a k-dimensional space subject to the criteria that similar nodes have Euclidian coordinates which are closer to each other. This is achieved by converting this constraint into a minimization problem whose solution is the eigenvectors of the graph Laplacian matrix. This procedure is known as Fiedler Embedding.

Short Biography:
Dr. Mubarak Shah, Agere Chair Professor of Computer Science, is the founding director of the Computer Visions Lab at UCF. He is a co-author of three books (Motion-Based Recognition (1997), Video Registration (2003), and Automated Multi-Camera Surveillance: Algorithms and Practice (2008)), all by Springer. He has published ten book chapters, seventy five journal and one hundred seventy conference papers on topics related to visual surveillance, tracking, human activity and action recognition, object detection and categorization, shape from shading, geo registration, photo realistic synthesis, visual crowd analysis, bio medical imaging, etc.

Dr. Shah is a fellow of IEEE, IAPR and SPIE. In 2006, he was awarded a Pegasus Professor award, the highest award at UCF, given to a faculty member who has made a significant impact on the university, has made an extraordinary contribution to the university community, and has demonstrated excellence in teaching, research and service. He is a Distinguished ACM Speaker. He was an IEEE Distinguished Visitor speaker for 1997-2000, and received IEEE Outstanding Engineering Educator Award in 1997. He received the Harris Corporation's Engineering Achievement Award in 1999, the TOKTEN awards from UNDP in 1995, 1997, and 2000; Teaching Incentive Program awards in 1995 and 2003, Research Incentive Award in 2003, Millionaires' Club awards in 2005 and 2006, University Distinguished Researcher award in 2007, SANA award in 2007, an honorable mention for the ICCV 2005 Where Am I? Challenge Problem, and was nominated for the best paper award in ACM Multimedia Conference in 2005. He is an editor of international book series on Video Computing; editor in chief of Machine Vision and Applications journal, and an associate editor of ACM Computing Surveys journal. He was an associate editor of the IEEE Transactions on PAMI, and a guest editor of the special issue of International Journal of Computer Vision on Video Computing. He is the program co-chair of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.

Robot Perception and Learning

Thursday, December 24, 2009

NTU talk: Human Action Recognition Using Bag of Video Words

No comments: