Tuesday, February 27, 2007

CMU VASC talk: Clustering and Classification via Lossy Data Compression

Clustering and Classification via Lossy Data Compression

Yi Ma, UIUC
Monday, Feb 26

For many problems in computer vision, image processing, and pattern recognition, we need to process and analyze massive amount of high-dimensional mixed data such as images and gene expression data. By mixed data, we mean that the given data set consists of multiple heterogeneous subsets (which have different geometric or statistical characteristics) but each subset can be more easily modeled or representedthan the whole data set together.

In this talk, we address two fundamental questions: .How to cluster and classify such high-dimensional mixed data?. We contend that both the (unsupervised) clustering and (supervised) classification problems can be cast as a lossy data compression problem and solved efficiently within a unified mathematical framework. In theory, this approach offers some distinguished advantages over conventional methods for clustering and classification, especially in dealing with several difficult issues that often arise in practice: regularization of degenerate distributions, selection of models with different complexities, and rejection of outliers.

Our work establishes a strong connection between information theory, especially the rate-distortion theory, with data clustering and classification, and it leads to extremely simple but effective algorithms. We will demonstrate the success of these algorithms in a few popular but difficult problems, including but not limited to natural image segmentation, microarray data clustering, handwritten digits and face
recognition.

No comments: