Saturday, September 09, 2006

CMU ML talks: the ICML 2006 Conference Review Session

CMU ML lunch talks

1. Support Vector Decomposition Machine
by F Pereira and G Gordon

In machine learning problems with tens of thousands of features and only dozens or hundreds of independent training examples, dimensionality reduction is essential for good learning performance. In previous work, many researchers have treated the learning problem in two separate phases: first use an algorithm such as singular value decomposition to reduce the dimensionality of the data set, and then use a classification algorithm such as naive Bayes or support vector machines to learn a classifier. We demonstrate that it is possible to combine the two goals of dimensionality reduction and classification into a single learning objective, and present a novel and efficient algorithm which optimizes this objective directly. We present experimental results in fMRI analysis which show that we can achieve better learning performance and lower-dimensional representations than two-phase approaches can.


2. Inference with the Universum
by Jason Weston, Ronan Collobert, Fabian Sinz, Leon Bottou and Vladimir Vapnik

In this paper we study a new framework introduced by (Vapnik 1998) that is an alternative capacity concept to the large margin approach. In the particular case of binary classification, we are given a set of labeled examples, and a collection of "non-examples" that do not belong to either class of interest. This collection, called the Universum, allows one to encode prior knowledge by representing meaningful concepts in the same domain as the problem at hand. We describe an algorithm to leverage the Universum by maximizing the number of observed contradictions, and show experimentally that this approach delivers accuracy improvements over using labeled data alone.


3. Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture
by E.P. Xing, K. Sohn, M.I. Jordan and Y.W. Teh

Uncovering the haplotypes of single nucleotide polymorphisms and their population demography is essential for many biological and medical applications. Methods for haplotype inference developed thus far-including methods based on coalescence, finite and infinitemixtures, and maximal parsimony ignore the underlying population structure in the genotype data. As noted by Pritchard(2001), different populations can share certain portion of their genetic ancestors, as well as have their own genetic components through migration and diversification. In this paper, we address the problem of multi-population haplotype inference. We capture cross-population structure using a nonparametric Bayesian prior known as the hierarchical Dirichlet process (HDP) (Teh et al.,2006), conjoining this prior with a recently developed Bayesian methodology for haplotype phasing known as DP-Haplotyper (Xinget al., 2004). We also develop an efficient sampling algorithm for the HDP based on a two-level nested P?olya urn scheme. Weshow that our model outperforms extant algorithms on both simulated and real biological data.

No comments: