Friday, November 09, 2007

[ML Lunch] Nati Srebro (Thursday, 11/08/07 at 5pm in NSH 3002)

Does a large data-set mean more, or less, work?

In devising methods for optimization problems associated with learning tasks, and in studying the runtime of these methods, we usually think of the runtime as increasing with the data set size. However, from a learning performance perspective, having more data available should not mean we need to spend more time optimizing. At the extreme, we can always ignore some of the data if it makes optimization difficult. But perhaps having more data available can actually allow us to spend less time optimizing?


Two types of behaviors:

(1) a phase transition behavior, where a computationally intractable problems becomes tractable, at the cost of excess information. I will demonstrate this through a detailed study of informational and computational limits in clustering.

(2) the scaling of the computational cost of training, e.g. supportvector machines (SVMs). I will argue that the computational cost should scale down with data set size, and up with the "hardness" ofthe decision problem. In particular, I will describe a simple training procedure, achieving state-of-the-art performance on large data sets, whose runtime does not increase with data set size.

No comments: