Sunday, October 01, 2006

[ML Lunch] Monday 10/02

Hi,

We'll be having a KDD 2006 conference review
session this Monday 10/02.
Hanghang Tong and Jimeng Sun will lead the
session with the following
papers:

1. Hanghang Tong on

Center-Piece Subgraphs: Problem Definition and
Fast Solutions.
by Hanghang Tong and Christos Faloutsos

2. Jimeng Sun on

Beyond Streams and Graphs: Dynamic Tensor
Analysis.
by Jimeng Sun, Yufei Tao, Christos Faloutsos


Venue: NSH 1507
Date : Monday, October 02
Time : 12:00 noon

And of course, thanks to MLD, lunch will be
served.


For schedules, links to papers et al, please see
the web page:

http://www.cs.cmu.edu/~learning

We are on the lookout for speakers for the
semester, so do send any of
us an email if you'd like to give a talk.


Your ML Lunch organizing committee,

Edoardo Airoldi (eairoldi@cs.cmu.edu)
Anna Goldenberg (anya@cs.cmu.edu)
Leonid Kontorovich (lkontor@cs.cmu.edu)
Andreas Krause (krausea@cs.cmu.edu)
Jure Leskovec (jure@cs.cmu.edu)
Pradeep Ravikumar (pradeepr@cs.cmu.edu)


=======================================================================
Abstracts:

1. Center-Piece Subgraphs: Problem Definition and
Fast Solutions
by Hanghang Tong and Christos Faloutsos

Given $\QN$ nodes in a social network (say,
authorship network), how can
we find the node/author that is the center-piece,
and has direct or
indirect connections to all, or most of them? For
example, this node
could be the common advisor, or someone who
started the research area
that the $\QN$ nodes belong to. Isomorphic
scenarios appear in law
enforcement (find the master-mind criminal,
connected to all current
suspects), gene regulatory networks (find the
protein that participates
in pathways with all or most of the given $\QN$
proteins), viral
marketing and many more. Connection subgraphs is
an important first
step, handling the case of $\QN$=2 query nodes.
Then, the connection
subgraph algorithm finds the $b$ intermediate
nodes, that provide a good
connection between the two query nodes. Here we
generalize the challenge
in multiple dimensions: First, we allow more than
two query nodes.
Second, we allow a whole family of queries,
ranging from 'OR' to 'AND',
with 'softAND' in-between. Finally, we design and
compare a fast
approximation, and study the quality/speed
trade-off. We also present
experiments on the DBLP dataset. The experiments
confirm that our
proposed method naturally deals with multi-source
queries and that the
resulting subgraphs agree with our intuition.
Wall-clock timing results
on the DBLP dataset show that our proposed
approximation achieve good
accuracy for about $6:1$ speedup.


2. Beyond Streams and Graphs: Dynamic Tensor
Analysis
by Jimeng Sun, Yufei Tao, Christos Faloutsos

How do we find patterns in author-keyword
associations, evolving over
time? Or in DataCubes, with
product-branch-customer sales information?
Matrix decompositions, like principal component
analysis (PCA) and
variants, are invaluable tools for mining,
dimensionality reduction,
feature selection, rule identification in
numerous settings like
streaming data, text, graphs, social networks and
many more.However,
they have only two orders, like author and
keyword, in the above
example. We propose to envision such higher order
data as tensors, and
tap the vast literature on the topic. However,
these methods do not
necessarily scale up, let alone operate on
semi-infinite streams. Thus,
we introduce the dynamic tensor analysis (DTA)
method, and its variants.
DTA provides a compact summary for high-order and
high-dimensional data,
and it also reveals the hidden correlations.
Algorithmically,we designed
DTA very carefully so that it is (a) scalable,
(b) space efficient (it
does not need to store the past) and (c) fully
automatic with no need
for user defined parameters. Moreover, we propose
STA, a streaming
tensor analysis method, which provides a fast,
streaming approximation
to DTA. We implemented all our methods, and
applied them in two real
settings, namely, anomaly detection and multi-way
latent semantic
indexing. We used two real, large datasets, one
on network flow data
(100GB over 1 month) and one from DBLP (200MB
over 25 years). Our
experiments show that our methods are fast,
accurate and that they find
interesting patterns and outliers on the real
datasets.

No comments: