Andrew McCallum
October 10, 2005, 4:15PM
http://graphics.stanford.edu/ba-colloquium/
Abstract
Although information extraction and data mining appear together in many applications, their interface in most current systems would better be described as serial juxtaposition than as tight integration. Information extraction populates slots in a database by identifying relevant subsequences of text, but is usually not aware of the emerging patterns and regularities in the database. Data mining methods begin from a populated database, and are often unaware of where the data came from, or its inherent uncertainties. The result is that the accuracy of both suffers, and accurate mining of complex text sources has been beyond reach.
In this talk I will describe work in probabilistic models that perform joint inference across multiple components of an information processing pipeline in order to avoid the brittle accumulation of errors. After briefly introducing conditional random fields, I will describe recent work in information extraction leveraging factorial state representations, object deduplication, and transfer learning, as well as scalable methods of inference and learning.
I will then describe two methods of integrating textual data into a particular type of data mining---social network analysis. The Author-Recipient-Topic (ART) model performs summarization and question routing from large quantities of email or other message data by discovering clusters of words associated with topics, and also role-similarity among entities based on those topics. The Group-Topic (GT) model captures relational data along with accompanying text by discovering how entities fall into groups---capturing the different coalitions that arise dependent on the topic at hand. I will demonstrate this on several decades of voting records in the U.N. and U.S. Senate.
If there is time, I will also give a demo of the new research paper search engine we are creating at UMass.
Joint work with colleagues at UMass: Charles Sutton, Chris Pal, Ben Wellner, Michael Hay, Xuerui Wang, Natasha Mohanty, and Andres Corrada.
Andrew McCallum is an Associate Professor at University of Massachusetts, Amherst. He was previously Vice President of Research and Development at WhizBang Labs, a company that used machine learning for information extraction from the Web. In the late 1990's he was a Research Scientist and Coordinator at Justsystem Pittsburgh Research Center, where he spearheaded the creation of CORA, an early research paper search engine that used machine learning for spidering, extraction, classification and citation analysis. He was a post-doctoral fellow at Carnegie Mellon University after receiving his PhD from the University of Rochester in 1995. He is an action editor for the Journal of Machine Learning Research. For the past ten years, McCallum has been active in research on statistical machine learning applied to text, especially information extraction, document classification, clustering, finite state models, semi-supervised learning, and social network analysis.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.