Speaker David Blei
Affiliation Columbia University
Host Paul Bennett
Date recorded 12 August 2014
Probabilistic topic models provide a suite of tools for analyzing large document collections. Topic modeling algorithms discover the latent themes that underlie the documents and identify how each document exhibits those themes. Topic modeling can be used to help explore, summarize, and form predictions about documents. Topic modeling ideas have been adapted to many domains, including images, music, networks, genomics, and neuroscience.
Traditional topic modeling algorithms analyze a document collection and estimate its latent thematic structure. However, many collections contain an additional type of data: how people use the documents. For example, readers click on articles in a newspaper website, scientists place articles in their personal libraries, and lawmakers vote on a collection of bills. Behavior data is essential both for making predictions about users (such as for a recommendation system) and for understanding how a collection and its users are organized.
In this talk, I will review the basics of topic modeling and describe our recent research on collaborative topic models, models that simultaneously analyze a collection of texts and its corresponding user behavior. We studied collaborative topic models on 80,000 scientists' libraries from Mendeley and 100,000 users' click data from the arXiv. Collaborative topic models enable interpretable recommendation systems, capturing scientists' preferences and pointing them to articles of interest. Further, these models can organize the articles according to the discovered patterns of readership. For example, we can identify articles that are important within a field and articles that transcend disciplinary boundaries.
More broadly, topic modeling is a case study in the large field of applied probabilistic modeling. Finally, I will survey some recent advances in this field. I will show how modern probabilistic modeling gives data scientists a rich language for expressing statistical assumptions and scalable algorithms for uncovering hidden patterns in massive data.
©2014 Microsoft Corporation. All rights reserved.