Presented By: Department of Statistics
Oral Prelim: Mikhail Yurochkin, New algorithms for Topic Modeling
Topic Modelling is a class of exploratory algorithms applied to text, image, audio or video data. The goal is to find latent topics to summarize and understand huge collections of data. We develop two new topic modelling algorithms and design a new mathematical formulation of the problem.
In the first part, we describe a novel geometric view of the problem and develop a fast and efficient algorithm based on k-means to capture the geometric structure. We demonstrate performance of the algorithm and compare it to established techniques based on the simulated data.
In the second part, we take the common probabilistic formulation of the model and address inference inefficiencies of the currently used algorithms. We design a new inference procedure based on Metropolis Hastings and suggest a new method of proposing candidates for high dimensional probability vectors via Generalized Beta distribution. We also consider supervised setting, where documents have class labels and generalize our algorithm to this case. Performance is evaluated with several simulation studies and a political blogs data set, where each document is labeled either liberal or conservative.
In the third part we propose EM algorithm based on closed form posterior approximation with Carlson's multiple hypergeometric functions.
In the first part, we describe a novel geometric view of the problem and develop a fast and efficient algorithm based on k-means to capture the geometric structure. We demonstrate performance of the algorithm and compare it to established techniques based on the simulated data.
In the second part, we take the common probabilistic formulation of the model and address inference inefficiencies of the currently used algorithms. We design a new inference procedure based on Metropolis Hastings and suggest a new method of proposing candidates for high dimensional probability vectors via Generalized Beta distribution. We also consider supervised setting, where documents have class labels and generalize our algorithm to this case. Performance is evaluated with several simulation studies and a political blogs data set, where each document is labeled either liberal or conservative.
In the third part we propose EM algorithm based on closed form posterior approximation with Carlson's multiple hypergeometric functions.
Explore Similar Events
-
Loading Similar Events...