All occurrences of this event have passed.
This listing is displayed for historical purposes.

Free Workshop / Seminar

Presented By: Department of Statistics

Contact Organizers Flag As Inappropriate

Statistics Department Seminar Series: Florentina Bunea, Professor, Department of Statistics and Data Science, Cornell University

Name: Statistics Department Seminar Series: Florentina Bunea, Professor, Department of Statistics and Data Science, Cornell University
Start: 2021-09-17T10:00:00-04:00
End: 2021-09-17T11:00:00-04:00
Location: West Hall

"Optimal estimation of topic distributions in topic models with applications to Wasserstein document-distance calculations"

The focus of this talk is on the estimation of high-dimensional, discrete, possibly sparse, mixture models in the context topic models. The data consists in p-dimensional multinomial count vectors, corresponding to p words in a given dictionary, across n independent samples, the documents in a corpus. In topic models, the p nexpected word frequency matrix is assumed to be factorized as a p K word-topic matrix A and a Kn topic-document matrix T. Since columns of both matrices represent conditional probability or probability vectors, columns of A are viewed as p-dimensional mixture components that are common to all documents while columns of T, the topic distributions, are viewed as the K-dimensional mixture weights that are document specific and are allowed to be sparse.

The main interest is to provide sharp, finite sample, l1-norm convergence rates for estimators of the possibly sparse mixture weights Twhen Ais either known or unknown. For known A, we suggest MLE estimation of T. Despite the wide-spread applications of these models, and simplicity of the method, the analysis is, surprisingly, still open, owing in part to the fact that T is typically on the boundary of its domain. Our non-standard analysis of the MLE not only establishes its l1 convergence rate, but also reveals a remarkable property: the MLE, with no extra regularization, can be exactly sparse and contain the true zero pattern of T. We further show that the MLE is both minimax optimal and adaptive to the unknown sparsity in a large class of sparse topic distributions. When Ais unknown, we estimate Tby optimizing the likelihood function corresponding to a plug in, generic, estimator A of A. For any estimator A that satisfies carefully detailed conditions for proximity to A, we show that the resulting estimator of T retains the properties established for the MLE. Our theoretical results allow the ambient dimensions K and p to grow with the sample sizes.

Our main application is to the estimation of 1-Wasserstein distances between document generating distributions. We propose, estimate and analyze new 1-Wasserstein distances between alternative probabilistic document representations, at the word and topic level, respectively. We derive finite sample bounds on the estimated proposed 1-Wasserstein distances. For word level document-distances, we provide contrast with existing rates on the 1-Wasserstein distance between standard empirical frequency estimates. The effectiveness of the proposed 1-Wasserstein distances is illustrated by an analysis of an IMDB movie reviews data set.

Florentina Bunea is a Professor in the Department of Statistics and Data Science at Cornell University. Her research is broadly centered on statistical machine learning theory and high-dimensional statistical inference.

https://stat.cornell.edu/people/faculty/florentina-bunea

Co-Sponsored By

Department of Statistics Seminar Series

Explore Similar Events

Loading Similar Events...

Keywords

Seminar

0 upcoming occurrence
0 expired occurrence

Happening @ Michigan

The University of Michigan Events Calendar

Sponsors

Keywords

Types

Search Results

Events

Statistics Department Seminar Series: Florentina Bunea, Professor, Department of Statistics and Data Science, Cornell University

"Optimal estimation of topic distributions in topic models with applications to Wasserstein document-distance calculations"

Related Links

Co-Sponsored By

Explore Similar Events

Keywords

Contact Event Organizers: Department of Statistics

When and Where

Map West Hall - 340

September 2021

Contact Us

Happening @ Michigan

The University of Michigan Events Calendar

Sponsors

Keywords

Types

Search Results

Events

Statistics Department Seminar Series: Florentina Bunea, Professor, Department of Statistics and Data Science, Cornell University

"Optimal estimation of topic distributions in topic models with applications to Wasserstein document-distance calculations"

Related Links

Co-Sponsored By

Share Event

Explore Similar Events

Keywords

Contact Event Organizers: Department of Statistics

When and Where

Map West Hall - 340

September 2021

Contact Us