Presented By: Department of Statistics
Statistics Department Seminar Series: Irina Gaynanova, Associate Professor, Biostatistics, Statistics (by courtesy), University of Michigan
"Representation learning for multi-view data integration"
Abstract: Multi-view data, where different data types are collected from the same samples, are increasingly prevalent due to advances in omics and wearable technologies. For instance, The Cancer Genome Atlas provides omics data from multiple platforms, while affordable digital technologies enable the collection of multiple types of high-frequency wearable signals (e.g., continuous glucose monitoring (CGM), actigraphy) alongside tabular clinical characteristics. Integrating this multi-view data has the potential to enhance scientific insights but also presents significant analytic challenges. In this talk, I will focus on one critical problem in multi-view representation learning: distinguishing between joint and individual signal subspaces in noisy, high-dimensional data. I will present our recent work, where we characterize the conditions under which these subspaces can be reliably identified, based on an analysis of spectrum perturbations of the product of projection matrices. We develop an easy-to-use, scalable estimation algorithm based on these insights, which employs the rotational bootstrap and random matrix theory to partition the observed spectrum into joint, individual, and noise subspaces. I will illustrate this method using multi-omics data from colorectal cancer patients and a nutrigenomic study of mice. Towards the end of the talk, I will broaden the discussion to the unique challenges of high-frequency wearable data, where a distributional representation is more attractive than a matrix representation of derived features. I will briefly highlight some recent work in this area and conclude by outlining open problems and future research directions for multi-view representation learning.