Presented By: Department of Statistics Dissertation Defenses
Modeling Structure in Unstructured Data: Statistical and Causal Perspectives
Kevin Christian Wibisono
Modern machine learning systems are trained on massive amounts of unstructured data such as text, images, and sequences. Despite the apparent lack of explicit structure, they exhibit remarkable abilities to learn patterns, perform reasoning, and support decision-making. This apparent paradox raises a central question: what structure do these models recover from unstructured data, and how can we understand and use it?
This dissertation investigates how language models (i) represent structure through their architectures, (ii) learn structure from unstructured data, and (iii) enable us to leverage this learned structure for principled causal inference with unstructured data.
The first part develops a statistical perspective of attention mechanisms, the core building block of modern language models. We show that attention can be interpreted as adaptive mixture-of-experts models. This interpretation enables us to extend attention to model general exponential family-distributed data, making it capable of modeling complex, heterogeneous data beyond text. In turn, this perspective reframes attention as a statistical model, explaining how it captures complex dependencies and latent structure, with guarantees on identifiability and generalization.
The second part examines how such structure arises from unstructured training data. We show that many in-context learning behaviors can emerge directly from co-occurrence patterns in unstructured text, linking modern models to classical co-occurrence modeling tools like latent factor modeling. At the same time, we identify the limits of this mechanism: positional structure becomes essential for more complex reasoning tasks. We further demonstrate that training data composition plays a critical role in shaping model behavior and alignment, with example difficulty acting as a key factor.
The final part studies how learned representations in language models can be leveraged for causal inference in high-dimensional, unstructured settings. Our approach identifies causal variables directly within the representation space, enabling well-defined estimation of causal effects when treatments or outcomes are themselves unstructured. In particular, we isolate representation directions corresponding to the most causally influential treatment components and the most salient treatment-induced outcome variations.
Together, these results provide a unified perspective on how modern machine learning systems extract structure from unstructured data, and how that structure can be harnessed for rigorous statistical and causal analysis.
This dissertation investigates how language models (i) represent structure through their architectures, (ii) learn structure from unstructured data, and (iii) enable us to leverage this learned structure for principled causal inference with unstructured data.
The first part develops a statistical perspective of attention mechanisms, the core building block of modern language models. We show that attention can be interpreted as adaptive mixture-of-experts models. This interpretation enables us to extend attention to model general exponential family-distributed data, making it capable of modeling complex, heterogeneous data beyond text. In turn, this perspective reframes attention as a statistical model, explaining how it captures complex dependencies and latent structure, with guarantees on identifiability and generalization.
The second part examines how such structure arises from unstructured training data. We show that many in-context learning behaviors can emerge directly from co-occurrence patterns in unstructured text, linking modern models to classical co-occurrence modeling tools like latent factor modeling. At the same time, we identify the limits of this mechanism: positional structure becomes essential for more complex reasoning tasks. We further demonstrate that training data composition plays a critical role in shaping model behavior and alignment, with example difficulty acting as a key factor.
The final part studies how learned representations in language models can be leveraged for causal inference in high-dimensional, unstructured settings. Our approach identifies causal variables directly within the representation space, enabling well-defined estimation of causal effects when treatments or outcomes are themselves unstructured. In particular, we isolate representation directions corresponding to the most causally influential treatment components and the most salient treatment-induced outcome variations.
Together, these results provide a unified perspective on how modern machine learning systems extract structure from unstructured data, and how that structure can be harnessed for rigorous statistical and causal analysis.