Skip to Content

Sponsors

No results

Keywords

No results

Types

No results

Search Results

Events

No results
Search events using: keywords, sponsors, locations or event type
When / Where

Presented By: Department of Statistics Dissertation Defenses

Modeling Structure in Unstructured Data: Statistical and Causal Perspectives

Kevin Christian Wibisono

Modern machine learning systems are trained on massive amounts of unstructured data such as text, images, and sequences. Despite the apparent lack of explicit structure, they exhibit remarkable abilities to learn patterns, perform reasoning, and support decision-making. This apparent paradox raises a central question: what structure do these models recover from unstructured data, and how can we understand and use it?

This dissertation investigates how language models (i) represent structure through their architectures, (ii) learn structure from unstructured data, and (iii) enable us to leverage this learned structure for principled causal inference with unstructured data.

The first part develops a statistical perspective of attention mechanisms, the core building block of modern language models. We show that attention can be interpreted as adaptive mixture-of-experts models. This interpretation enables us to extend attention to model general exponential family-distributed data, making it capable of modeling complex, heterogeneous data beyond text. In turn, this perspective reframes attention as a statistical model, explaining how it captures complex dependencies and latent structure, with guarantees on identifiability and generalization.

The second part examines how such structure arises from unstructured training data. We show that many in-context learning behaviors can emerge directly from co-occurrence patterns in unstructured text, linking modern models to classical co-occurrence modeling tools like latent factor modeling. At the same time, we identify the limits of this mechanism: positional structure becomes essential for more complex reasoning tasks. We further demonstrate that training data composition plays a critical role in shaping model behavior and alignment, with example difficulty acting as a key factor.

The final part studies how learned representations in language models can be leveraged for causal inference in high-dimensional, unstructured settings. Our approach identifies causal variables directly within the representation space, enabling well-defined estimation of causal effects when treatments or outcomes are themselves unstructured. In particular, we isolate representation directions corresponding to the most causally influential treatment components and the most salient treatment-induced outcome variations.

Together, these results provide a unified perspective on how modern machine learning systems extract structure from unstructured data, and how that structure can be harnessed for rigorous statistical and causal analysis.

Explore Similar Events

  •  Loading Similar Events...

Keywords


Back to Main Content