BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//UM//UM*Events//EN
CALSCALE:GREGORIAN
BEGIN:VTIMEZONE
TZID:America/Detroit
TZURL:http://tzurl.org/zoneinfo/America/Detroit
X-LIC-LOCATION:America/Detroit
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20070311T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20071104T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260403T113109
DTSTART;TZID=America/Detroit:20260429T090000
DTEND;TZID=America/Detroit:20260429T110000
SUMMARY:Lecture / Discussion:Principled Evaluation of Large Language Models: A Statistical Perspective
DESCRIPTION:The rapid progress of large language models has outpaced the development of principled methodologies for their evaluation. This dissertation draws on ideas from psychometrics and statistics to build rigorous\, efficient\, and interpretable evaluation frameworks for modern AI systems. In this talk\, I focus on three contributions that address complementary challenges in LLM evaluation.\n\nFirst\, I present PromptEval\, a method that confronts the problem of prompt sensitivity — the phenomenon whereby minor rephrasing of benchmark questions can substantially alter measured model performance. By combining Item Response Theory with matrix completion\, PromptEval efficiently approximates the full distribution of model performance across hundreds of prompt variations while requiring less than 5% of the total evaluations\, replacing arbitrary single-prompt assessments with statistically robust characterizations of model behavior.\n\nSecond\, I introduce skill-based scaling laws that model LLM performance through latent capabilities such as reasoning and instruction-following. Inspired by factor analysis\, this approach exploits the correlation structure among benchmark tasks to produce scaling predictions that are both more accurate and more interpretable than existing laws\, which typically focus on aggregate validation loss and fail to generalize across model families.\n\nThird\, I present Bridge\, a unified statistical framework that explicitly connects LLM-as-a-Judge evaluations to human assessments. Bridge models the systematic discrepancies between human and LLM judgments through a latent preference score and a linear transformation of divergence-capturing covariates\, enabling principled recalibration of automated scores and formal statistical testing for human–LLM gaps.\n\nTogether\, these contributions advance a vision of AI evaluation as a scientific discipline in its own right — one that demands the same statistical care we expect from the systems being evaluated.
UID:147383-21900951@events.umich.edu
URL:https://events.umich.edu/event/147383
CLASS:PUBLIC
STATUS:CONFIRMED
CATEGORIES:Dissertation
LOCATION:West Hall - 470
CONTACT:
END:VEVENT
END:VCALENDAR