Presented By: Department of Statistics
Statistics Department Seminar Series: Fred Feinberg, Handleman Professor of Management, Ross School of Business, Professor of Statistics (by courtesy), Department of Statistics, University of Michigan
"Harmonizing Discord: A Bayesian Model of Multi-Rater Agreement with Nonignorable Recusals (and a surprise empirical application!)"
Abstract: Expert adjudications are ubiquitous in high-stakes decision-making, from grant reviews and academic hiring to elite evaluations in the arts and athletics. In these settings, panels of judges score candidates across sequential stages, and these scores are aggregated into a consensus ranking. Standard practice typically employs arithmetic averaging, often supplemented with ad-hoc "corrections" for outliers or scale differences. However, such approaches suffer from three core statistical problems: (1) Scale Heterogeneity, where judges exhibit varying levels of discrimination and range-restriction; (2) Information Loss, where the longitudinal "trajectory" of a candidate is sidestepped in favor of stage-specific snapshots; and (3) Nonignorable Missingness, where conflict-of-interest (COI) recusals can introduce systematic bias.
We develop a hierarchical Bayesian framework that addresses these issues simultaneously. First, we treat observed scores as generators of ordinal tie-blocks, bypassing the "cardinality fallacy" and modeling the probability of observed ranks. Second, we link sequential rounds via a fusion model with LKJ correlation priors, allowing the model to borrow strength across the tournament while regularizing the latent covariance. Third, we introduce a novel Informative Missing Data Likelihood (MDL) that treats COI recusals as a form of informative censoring. When judges abstain from rating their own students or collaborators, standard approaches invoke a "Missing Completely at Random" (MCAR) assumption. Our MDL instead retains recused candidates in the "risk set" as censored alternatives, correcting for the potential bias in win probabilities that occurs when high-caliber competitors are systematically excluded from a judge’s denominator. The model combines a Plackett–Luce formulation for tied data (implemented via Elementary Symmetric Polynomials) with judge-specific discrimination parameters that automatically downweight poorly-calibrated raters, and the full posterior can be efficiently sampled via Hamiltonian Monte Carlo, allowing full uncertainty quantification in downstream estimands.
We apply this framework to a high-stakes international competition — to be revealed during the talk! — featuring dozens of candidates, multiple rounds, and nearly 20 elite judges. Analysis suggests that the standard scoring method and the MDL-augmented model produce distinctly different results: they disagree on the winner and posterior advancement probabilities, driven almost entirely by the differential treatment of collaborator-based recusals. Sensitivity analysis reveals that these outcomes are largely contingent on the assumed missing data mechanism. By making these untestable assumptions explicit, we provide a more transparent and principled foundation for high-stakes adjudication in grant panels, hiring committees, and both athletic and artistic judging.
We develop a hierarchical Bayesian framework that addresses these issues simultaneously. First, we treat observed scores as generators of ordinal tie-blocks, bypassing the "cardinality fallacy" and modeling the probability of observed ranks. Second, we link sequential rounds via a fusion model with LKJ correlation priors, allowing the model to borrow strength across the tournament while regularizing the latent covariance. Third, we introduce a novel Informative Missing Data Likelihood (MDL) that treats COI recusals as a form of informative censoring. When judges abstain from rating their own students or collaborators, standard approaches invoke a "Missing Completely at Random" (MCAR) assumption. Our MDL instead retains recused candidates in the "risk set" as censored alternatives, correcting for the potential bias in win probabilities that occurs when high-caliber competitors are systematically excluded from a judge’s denominator. The model combines a Plackett–Luce formulation for tied data (implemented via Elementary Symmetric Polynomials) with judge-specific discrimination parameters that automatically downweight poorly-calibrated raters, and the full posterior can be efficiently sampled via Hamiltonian Monte Carlo, allowing full uncertainty quantification in downstream estimands.
We apply this framework to a high-stakes international competition — to be revealed during the talk! — featuring dozens of candidates, multiple rounds, and nearly 20 elite judges. Analysis suggests that the standard scoring method and the MDL-augmented model produce distinctly different results: they disagree on the winner and posterior advancement probabilities, driven almost entirely by the differential treatment of collaborator-based recusals. Sensitivity analysis reveals that these outcomes are largely contingent on the assumed missing data mechanism. By making these untestable assumptions explicit, we provide a more transparent and principled foundation for high-stakes adjudication in grant panels, hiring committees, and both athletic and artistic judging.