Skip to Content

Sponsors

No results

Tags

No results

Types

No results

Search Results

Events

No results
Search events using: keywords, sponsors, locations or event type
When / Where
All occurrences of this event have passed.
This listing is displayed for historical purposes.

Presented By: Department of Statistics Dissertation Defenses

Statistical Estimation and Inference for Large-Scale Categorical Data

Chengcheng Li

Flyer Flyer
Flyer
Abstract:
Categorical data become increasingly ubiquitous in the modern big data era. In this dissertation, we propose novel statistical learning and inference methods on large-scale categorical data, with a special focus on latent variable models and their applications to psychometrics. In psychometric assessments, the subjects' underlying aptitude often cannot be fully captured by raw scores due to differing item difficulties. Latent variable models are popularly used to capture this unobserved proficiency. This dissertation studies two types of latent variable models with categorical responses. The first type assumes multiple discrete latent traits, commonly known as the cognitive diagnosis models (CDMs), is a special family of discrete latent variable models. The second type assumes a continuous latent score, commonly known as the item response theory (IRT) models. Although both have been widely applied in large-scale assessments with diagnostic purposes, many challenges still exist for efficient learning and statistical inference. This dissertation studies four important problems that arise in these contexts.

The first part develops novel algorithms to estimate large latent Q-matrix in CDMs. Q-matrix plays an important role in CDMs; it specifies the inter-dependence between items and subjects' latent attributes. Accurate knowledge of Q-matrix is critical for cognitive diagnosis, item categorization and assessment design. However, in practice, many assessments do not provide Q-matrix or do not have accurate Q-matrix specifications. Existing methods are not scalable with the size of Q-matrix, despite the prevalence of large Q-matrix. We propose a penalized likelihood approach, with computational complexity growing linearly with Q sizes, to learn large Q-matrix from observational data. The estimation consistency and the robustness of the proposed method across various CDMs are also established.

The second part develops learning and inference methods for a unidimensional IRT model, the Rasch model, under the missing data setting. Data missingness is prevalent in large-scale assessments; examples include SAT and GRE where responses are combined from multiple tests administered year round from a large item pool. Direct inference to compare subjects’ latent scores under the missing data setting remains open and challenging in the literature. In this part, we obtain point estimators for the latent scores and derive their asymptotic distribution under a flexible missing-entry design in double asymptotic settings. We show our estimator is statistically efficient and optimal, which is amongst the first results in the binary matrix completion literature.

The third part concerns measurement biases in IRT models. Novel estimation and inference procedures are developed for biases brought by measurement non-invariant items under the differential item functioning (DIF) framework. Existing methods either require to know anchor items, i.e. DIF-free items or to adopt regularization to ensure model identifiability where easy inference is not permitted. We propose a novel minimal L1 condition for simultaneous DIF detection and model identification. It does not require any knowledge on anchor items and permits easy inference for both binary and multiple groups settings.

The fourth part considers privacy issues for releasing tabular (categorical) data to the public. We recommend an optimal mechanism, in which data utility is maximized given a privacy constraint, under the data differential privacy (DP) framework. Common users' practices, including merging related cells or integrating multiple data sources, are considered. Valid inference procedures are developed for the associated DP privacy-protected data.
Flyer Flyer
Flyer

Explore Similar Events

  •  Loading Similar Events...

Back to Main Content