Chair: Jason Corso
In person in FRB 2300 and on Zoom:
https://umich.zoom.us/j/95963594618
Passcode: HAZY
Abstract:
While deep learning problems are often motivated as enabling technologies for human-computer interaction---a support robot, for example, must align natural language referents and sensor readings to operate in a human world---assumptions of these works make them poorly suited to real-world human interaction. Specifically, evaluation typically assumes that humans are oracles that provide semantically correct and unambiguous information, and that all such information is equally useful. While this is enforced in controlled experiments via carefully curated datasets, models operating in the wild will need to compensate for the fact that humans are hazy oracles that may provide information that is incorrect, ambiguous, or misaligned with the features learned by the model. For example: given a choice of three mugs, a robot would not be able to satisfy a request to retrieve the mug, but would be able to retrieve the orange mug.
A natural question follows: how can we use deep learning models trained via the oracle assumption with hazy humans? We answer this question via a method we call deferred inference, which allows models trained via supervised learning to solicit and integrate additional information from the human when necessary. Deferred inference begins with a method for determining if the model should defer inference and wait until more human-provided information is provided. While past work has generally simplified this into one of two questions: is the human-provided information correct? or is the output correct? We find that these approaches are insufficient due to the complex relationship between human inputs, sensor readings, and deep models: low-quality human-provided information may not cause error, while high-quality human-provided information may not correct it. To address the misalignment between input and output error, we introduce Dual-loss Additional Error Regression, or DAER, a method that successfully locates instances where a new human input can reduce error.
Although introduction of such an effective deferral function is necessary to optimize the trade-off between human effort and error, we must additionally consider that the deferral response is also subject to the effects of hazy oracles. For this reason, we must not only consider how to find error caused by human input but also how to integrate deferral responses and measure the performance of the team. For this, we introduce aggregation functions that allow us to integrate information across multiple inferences and a novel evaluation framework that measures the trade-off between error and additional human effort. Through this evaluation, we show that we can reduce error by up to 48% under a reasonable level of human effort without any changes to training or architecture.
Last, we consider how shifting from a dataset-based evaluation to an individual human affects deferred inference. Specifically, whereas crowdsourced datasets work well for rapid implementation and evaluation of deferral and aggregation functions, they do not accurately model human-computer interaction: the mechanisms used to procure high-quality data cause shifts in the distribution, and the failure to track the inputs of individual annotators makes the tacit assumptions that all humans are the same, and inputs do not change over time or deferral depth. Through a human-centered experiment, we find that these assumptions are not true: an ideal deferral function must be calibrated for a specific user, users learn the model over time, and the deferral response is likely to be of lower quality than the initial query. Despite this mismatch with crowdsourced evaluation, we find that using our proposed deferral and aggregation functions can still reduce error in practice.
In person in FRB 2300 and on Zoom:
https://umich.zoom.us/j/95963594618
Passcode: HAZY
Abstract:
While deep learning problems are often motivated as enabling technologies for human-computer interaction---a support robot, for example, must align natural language referents and sensor readings to operate in a human world---assumptions of these works make them poorly suited to real-world human interaction. Specifically, evaluation typically assumes that humans are oracles that provide semantically correct and unambiguous information, and that all such information is equally useful. While this is enforced in controlled experiments via carefully curated datasets, models operating in the wild will need to compensate for the fact that humans are hazy oracles that may provide information that is incorrect, ambiguous, or misaligned with the features learned by the model. For example: given a choice of three mugs, a robot would not be able to satisfy a request to retrieve the mug, but would be able to retrieve the orange mug.
A natural question follows: how can we use deep learning models trained via the oracle assumption with hazy humans? We answer this question via a method we call deferred inference, which allows models trained via supervised learning to solicit and integrate additional information from the human when necessary. Deferred inference begins with a method for determining if the model should defer inference and wait until more human-provided information is provided. While past work has generally simplified this into one of two questions: is the human-provided information correct? or is the output correct? We find that these approaches are insufficient due to the complex relationship between human inputs, sensor readings, and deep models: low-quality human-provided information may not cause error, while high-quality human-provided information may not correct it. To address the misalignment between input and output error, we introduce Dual-loss Additional Error Regression, or DAER, a method that successfully locates instances where a new human input can reduce error.
Although introduction of such an effective deferral function is necessary to optimize the trade-off between human effort and error, we must additionally consider that the deferral response is also subject to the effects of hazy oracles. For this reason, we must not only consider how to find error caused by human input but also how to integrate deferral responses and measure the performance of the team. For this, we introduce aggregation functions that allow us to integrate information across multiple inferences and a novel evaluation framework that measures the trade-off between error and additional human effort. Through this evaluation, we show that we can reduce error by up to 48% under a reasonable level of human effort without any changes to training or architecture.
Last, we consider how shifting from a dataset-based evaluation to an individual human affects deferred inference. Specifically, whereas crowdsourced datasets work well for rapid implementation and evaluation of deferral and aggregation functions, they do not accurately model human-computer interaction: the mechanisms used to procure high-quality data cause shifts in the distribution, and the failure to track the inputs of individual annotators makes the tacit assumptions that all humans are the same, and inputs do not change over time or deferral depth. Through a human-centered experiment, we find that these assumptions are not true: an ideal deferral function must be calibrated for a specific user, users learn the model over time, and the deferral response is likely to be of lower quality than the initial query. Despite this mismatch with crowdsourced evaluation, we find that using our proposed deferral and aggregation functions can still reduce error in practice.
Explore Similar Events
-
Loading Similar Events...