Presented By: Michigan Robotics
Language-Driven Video Understanding
Robotics PhD Defense, Luowei Zhou
Abstract: Video understanding has advanced quite a long way in the past decade, accomplishing tasks including low-level segmentation and tracking that study objects as pixel-level segments or bounding boxes to more high-level activity recognition or classification tasks that classify a video scene to a categorical action label. Despite the progress that has been made, much of this work remains a proxy for an eventual task or application that requires a holistic view of the video, such as objects, actions, attributes, and other semantic components. In this defense, we argue that language could deliver the required holistic representation. It plays a significant role in video understanding by allowing machines to communicate with humans and to understand our requests, as shown in tasks such as text-to-video search engine, voice-guided robot manipulation, to name a few. Our language-driven video understanding focuses on two specific problems: video description and visual grounding. What marks our viewpoint different from prior literature is twofold. First, we propose a bottom-up structured learning scheme by decomposing long video into individual procedure steps and represent each one with a description. Second, we propose to have both explicit (i.e., supervised) and implicit (i.e., weakly-supervised and self-supervised) grounding between words and visual concepts which enables interpretable modeling of the two spaces.
Related Links
Explore Similar Events
-
Loading Similar Events...