Skip to Content

Sponsors

No results

Keywords

No results

Types

No results

Search Results

Events

No results
Search events using: keywords, sponsors, locations or event type
When / Where

Presented By: Michigan Robotics

Multimodal Fusion and Temporal Reasoning for Intelligent Robot Perception

PhD Defense, Jingyu Song

Split illustration contrasting two robotic perception scenarios: on the left, an underwater ROV with a camera surveys fish and seafloor objects in murky water; on the right, a self-driving car uses camera, LiDAR, and radar to detect pedestrians and vehicles on a rainy city street. A stylized brain with neural network nodes and intertwined loops sits at the center, linking the two domains. Split illustration contrasting two robotic perception scenarios: on the left, an underwater ROV with a camera surveys fish and seafloor objects in murky water; on the right, a self-driving car uses camera, LiDAR, and radar to detect pedestrians and vehicles on a rainy city street. A stylized brain with neural network nodes and intertwined loops sits at the center, linking the two domains.
Split illustration contrasting two robotic perception scenarios: on the left, an underwater ROV with a camera surveys fish and seafloor objects in murky water; on the right, a self-driving car uses camera, LiDAR, and radar to detect pedestrians and vehicles on a rainy city street. A stylized brain with neural network nodes and intertwined loops sits at the center, linking the two domains.
Committee chair: Katie Skinner

Abstract:
Reliable autonomy for field robots depends on perception systems that can operate under difficult sensing conditions. In real-world environments, robot perception is often degraded by low-texture visual patterns, environmental disturbances, adverse weather, occlusions, and sensor failures. This dissertation develops multimodal fusion and temporal reasoning methods that improve the robustness, scalability, and accuracy of robot perception across challenging environments.

The first part of this thesis addresses state estimation and dense mapping for underwater robots, where wave disturbance and low-texture environments often cause vision-based localization to fail. We introduce TURTLMap, a real-time localization and dense mapping framework for low-cost underwater robots. TURTLMap fuses Doppler velocity log, inertial, and pressure measurements for robust localization, while using stereo depth to construct dense 3D maps. Real-world experiments in a water tank environment, evaluated with underwater motion capture and a reference 3D structure, demonstrate accurate robot tracking and mapping under low-texture and wave-disturbed conditions.

The second part studies adaptive multimodal fusion for autonomous vehicle perception. We introduce LiRaFusion, a LiDAR-radar fusion network that combines joint feature encoding with adaptive feature weighting to better exploit the complementary strengths of LiDAR and radar. Experiments on large-scale 3D object detection benchmarks show that this design improves detection performance over existing fusion methods. Building on this direction, we develop CRKD, a cross-modality knowledge distillation framework that transfers knowledge from a high-performing LiDAR-camera teacher to a scalable camera-radar student. This approach provides a practical pathway for using high-quality sensor data from test fleets to improve cost-effective sensing configurations for consumer vehicles, achieving state-of-the-art camera-radar object detection performance.

The third part explores temporal reasoning for road scene understanding. We introduce MemFusionMap, a memory-based framework for online vectorized HD map construction that improves temporal fusion by combining current BEV features with multiple working-memory features. MemFusionMap further maintains a temporal overlap heatmap, which provides a spatiotemporal cue for how historical observations overlap with the current field of view and helps the model reason over memory more adaptively. Together, these designs improve map construction under challenging and complex road conditions, including occlusion and dynamic scene changes, while preserving efficient runtime and compatibility with multiple perception models.

Finally, this thesis develops CRISP, a spatiotemporal camera-radar pretraining framework for autonomous driving. CRISP learns transferable bird’s-eye-view representations by forecasting future LiDAR point clouds from historical camera and radar observations, using LiDAR as privileged supervision only during pretraining. At deployment, the model operates using camera-radar inputs alone. Experiments on real-world benchmarks show that CRISP improves long-horizon point cloud forecasting and transfers effectively to downstream tasks including 3D object detection, tracking, online mapping, motion forecasting, future occupancy prediction, and planning.

Together, these contributions show how multimodal sensing, cross-modality knowledge transfer, temporal memory, and predictive pretraining can make robot perception more reliable under practical sensing constraints. The resulting methods improve localization, mapping, perception, prediction, and planning across challenging underwater and autonomous driving environments.
Split illustration contrasting two robotic perception scenarios: on the left, an underwater ROV with a camera surveys fish and seafloor objects in murky water; on the right, a self-driving car uses camera, LiDAR, and radar to detect pedestrians and vehicles on a rainy city street. A stylized brain with neural network nodes and intertwined loops sits at the center, linking the two domains. Split illustration contrasting two robotic perception scenarios: on the left, an underwater ROV with a camera surveys fish and seafloor objects in murky water; on the right, a self-driving car uses camera, LiDAR, and radar to detect pedestrians and vehicles on a rainy city street. A stylized brain with neural network nodes and intertwined loops sits at the center, linking the two domains.
Split illustration contrasting two robotic perception scenarios: on the left, an underwater ROV with a camera surveys fish and seafloor objects in murky water; on the right, a self-driving car uses camera, LiDAR, and radar to detect pedestrians and vehicles on a rainy city street. A stylized brain with neural network nodes and intertwined loops sits at the center, linking the two domains.

Explore Similar Events

  •  Loading Similar Events...

Back to Main Content