top of page

Open-World Embodied AI

STaR: Scalable Task-Conditioned Retrieval for Long-Horizon Multimodal Robot Memory

Mingfeng Yuan, Hao Zhang, Mahan Mohammadi, Runhao Li1, Jinjun Shan, Steven L. Waslander
IEEE Robotics and Automation Letters (RA-L), 2026.

Our framework starts with memory construction, where the robot records RGB and posed depth data to build a multimodal memory composed of three complementary databases (DB) - video caption, 3D primitive, and visual keyframe - jointly forming OmniMem. The memory supports user query and reasoning, where given text or multimodal queries, an agentic planner (MLLM) retrieves task-relevant memories through an Information Bottleneck, performs contextual reasoning, and outputs structured answers (location, time, or description).

Paper Website

OpenNav: Open-World Navigation with Multimodal Large Language Models

Mingfeng Yuan, Letian Wang, Steven L. Waslander
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025.

Given a free-form language instruction and sensor observations, our OpenNav is capable of generating a dense sequence of instruction-following and scene-compliant robot waypoints in a zero-shot manner for open-world navigation, effectively handling open-set objects and open-set instructions without relying on in-context examples or pre-trained skills.

Paper Website

bottom of page