Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation

Abstract (EN)

Embodied navigation requires an agent to map language and visual observations to a stream of spatial actions that drive a real robot through environments it has never seen. The dominant approach has been to scale vision-language-action (VLA) foundation models on ever-larger collections of robot trajectories. This paper argues that, for navigation specifically, generality can be obtained structurally, not only through data scale. The underlying decision structure of navigation reduces to a single Language-Vision-Robot Actions Translation. The language action emits semantic-level directional command and the vision action emits a pixel-level visual target. Both outputs lie inside the natural output manifold of pretrained multimodal large language models (MLLMs), so the task can be reasoned about by an agent rather than learned from robot data. Therefore, we present Uni-LaViRA, a unified agentic architecture that extends the same insight to four task families (VLN-CE, ObjectNav, EQA, and Aerial-VLN) and to four heterogeneous real robots (Wheeled, Quadruped, Humanoid robot, and a self-built UAV) in a zero-shot manner. Two agent-loop mechanisms make this unification practical. TODO List Memory (TDM) rewrites a structured checklist of pending sub-goals at every step, reciting the unfinished items back into the agent's most recent attention window. Second Chance Backtrack (SCB) rolls the robot back to the pre-error state and conditions the agent's next plan on the failed sub-trajectory, turning single-pass navigation into a self-correcting process. With zero training effort, Uni-LaViRA reaches 60.7% SR on VLN-CE R2R, 51.3% on VLN-CE RxR, 77.7% on HM3D-v2, 60.0% on HM3D-OVON, 54.7% on MP3D-EQA, and 40.0% on OpenUAV, matching or even surpassing recent training navigation foundation models that consume millions of samples and thousands of GPU-hours.

摘要 (ZH)

具身导航要求智能体将语言和视觉观测映射为一系列空间动作，从而驱动真实机器人在从未见过的环境中移动。主流方法是在日益庞大的机器人轨迹数据集上扩展视觉-语言-动作基础模型。本文论证，对于导航这一特定任务，通用性不仅可以依靠数据规模，还可以通过结构方式实现。导航的底层决策结构可简化为单一的语言-视觉-机器人动作翻译：语言动作发出语义层面的方向指令，视觉动作发出像素层面的视觉目标。这两个输出都位于预训练多模态大语言模型自然输出流形之内，因此任务可以由智能体进行推理，而非从机器人数据中学习。为此，我们提出了Uni-LaViRA，一种统一的智能体架构，将相同的洞见以零样本方式扩展到四个任务族（VLN-CE、ObjectNav、EQA和Aerial-VLN）以及四种异构真实机器人（轮式、四足、人形机器人及自建无人机）。两个智能体循环机制使这种统一变得实用：待办列表记忆（TDM）每一步重写一份结构化的待办子目标清单，并将未完成项重新读入智能体最近的注意力窗口；第二次机会回溯（SCB）将机器人回滚到错误前状态，并以失败子轨迹为条件引导智能体下一步计划，将单次导航转变为自纠正过程。在零训练投入下，Uni-LaViRA在VLN-CE R2R上达到60.7%的成功率，在VLN-CE RxR上达51.3%，在HM3D-v2上达77.7%，在HM3D-OVON上达60.0%，在MP3D-EQA上达54.7%，在OpenUAV上达40.0%，匹配甚至超越了近期消耗数百万样本和数千GPU小时的训练型导航基础模型。

← Back