E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation

Abstract (EN)

Recently, a few works have made early attempts to study test-time scaling for embodied tasks. However, two major challenges remain unsolved: (1) reasoning can effectively improve the performance of the policy, but its scaling mechanism has seldom been studied; (2) historical information is essential, as embodied tasks are inherently long-horizon and sequential, making sole reliance on current observations for action scaling inadequate due to the lack of historical context utilization. To address these challenges, we introduce E-TTS, a modular and plug-and-play Embodied Test-Time Scaling framework that unifies reasoning and action scaling for robotic manipulation via history-aware iterative refinement with vision-language verifiers. To support joint reasoning-action scaling, E-TTS performs reasoning-action joint sampling and scoring in a pairwise manner. To better utilize historical information, E-TTS uses a history buffer to store historical context, which is then used by reasoning and action verifiers to evaluate the sampled candidates. Unlike conventional open-loop TTS methods, E-TTS introduces feedback generation into the sampling process to form a closed-loop iterative refinement mechanism, enhancing both inference efficiency and environmental adaptability. Each component functions as an independent and composable module, allowing flexible and adaptive configuration depending on task requirements. To evaluate the advantages of our framework, we conduct experiments across 4 different benchmarks, 6 environments, 3 embodiments, and 4 base vision-language-action models. The experimental results demonstrate that, without requiring additional expert data collection or retraining, E-TTS consistently improves performance, achieving up to a 33.14% increase in simulation and 26.62% in real-world scenarios.

摘要 (ZH)

近期，已有部分研究初步探索了具身任务中的测试时缩放技术，但仍有两大挑战尚未解决：（1）推理能力可有效提升策略性能，但其缩放机制鲜少被研究；（2）历史信息至关重要，因为具身任务本质上是长期序列性的，仅依赖当前观测进行动作缩放会因缺乏历史上下文利用而效果不足。为应对这些挑战，我们提出E-TTS——一种模块化、即插即用的具身测试时缩放框架，通过结合视觉语言验证器的历史感知迭代优化，统一了机器人操作中的推理与动作缩放。为支持联合推理-动作缩放，E-TTS采用成对推理-动作联合采样与评分机制。为更好利用历史信息，E-TTS构建历史缓冲区存储上下文，并通过推理与动作验证器评估采样候选。不同于传统开环TTS方法，E-TTS将反馈生成融入采样过程，形成闭环迭代优化机制，提升了推理效率与环境适应性。各组件作为独立可组合模块，可根据任务需求灵活自适应配置。为评估框架优势，我们在4种基准、6个环境、3种实体及4种基础视觉-语言-动作模型上开展实验。结果表明，无需额外专家数据收集或重新训练，E-TTS可稳定提升性能，在仿真场景中最高提升33.14%，在真实场景中达26.62%。

← Back