Not All Actions Are Equal: Rethinking Conditioning for Dexterous World Model

Abstract (EN)

Recent advances in action-conditioned world models show promising progress in modeling complex interactions and forecasting future states under diverse action sequences. While these models are often driven by stronger visual representations and model capacity, action conditioning itself remains underexplored. Most existing approaches compress the entire action sequence into a single representation, which works well for low-DoF control but becomes less reliable in high-DoF scenarios. We observe that high-DoF dexterous actions are inherently heterogeneous, spanning multiple orders of magnitude, where large-scale motions coexist with subtle but important signals. When uniformly aggregated, optimization exhibits an imbalance across action components, which hinders the modeling of fine-grained effects and affects action fidelity. We therefore propose DexAC-WM, which treats action conditioning as a structured process rather than global compression. DexAC preserves dimension-level semantics via action tokenization and aligns action signals with visual dynamics through local refinement and global modulation. To address the limited high-level semantic grounding in existing world models, we further introduce a semantic branch that provides rich object-scene priors, which enables world model to capture dynamic visual details while supporting high-DoF action-conditioned video prediction. Experiments on EgoDex and EgoVerse show that combining the semantic branch with DexAC significantly improves FID, FVD, and PCK, demonstrating gains in visual-temporal realism and action-following consistency. We further verify that DexAC extends to other backbones, showing the scalability of our structured action-conditioning design. These results suggest that scaling world models to high-DoF control requires both structured action modeling and semantic grounding.

摘要 (ZH)

近年来，基于动作条件的世界模型在复杂交互建模和多样化动作序列下的未来状态预测方面取得了显著进展。尽管这些模型通常依赖更强的视觉表征和模型容量，但动作条件机制本身仍未得到充分探索。现有方法大多将完整动作序列压缩为单一表征，这在低自由度控制场景中表现良好，但在高自由度场景中可靠性显著下降。我们发现高自由度灵巧动作具有内在异质性，其量级跨度达数个数量级——大规模运动与微弱但关键的信号并存。当统一聚合时，优化过程会出现动作组件间的不平衡，这阻碍了对细粒度效应的建模并影响动作保真度。为此，我们提出DexAC-WM，将动作条件机制视为结构化过程而非全局压缩。DexAC通过动作分词化保留维度级语义，并通过局部精修与全局调制对齐动作信号与视觉动态。针对现世界模型缺乏高层语义基础的问题，我们进一步引入语义分支，该分支提供丰富的物体-场景先验，使世界模型既能捕捉动态视觉细节，又能支持高自由度动作条件视频预测。在EgoDex和EgoVerse上的实验表明，将语义分支与DexAC相结合可显著提升FID、FVD和PCK指标，在视觉时间真实性和动作跟随一致性方面均取得改善。我们进一步验证了DexAC可扩展至其他主干网络，展现了结构化动作条件设计具备可扩展性。这些结果表明，将世界模型扩展到高自由度控制需要同时考虑结构化动作建模与语义基础。

← Back