A Generalization Theory for JEPA-Based World Models

Abstract (EN)

Joint Embedding Predictive Architectures (JEPAs) have recently emerged as a promising paradigm for world modeling by learning predictive dynamics in a latent space rather than generating future observations at the input level. Despite their empirical success, the theoretical understanding of JEPA-based world models remains limited. In this paper, we develop the first generalization theory for JEPA-based world models. We formulate JEPA pretraining as a conditional spectral graph learning problem and show that the JEPA objective is equivalent to a low-rank factorization of an action-conditioned co-occurrence matrix. Building on this characterization, we establish a connection between JEPA pretraining error and downstream planning regret, leading to a finite-sample generalization bound for JEPA-based world models. Our analysis reveals an inherent trade-off between approximation and sample errors with respect to the latent dimension, providing theoretical insights into the advantages and limitations of latent predictive models compared with input-level predictive approaches.

摘要 (ZH)

联合嵌入预测架构（JEPAs）最近通过学习潜在空间中的预测动态而非在输入层面生成未来观测，成为世界建模的一种有前景的范式。尽管其实验成功，但基于JEPA的世界模型的理论理解仍然有限。在本文中，我们首次为基于JEPA的世界模型建立了泛化理论。我们将JEPA预训练形式化为一个条件谱图学习问题，并证明JEPA目标等价于对动作条件共现矩阵进行低秩分解。基于这一特征，我们建立了JEPA预训练误差与下游规划遗憾之间的关联，从而得出了基于JEPA的世界模型的有限样本泛化界。我们的分析揭示了潜在维度上近似误差与样本误差之间的内在权衡，为潜在预测模型相较于输入层面预测方法的优势与局限性提供了理论洞见。

← Back