Does intelligence require the ability to reason about phenomena beyond direct experience? It is natural to suspect that some complex thought cannot be captured through language alone. However, of particular concern to this work, is whether visualising counterfactual events can complement language as a mechanism for complex thought. We ask whether LLMs can be trained to utilise such visualisation mechanisms, in a way that benefits their reasoning abilities. Motivated by this question, we propose Einstein World Models. EWMs are a blueprint for LLM-based reasoning systems that place visual-temporal rollouts inside the reasoning trace, allowing them to reason in ways that text alone may not support well. In an EWM, the LLM calls a world-module (not to be confused with a world model), to produce short rollouts of scenes under consideration. The returned rollout is treated not as the answer, but as an inspectable hypothesis that can support later reasoning. Einstein World Models extend the capability of LLMs for tool calling (such as web search or code execution), into the domain of visual thought experiments.
智能是否需要推理超越直接经验的现象?人们自然怀疑某些复杂思维无法仅通过语言捕捉。然而,本文特别关注的是,可视化反事实事件能否作为语言之外的复杂思维机制加以补充。我们探究大型语言模型能否被训练利用这种可视化机制,从而提升其推理能力。受此问题启发,我们提出爱因斯坦世界模型。EWM是基于LLM的推理系统蓝图,将视觉-时间推演嵌入推理轨迹中,使其能够以纯文本可能难以支持的方式进行推理。在EWM中,LLM调用一个世界模块(不同于世界模型),生成所考虑场景的简短推演。返回的推演不被视为答案,而是作为可检验的假设,用于支持后续推理。爱因斯坦世界模型将LLM调用工具(如网络搜索或代码执行)的能力扩展至视觉思想实验领域。