Reasoning over long sequences of observations and actions is essential for many robotic tasks. Yet, learning effective long-context policies from demonstrations remains challenging. As context length increases, training becomes prohibitively expensive due to the surge of memory demands, and policy performance often degrades due to spurious correlations. Recent methods typically sidestep these issues by truncating context length, discarding potentially critical information for subsequent decisions. In this paper, we propose an alternative approach that explicitly regularizes information retention from past observations. At the core of our method is Past-Token Prediction (PTP), an auxiliary task where the policy learns to predict past action tokens alongside future ones. This simple regularizer significantly strengthens temporal action dependencies, which are commonly lost in recent policies. In particular, we find that the benefit of PTP primarily emerges in the policy head rather than the visual encoder. Building on this observation, we introduce a multistage training strategy: first pre-train the visual encoder with short contexts, then fine-tune the policy head using cached long-context embeddings. This approach preserves the benefits of PTP while greatly reducing memory and computational overhead. Beyond training, we further leverage PTP as a self-verification mechanism, enabling the policy to search for action predictions consistent with past actions at test time. Experiments across seven simulated and four real-world tasks demonstrate that our proposed method improves the performance of long-context policies by 3× and accelerates policy training by more than 10×.
One major challenge in long-context imitation learning is causal confusion, where policies latch onto spurious correlations in the input context that do not truly influence expert behavior. This issue worsens with longer contexts, leading to overfitting during training and poor generalization at deployment. A classic example is copycat behavior, where the model simply mimics past actions without understanding their relationship to observations. However, our findings reveal a different trend: modern policies tend to under-utilize, rather than over-rely on, temporal action dependencies.
Building upon our analysis, we introduce a simple yet effective method for long-context imitation learning. At the core is Past-Token Prediction (PTP), an auxiliary objective that tasks the policy to predict both past and future actions. This task encourages the model to better capture action dependencies over time. To scale PTP efficiently, we propose a multi-stage training recipe that freezes a short-horizon encoder, caches visual features, and trains the policy head using these cached embeddings. At test time, we extend PTP into a self-verification mechanism: the policy samples multiple candidate sequences and selects the one that best reconstructs the already-executed actions.
We evaluate our method on seven simulation tasks, including five from the RobotMimic benchmark and two newly designed long-horizon tasks that require historical context. Our approach significantly outperforms short-context and long-context baselines, especially in challenging manipulation scenarios.
We evaluate our method on four real-world tasks, each requiring historical context for task completion:
Our method consistently outperforms both short-context and long-context baselines across all four tasks.
@article{ptp2025,
title={Learning Long-Context Robot Policies via Past-Token Prediction},
year={2025},
}