WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

Zhu, Fangqi; Yan, Zhengyang; Hong, Zicong; Shou, Quanxin; Ma, Xiao; Guo, Song

WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

Fangqi Zhu^{1, 2}, Zhengyang Yan¹, Zicong Hong¹, Quanxin Shou¹, Xiao Ma^2*, Song Guo^1*

¹Hong Kong University of Science and Technology, ²ByteDance Seed
^*Corresponding Authors

Paper Code arXiv

Figure 1. Three different VLA training paradigms: (a) Imitation learning learns from human demonstrations but lacks the ability for learning from failures and self-correction; (b) Real-world RL improves policy through direct interaction but suffers from high sampling costs and difficulty in achieving on-policy RL; (c) WMPO pretrains a world model on large-scale robotic trajectories and fine-tunes it with limited policy behavior data, enabling sample-efficient on-policy RL for VLA without real-world interaction.

Abstract

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation, but their reliance on expert demonstrations limits their ability to learn from failures and perform self-corrections. Reinforcement learning (RL) addresses these through self-improving interactions with the physical environment, but suffers from high sample complexity on real robots. We introduce World-Model-based Policy Optimization (WMPO), a principled framework for on-policy VLA RL without interacting with the real environment. In contrast to widely used latent world models, WMPO focuses on pixel-based predictions that align the "imagined" trajectories with the VLA features pretrained with web-scale images. Crucially, WMPO enables the policy to perform on-policy GRPO that provides stronger performance than the often-used off-policy methods. Extensive experiments in both simulation and real-robot settings demonstrate that WMPO (i) substantially improves sample efficiency, (ii) achieves stronger overall performance, (iii) exhibits emergent behaviors such as self-correction, and (iv) demonstrates robust generalization and lifelong learning capabilities.

Methodology

WMPO starts from an initial state \(s_0\). The overall training procedure consists of three components: (1) Imagined Trajectory Generation, where policy model \(\pi_{\theta_{\text{old}}}\) and world model \(p_\phi\) interact alternately to generate a full imagined trajectory; (2) Trajectory Sampling, where multiple trajectories are sampled and evaluated by the reward model \(R_\psi\); and (3) Policy Update, where the policy parameters \(\theta\) are optimized via Equation 4. This process is iteratively repeated throughout training.

Figure 2. Illustration of Our Proposed World Model-based Policy Optimization (WMPO)

Results and Analysis

Comparison with GRPO and DPO

We compare WMPO with two established RL algorithms, GRPO and DPO, both widely used for optimizing large language models. To ensure fairness, all methods are allocated the same real rollout budget \(P\) (i.e., the number of full real trajectories available for policy optimization). We consider both online and offline baselines: GRPO is implemented in an online setting, where the policy is updated directly from trajectories collected in the environment; DPO is implemented in an offline setting, where the base policy serves as the reference and trajectory pairs (success vs. failure) are constructed for optimization using the standard DPO loss. Unlike GRPO, which discards trajectories after each update, DPO can repeatedly reuse collected data, but it lacks the ability to update the policy online as WMPO does. Performance is reported as the task success rate (%).

Table 1. Comparison of policy optimization methods across four manipulation tasks in the Mimicgen simulation benchmark. Results show that WMPO consistently outperforms both GRPO and DPO baselines under different budgets. As the rollout budget increases from \(128\) to \(1280\), WMPO continues to exhibit substantial improvements, highlighting both its data efficiency and scalability.

Emergent Behavior of WMPO

To better understand the source of WMPO's strong performance, we conduct a visual comparison of its test-time behavior against the base policy. We identify two emergent behaviors unique to our method: (1) the WMPO policy learns to self-correct, recovering from nearly failure states; and (2) the WMPO policy executes tasks more efficiently, as it rarely becomes "stuck" in suboptimal states.

Figure 3. Behavior analysis of the Square task (inserting the square into the stick) shows that, compared with the base policy, WMPO demonstrates the ability to self-correct.

Figure 5. Relative average trajectory length of successful trials across different policies (Base Policy = \(100\%\)).

Generalization to Novel Tasks

We evaluate the generalization ability of WMPO across three novel disruption scenarios (Figure 4), which systematically assess generalization under spatial, background, and texture shifts. As shown in Table 2, WMPO consistently achieves the best performance across all disruption types. DPO attains modest improvements in the in-distribution setting compared to the base policy, but its performance degrades significantly under background and texture changes, suggesting reliance on spurious visual cues rather than transferable manipulation skills. GRPO exhibits performance similar to the base policy, and both are worse than WMPO across all disruption scenarios. In contrast, WMPO, trained entirely in the world model, captures more generalizable strategies and maintains reliable performance across spatial, background, and texture variations.

Figure 4. (a) For the Square task, we vary the stick’s position from fixed to a random position inside a rectangle. (b) For the StackThree task, we substitute the tabletop background with a gray background. (c) For the ThreePieceAssembly task, we substitute the red base with a dark wooden base.

Table 2. We evaluate each policy in its corresponding disruption scenario and report the success rate (%)

Lifelong Learning

We demonstrate that WMPO can continuously improve the performance of VLA by iteratively collecting real trajectories from the environment. Specifically, we iteratively collect \(P=128\) real trajectories, perform WMPO to optimize the policy, and then use the updated policy to collect another \(P\) real trajectories. We apply the same setting to the DPO baseline. To compare WMPO with an imitation learning-based policy using more expert demonstrations, we leverage \(300\), \(428\), and \(556\) expert trajectories to train the base policy as a reference. It is important to note that the base policy requires human-collected trajectories, whereas WMPO only relies on trajectories collected by the policy itself, making it more scalable. The results on the StackThree task, shown in Figure 6, demonstrate that WMPO achieves stable and substantial improvements over both baselines, whereas DPO fails to improve iteratively due to unstable training.

Figure 6. Lifelong learning results of WMPO and baselines.

Real-world Experiments

In this part, we evaluate the challenging real-world manipulation task, "Insert the square into the stick", where the clearance between the square and the stick is only 5mm to validate the effectiveness of WMPO. Using the Cobot Mobile ALOHA platform, we collect \(200\) high-quality expert demonstrations to fine-tune the OpenVLA-OFT model as the base policy. We then deploy this policy to collect an additional \(128\) trajectories, which are used to further fine-tune the world model and optimize the policy within it. For comparison, we also train an offline DPO policy using the same dataset. All models are evaluated under identical experimental conditions, and we report the average success rate over \(30\) trials. The results show that the base policy, DPO, and WMPO achieve success rates of \(53\%\), \(60\%\), and \(70\%\), respectively, demonstrating the effectiveness of WMPO on real robots.

Figure 7. Successful attempt on the fine-grained manipulation task "Insert the square into the stick". Despite never observing this trajectory during training, the world model accurately predicts the future evolution.

Figure 8. Unsuccessful attempt on the fine-grained manipulation task "Insert the square into the stick". The world model successfully predicted failure cases.

Figure 9. Example of a failure case. Although the predicted trajectory remains accurate until the final frame, the model fails to capture the square getting stuck in the stick due to subtle perturbations.

Read Our Paper For More Details

BibTeX

@article{WMPO2025,
  title={WMPO: World Model-based Policy Optimization for Vision-Language-Action Models},
  author={Fangqi, Zhu and Zhengyang, Yan and Zicong, Hong and Quanxin, Shou and Xiao, Ma and Song, Guo},
  journal={arXiv preprint arXiv:2511.09515},
  year={2025},
  url={https://arxiv.org/abs/2511.09515}
}

More Related Works from PEILab

IRASim: Learning Interactive Real-Robot Action Simulators

WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

Abstract

Methodology

Figure 2. Illustration of Our Proposed World Model-based Policy Optimization (WMPO)

Results and Analysis

Comparison with GRPO and DPO

Emergent Behavior of WMPO

Figure 3. Behavior analysis of the Square task (inserting the square into the stick) shows that, compared with the base policy, WMPO demonstrates the ability to self-correct.

Figure 5. Relative average trajectory length of successful trials across different policies (Base Policy = \(100\%\)).

Generalization to Novel Tasks

Figure 4. (a) For the Square task, we vary the stick’s position from fixed to a random position inside a rectangle. (b) For the StackThree task, we substitute the tabletop background with a gray background. (c) For the ThreePieceAssembly task, we substitute the red base with a dark wooden base.

Table 2. We evaluate each policy in its corresponding disruption scenario and report the success rate (%)

Lifelong Learning

Figure 6. Lifelong learning results of WMPO and baselines.

Real-world Experiments

Figure 7. Successful attempt on the fine-grained manipulation task "Insert the square into the stick". Despite never observing this trajectory during training, the world model accurately predicts the future evolution.

Figure 8. Unsuccessful attempt on the fine-grained manipulation task "Insert the square into the stick". The world model successfully predicted failure cases.

Figure 9. Example of a failure case. Although the predicted trajectory remains accurate until the final frame, the model fails to capture the square getting stuck in the stick due to subtle perturbations.

Read Our Paper For More Details

BibTeX