Correctness Before Corrections: The vLLM V0 to V1 Migration
PipelineRL's vLLM inference engine upgrade from V0 to V1 required fixing backend behavior to match training dynamics.

The recent migration of PipelineRL's inference engine from vLLM V0 to V1 has highlighted the importance of correctness in reinforcement learning (RL) systems. As the engine responsible for rollout generation, vLLM provides token logprobs used by the trainer to compute policy ratios, KL, clip rate, entropy, and reward. Any discrepancies in logprob computation can alter training dynamics, leading to a train-inference mismatch that needed to be eliminated.
The vLLM V1 upgrade was a substantial rewrite of the V0 engine, and the migration target was deliberately narrow. The reference run used vLLM 0.8.5, while the V1 runs used vLLM 0.18.1. Initially, the V1 run showed a clear problem, with trainer-side logprobs and reward diverging from the V0 reference early in training.
The same pattern appeared in trainer metrics, with clip rate being the easiest signal to read. To diagnose the issue, possible causes were separated into three layers. Initially, the suspicion fell on the third category, but useful diagnosis came from treating the first two as backend behavior problems and ruling them out first.
The first issue was semantic: vLLM V1 returns logprobs from raw model outputs by default, before logits post-processing, whereas PipelineRL expected logprobs from the processed distribution used by the sampler. Fixing this removed the obvious mean offset in rollout logprobs, but training curves still showed a gap relative to the known-good reference. The next issue was in the inference path, specifically with V1 runtime defaults.
Making these choices explicit and disabling prefix caching removed another V1-only degree of freedom from the parity comparison. Weight synchronization also had to match the online-RL update model. The final parity required matching the numerical path used to compute logits.
The trainer used an fp32 lm_head for the final projection, and the rollout backend had to match that behavior. A related issue was reported in the MiniMax-M1 technical report, where an RL run showed a training/inference token-probability mismatch traced to the LM output head. The corrected V1 run tracked the V0 reference, while the initial V1 attempt produced a clearly different reward curve.
The negative results are important, as they rule out common explanations. Objective-side corrections, such as truncated importance sampling, can be useful tools, but in this case, the first problem was inference correctness. The main lesson from this migration is to fix backend correctness first, then add corrections for the mismatch that remains.
With inference parity restored, the next improvement is the usual async/off-policy cleanup. The current objective can still improve, but separating the questions of inference correctness and objective-side corrections is crucial for interpretable training curves.
Source: Hugging Face