Prime Intellect Releases prime-rl 0.6.0 to Train Trillion-Parameter MoE Models on Agentic RL Workloads
Prime Intellect has released prime-rl version 0.6.0 .

Prime Intellect has released prime-rl version 0.6.0 . The framework targets reinforcement learning on trillion-parameter Mixture-of-Experts (MoE) models. It focuses on heavy agentic workloads, like long-horizon software-engineering tasks.
The research team trained GLM-5 on SWE tasks at up to 131k sequence length. Step times stayed under five minutes. The batch size was 256 rollouts. The run used only 28 H200 nodes.
prime-rl is an open framework for asynchronous reinforcement learning. It post-trains large open-source models on agentic tasks. Version 0.6.0 extends this to trillion-parameter MoE scale.
The example model in the announcement is zai-org/GLM-5.1 . The optimizations also apply to other large MoE models. Examples include moonshotai/Kimi-K2.7-Code and nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 .
A full GLM-5.1 run starts with one command on a Slurm cluster.
Agentic tasks have long-tail outliers. Some coding rollouts run for hours. Waiting for them before each policy update would idle GPUs.
Asynchronous RL avoids this. The trainer and inference systems are disaggregated. They run and scale independently. The inference policy updates as soon as the optimizer step finishes.
There is one synchronization point: the policy update. prime-rl pushes new weights as soon as they exist. Already-dispatched rollouts keep their active prefix cache. So a single rollout may mix tokens from several policy versions.
New rollouts behave differently. They repopulate their own KV cache, even when prefixes match. A KV-cache salt forces this. Requests from too old a policy are dropped. The max_off_policy_steps value controls that threshold.
Inference is usually the throughput bottleneck in an RL system. prime-rl optimizes for throughput, while keeping latency bounded.
FP8 inference : Lower precision speeds up prefill and decode. prime-rl uses FP8 with DeepEP and DeepGEMM kernels.
Wide Expert Parallelism : Wide EP spreads experts across ≥32 GPUs. It pairs with a large data-parallel rank, for example 32. Each GPU holds separate experts and serves as an endpoint. Synchronization happens per-layer, through dispatch and combine operations.
Prefill and Decode Disaggregation : Some model env pairs hit a 4:1 prefill:decode token ratio. Shared workers would inflate end-to-end latency. That reduces the benefits of PipelineRL. P/D disaggregation separates prefill and decode workers. Long tool outputs then stop throttling decode workers.
Source: MarkTechPost