AI Research

NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code

AI News Desk

MarkTechPost

May 27, 2026

4 min read

Reinforcement learning for language agents is growing more complex.

Framework for GRPO Training Across Codex, Claude Code, and Qwen Code">

Reinforcement learning for language agents is growing more complex. Agents now manage multi-turn tool use, long-running contexts, and multi-agent orchestration. The main engineering challenge is connecting existing agent software to training pipelines without breaking how those tools work.

NVIDIA’s research team introduced Polar , a rollout framework that lets researchers run reinforcement learning over any agent harness without modifying that harness.

An ‘agent harness’ is a tool like Codex CLI, Claude Code, Qwen Code, or Pi. These harnesses manage system prompts, tool formatting, context engineering, and how the agent submits patches. These details directly affect agent behavior at evaluation time.

Traditional RL infrastructure requires harness logic to be rewritten behind a framework-owned environment API — typically env.init() , env.step() , env.reset() in the OpenAI Gym style. Every new harness requires new integration code. That integration can also lose execution details specific to the native harness path.

Polar’s key observation is that every LLM-based agent must call a model. That model API boundary is a common interface outside the agent itself. Instead of integrating inside the harness, Polar places a proxy at that boundary.

For each incoming model request, the gateway proxy performs four steps:

For streaming requests, Polar obtains a non-streaming upstream response and emits a synthetic provider-shaped stream. This preserves compatibility with harnesses that expect server-sent events while ensuring complete token capture.

The only required change to an existing harness is pointing its model base URL at the gateway.

The rollout server accepts a TaskRequest and expands it into num_samples independent sessions. Each session carries a session ID, task ID, timeout budget, runtime specification, agent specification, trajectory builder, evaluator, and callback URL. The server dispatches sessions to gateway nodes and accepts callbacks when sessions complete.

Gateway nodes own the lifecycle of each session — starting the runtime, running the harness, building trajectories, evaluating output, and teardown. The gateway also hosts the proxy endpoint for that session’s model calls, keeping completion capture tied to the session registry.

Source: MarkTechPost