How to build custom reasoning agents with a fraction of the compute
A new training paradigm called Reinforcement Learning with Verifiable Rewards with Self-Distillation (RLSD) allows enterprises to build custom reasoning models at a lower cost.

['Training AI reasoning models demands resources that most enterprise teams do not have. Engineering teams are often forced to choose between distilling knowledge from large, expensive models or relying on reinforcement learning techniques that provide sparse feedback. Researchers at JD.com and several academic institutions recently introduced a new training paradigm that sidesteps this dilemma.
The technique, called Reinforcement Learning with Verifiable Rewards with Self-Distillation (RLSD), combines the reliable performance tracking of reinforcement learning with the granular feedback of self-distillation. Experiments indicate that models trained with RLSD outperform those built on classic distillation and reinforcement learning algorithms.', "The standard method for training reasoning models is Reinforcement Learning with Verifiable Rewards (RLVR). In this paradigm, the model learns through trial and error, guided by a final outcome from its environment.
An automated verifier checks if the model's answer is right or wrong, providing a binary reward, such as a 0 or 1. RLVR suffers from sparse and uniform feedback. 'Standard GRPO has a signal density problem,' Chenxu Yang, co-author of the paper, told VentureBeat.
'A multi-thousand-token reasoning trace gets a single binary reward, and every token inside that trace receives identical credit, whether it's a pivotal logical step or a throwaway phrase.'", "On-Policy Distillation (OPD) takes a different approach. Instead of waiting for a final outcome, developers pair a smaller student model with a larger, more capable teacher model. For each training example, the student compares its response to that of the teacher token by token.
This provides the student with granular feedback on the entire reasoning chain and response-generation process. However, deploying and running a separate, massive teacher model alongside the student throughout the entire training process incurs massive computational overhead. 'You have to keep a larger teacher model resident throughout training, which roughly doubles your GPU footprint,' Yang said.", 'The researchers behind RLSD realized that the signals governing how a model updates its parameters have fundamentally asymmetric requirements.
They identified that the signal dictating the direction of the update (i.e., whether to reinforce or penalize a behavior) can be sparse, but must be perfectly reliable, because pointing the model in the wrong direction damages its reasoning policy. On the other hand, the signal dictating the magnitude of the update (i.e., how much relative credit or blame a specific step deserves) benefits from being extremely dense to enable fine-grained, step-by-step corrections. RLSD builds on this principle by decoupling the update direction from the update magnitude.', 'To test RLSD, the researchers trained the open-weight Qwen3-VL-8B vision-language model and evaluated it on several visual reasoning benchmarks.
Source: VentureBeat