Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124


Training AI reasoning models requires resources that most enterprise teams do not have. Engineering teams are often forced to choose between distilling knowledge from large, expensive models or relying on reinforcement learning techniques that provide sparse feedback.
Researchers at JD.com and several academic institutions have recently introduced a new learning paradigm that circumvents this dilemma. The technique, called Reinforcement Learning with Verifiable Rewards with Self-distillation (RLSD), combines the reliable performance tracking of reinforcement learning with the granular feedback of self-distillation.
Experiments show that models trained with RLSD outperform those built based on classical distillation and reinforcement learning algorithms. For enterprise teams, this approach lowers the technical and financial barriers to building custom reasoning models aligned with specific business logic.
The standard method for training inference models is Reinforcement Learning with Verifiable Rewards (RLVR). In this paradigm, the model learns through trial and error, guided by the end result of its environment. An automatic verifier checks whether the model answer is correct or incorrect, providing a binary reward such as 0 or 1.
RLVR suffers from sparse and uniform feedback. “Standard GRPO has a signal density problem,” Chenxu Yang, co-author of the paper, told VentureBeat. « A reasoning trace with multiple tokens receives a single binary reward, and every token in that trace receives identical credit, whether it is a basic logic step or a throwaway phrase. » Therefore, the model never learns which intermediate steps led to its success or failure.
Rule Distillation (OPD) takes a different approach. Instead of waiting for a final result, the developers paired a smaller student model with a larger, more capable teacher model. For each learning example, the student compares his answer with the teacher’s token by token. This provides the student with detailed feedback on the entire chain of reasoning and the process of generating an answer.
Implementing and running a separate, massive model of the teacher along with the student throughout the learning process incurs huge computational costs. « You have to maintain a larger teacher model during training, which roughly doubles your GPU footprint, » Yang said. Additionally, the teacher and student models must share the same vocabulary structure, which Yang says « quietly rules out most of the cross-architecture, cross-modality, or multilingual settings that enterprises actually manage. »
Rule-based self-distillation (OPSD) emerged as a solution designed to overcome the shortcomings of the other two approaches. In OPSD, the same model plays the role of both the student and the teacher.
During learning, the student receives a standard prompt, while the teacher receives privileged information, such as a verified step-by-step answer key. This well-informed teacher’s version of the model then evaluates the student’s version, providing token-by-token feedback as the student attempts to solve the problem using only the standard prompt.
OPSD seems like the perfect compromise for the corporate budget. It provides detailed step by step guidance to OPD. Because it eliminates the need for an external teacher model, it operates with the high computational efficiency and low cost of RLVR, requiring only additional feed-forward for the teacher.
However, the researchers found that OPSD suffers from a phenomenon called « privileged information leakage. »
« The target is structurally misplaced, » Yang said. « There is an irreducible gap in mutual information that the student can never close. . . . When self-distillation is set up as distribution matching, the student is asked to mimic the full distribution of the teacher’s output in a privileged context. »
Since the teacher evaluates the student based on a hidden answer key, the learning objective forces the student model to learn the teacher’s exact phrase or steps instead of the underlying reasoning logic. As a result, the student model begins to hallucinate references to an invisible solution that it will not have access to in a real-world implementation.
In practice, OPSD models show a rapid spike in performance early in training, but their reasoning abilities soon plateau and progressively deteriorate over time.
The researchers behind RLSD realized that the signals governing how the model updates its parameters have fundamentally asymmetric requirements. They found that the signal dictating the direction of the update (i.e., whether to reinforce or punish a behavior) can be sparse, but must be completely reliable, because pointing the model in the wrong direction breaks its reasoning policy.
On the other hand, the signal determining the magnitude of the update (ie, how much relative credit or blame a particular step deserves) benefits from being extremely dense to allow fine step-by-step adjustments.
RLSD builds on this principle by separating the update direction from the update magnitude. The framework allows verifiable environmental feedback from the RLVR signal to strictly determine the direction of learning. The model receives overall reinforcement only if the final answer is objectively correct.
The tutorial is stripped of its power to dictate what the model should generate. Instead, the token-by-token evaluation of the teacher is redirected to determine the magnitude of the update. It simply allocates overall credit or blame between the individual steps of the model’s reasoning path.
This changes the way the model is learned compared to the classical OPSD paradigm. In standard OPSD, the learning objective acts as behavioral cloning, where the model is forced to directly copy the teacher’s exact wording and phrasing. This causes the student to hallucinate and leak references to data they don’t own.
Instead of forcing the model to copy a hidden solution, RLSD provides a natural and practically free source of credit information for a token.
« The intuition: we don’t teach the model to reason like the teacher, » Yang said. « We tell the model, along the path it has chosen, which of its own tokens are actually doing the work. The model’s exploration distribution remains its own. Only the credit distribution is sharpened. »
If a particular deduction strongly supports the correct result, it receives a higher score. If it’s just a useless filler word, it gets a base score. RLSD eliminates the need to train complex auxiliary reward networks, manually annotate data step by step, or maintain massive external teacher models.
To test RLSD, the researchers trained the Qwen3-VL-8B open model on visual language and evaluated it on several visual reasoning metrics. These include MMMU for multidisciplinary college-level questions, MathVista, MathVision, WeMath, and ZeroBench, a stress test benchmark expressly designed to be nearly impossible for current frontier models.
They compared the RLSD model with the baseline model without post-training, standard RLVR via the GRPO algorithm, standard OPSD, and a hybrid combination of the two.
RLSD significantly outperforms any other method, achieving the highest average accuracy of 56.18% across all five metrics. It beat the base model by 4.69% and beat the standard RLVR by 2.32%. The gains were most pronounced on complex mathematical reasoning tasks, where RLSD outperformed standard RLVR by 3.91% on the MathVision benchmark.
Besides accuracy, the framework offers huge gains in efficiency. « Specifically, RLSD at 200 training steps already beats GRPO trained at 400 steps, so roughly a 2x convergence speedup, » Yang said. « From a cost perspective, the only additional cost beyond the normal GRPO pipeline is one extra pass forward per response to get teacher logs. Compared to generating a deployment… it’s basically free. »
Unlike OPSD, where performance rises and then completely crashes due to information leakage, RLSD maintains long-term learning stability and reaches a higher performance ceiling than standard methods.
Qualitative findings highlight how the model changes its learning behavior. For example, in a complex visual counting task, the standard RLVR considers the final correct answer and gives the entire paragraph of reasoning tokens the same reward. RLSD surgically applied rewards to the specific subtraction math steps that solved the problem while actively reducing the weight of the generic filler text as "Looking at the image, I see…".
In another example, the model performed an incorrect math inference based on a bar chart. Instead of marking the entire response as a failure, RLSD concentrates the most severe penalty on the exact point where the model incorrectly reads a relationship from the diagram. He remained neutral on the rest of the logical setup, acknowledging that the original framework was valid.
This is especially important for messy, real-world enterprise use cases. If a model makes a mistake analyzing a 50-page quarterly earnings report, developers don’t want it to invalidate its entire analytical framework. They just want to correct the particular assumption that was wrong. RLSD allows the model to learn exactly which logic jumps are valuable and which are false, token by token. Because RLSD does this by retraining the model itself, it provides models with granular reasoning capabilities while keeping training costs reasonable.
For data engineers and AI orchestration teams, integrating RLSD is easy, but requires the right setup. The most critical requirement is a verifiable reward signal, such as code compilers, math checks, SQL execution, or schema validators. « Tasks without verifiable reward (open dialogue, writing brand voice) belong in preference-based pipelines, » Yang said.
However, the RLSD is very flexible about the privileged information it requires. While OPSD structurally requires full intermediate logic traces, forcing enterprises to either pay annotators or distill from a boundary model, RLSD does not.
« If you have complete verified logic traces, great, RLSD will use them, » Yang said. « If all you have is the definitive answer to the truth, that works, too. . . . OPSD doesn’t have that flexibility. »
Integrating the technique into existing open source multimodal RL frameworks such as veRL or EasyR1 is incredibly easy. According to Yang, it requires no rewriting of the framework and sits right on the standard stack. Code swapping involves simply changing dozens of lines to adjust the GRPO target and synchronize the teacher with the student.
Looking ahead, RLSD offers a powerful way for enterprises to maximize their existing internal assets.
« The proprietary data that enterprises keep on their perimeter (compliance manuals, internal documentation, historical tickets, audited code snippets) is essentially free privileged information, » Yang concluded. « RLSD allows enterprises to feed this kind of data right into a privileged context, which sharpens the signal for training smaller models without needing an external teacher and without sending anything outside the network. »
Orchestration
#build #custom #reasoning #agents #fraction #computation