Research Synthesized from 1 source

RL After Flawed SFT Locks In Model Defects

Key Points

• RL amplifies SFT defects rather than correcting them
• SFT embeds flawed patterns that RL reinforces
• Agentic systems propagate compounding hallucinations
• Audit SFT damage before applying reinforcement learning
• Commercial pressure pushes teams to skip critical diagnostics

References (1)

[1] Warning: rushing RL after SFT may train broken multimodal LLMs — 量子位 QbitAI ↗

The most dangerous assumption in AI development right now is that reinforcement learning fixes what supervised fine-tuning broke. A technical analysis from 量子位 QbitAI reveals a troubling pattern: teams deploying AI agents are building on foundations riddled with hidden defects, and the standard playbook for "improving" models may be actively making things worse.

The core finding inverts conventional wisdom. Practitioners have treated SFT and RL as sequential improvements—train the base model, fine-tune it, then reinforce it. But this framing misses a critical interaction. When SFT embeds flawed patterns in a multimodal model, subsequent RL training doesn't correct those errors. It optimizes for them. The model becomes more efficient at reproducing exactly what it learned during fine-tuning, including the hallucination-prone behaviors teams thought they were eliminating.

The mechanism unfolds in stages. SFT exposes the model to curated demonstration data, teaching it to emulate patterns in that dataset. If those demonstrations contain hidden biases, gaps, or outright errors, the model absorbs them. RL then applies reward signals that reinforce the model's current behavior. Since the model has already internalized the flawed patterns, it performs those patterns more confidently. There's no "unlearning" phase—only amplification of what came before.

This matters enormously for agentic systems. Modern AI deployments string multiple model capabilities together: vision inputs triggering text generation, tool use decisions depending on retrieval results, multi-step reasoning chains where each link compounds the previous. When a model at the foundation of such chains has SFT-induced defects, RL doesn't isolate those defects. It propagates them through every downstream decision. A model that hallucinates occasionally in testing becomes a system that hallucinates systematically in production.

The 量子位 analysis offers practical guidance: teams should audit and remediate SFT-induced "injuries" before applying RL. Identify what the fine-tuning phase actually taught the model, not just what it was supposed to teach. Test for failure modes that stem from the fine-tuning data, not just the base model. Only then should teams proceed to reinforcement learning—and even then, with clear visibility into what behaviors RL is actually reinforcing.

The industry trend runs opposite to this advice. Commercial pressure pushes teams to ship agentic products quickly, treating RL as a box-checking exercise rather than a carefully sequenced remediation. The gap between "works in demo" and "works reliably in production" widens precisely because teams skip the diagnostic work. SFT defects compound silently until they surface in ways that are difficult to reverse.

The stakes extend beyond individual model failures. As agentic systems take on consequential tasks—medical advice, legal analysis, autonomous navigation—the cost of amplified hallucinations grows. A model that confidently generates plausible-but-wrong information, reinforced by RL to generate that content more reliably, becomes a liability that scales with deployment breadth.

This research is a warning, not a prohibition. RL remains a powerful tool. But teams treating it as a universal fix for fine-tuning problems are operating on flawed premises. The question isn't whether to use RL after SFT—it's whether the foundation it will reinforce is sound enough to justify the investment.