Direct Preference Optimization Beyond Chatbots
A new study reveals that Direct Preference Optimization (DPO) can significantly reduce text degeneration in specialized structured OCR models, achieving an average reduction of 59.4% across five model families.

['In April, researchers released DharmaOCR, a specialized structured OCR model available on Hugging Face, along with a paper detailing its methodology and benchmark results. The model demonstrated superior quality and cost efficiency compared to other vision-language models. One key finding was the issue of text degeneration, where models produce repetition loops instead of accurate transcriptions.
The study found that vanilla degeneration rates ranged from below 1% to above 33% across open-source model families, and supervised fine-tuning (SFT) only reduced these rates to a certain extent.', '\n\nThe researchers discovered that adding a second training stage, Direct Preference Optimization (DPO), significantly reduced text degeneration in every model family tested. DPO uses a preference signal to optimize the model, where correct transcriptions are chosen and degeneration loops are rejected. The study reported an average reduction of 59.4% in text degeneration, with the best case showing an 87.6% reduction.
The results suggest that SFT has a ceiling on reducing degeneration, whereas DPO can explicitly penalize degeneration loops.', '\n\nThe study explored why SFT has limitations in reducing degeneration, conjecturing that it may be due to loss granularity. SFT trains token by token, evaluating each prediction in isolation, whereas DPO inverts this logic by considering the full output. The researchers proposed that text degeneration is a self-reinforcing repetition loop that autoregressive models cannot exit without external intervention.
They suggested that DPO addresses this geometry by training the model to move away from its failure modes.', "\n\nThe DharmaOCR pipeline used the SFT model's own degenerate outputs as rejected examples, constructing a DPO training set without requiring human preference labels. The pipeline identified a preference signal in the range of outputs generated by the SFT model, using an automated LLM judge to score candidate responses. The study demonstrated that this approach can be effective in reducing text degeneration across various model families and architectures.", '\n\nThe results showed that no model family was immune to the effect of DPO, with degeneration rates consistently decreasing across different architectures and parameter scales.
The study concluded that SFT and DPO address different failure dimensions, and that DPO can be a valuable addition to structured generation pipelines. The researchers suggested that the approach can be applied to other domains and failure modes, as long as the failures are categorically identifiable, scoreable, and sufficiently numerous.']
Source: Hugging Face