超越聊天機器人的直接偏好優化

2026年6月3日 12:55

重點摘要

回到文章：超越聊天機器人的直接偏好優化。團隊文章，發表於2026年6月3日。讚 - Erick Lachmann（ErickvL 追蹤 Dharma-AI）、Pimenta de Freitas Cardoso（GabrielPimenta99 追蹤 Dharma-AI）。利用模型自身失敗產生的拒絕配對。四月份，我們發表了 DharmaOCR——我們專用的結構化 OCR 模型（可在 Hugging Face 上取得），同時發表了一篇論文詳細說明其方法論，以及一個展示其卓越品質與成本效益的基準測試。該論文針對領先的視覺語言模型系列（包括開源與商業版本）進行了結構化文件擷取任務的基準測試：巴西葡萄牙語文本的 OCR。報告的指標中包含文字退化率：模型產生重複循環而非正確轉錄的頻率。

站內 AI 整理稿

Back to Articles Direct Preference Optimization Beyond Chatbots Team Article Published June 3, 2026 Upvote - Erick Lachmann ErickvL Follow Dharma-AI Pimenta de Freitas Cardoso GabrielPimenta99 Follow Dharma-AI Using Rejection Pairs From Your Model's Own Failures In April, we released DharmaOCR, our specialized structured OCR model (available on Hugging Face) along with a paper detailing the methodology behind it and a benchmark demonstrating its superior quality and cost efficiency. The paper benchmarked leading vision-language model families - both open-source and commercial - on a structured document extraction task: OCR on Brazilian Portuguese text. Among the reported metrics was text degeneration rate: the frequency with which a model produces a repetition loop instead of a transcription. Across the tested open-source families, vanilla degeneration rates ranged from below 1% to above 33%. Supervised fine-tuning reduced those rates for most models - but rarely to production-acceptable levels. The pattern points to a structural limitation: SFT optimizes for correct outputs, but does not explicitly penalize degeneration. There appears to be a ceiling on how much task-focused fine-tuning alone can reduce this failure mode (Text Degeneration Article). A second training stage - applied after supervised fine-tuning (SFT), on the same documents, using the same model - reduced text degeneration in every family tested. No exceptions. Average reduction: 59.4%. Best case: 87.6%. Figure 1: DPO reduced degeneration relative to SFT in every family tested - average reduction of 59.4%, peak of 87.6% (Nanonets-OCR2–3B: 1.61% to 0.20%). The direction is invariant; only the magnitude varies. That second stage was Direct Preference Optimization (DPO). Almost all published DPO applications target chat alignment - models trained on human judgments about helpfulness or harmlessness (example: Rafailov et al., 2023). OCR carries none of that subjectivity: the task is objective, and there is no conversational context. There is, however, a clear preference signal. A correct transcription is chosen; a degeneration loop is rejected. DharmaOCR used that binary to construct a DPO training set, testing the technique not for alignment, but as a direct mitigation tool for a specific failure mode. The training signal came from the model itself - specifically from the outputs it produced when it failed. How a failure mode becomes a training signal is a structural question about the failure, not the model. The Loop Survives Fine-Tuning Why SFT has a ceiling on degeneration is still an open question - but the leading conjecture points to loss granularity. SFT trains token by token: each prediction is evaluated in isolation, and a repetition loop is never penalized as a completion-level failure. DPO inverts that logic. The training signal is the full output - chosen or rejected - which means a degenerated completion can be explicitly labeled as the wrong outcome, not just a sequence of locally probable tokens. When a training objective maximizes the likelihood of observed sequences, it concentrates probability mass in the regions of distribution space those sequences occupy. A model that enters one of those high-probability attractor regions during inference assigns elevated probability to the same token at the next step - which increases the probability further, which sustains the loop until the sequence hits the maximum token limit. Text degeneration is the output of this geometry: a self-reinforcing repetition loop that an autoregressive model cannot exit without external intervention (Holtzman et al., 2020). It is not purely a decoding artifact. The attractor involves the training objective, the learned distribution, and how probability mass concentrates during inference - a systems-level failure rather than a failure localized to any single component. The geometry of this failure is visible at the token level. Figure 2: When a token dominates its own conditional distribution, every sampling step deepens the attractor. The decoder samples from this geometry; it does not determine it. Inference-layer interventions - repetition penalties, temperature adjustments, early-abort logic - operate on the sampling step. They contain the symptom without touching the distribution that produces it. The attractor persists. Supervised fine-tuning moves the distribution closer to the task domain. For a structured generation pipeline, this means training on domain-specific documents, in the target language, with the required output format. The model gains fluency with longer sequences, constrained syntax, domain vocabulary. What SFT does not do is attack degeneration directly. Its objective - maximizing the likelihood of observed sequences - has no term that penalizes repetition loops. The failure mode is simply outside the scope of what the training signal optimizes for. One model family in the DharmaOCR benchmark showed an unexpected pattern: vanilla degeneration rate of 0.60%, rising to 3.23% after SFT, before a subsequent DPO stage brought it to 1.41%. It is a single data point - an exception, not a rule - and it would be overstating the evidence to treat it as proof of a mechanism. What it does illustrate is that SFT does not reliably reduce degeneration. Capability and degeneration resistance can move independently. The distinction matters structurally. SFT and DPO are not interchangeable training stages performing the same operation at different intensities. SFT closes the distance between the model's prior distribution and the task domain. What it does not do is target degeneration as an objective - its effect on the failure mode is incidental, and the benchmark results show it is not consistent. The attractor that produces degeneration is not a problem with the model's proximity to the task - it is a problem with the shape of the distribution space the model now occupies. Addressing that geometry requires a training signal built specifically to point the model away from its own failure modes. For a structured, non-conversational task with no human preference labels and no conventional "helpful versus harmful" distinction, constructing that signal is a design decision. The Design Decision: Degenerate Outputs as Rejection Pairs The DharmaOCR pipeline's contribution to DPO methodology is specific: it used the SFT model's own degenerate outputs as the rejected examples - not as noise to remove, but as the negative training signal the optimization needed. DPO requires preference pairs: a chosen output and a rejected output for the same input, with a quality difference clear enough for the optimization to learn from. In chat alignment, human annotators produce those judgments - rating responses as more or less helpful, accurate, or safe. Structured generation tasks have no equivalent annotation source. An OCR pipeline either produces a correct transcription or it does not. Quality differences exist, but they are not produced by human preference rankings - they are produced by the task's own criteria for correctness. The DharmaOCR pipeline identified a preference signal that structured generation tasks already produce: the range of outputs the SFT model generates in inference. A model capable of performing a structured task is also capable of failing at it in characteristic ways. Those failures - outputs that enter the degeneration attractor - are not noise to filter. They are the most informative negative signal available. The paper implemented this on 23,726 training documents, generating multiple candidate responses per document with the SFT model and scoring each with an automated LLM judge. The pipeline is shown below. Figure 3: The critical design decision is not in the pipeline's structure - it is in what the pipeline preserved: outputs displaying text degeneration were deliberately labeled as rejected examples, not filtered out as low-quality noise. The conventional response when degene

原始來源：Hugging Face Blog ↗

查看原始來源

鈦媒體產業與商業

Manus回購方案浮出水面：中國投資方擬掏20億美元買回股權，赴港IPO路徑漸明

## Manus 回購方案浮出水面：中國投資方擬砸 20 億美元買回股權，赴港 IPO 路徑漸明在被監管機構叫停營運近兩個月後，中國 AI 新創公司 Manus 的去向終於出現明確輪廓。根據最新消息，一組中國投資方正計畫以約 20 億美元的金額，回購現有股東手中的股份，藉此重整股權結構，並為接下來赴香港 IPO 鋪路。

剛剛閱讀分析