← Back ◬ AI & Machine Learning May 13, 2026

Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs

arXiv Security Archived May 13, 2026 ✓ Full text saved

arXiv:2605.10998v1 Announce Type: new Abstract: Fine-tuning APIs make frontier LLMs easy to customize, but they can also weaken safety alignment during fine-tuning. While prior work shows that benign supervised fine-tuning (SFT) can reduce refusal behavior, deployed fine-tuning pipelines increasingly support preference-based objectives, whose safety risks remain less understood. We show that Direct Preference Optimization (DPO) introduces a stronger and harder-to-audit failure mode. We propose a

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Cryptography and Security [Submitted on 9 May 2026] Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs Sangyeon Yoon, Wonje Jeung, Yoonjun Cho, Dongjae Jeon, Albert No Fine-tuning APIs make frontier LLMs easy to customize, but they can also weaken safety alignment during fine-tuning. While prior work shows that benign supervised fine-tuning (SFT) can reduce refusal behavior, deployed fine-tuning pipelines increasingly support preference-based objectives, whose safety risks remain less understood. We show that Direct Preference Optimization (DPO) introduces a stronger and harder-to-audit failure mode. We propose a truly benign DPO attack using only 10 harmless preference pairs, the minimum data scale accepted by OpenAI's fine-tuning service. Each pair contains a benign prompt, a normal helpful answer as the preferred response, and a refusal as the dispreferred response. Unlike prior benign fine-tuning attacks, our data exhibits no suspicious behavior: it is practically indistinguishable from the fine-tuning request of a legitimate user seeking to reduce over-refusal, making harmful intent almost impossible to infer from the request alone. Nevertheless, because DPO directly optimizes the model to prefer helpful answers over refusals, this seemingly benign objective broadly suppresses refusal behavior and transfers to harmful prompts outside the fine-tuning data. Across OpenAI models supporting DPO fine-tuning, our attack achieves attack success rates of 59.13% on GPT-4o, 70.20% on GPT-4.1, 54.80% on GPT-4.1-mini, and 81.73% on GPT-4.1-nano, at costs of only $1.7, $1.7, $0.3, and $0.1. Moreover, on open-weight models that do not impose minimum data requirements, we find that this effect can emerge from even a single benign preference pair. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.10998 [cs.CR] (or arXiv:2605.10998v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.10998 Focus to learn more Submission history From: Sangyeon Yoon [view email] [v1] Sat, 9 May 2026 15:52:29 UTC (2,453 KB) Access Paper: HTML (experimental) view license Current browse context: cs.CR < prev | next > new | recent | 2026-05 Change to browse by: cs cs.AI References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes