← Back ◬ AI & Machine Learning Apr 03, 2026

Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI

arXiv Security Archived Apr 03, 2026 ✓ Full text saved

arXiv:2507.05660v2 Announce Type: replace Abstract: Customizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecting toxic behaviors. In this work, we introduce Optimus, a novel defense framework designed to mitigate fine-tuning harms while preserving conversational utility. Unlike existing defenses that rely heavily on precise toxicity detection or restrictive filtering, Optimus addresses the critical challenge of ensuring robust mitigation even when toxicity class

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Cryptography and Security [Submitted on 8 Jul 2025 (v1), last revised 2 Apr 2026 (this version, v2)] Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI Aravind Cheruvu, Shravya Kanchi, Sifat Muhammad Abdullah, Nicholas Kong, Daphne Yao, Murtuza Jadliwala, Bimal Viswanath Customizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecting toxic behaviors. In this work, we introduce Optimus, a novel defense framework designed to mitigate fine-tuning harms while preserving conversational utility. Unlike existing defenses that rely heavily on precise toxicity detection or restrictive filtering, Optimus addresses the critical challenge of ensuring robust mitigation even when toxicity classifiers are imperfect or biased. Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization (DPO) to efficiently steer models toward safety. Extensive evaluations demonstrate that Optimus mitigates toxicity even when relying on extremely biased classifiers (with up to 85% degradation in Recall). Optimus outperforms the state-of-the-art defense StarDSS and exhibits strong resilience against adaptive adversarial and jailbreak attacks. Our source code and datasets are available at this https URL Comments: Accepted at ACM CODASPY 2026 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2507.05660 [cs.CR] (or arXiv:2507.05660v2 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.05660 Focus to learn more Submission history From: Aravind Cheruvu [view email] [v1] Tue, 8 Jul 2025 04:40:09 UTC (1,269 KB) [v2] Thu, 2 Apr 2026 16:59:25 UTC (704 KB) Access Paper: HTML (experimental) view license Current browse context: cs.CR < prev | next > new | recent | 2025-07 Change to browse by: cs cs.AI cs.CL References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes