← Back ◬ AI & Machine Learning May 27, 2026

BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning

arXiv Security Archived May 27, 2026 ✓ Full text saved

arXiv:2605.27110v1 Announce Type: new Abstract: In this work, we propose BAIT (Boundary-Aware Iterative Trap), a three-step jailbreak framework that approaches malicious goals through internal disclosure. BAIT first asks the model to identify the protection boundary, then requires it to refine that boundary, and finally requests a detailed example. By expanding each step upon the model's previous responses, BAIT turns the model's own reasoning and consistency tendency into a disclosure pathway.

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Cryptography and Security [Submitted on 26 May 2026] BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning Xuan Luo, Yue Wang, Geng Tu, Jing Li, Ruifeng Xu In this work, we propose BAIT (Boundary-Aware Iterative Trap), a three-step jailbreak framework that approaches malicious goals through internal disclosure. BAIT first asks the model to identify the protection boundary, then requires it to refine that boundary, and finally requests a detailed example. By expanding each step upon the model's previous responses, BAIT turns the model's own reasoning and consistency tendency into a disclosure pathway. Experiments on AdvBench, JailbreakBench, AIR-Bench, and SORRY-Bench demonstrate that BAIT consistently achieves strong attack success rates across top-tier large language models, significantly advancing conventional jailbreak baselines. Further analysis reveals that: 1) prevention-oriented framing significantly outperforms direct knowledge request; 2) the refinement step plays a critical role in disclosure escalation; and 3) the first two steps have a certain chance of eliciting harmful content while triggering little filtering. Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL) Cite as: arXiv:2605.27110 [cs.CR] (or arXiv:2605.27110v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.27110 Focus to learn more Submission history From: Xuan Luo [view email] [v1] Tue, 26 May 2026 14:51:13 UTC (7,891 KB) Access Paper: HTML (experimental) view license Current browse context: cs.CR < prev | next > new | recent | 2026-05 Change to browse by: cs cs.CL References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes