PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality
arXiv SecurityArchived Apr 03, 2026✓ Full text saved
arXiv:2508.18649v2 Announce Type: replace Abstract: Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning. To this end, we introduc PRISM (Principled Reasoning for Integrated Safety in Multimodality), a System 2-like framework that aligns VLMs through a structured four-stage reasoning process explicitly designed to hand
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Cryptography and Security
[Submitted on 26 Aug 2025 (v1), last revised 2 Apr 2026 (this version, v2)]
PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality
Nanxi Li, Zhengyue Zhao, G. Edward Suh, Marco Pavone, Chaowei Xiao
Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning. To this end, we introduc PRISM (Principled Reasoning for Integrated Safety in Multimodality), a System 2-like framework that aligns VLMs through a structured four-stage reasoning process explicitly designed to handle three distinct categories of multimodal safety violations. Our framework consists of two key components: a structured reasoning pipeline that analyzes each violation category in dedicated stages, and PRISM-DPO, generated via Monte Carlo Tree Search (MCTS) to refine reasoning quality through Direct Preference Optimization. Comprehensive evaluations show that PRISM substantially reduces attack success rates on JailbreakV-28K and VLBreak, improves robustness against adaptive attacks, and generalizes to out-of-distribution multi-image threats, while better preserving model utility on benign multimodal benchmarks. Our code, data, and model weights available at this https URL.
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Cite as: arXiv:2508.18649 [cs.CR]
(or arXiv:2508.18649v2 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2508.18649
Focus to learn more
Submission history
From: Nanxi Li [view email]
[v1] Tue, 26 Aug 2025 03:45:19 UTC (3,808 KB)
[v2] Thu, 2 Apr 2026 03:30:48 UTC (3,794 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.CR
< prev | next >
new | recent | 2025-08
Change to browse by:
cs
cs.AI
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)