RTD-Guard: A Black-Box Textual Adversarial Detection Framework via Replacement Token Detection
arXiv SecurityArchived Mar 16, 2026✓ Full text saved
arXiv:2603.12582v1 Announce Type: cross Abstract: Textual adversarial attacks pose a serious security threat to Natural Language Processing (NLP) systems by introducing imperceptible perturbations that mislead deep learning models. While adversarial example detection offers a lightweight alternative to robust training, existing methods typically rely on prior knowledge of attacks, white-box access to the victim model, or numerous queries, which severely limits their practical deployment. This pa
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Computation and Language
[Submitted on 13 Mar 2026]
RTD-Guard: A Black-Box Textual Adversarial Detection Framework via Replacement Token Detection
He Zhu, Yanshu Li, Wen Liu, Haitian Yang
Textual adversarial attacks pose a serious security threat to Natural Language Processing (NLP) systems by introducing imperceptible perturbations that mislead deep learning models. While adversarial example detection offers a lightweight alternative to robust training, existing methods typically rely on prior knowledge of attacks, white-box access to the victim model, or numerous queries, which severely limits their practical deployment. This paper introduces RTD-Guard, a novel black-box framework for detecting textual adversarial examples. Our key insight is that word-substitution perturbations in adversarial attacks closely resemble the "replaced tokens" that a Replaced Token Detection (RTD) discriminator is pre-trained to identify. Leveraging this, RTD-Guard employs an off-the-shelf RTD discriminator-without fine-tuning-to localize suspicious tokens, masks them, and detects adversarial examples by observing the prediction confidence shift of the victim model before and after intervention. The entire process requires no adversarial data, model tuning, or internal model access, and uses only two black-box queries. Comprehensive experiments on multiple benchmark datasets demonstrate that RTD-Guard effectively detects adversarial texts generated by diverse state-of-the-art attack methods. It surpasses existing detection baselines across multiple metrics, offering a highly efficient, practical, and resource-light defense mechanism-particularly suited for real-world deployment in resource-constrained or privacy-sensitive environments.
Comments: 15 pages, 4 figures
Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Cite as: arXiv:2603.12582 [cs.CL]
(or arXiv:2603.12582v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.12582
Focus to learn more
Submission history
From: He Zhu [view email]
[v1] Fri, 13 Mar 2026 02:30:56 UTC (139 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.CL
< prev | next >
new | recent | 2026-03
Change to browse by:
cs
cs.CR
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)