CyberIntel ⬡ News
★ Saved ◆ Cyber Reads
← Back ◬ AI & Machine Learning Apr 14, 2026

DuCodeMark: Dual-Purpose Code Dataset Watermarking via Style-Aware Watermark-Poison Design

arXiv Security Archived Apr 14, 2026 ✓ Full text saved

arXiv:2604.10611v1 Announce Type: new Abstract: The proliferation of large language models for code (CodeLMs) and open-source contributions has heightened concerns over unauthorized use of source code datasets. While watermarking provides a viable protection mechanism by embedding ownership signals, existing methods rely on detectable trigger-target patterns and are limited to source-code tasks, overlooking other scenarios such as decompilation tasks. In this paper, we propose DuCodeMark, a stea

Full text archived locally
✦ AI Summary · Claude Sonnet


    Computer Science > Cryptography and Security [Submitted on 12 Apr 2026] DuCodeMark: Dual-Purpose Code Dataset Watermarking via Style-Aware Watermark-Poison Design Yuchen Chen, Yuan Xiao, Chunrong Fang, Zhenyu Chen, Baowen Xu The proliferation of large language models for code (CodeLMs) and open-source contributions has heightened concerns over unauthorized use of source code datasets. While watermarking provides a viable protection mechanism by embedding ownership signals, existing methods rely on detectable trigger-target patterns and are limited to source-code tasks, overlooking other scenarios such as decompilation tasks. In this paper, we propose DuCodeMark, a stealthy and robust dual-purpose watermarking method for code datasets that generalizes across both source-code tasks and decompilation tasks. DuCodeMark parses each code sample into an abstract syntax tree (AST), applies language-specific style transformations to construct stealthy trigger-target pairs, and injects repressible poisoned features into a subset of return-typed samples to enhance robustness against watermark removal or evasion. These features remain inactive during normal training but are activated upon watermark removal, degrading model performance. For verification, DuCodeMark employs a black-box method based on the independent-samples t-test. We conduct a comprehensive evaluation of DuCodeMark across 72 settings spanning two code tasks, two programming languages, three CodeLMs, and six decoding temperatures. The results demonstrate that it consistently achieves strong verifiability (p < 0.05), high stealthiness (suspicious rate \leq 0.36), robustness against both watermark and poisoning attacks (recall \leq 0.57), and a substantial drop in model performance upon watermark removal (Pass@1 drops by 28.6%), underscoring its practicality and resilience. Comments: Accepted to the 34rd ACM International Conference on the Foundations of Software Engineering (FSE 2026) Subjects: Cryptography and Security (cs.CR) Cite as: arXiv:2604.10611 [cs.CR]   (or arXiv:2604.10611v1 [cs.CR] for this version)   https://doi.org/10.48550/arXiv.2604.10611 Focus to learn more Submission history From: Yuchen Chen [view email] [v1] Sun, 12 Apr 2026 12:32:38 UTC (657 KB) Access Paper: HTML (experimental) view license Current browse context: cs.CR < prev   |   next > new | recent | 2026-04 Change to browse by: cs References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
    💬 Team Notes
    Article Info
    Source
    arXiv Security
    Category
    ◬ AI & Machine Learning
    Published
    Apr 14, 2026
    Archived
    Apr 14, 2026
    Full Text
    ✓ Saved locally
    Open Original ↗