← Back ◬ AI & Machine Learning May 14, 2026

From Compression to Accountability: Harmless Copyright Protection for Dataset Distillation

arXiv Security Archived May 14, 2026 ✓ Full text saved

arXiv:2605.12942v1 Announce Type: new Abstract: Large-scale datasets have been a key driving force behind the rapid progress of deep learning, but their storage, computational, and energy costs have become increasingly prohibitive. Dataset distillation (DD) mitigates this problem by synthesizing compact yet informative datasets, thereby enabling efficient model training and storage. However, the ease of copying and distributing distilled datasets introduces serious risks of copyright infringemen

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Cryptography and Security [Submitted on 13 May 2026] From Compression to Accountability: Harmless Copyright Protection for Dataset Distillation Yan Liang, Ziyuan Yang, Mengyu Sun, Joey Tianyi Zhou, Yi Zhang Large-scale datasets have been a key driving force behind the rapid progress of deep learning, but their storage, computational, and energy costs have become increasingly prohibitive. Dataset distillation (DD) mitigates this problem by synthesizing compact yet informative datasets, thereby enabling efficient model training and storage. However, the ease of copying and distributing distilled datasets introduces serious risks of copyright infringement and data leakage. Existing protection methods are primarily designed for raw datasets rather than distilled datasets, and typically rely on backdoor-triggered malicious behaviors, which may raise security concerns. In this paper, we observe that deep neural networks tend to memorize subpopulation distributions during training, resulting in a systematic prediction bias, where models perform better on samples aligned with memorized subpopulations. Motivated by this observation, we propose SubPopMark, a harmless subpopulation-driven protection framework for distilled datasets. SubPopMark consists of two stages. First, the Copyright Verification Marker(CVM) optimization stage injects a class-consistent subpopulation bias while preserving the original optimization trajectory. Second, the User-Specific Tracing Marker (USTM) optimization stage further introduces user-distinguishable perturbations into the CVM-augmented data. To enable black-box verification and tracing, we construct a reference behavior bank by collecting model outputs over carefully designed test sets that cover both standard and subpopulation-shifted data distributions. The provenance of a suspicious model is then inferred by comparing its output behavior signature with the bank and identifying the most consistent reference behavior pattern. Subjects: Cryptography and Security (cs.CR) Cite as: arXiv:2605.12942 [cs.CR] (or arXiv:2605.12942v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.12942 Focus to learn more Submission history From: Yan Liang [view email] [v1] Wed, 13 May 2026 03:23:35 UTC (6,195 KB) Access Paper: HTML (experimental) view license Current browse context: cs.CR < prev | next > new | recent | 2026-05 Change to browse by: cs References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes