From Compression to Accountability: Harmless Copyright Protection for Dataset Distillation
arXiv SecurityArchived May 14, 2026✓ Full text saved
arXiv:2605.12942v1 Announce Type: new Abstract: Large-scale datasets have been a key driving force behind the rapid progress of deep learning, but their storage, computational, and energy costs have become increasingly prohibitive. Dataset distillation (DD) mitigates this problem by synthesizing compact yet informative datasets, thereby enabling efficient model training and storage. However, the ease of copying and distributing distilled datasets introduces serious risks of copyright infringemen
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Cryptography and Security
[Submitted on 13 May 2026]
From Compression to Accountability: Harmless Copyright Protection for Dataset Distillation
Yan Liang, Ziyuan Yang, Mengyu Sun, Joey Tianyi Zhou, Yi Zhang
Large-scale datasets have been a key driving force behind the rapid progress of deep learning, but their storage, computational, and energy costs have become increasingly prohibitive. Dataset distillation (DD) mitigates this problem by synthesizing compact yet informative datasets, thereby enabling efficient model training and storage. However, the ease of copying and distributing distilled datasets introduces serious risks of copyright infringement and data leakage. Existing protection methods are primarily designed for raw datasets rather than distilled datasets, and typically rely on backdoor-triggered malicious behaviors, which may raise security concerns. In this paper, we observe that deep neural networks tend to memorize subpopulation distributions during training, resulting in a systematic prediction bias, where models perform better on samples aligned with memorized subpopulations. Motivated by this observation, we propose SubPopMark, a harmless subpopulation-driven protection framework for distilled datasets. SubPopMark consists of two stages. First, the Copyright Verification Marker(CVM) optimization stage injects a class-consistent subpopulation bias while preserving the original optimization trajectory. Second, the User-Specific Tracing Marker (USTM) optimization stage further introduces user-distinguishable perturbations into the CVM-augmented data. To enable black-box verification and tracing, we construct a reference behavior bank by collecting model outputs over carefully designed test sets that cover both standard and subpopulation-shifted data distributions. The provenance of a suspicious model is then inferred by comparing its output behavior signature with the bank and identifying the most consistent reference behavior pattern.
Subjects: Cryptography and Security (cs.CR)
Cite as: arXiv:2605.12942 [cs.CR]
(or arXiv:2605.12942v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2605.12942
Focus to learn more
Submission history
From: Yan Liang [view email]
[v1] Wed, 13 May 2026 03:23:35 UTC (6,195 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.CR
< prev | next >
new | recent | 2026-05
Change to browse by:
cs
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)