Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals
arXiv AIArchived Apr 21, 2026✓ Full text saved
arXiv:2604.16745v1 Announce Type: new Abstract: Training-free token reduction methods for Vision Transformers (ToMe, ToFu, PiToMe, and MCTF) employ different scoring mechanisms, yet they share a closely matched cliff-like collapse at high compression. This paper explains \emph{why}. We develop a diagnostic framework with two tools, ranking consistency $\rho_s$ and off-diagonal correlation $\rho_\text{off}$, that decomposes the collapse into (1)a signal-agnostic error amplifier inherent to layer-
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Artificial Intelligence
[Submitted on 17 Apr 2026]
Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals
Yang Shanglin
Training-free token reduction methods for Vision Transformers (ToMe, ToFu, PiToMe, and MCTF) employ different scoring mechanisms, yet they share a closely matched cliff-like collapse at high compression. This paper explains \emph{why}. We develop a diagnostic framework with two tools, ranking consistency \rho_s and off-diagonal correlation \rho_\text{off}, that decomposes the collapse into (1)a signal-agnostic error amplifier inherent to layer-wise reduction, predicting convex Pareto curves and r_{\text{crit}} \propto 1/L; and (2)shared reliance on \emph{pairwise} similarity signals whose ranking consistency degrades from \rho_s{=}0.88 to 0.27 in deep layers. Pairwise rankings are inherently unstable (O(N_p^2) joint perturbations) while unary signals enjoy greater stability (O(N_p) perturbations, CLT). From three design principles derived from this diagnosis, we construct CATIS as a constructive validation: unary signals raise the trigger threshold, triage suppresses the gain. On ViT-Large at 63% FLOPs reduction, CATIS retains 96.9% of vanilla accuracy (81.0%) on ImageNet-1K where all baselines collapse to 43--65%.
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2604.16745 [cs.AI]
(or arXiv:2604.16745v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2604.16745
Focus to learn more
Submission history
From: Shanglin Yang [view email]
[v1] Fri, 17 Apr 2026 23:26:27 UTC (10,369 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.AI
< prev | next >
new | recent | 2026-04
Change to browse by:
cs
cs.CV
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)