← Back ◬ AI & Machine Learning Jun 10, 2026

Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm

arXiv AI Archived Jun 10, 2026 ✓ Full text saved

arXiv:2606.10504v1 Announce Type: new Abstract: Cross-modal knowledge distillation (CMKD) studies how a (large) teacher model trained on one type of data (e.g., images) can guide a (smaller) student model building on another type of data (e.g., text/audio). Existing CMKD methods often require paired multi-modal data with aligned semantics, but obtaining such paired data are often costly and impractical. To mitigate this limitation, we develop a new CMKD framework for the more challenging setting

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Artificial Intelligence [Submitted on 9 Jun 2026] Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm Trong Khiem Tran, Anh Duc Chu, Quang Hung Pham, Phi Le Nguyen, Trong Nghia Hoang Cross-modal knowledge distillation (CMKD) studies how a (large) teacher model trained on one type of data (e.g., images) can guide a (smaller) student model building on another type of data (e.g., text/audio). Existing CMKD methods often require paired multi-modal data with aligned semantics, but obtaining such paired data are often costly and impractical. To mitigate this limitation, we develop a new CMKD framework for the more challenging setting where paired data are unavailable. In particular, we establish a cross-modal distributional relationship between teacher and student models, which reveals two fundamental quantities governing effective distillation: feature alignment and label alignment. These quantities characterize semantic discrepancy between modalities at the levels of representation and prediction distributions, respectively. Motivated by this insight, we propose a principled framework, with theoretical guarantees, that enables effective cross-modal knowledge distillation by aligning distributions rather than individual samples. Extensive experiments across a wide range of multimodal benchmarks show that our framework is highly effective in both unpaired and paired data settings, improving significantly over prior work. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.10504 [cs.AI] (or arXiv:2606.10504v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.10504 Focus to learn more Submission history From: Khiem Tran Trong [view email] [v1] Tue, 9 Jun 2026 07:29:12 UTC (1,693 KB) Access Paper: view license Current browse context: cs.AI < prev | next > new | recent | 2026-06 Change to browse by: cs References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes