← Back ◬ AI & Machine Learning Jun 16, 2026

Semantic Integrity Failures in Document-to-LLM Supply Chains

arXiv Security Archived Jun 16, 2026 ✓ Full text saved

arXiv:2606.15020v1 Announce Type: new Abstract: Document-to-LLM applications typically read uploaded PDFs by first translating them into text through a hidden extraction layer that users cannot observe or audit. We show that this layer enables split-view PDFs: one document can have two semantic views before model reasoning. By mining specification-permitted or implementation-tolerated representation gaps at the PDF render/extract boundary, we instantiate 25 extraction gaps (EG) in which extracto

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Cryptography and Security [Submitted on 12 Jun 2026] Semantic Integrity Failures in Document-to-LLM Supply Chains Side Liu, Jiang Ming Document-to-LLM applications typically read uploaded PDFs by first translating them into text through a hidden extraction layer that users cannot observe or audit. We show that this layer enables split-view PDFs: one document can have two semantic views before model reasoning. By mining specification-permitted or implementation-tolerated representation gaps at the PDF render/extract boundary, we instantiate 25 extraction gaps (EG) in which extractors return attacker-controlled or extractor-dependent text while the rendered page shows benign or different content. The gaps form four families: semantic overrides, hidden semantic injection, reading-order splits, and font-decoding splits, and 14 gaps have no exact path/mechanism-level match in prior PDF-to-LLM attacks. We evaluate these gaps on 16 PDF processing stacks and 7 commercial LLM services. Each gap causes render-extract divergence on at least one stack. Under a gap-level exposure criterion, every evaluated service exposes at least one gap, with 12/25 to 21/25 exposed gaps. Exposure is driven mainly by the ingestion stack -- not model identity alone. We further show that tested safety filters cover only selected hidden-text constructions. To support triage, we develop a static screening scanner whose rules trigger on all 25 benchmark gaps, and discuss dual-view consistency as a longer-term defense direction. Subjects: Cryptography and Security (cs.CR) Cite as: arXiv:2606.15020 [cs.CR] (or arXiv:2606.15020v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.15020 Focus to learn more Submission history From: Side Liu [view email] [v1] Fri, 12 Jun 2026 23:30:56 UTC (292 KB) Access Paper: HTML (experimental) view license Current browse context: cs.CR < prev | next > new | recent | 2026-06 Change to browse by: cs References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes