← Back ◬ AI & Machine Learning Jun 24, 2026

The Serialized Bridge: Understanding and Recovering LLM Serving Performance under Blackwell GPU Confidential Computing

arXiv Security Archived Jun 24, 2026 ✓ Full text saved

arXiv:2606.23969v1 Announce Type: cross Abstract: GPU Confidential Computing (GPU-CC) now preserves GPU-local performance: on NVIDIA B300, BF16 matmul runs at 0.998x of non-confidential performance. Yet LLM serving under Intel TDX plus GPU-CC still loses 13-27% of throughput, and KV-cache restore latency can more than double. This paper studies that gap on two Blackwell platforms, RTX Pro 6000 and B300 HGX, and identifies its dominant cause: the confidential VM-GPU bridge, not GPU compute. We fi

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 22 Jun 2026] The Serialized Bridge: Understanding and Recovering LLM Serving Performance under Blackwell GPU Confidential Computing Hang Yin, Kevin Wang GPU Confidential Computing (GPU-CC) now preserves GPU-local performance: on NVIDIA B300, BF16 matmul runs at 0.998x of non-confidential performance. Yet LLM serving under Intel TDX plus GPU-CC still loses 13-27% of throughput, and KV-cache restore latency can more than double. This paper studies that gap on two Blackwell platforms, RTX Pro 6000 and B300 HGX, and identifies its dominant cause: the confidential VM-GPU bridge, not GPU compute. We find that GPU-CC turns host/device movement into a serialized, high-setup-cost channel. Secure copies do not gain CUDA-stream concurrency within a context, asynchronous transfers block at the runtime boundary, and small crossings pay a fixed toll. This violates the assumptions of modern inference runtimes, where DMA is expected to be cheap, concurrent, and asynchronous. In vLLM dense decode, the gap closes around 44x-slower small alloc-and-copy operations; targeted patches reject alternative explanations. A scheduling flag recovers 57% of the gap, while a worker-thread drain recovers up to 92% in qualified high-concurrency runs. The same bridge model explains a +131% KV-restore penalty and a 34x model-load slowdown. Blackwell also changes the confidential tenancy unit. We qualify confidential multi-GPU NVSwitch tenants on B300, including 510 GB/s NVLink P2P inside a CVM and concurrent isolated tenants, and identify the remaining fabric-attestation gap for production confidential AI platforms. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR); Performance (cs.PF) Cite as: arXiv:2606.23969 [cs.DC] (or arXiv:2606.23969v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2606.23969 Focus to learn more Submission history From: Hang Yin [view email] [v1] Mon, 22 Jun 2026 21:48:53 UTC (63 KB) Access Paper: HTML (experimental) view license Current browse context: cs.DC < prev | next > new | recent | 2026-06 Change to browse by: cs cs.CR cs.PF References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes