OTRO: Oblivious Tokenization Path with Square-Root ORAM
arXiv SecurityArchived Jun 17, 2026✓ Full text saved
arXiv:2606.17358v1 Announce Type: new Abstract: The CPU-side large language model (LLM) tokenizer is a critical security gap in LLM serving through a confidential computing stack with CPU and GPU trusted execution environments (TEEs). Tokenizers converts the prompts through table-driven lookups, and the resulting memory access patterns are a powerful source of side-channel leakage. Recent work demonstrates end-to-end recovery of user prompts from tokenizer access pattern on production Intel TDX.
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Cryptography and Security
[Submitted on 15 Jun 2026]
OTRO: Oblivious Tokenization Path with Square-Root ORAM
Jonghyun Lee, Yongqin Wang, Rachit Rajat, Daniel Wong, Mengyuan Li, Murali Annavaram
The CPU-side large language model (LLM) tokenizer is a critical security gap in LLM serving through a confidential computing stack with CPU and GPU trusted execution environments (TEEs). Tokenizers converts the prompts through table-driven lookups, and the resulting memory access patterns are a powerful source of side-channel leakage. Recent work demonstrates end-to-end recovery of user prompts from tokenizer access pattern on production Intel TDX. However, a drop-in use of the popular tree-based Oblivious RAMs (e.g., PathORAM) to prevent access-pattern leakage introduces \sim13\times tokenizer slowdown, resulting in 10-58% higher time-to-first-token (TTFT).
In this paper, we present OTRO, an efficient, oblivious tokenization path tailored to latency-critical LLM serving. OTRO relies on square-root ORAM for fast single-access lookups, but avoids its prohibitive O(N\log^2N) rebuild cost every \sqrt{N} accesses through three key innovations. First, OTRO provides a pool of replicated square-root ORAM instances that utilize the read-only nature of tokenizer table. Second, an epoch-based rotation policy decouples accesses from rebuilds and pads each epoch with dummy accesses to its boundaries, minimizing observable information. Lastly, chunked KV-cache-aware tokenization further overlaps rebuilds with GPU prefill and minimizes the instance count. Implemented as modules in HuggingFace Tokenizers and nano-vLLM, running within a TDX-enabled CVM with an NVIDIA H100 GPU, OTRO limits TTFT overhead to at most 4.5%, keeps tokenizer-induced latency under 10\% of total TTFT, and adds less than 0.5 GB of memory overhead while reducing the tokenizer's observable leakage across various model families and sizes.
Subjects: Cryptography and Security (cs.CR)
Cite as: arXiv:2606.17358 [cs.CR]
(or arXiv:2606.17358v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2606.17358
Focus to learn more
Submission history
From: Jonghyun Lee [view email]
[v1] Mon, 15 Jun 2026 23:21:32 UTC (1,166 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.CR
< prev | next >
new | recent | 2026-06
Change to browse by:
cs
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)