CyberIntel ⬡ News
★ Saved ◆ Cyber Reads
← Back ◬ AI & Machine Learning Jun 19, 2026

Low-Cost Multi-Precision Systolic Arrays for Accelerating FHE NTTs on AI ASICs

arXiv Security Archived Jun 19, 2026 ✓ Full text saved

arXiv:2606.19866v1 Announce Type: new Abstract: Fully Homomorphic Encryption (FHE) ensures robust data privacy but suffers from prohibitive computational overhead. Accelerating FHE on AI hardware like Tensor Processing Units (TPUs) is promising, yet fundamentally limited by a precision mismatch: TPUs are optimized for 8-bit arithmetic, whereas FHE and its critical parts such as the Number Theoretic Transform (NTT), demand high precision. Current approaches bridge this gap using matrix decomposit

Full text archived locally
✦ AI Summary · Claude Sonnet


    Computer Science > Cryptography and Security [Submitted on 18 Jun 2026] Low-Cost Multi-Precision Systolic Arrays for Accelerating FHE NTTs on AI ASICs George Alexakis, Dimitrios Schoinianakis, Giorgos Dimitrakopoulos Fully Homomorphic Encryption (FHE) ensures robust data privacy but suffers from prohibitive computational overhead. Accelerating FHE on AI hardware like Tensor Processing Units (TPUs) is promising, yet fundamentally limited by a precision mismatch: TPUs are optimized for 8-bit arithmetic, whereas FHE and its critical parts such as the Number Theoretic Transform (NTT), demand high precision. Current approaches bridge this gap using matrix decomposition to execute NTT computations on low-precision matrix engines. However, reconstructing the full-precision results requires shift-and-add accumulation that does not match the dataflow of matrix multiplication. This forces offloading full-precision reconstruction from matrix engines to vector processors that disrupts the matrix multiplication dataflow, creating significant performance bottleneck. To resolve this limitation, we propose a minimally modified multi-precision systolic array that performs full-precision output reconstruction natively within the array in sync with low-precision matrix multiplication under a uniform dataflow. Synthesized at 7nm with OpenRoad, our design incurs negligible hardware overhead. Cycle-accurate simulations using SCALE-Sim demonstrate that natively executing NTTs on the proposed architecture achieves at least 1.33x speedup, for transform sizes 2^12 to 2^16 on 128x128 matrix engines, successfully enabling standard AI hardware to support high-precision FHE acceleration. Subjects: Cryptography and Security (cs.CR) Cite as: arXiv:2606.19866 [cs.CR]   (or arXiv:2606.19866v1 [cs.CR] for this version)   https://doi.org/10.48550/arXiv.2606.19866 Focus to learn more Submission history From: George Alexakis [view email] [v1] Thu, 18 Jun 2026 07:26:13 UTC (9,486 KB) Access Paper: view license Current browse context: cs.CR < prev   |   next > new | recent | 2026-06 Change to browse by: cs References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
    💬 Team Notes
    Article Info
    Source
    arXiv Security
    Category
    ◬ AI & Machine Learning
    Published
    Jun 19, 2026
    Archived
    Jun 19, 2026
    Full Text
    ✓ Saved locally
    Open Original ↗