CyberIntel ⬡ News
★ Saved ◆ Cyber Reads
← Back ◬ AI & Machine Learning Apr 21, 2026

Introspection Adapters: Training LLMs to Report Their Learned Behaviors

arXiv AI Archived Apr 21, 2026 ✓ Full text saved

arXiv:2604.16812v1 Announce Type: new Abstract: When model developers or users fine-tune an LLM, this can induce behaviors that are unexpected, deliberately harmful, or hard to detect. It would be far easier to audit LLMs if they could simply describe their behaviors in natural language. Here, we study a scalable approach to rapidly identify learned behaviors of many LLMs derived from a shared base LLM. Given a model $M$, our method works by finetuning models $M_i$ from $M$ with implanted behavi

Full text archived locally
✦ AI Summary · Claude Sonnet


    Computer Science > Artificial Intelligence [Submitted on 18 Apr 2026] Introspection Adapters: Training LLMs to Report Their Learned Behaviors Keshav Shenoy, Li Yang, Abhay Sheshadri, Sören Mindermann, Jack Lindsey, Sam Marks, Rowan Wang When model developers or users fine-tune an LLM, this can induce behaviors that are unexpected, deliberately harmful, or hard to detect. It would be far easier to audit LLMs if they could simply describe their behaviors in natural language. Here, we study a scalable approach to rapidly identify learned behaviors of many LLMs derived from a shared base LLM. Given a model M, our method works by finetuning models M_i from M with implanted behaviors b_i; the (M_i, b_i) pairs serve as labeled training data. We then train an \emph{introspection adapter} (IA): a single LoRA adapter jointly trained across the finetunes M_i to cause them to verbalize their implanted behaviors. We find that this IA induces self-description of learned behaviors even in finetunes of M that were trained in very different ways from the M_i. For example, IAs generalize to AuditBench, achieving state-of-the-art at identifying explicitly hidden concerning behaviors. IAs can also be used to detect encrypted finetuning API attacks. They scale favorably with model size and training data diversity. Overall, our results suggest that IAs are a scalable, effective, and practically useful approach to auditing fine-tuned LLMs. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.16812 [cs.AI]   (or arXiv:2604.16812v1 [cs.AI] for this version)   https://doi.org/10.48550/arXiv.2604.16812 Focus to learn more Submission history From: Keshav Shenoy [view email] [v1] Sat, 18 Apr 2026 03:50:00 UTC (5,608 KB) Access Paper: HTML (experimental) view license Current browse context: cs.AI < prev   |   next > new | recent | 2026-04 Change to browse by: cs References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
    💬 Team Notes
    Article Info
    Source
    arXiv AI
    Category
    ◬ AI & Machine Learning
    Published
    Apr 21, 2026
    Archived
    Apr 21, 2026
    Full Text
    ✓ Saved locally
    Open Original ↗