← Back ◬ AI & Machine Learning May 14, 2026

Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue

arXiv AI Archived May 14, 2026 ✓ Full text saved

arXiv:2605.12856v1 Announce Type: new Abstract: The emergence of multi-agent systems introduces novel moderation challenges that extend beyond content filtering. Agents with {\em malicious intent} may contribute harmful content that appears benign to evade content-based moderation, while compromising the system through exploitative and malicious behavior manifested across their overall interaction patterns within the community. To address this, we introduce \textsc{\textbf{Bot-Mod}} (\textsc{\te

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Artificial Intelligence [Submitted on 13 May 2026] Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue Ali Al-Lawati, Nafis Tripto, Abolfazl Ansari, Jason Lucas, Suhang Wang, Dongwon Lee The emergence of multi-agent systems introduces novel moderation challenges that extend beyond content filtering. Agents with {\em malicious intent} may contribute harmful content that appears benign to evade content-based moderation, while compromising the system through exploitative and malicious behavior manifested across their overall interaction patterns within the community. To address this, we introduce \textsc{\textbf{Bot-Mod}} (\textsc{\textbf{Bot-Mod}}eration), a moderation framework that grounds detection in agent intent rather than traditional content level signals. \method{} identifies the underlying intent by engaging with the target agent in a multi-turn exchange guided by Gibbs-based sampling over candidate intent hypotheses. This progressively narrows the space of plausible agent objectives to identify the underlying behavior. To evaluate our approach, we construct a dataset derived from Moltbook that encompasses diverse benign and malicious behaviors based on actual community structures, posts, and comments. Results demonstrate that \textsc{\textbf{Bot-Mod}} reliably identifies agent intent across a range of adversarial configurations, while maintaining a low false positive rate on benign behaviors. This work advances the foundation for scalable, intent-aware moderation of agents in open multi-agent environments. Subjects: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI) Cite as: arXiv:2605.12856 [cs.AI] (or arXiv:2605.12856v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.12856 Focus to learn more Submission history From: Ali Al Lawati [view email] [v1] Wed, 13 May 2026 01:04:16 UTC (1,209 KB) Access Paper: HTML (experimental) view license Current browse context: cs.AI < prev | next > new | recent | 2026-05 Change to browse by: cs cs.SI References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes