← Back ◬ AI & Machine Learning May 28, 2026

AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks?

arXiv Security Archived May 28, 2026 ✓ Full text saved

arXiv:2605.27705v1 Announce Type: new Abstract: Video production workflows offer a rich and demanding arena for evaluating multimodal AI agents: they require composite capabilities across text, image, audio, and video understanding, along with long-horizon planning, and tool use. To this end, we introduce AgenticVBench, a benchmark of 100 agentic tasks across 4 task families spanning the real world post-production workflow, constructed from real production workflows contributed by 20 industry ex

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Cryptography and Security [Submitted on 26 May 2026] AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks? Zongheng Cao, Yi Zheng, Rui Song, Xinyu Hu Video production workflows offer a rich and demanding arena for evaluating multimodal AI agents: they require composite capabilities across text, image, audio, and video understanding, along with long-horizon planning, and tool use. To this end, we introduce AgenticVBench, a benchmark of 100 agentic tasks across 4 task families spanning the real world post-production workflow, constructed from real production workflows contributed by 20 industry experts averaging 6 years of professional experience. Tasks are paired with evaluation specifications that combine programmatic verifiers and expert rubrics. We evaluate frontier vision-language models (VLMs) with both vendor-native and open-source harnesses. The best evaluated agent stack barely crosses 30%, far below human expert performance on the same tasks. We further find that the choice of harness substantially affects model behavior, including scores, tool-use patterns, and failure modes. AgenticVBench provides a foundation for diagnosing and improving both models and harnesses for agentic video production. Benchmark website: this https URL. Comments: 22 pages, 6 figures. Benchmark website: this https URL Subjects: Cryptography and Security (cs.CR); Multimedia (cs.MM) Cite as: arXiv:2605.27705 [cs.CR] (or arXiv:2605.27705v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.27705 Focus to learn more Submission history From: Xinyu Hu [view email] [v1] Tue, 26 May 2026 21:27:16 UTC (1,218 KB) Access Paper: HTML (experimental) view license Current browse context: cs.CR < prev | next > new | recent | 2026-05 Change to browse by: cs cs.MM References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes