AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks?
arXiv SecurityArchived May 28, 2026✓ Full text saved
arXiv:2605.27705v1 Announce Type: new Abstract: Video production workflows offer a rich and demanding arena for evaluating multimodal AI agents: they require composite capabilities across text, image, audio, and video understanding, along with long-horizon planning, and tool use. To this end, we introduce AgenticVBench, a benchmark of 100 agentic tasks across 4 task families spanning the real world post-production workflow, constructed from real production workflows contributed by 20 industry ex
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Cryptography and Security
[Submitted on 26 May 2026]
AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks?
Zongheng Cao, Yi Zheng, Rui Song, Xinyu Hu
Video production workflows offer a rich and demanding arena for evaluating multimodal AI agents: they require composite capabilities across text, image, audio, and video understanding, along with long-horizon planning, and tool use. To this end, we introduce AgenticVBench, a benchmark of 100 agentic tasks across 4 task families spanning the real world post-production workflow, constructed from real production workflows contributed by 20 industry experts averaging 6 years of professional experience. Tasks are paired with evaluation specifications that combine programmatic verifiers and expert rubrics. We evaluate frontier vision-language models (VLMs) with both vendor-native and open-source harnesses. The best evaluated agent stack barely crosses 30%, far below human expert performance on the same tasks. We further find that the choice of harness substantially affects model behavior, including scores, tool-use patterns, and failure modes. AgenticVBench provides a foundation for diagnosing and improving both models and harnesses for agentic video production. Benchmark website: this https URL.
Comments: 22 pages, 6 figures. Benchmark website: this https URL
Subjects: Cryptography and Security (cs.CR); Multimedia (cs.MM)
Cite as: arXiv:2605.27705 [cs.CR]
(or arXiv:2605.27705v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2605.27705
Focus to learn more
Submission history
From: Xinyu Hu [view email]
[v1] Tue, 26 May 2026 21:27:16 UTC (1,218 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.CR
< prev | next >
new | recent | 2026-05
Change to browse by:
cs
cs.MM
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)