← Back ◬ AI & Machine Learning —

VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

arXiv AI Archived Mar 16, 2026 ✓ Full text saved

arXiv:2603.12310v1 Announce Type: cross Abstract: Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Computer Vision and Pattern Recognition [Submitted on 12 Mar 2026] VQQA: An Agentic Approach for Video Evaluation and Quality Improvement Yiwen Song, Tomas Pfister, Yale Song Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision-Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human-interpretable, actionable feedback. This enables a highly efficient, closed-loop prompt optimization process via a black-box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially improving generation quality in just a few refinement steps. Applicable to both text-to-video (T2V) and image-to-video (I2V) tasks, our method achieves absolute improvements of +11.57% on T2V-CompBench and +8.43% on VBench2 over vanilla generation, significantly outperforming state-of-the-art stochastic search and prompt optimization techniques. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2603.12310 [cs.CV] (or arXiv:2603.12310v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.12310 Focus to learn more Submission history From: Yiwen Song [view email] [v1] Thu, 12 Mar 2026 18:00:00 UTC (13,269 KB) Access Paper: HTML (experimental) view license Current browse context: cs.CV < prev | next > new | recent | 2026-03 Change to browse by: cs cs.AI cs.LG cs.MA References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes