ManiBench: A Benchmark for Testing Visual-Logic Drift and Syntactic Hallucinations in Manim Code Generation
arXiv AIArchived Mar 17, 2026✓ Full text saved
arXiv:2603.13251v1 Announce Type: new Abstract: Traditional benchmarks like HumanEval and MBPP test logic and syntax effectively, but fail when code must produce dynamic, pedagogical visuals. We introduce ManiBench, a specialized benchmark evaluating LLM performance in generating Manim CE code, where temporal fidelity and version-aware API correctness are critical. ManiBench targets two key failure modes: Syntactic Hallucinations (valid Python referencing non-existent or deprecated Manim APIs) a
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Artificial Intelligence
[Submitted on 24 Feb 2026]
ManiBench: A Benchmark for Testing Visual-Logic Drift and Syntactic Hallucinations in Manim Code Generation
Nabin Oli
Traditional benchmarks like HumanEval and MBPP test logic and syntax effectively, but fail when code must produce dynamic, pedagogical visuals. We introduce ManiBench, a specialized benchmark evaluating LLM performance in generating Manim CE code, where temporal fidelity and version-aware API correctness are critical. ManiBench targets two key failure modes: Syntactic Hallucinations (valid Python referencing non-existent or deprecated Manim APIs) and Visual-Logic Drift (generated visuals diverging from intended mathematical logic through timing errors or missing causal relationships). The benchmark comprises 150-200 problems across five difficulty levels spanning calculus, linear algebra, probability, topology, and AI, grounded in analysis of 3Blue1Brown's ManimGL source (53,000 lines, 143 scene classes). Evaluation uses a four-tier framework measuring Executability, Version-Conflict Error Rate, Alignment Score, and Coverage Score. An open-source framework automates evaluation across multiple models and prompting strategies. Code, data and benchmark suite are available at this https URL. and the dataset is hosted on this https URL.
Subjects: Artificial Intelligence (cs.AI)
Cite as: arXiv:2603.13251 [cs.AI]
(or arXiv:2603.13251v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2603.13251
Focus to learn more
Submission history
From: Nabin Oli [view email]
[v1] Tue, 24 Feb 2026 13:05:56 UTC (58 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.AI
< prev | next >
new | recent | 2026-03
Change to browse by:
cs
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)