AgentDish directory

benchmark

Accepted listings with this tag.

Listing	Category	Score	Trend	Checked
#51 → 0 ReactBench ReactBench is a benchmark for evaluating coding agents on realistic React work, with published scores, cost comparisons, and example tasks focused on production-grade frontend issues.	Developer Tools / Code Assistant	89	→ 0	45 hours ago	Details
#163 ↓ -3 CAD-Bench An open benchmark and leaderboard for AI CAD agents, with 308 prompts across 20 categories and layered scoring for geometry, engineering, manufacturability, and cognition.	Research / Knowledge Work	88	↓ -3	69 days ago	Details
#270 ↑ +2 The Banana Test A visual benchmark that asks AI coding agents to generate a single-file Three.js animation of a banana plant’s full life cycle, then compares the live results side by side.	AI Tools / Benchmarking	86	↑ +2	10 days ago	Details
#539 ↓ -6 DeepSWE DeepSWE is a benchmark for measuring frontier coding agents on original, long-horizon software engineering tasks. The page shows a leaderboard, methodology overview, task examples, and a full blog explaining the benchmark design and results.	Developer Tools / AI Benchmarking	84	↓ -6	50 days ago	Details
#663 ↓ -3 AI Agent Benchmark: API Bug Detection \| KushoAI A black-box benchmark report on how AI-generated tests detect functional bugs in live APIs across 20 scenarios and 7 systems.	Developer Tools / Code Assistant	83	↓ -3	43 days ago	Details
#815 ↑ +2 MiniMax M3 vs. GLM 5.2: Codegen comparison across autonomous coding tasks A workbench report comparing MiniMax M3 and GLM 5.2 on autonomous coding tasks, with scored results, latency and cost data, task-type breakdowns, and examples of where each model performed better.	Developer Tools / Code Assistant	81	↑ +2	27 days ago	Details
#930 ↑ +6 Module decomposition cut agent token use 32% on follow-up feature additions A Topos case study showing how a guided refactor of a synthetic healthcare claims engine reduced token usage, wall time, and estimated cost for later Gemini feature sessions.	Developer Tools / Code Assistant	78	↑ +6	21 days ago	Details
#1060 ↓ -1 What a Verification Loop Adds to a Coding Agent: A First Look An in-depth Medium post from IronBee about how a verification loop affects AI coding agents, using Web-Bench and comparing DeepSeek with Claude Opus on a real web app task.	Developer Tools / Code Assistant	74	↓ -1	10 days ago	Details
#1094 ↑ +1 WifeBench A playful benchmark dashboard that ranks LLMs based on one person's 10-question scoring process.	Writing / Copywriting	73	↑ +1	12 days ago	Details