AgentDish directory

AI benchmarking

Accepted listings with this tag.

Listing Category Score Trend Checked
#191 ↓ -6
DeepSWE

DeepSWE is a benchmark for measuring frontier coding agents on original, long-horizon software engineering tasks. The page shows a leaderboard, methodology overview, task examples, and a full blog explaining the benchmark design and results.

Developer Tools / AI Benchmarking 84 ↓ -6 5 days ago Details

A public visualization that tracks flagship AI models’ Elo history over time using the Arena AI Leaderboard dataset, with notes on caveats and methodology.

Developer Tools / Code Assistant 77 → 0 19 days ago Details