AgentDish directory
evaluation
Accepted listings with this tag.
| Listing | Category | Score | Trend | Checked | |
|---|---|---|---|---|---|
|
#56
↓ -3
CAD-Bench
An open benchmark and leaderboard for AI CAD agents, with 308 prompts across 20 categories and layered scoring for geometry, engineering, manufacturability, and cognition. |
Research / Knowledge Work | 88 | ↓ -3 | 24 days ago | Details |
|
#61
↓ -3
agent-skills-eval
A TypeScript CLI and SDK for testing whether Agent Skills improve model outputs by running with-skill vs baseline evaluations and generating reports. |
Developer Tools / AI Evaluation | 88 | ↓ -3 | 26 days ago | Details |
|
#152
↑ +222
Alignment Whack-a-Mole
A research code repository for studying how fine-tuning can trigger verbatim recall of copyrighted books in large language models. It includes preprocessing, fine-tuning, generation, and memorization-evaluation scripts, with setup notes and example data. |
Research / Copywriting | 86 | ↑ +222 | 28 days ago | Details |
|
#180
↓ -6
DeepSWE
DeepSWE is a benchmark for measuring frontier coding agents on original, long-horizon software engineering tasks. The page shows a leaderboard, methodology overview, task examples, and a full blog explaining the benchmark design and results. |
Developer Tools / AI Benchmarking | 84 | ↓ -6 | 5 days ago | Details |
|
A GitHub example that audits LangChain’s RAG quickstart with retrieval-quality metrics, flags off-topic and out-of-distribution queries, and surfaces ranking and calibration issues with charts and results files. |
Developer Tool / RAG Evaluation | 78 | ↑ +4 | 27 days ago | Details |
|
A GitHub research project documenting a long-form, multi-model analysis of LLM behavior across Claude, Gemini, ChatGPT, and Grok. The repo includes an executive summary, screenplay, technical white paper, and archive of logs and chat records. |
AI Research / LLM Evaluation & Analysis | 75 | → 0 | 7 days ago | Details |