AgentDish directory

evaluation

Accepted listings with this tag.

Listing Category Score Trend Checked
#56 ↓ -3
CAD-Bench

An open benchmark and leaderboard for AI CAD agents, with 308 prompts across 20 categories and layered scoring for geometry, engineering, manufacturability, and cognition.

Research / Knowledge Work 88 ↓ -3 24 days ago Details
#61 ↓ -3
agent-skills-eval

A TypeScript CLI and SDK for testing whether Agent Skills improve model outputs by running with-skill vs baseline evaluations and generating reports.

Developer Tools / AI Evaluation 88 ↓ -3 26 days ago Details
#152 ↑ +222
Alignment Whack-a-Mole

A research code repository for studying how fine-tuning can trigger verbatim recall of copyrighted books in large language models. It includes preprocessing, fine-tuning, generation, and memorization-evaluation scripts, with setup notes and example data.

Research / Copywriting 86 ↑ +222 28 days ago Details
#180 ↓ -6
DeepSWE

DeepSWE is a benchmark for measuring frontier coding agents on original, long-horizon software engineering tasks. The page shows a leaderboard, methodology overview, task examples, and a full blog explaining the benchmark design and results.

Developer Tools / AI Benchmarking 84 ↓ -6 5 days ago Details

A GitHub example that audits LangChain’s RAG quickstart with retrieval-quality metrics, flags off-topic and out-of-distribution queries, and surfaces ranking and calibration issues with charts and results files.

Developer Tool / RAG Evaluation 78 ↑ +4 27 days ago Details

A GitHub research project documenting a long-form, multi-model analysis of LLM behavior across Claude, Gemini, ChatGPT, and Grok. The repo includes an executive summary, screenplay, technical white paper, and archive of logs and chat records.

AI Research / LLM Evaluation & Analysis 75 → 0 7 days ago Details