AgentDish directory
benchmarking
Accepted listings with this tag.
| Listing | Category | Score | Trend | Checked | |
|---|---|---|---|---|---|
|
#22
↓ -3
trycua/cua
Open-source infrastructure for computer-use agents, with sandboxes, SDKs, benchmarks, and desktop automation tooling for macOS, Linux, Windows, and Android. The repo also includes Cua Driver, CuaBot, Cua-Bench, and Lume for VM management. |
Developer Tools / AI Agent Infrastructure | 90 | ↓ -3 | 27 days ago | Details |
|
#46
↓ -3
LLMRequirements.com
An interactive guide for choosing local AI hardware and matching open-weights LLMs to specific builds. It offers budget-based build recommendations, a hardware picker, model-to-hardware compatibility, and a state-of-the-local-AI snapshot. |
Developer Tools / AI / ML Infrastructure | 88 | ↓ -3 | 9 days ago | Details |
|
#61
↓ -3
agent-skills-eval
A TypeScript CLI and SDK for testing whether Agent Skills improve model outputs by running with-skill vs baseline evaluations and generating reports. |
Developer Tools / AI Evaluation | 88 | ↓ -3 | 26 days ago | Details |
|
#90
↓ -4
SQLite-Columnar
A loadable SQLite extension that adds column-oriented storage and analytics for fast local OLAP-style queries, with benchmark data and build instructions. |
Developer Tools / Databases & Storage | 87 | ↓ -4 | 20 days ago | Details |
|
A research article from Applied Compute on how agentic, tool-using workloads differ from traditional LLM benchmarks, with production observations, workload profiles, and an open-source harness for replaying traces. |
Research / Knowledge Work | 87 | ↓ -34 | 27 days ago | Details |
|
#246
↓ -4
YourMemory
A persistent memory layer for AI agents, built as a standard MCP server with local setup, dashboard, and benchmark claims against other memory tools. |
Developer Tools / AI Memory / MCP | 84 | ↓ -4 | 27 days ago | Details |
|
arXiv paper describing QUEST, an open family of deep research agents from 2B to 35B parameters, plus a synthetic-task training recipe and released models, data, and scripts. |
Research / AI Agents | 83 | ↓ -3 | 7 days ago | Details |
|
#369
↓ -48
Bonsai 1.7B: Apple Silicon Optimized Build
An Apple Silicon–optimized inference build of Bonsai 1.7B with custom Metal kernels, benchmark results, quick-start instructions, and a bundled OpenAI-compatible server. |
Developer Tools / Code Assistant | 79 | ↓ -48 | 27 days ago | Details |
|
#371
↑ +6
Verified RAG: every sentence checked
A blog post about verifiable RAG that benchmarks open-source NLI verifiers against Claude on RAGTruth and describes a Python library for sentence-level citation and claim verification. |
AI / RAG / Verification & Hallucination Detection | 78 | ↑ +6 | 15 hours ago | Details |
|
A blog post from Augment Code comparing its coding agent, Auggie, against Claude Code on Opus 4.7. It presents benchmark results, token usage, cost comparisons, and an explanation of the Context Engine and Prism router. |
Developer Tools / AI Coding Assistant | 77 | → 0 | 15 days ago | Details |
|
A GitHub research project documenting a long-form, multi-model analysis of LLM behavior across Claude, Gemini, ChatGPT, and Grok. The repo includes an executive summary, screenplay, technical white paper, and archive of logs and chat records. |
AI Research / LLM Evaluation & Analysis | 75 | → 0 | 7 days ago | Details |
|
A Superconductor blog post showing how background coding agents were used to reproduce, diagnose, and fix a Rails memory leak using derailed_benchmarks, with a reusable Agent Skill workflow included. |
Developer Tools / Code Assistant | 74 | ↓ -1 | 5 days ago | Details |