AgentDish directory

benchmarking

Accepted listings with this tag.

Listing Category Score Trend Checked
#22 ↓ -3
trycua/cua

Open-source infrastructure for computer-use agents, with sandboxes, SDKs, benchmarks, and desktop automation tooling for macOS, Linux, Windows, and Android. The repo also includes Cua Driver, CuaBot, Cua-Bench, and Lume for VM management.

Developer Tools / AI Agent Infrastructure 90 ↓ -3 27 days ago Details
#46 ↓ -3
LLMRequirements.com

An interactive guide for choosing local AI hardware and matching open-weights LLMs to specific builds. It offers budget-based build recommendations, a hardware picker, model-to-hardware compatibility, and a state-of-the-local-AI snapshot.

Developer Tools / AI / ML Infrastructure 88 ↓ -3 9 days ago Details
#61 ↓ -3
agent-skills-eval

A TypeScript CLI and SDK for testing whether Agent Skills improve model outputs by running with-skill vs baseline evaluations and generating reports.

Developer Tools / AI Evaluation 88 ↓ -3 26 days ago Details
#90 ↓ -4
SQLite-Columnar

A loadable SQLite extension that adds column-oriented storage and analytics for fast local OLAP-style queries, with benchmark data and build instructions.

Developer Tools / Databases & Storage 87 ↓ -4 20 days ago Details

A research article from Applied Compute on how agentic, tool-using workloads differ from traditional LLM benchmarks, with production observations, workload profiles, and an open-source harness for replaying traces.

Research / Knowledge Work 87 ↓ -34 27 days ago Details
#246 ↓ -4
YourMemory

A persistent memory layer for AI agents, built as a standard MCP server with local setup, dashboard, and benchmark claims against other memory tools.

Developer Tools / AI Memory / MCP 84 ↓ -4 27 days ago Details

arXiv paper describing QUEST, an open family of deep research agents from 2B to 35B parameters, plus a synthetic-task training recipe and released models, data, and scripts.

Research / AI Agents 83 ↓ -3 7 days ago Details

An Apple Silicon–optimized inference build of Bonsai 1.7B with custom Metal kernels, benchmark results, quick-start instructions, and a bundled OpenAI-compatible server.

Developer Tools / Code Assistant 79 ↓ -48 27 days ago Details

A blog post about verifiable RAG that benchmarks open-source NLI verifiers against Claude on RAGTruth and describes a Python library for sentence-level citation and claim verification.

AI / RAG / Verification & Hallucination Detection 78 ↑ +6 15 hours ago Details

A blog post from Augment Code comparing its coding agent, Auggie, against Claude Code on Opus 4.7. It presents benchmark results, token usage, cost comparisons, and an explanation of the Context Engine and Prism router.

Developer Tools / AI Coding Assistant 77 → 0 15 days ago Details

A GitHub research project documenting a long-form, multi-model analysis of LLM behavior across Claude, Gemini, ChatGPT, and Grok. The repo includes an executive summary, screenplay, technical white paper, and archive of logs and chat records.

AI Research / LLM Evaluation & Analysis 75 → 0 7 days ago Details

A Superconductor blog post showing how background coding agents were used to reproduce, diagnose, and fix a Rails memory leak using derailed_benchmarks, with a reusable Agent Skill workflow included.

Developer Tools / Code Assistant 74 ↓ -1 5 days ago Details