AgentDish directory

benchmarking

Accepted listings with this tag.

Listing	Category	Score	Trend	Checked
#8 ↓ -3 SigMap SigMap is a deterministic grounding layer for AI code work. It generates a signature-and-evidence map, helps pick relevant files, validates coverage, and judges whether an AI answer is grounded in the repo.	Developer Tools / AI Code Assistance	91	↓ -3	12 days ago	Details
#48 ↓ -3 trycua/cua Open-source infrastructure for computer-use agents, with sandboxes, SDKs, benchmarks, and desktop automation tooling for macOS, Linux, Windows, and Android. The repo also includes Cua Driver, CuaBot, Cua-Bench, and Lume for VM management.	Developer Tools / AI Agent Infrastructure	90	↓ -3	72 days ago	Details
#153 ↓ -3 LLMRequirements.com An interactive guide for choosing local AI hardware and matching open-weights LLMs to specific builds. It offers budget-based build recommendations, a hardware picker, model-to-hardware compatibility, and a state-of-the-local-AI snapshot.	Developer Tools / AI / ML Infrastructure	88	↓ -3	54 days ago	Details
#168 ↓ -3 agent-skills-eval A TypeScript CLI and SDK for testing whether Agent Skills improve model outputs by running with-skill vs baseline evaluations and generating reports.	Developer Tools / AI Evaluation	88	↓ -3	71 days ago	Details
#237 ↓ -4 SQLite-Columnar A loadable SQLite extension that adds column-oriented storage and analytics for fast local OLAP-style queries, with benchmark data and build instructions.	Developer Tools / Databases & Storage	87	↓ -4	65 days ago	Details
#243 ↓ -74 Benchmarking Inference Engines on Agentic Workloads A research article from Applied Compute on how agentic, tool-using workloads differ from traditional LLM benchmarks, with production observations, workload profiles, and an open-source harness for replaying traces.	Research / Knowledge Work	87	↓ -74	72 days ago	Details
#266 ↑ +2 Atelier Open-source runtime for coding agents that sits underneath Claude Code to reduce tool calls, shorten context, and track savings. The page includes installation steps, a local savings check, and benchmark results comparing cost and performance.	Developer Tools / AI Development	86	↑ +2	8 days ago	Details
#270 ↑ +2 The Banana Test A visual benchmark that asks AI coding agents to generate a single-file Three.js animation of a banana plant’s full life cycle, then compares the live results side by side.	AI Tools / Benchmarking	86	↑ +2	10 days ago	Details
#490 ↓ -6 Caplets Caplets is a developer tool that wraps MCP servers into smaller capability-based surfaces for coding agents. The site explains the workflow, setup, example capabilities like OSV, GitHub, and Sourcegraph, and includes benchmark results plus docs links.	Developer Tools / Code Assistant	84	↓ -6	23 days ago	Details
#605 ↓ -4 YourMemory A persistent memory layer for AI agents, built as a standard MCP server with local setup, dashboard, and benchmark claims against other memory tools.	Developer Tools / AI Memory / MCP	84	↓ -4	73 days ago	Details
#666 ↓ -3 QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks arXiv paper describing QUEST, an open family of deep research agents from 2B to 35B parameters, plus a synthetic-task training recipe and released models, data, and scripts.	Research / AI Agents	83	↓ -3	52 days ago	Details
#741 ↓ -2 clawmark A local Rust CLI for A/B testing two CLAUDE.md files against a fixed SWE-bench Lite smoke set, with doctor, run, and report commands.	Developer Tools / AI Benchmarking	82	↓ -2	29 days ago	Details
#812 ↑ +2 PSI KV Governor Reference implementation for using Linux Pressure Stall Information to trim an LLM KV cache under memory pressure. The repo includes requirements, basic usage commands, a simulator, a llama.cpp runner, and benchmark scripts with example results.	Developer Tools / AI Infrastructure	81	↑ +2	19 days ago	Details
#889 ↑ +2 HoprLabs A Python CLI and research toolkit for simulating AI training math before model training. It estimates memory, training time, token budget, config risks, benchmark speed, and reliability, with optional native Rust and C backends.	Developer Tool / AI Research Toolkit	79	↑ +2	21 days ago	Details
#916 ↓ -123 Bonsai 1.7B: Apple Silicon Optimized Build An Apple Silicon–optimized inference build of Bonsai 1.7B with custom Metal kernels, benchmark results, quick-start instructions, and a bundled OpenAI-compatible server.	Developer Tools / Code Assistant	79	↓ -123	72 days ago	Details
#947 ↑ +6 Verified RAG: every sentence checked A blog post about verifiable RAG that benchmarks open-source NLI verifiers against Claude on RAGTruth and describes a Python library for sentence-level citation and claim verification.	AI / RAG / Verification & Hallucination Detection	78	↑ +6	45 days ago	Details
#1009 → 0 Auggie vs Claude Code cost and quality benchmark A blog post from Augment Code comparing its coding agent, Auggie, against Claude Code on Opus 4.7. It presents benchmark results, token usage, cost comparisons, and an explanation of the Context Engine and Prism router.	Developer Tools / AI Coding Assistant	77	→ 0	60 days ago	Details
#1054 → 0 A 400-hour forensic audit of LLMs using multi-model context saturation A GitHub research project documenting a long-form, multi-model analysis of LLM behavior across Claude, Gemini, ChatGPT, and Grok. The repo includes an executive summary, screenplay, technical white paper, and archive of logs and chat records.	AI Research / LLM Evaluation & Analysis	75	→ 0	52 days ago	Details
#1068 ↓ -1 Evaluate Your Agentic Tooling A blog post describing an evaluation harness for comparing agentic coding tools and prompts across realistic SWE tasks, with token-cost results and model-specific behavior notes.	Developer Tools / Code Assistant	74	↓ -1	33 days ago	Details
#1071 ↓ -1 LEVI LEVI is a harness-first evolutionary framework for code and prompt optimization. It focuses on reducing LLM cost with diversity-preserving search, role-aware model routing, and a proxy benchmark, and presents comparative results against several existing systems.	Developer Tools / Code Assistant	74	↓ -1	39 days ago	Details
#1079 ↓ -1 Teaching coding agents to debug Rails memory issues with derailed_benchmarks A Superconductor blog post showing how background coding agents were used to reproduce, diagnose, and fix a Rails memory leak using derailed_benchmarks, with a reusable Agent Skill workflow included.	Developer Tools / Code Assistant	74	↓ -1	50 days ago	Details