AgentDish directory

llm-inference

Accepted listings with this tag.

Listing	Category	Score	Trend	Checked
#151 ↓ -3 tiny-vllm Open-source C++ and CUDA LLM inference engine inspired by vLLM, with a teaching-focused course that walks through model serving, batching, KV cache, and attention kernels.	Developer Tools / AI Inference / LLM Serving	88	↓ -3	48 days ago	Details
#243 ↓ -74 Benchmarking Inference Engines on Agentic Workloads A research article from Applied Compute on how agentic, tool-using workloads differ from traditional LLM benchmarks, with production observations, workload profiles, and an open-source harness for replaying traces.	Research / Knowledge Work	87	↓ -74	73 days ago	Details
#312 ↑ +2 ZSE v2.0.0 A pure-Python LLM inference engine and server with CUDA/HIP/Metal code generation, OpenAI-compatible API support, built-in RAG, and multi-GPU backend support.	Developer Tools / AI / ML Infrastructure	86	↑ +2	45 days ago	Details
#382 ↓ -3 axiom Bootable Rust no_std kernel built as an inference substrate for LLMs, with tensor-native memory allocation, layer-boundary scheduling, and streaming-focused runtime primitives.	Developer Tools / AI Infrastructure	85	↓ -3	15 days ago	Details
#430 ↓ -6 LLM Inspector An open-source CLI for inspecting live LLM inference processes on NVIDIA GPUs. It breaks down VRAM usage by component, shows runtime and model details, and projects memory savings from quantization strategies.	Developer Tools / LLM Inference Observability	84	↓ -6	just now	Details
#842 ↓ -1 Jamesob's guide to running SOTA LLMs locally A GitHub repository guide to running state-of-the-art LLMs on local hardware, with concrete build notes, cost breakdowns, and runnable configs for large models.	AI Infrastructure / Local LLMs	80	↓ -1	14 days ago	Details
#974 ↓ -187 Achieving 3X speedups on Google TPUs with diffusion-style speculative decoding Google Developers Blog post about integrating DFlash, a diffusion-style speculative decoding framework, into the vLLM TPU ecosystem to improve LLM serving speed on TPU v5p.	Developer Tools / Code Assistant	78	↓ -187	72 days ago	Details
#1045 → 0 DSpark: Speculative decoding accelerates LLM inference A DeepSeek paper about DSpark, a full-stack codebase for training and evaluating speculative decoding algorithms to speed up LLM inference.	Developer Tool / AI Infrastructure	75	→ 0	19 days ago	Details
#1119 ↓ -29 vLLM-Compile A public slide deck about vLLM-compile, a project focused on bringing compiler optimizations to LLM inference and speeding up torch.compile for vLLM workflows.	Developer Tools / Code Assistant	72	↓ -29	73 days ago	Details