Benchmarks

Lean models vs. base Qwen3 on agent-specific tasks, measured by LeanBench v1.

Agent Task Performance

Tool calling accuracy, JSON validity, and multi-step orchestration. Click a Lean model row to see all quant variants.

Model Tool Call % JSON Valid % Multi-Step % Speed (tok/s) VRAM
Qwen3-8B (base Q5_K_M) 72% 81% 65% 45 7.8 GB
Lean-Agent-8B (Q8_0) 94% 97% 88% 38 8.2 GB
Qwen3-14B (base Q4_K_M) 79% 85% 71% 28 10.8 GB
Lean-Agent-14B (Q8_0) 96% 98% 91% 22 14.8 GB

Academic Benchmarks

Standard evals confirming no catastrophic forgetting from distillation. Run on pre-quantization merged weights.

Model MMLU HumanEval GSM8K ARC-Challenge HellaSwag
Qwen3-8B (base) 72.1 62.3 79.4 63.1 81.2
Lean-Agent-8B 71.8 68.1 78.9 62.8 80.9
Qwen3-14B (base) 76.3 67.2 84.1 68.5 85.0
Lean-Agent-14B 75.9 72.5 83.7 68.1 84.6

Academic benchmarks are run on merged F16 weights (pre-quantization) so they are the same across all quant variants.

Methodology

LeanBench v1 tests three core agent capabilities: tool calling format compliance and parameter extraction (200+ test cases), structured JSON/XML output generation (100+ test cases), and multi-step agentic reasoning with task decomposition (100+ test cases).

Academic benchmarks are run via lm-evaluation-harness on the merged (pre-quantization) model weights to establish a clean baseline, then re-run on quantized GGUF variants to measure quantization impact.

Hardware: All inference benchmarks run locally via llama.cpp on a consumer GPU. Reported speeds are prompt processing excluded, generation only.

All results are reproducible. Raw eval outputs will be published alongside each model release.