Benchmarks
Lean models vs. base Qwen3 on agent-specific tasks, measured by LeanBench v1.
Agent Task Performance
Tool calling accuracy, JSON validity, and multi-step orchestration. Click a Lean model row to see all quant variants.
| Model | Tool Call % | JSON Valid % | Multi-Step % | Speed (tok/s) | VRAM |
|---|---|---|---|---|---|
| Qwen3-8B (base Q5_K_M) | 72% | 81% | 65% | 45 | 7.8 GB |
| Lean-Agent-8B (Q8_0) | 94% | 97% | 88% | 38 | 8.2 GB |
| Qwen3-14B (base Q4_K_M) | 79% | 85% | 71% | 28 | 10.8 GB |
| Lean-Agent-14B (Q8_0) | 96% | 98% | 91% | 22 | 14.8 GB |
Academic Benchmarks
Standard evals confirming no catastrophic forgetting from distillation. Run on pre-quantization merged weights.
| Model | MMLU | HumanEval | GSM8K | ARC-Challenge | HellaSwag |
|---|---|---|---|---|---|
| Qwen3-8B (base) | 72.1 | 62.3 | 79.4 | 63.1 | 81.2 |
| Lean-Agent-8B | 71.8 | 68.1 | 78.9 | 62.8 | 80.9 |
| Qwen3-14B (base) | 76.3 | 67.2 | 84.1 | 68.5 | 85.0 |
| Lean-Agent-14B | 75.9 | 72.5 | 83.7 | 68.1 | 84.6 |
Academic benchmarks are run on merged F16 weights (pre-quantization) so they are the same across all quant variants.
Methodology
LeanBench v1 tests three core agent capabilities: tool calling format compliance and parameter extraction (200+ test cases), structured JSON/XML output generation (100+ test cases), and multi-step agentic reasoning with task decomposition (100+ test cases).
Academic benchmarks are run via
lm-evaluation-harness
on the merged (pre-quantization) model weights to establish a clean baseline,
then re-run on quantized GGUF variants to measure quantization impact.
Hardware: All inference benchmarks run locally via llama.cpp on a consumer GPU. Reported speeds are prompt processing excluded, generation only.
All results are reproducible. Raw eval outputs will be published alongside each model release.