Benchmarks
Offloading performance and model quality, measured on real hardware.
Offloading Performance
Token generation rates with full expert offloading on an RTX 3090 (24 GB VRAM, 64 GB RAM, NVMe SSD).
| Model | Quant | Prefill | Decode | VRAM Cache Hit |
|---|---|---|---|---|
| lean-agent-35b | Q4_K_M | 10-15 tok/s | 6.7-7.6 tok/s | 93.1% |
| lean-agent-122b | Q4_K_M | - | 2.3 tok/s | - |
| lean-think-398b | Q4_K_M | Testing in progress | ||
All measurements on a single RTX 3090 with profile-guided preloading and speculative router prefetch enabled. Model load <1s. Preload throughput: 5.9 GB/s. CopyEngine prefetch hit rate: 81.3%.
Engine Features
Performance infrastructure built into the runtime.
93%
VRAM cache hit rate
LRU cache with profile-guided preloading
81%
Speculative prefetch hit rate
Router predicts next-layer experts ahead of computation
5.9 GB/s
Expert preload throughput
Async I/O via background thread pool
0.83s
Model load time
Core weights into VRAM, experts lazy-loaded via mmap
Bit-identical
Output vs llama.cpp
Cross-validated on 10 diverse prompts, same GGUF weights
Multi-GPU
Pipeline parallelism
Layers split across GPUs, output bit-identical to single-GPU
Hardware Reference Configurations
All benchmarks run on local hardware. No cloud GPUs.
| Tier | VRAM | RAM | NVMe | Target Models |
|---|---|---|---|---|
| Minimal | 12 GB | 16 GB | 1.8 TB | lean-agent-35b, lean-coder-80b |
| Prosumer | 24 GB | 32 GB | 1.8 TB | lean-agent-122b |
| Enthusiast | 48 GB | 64 GB | 1.8 TB | lean-reason-397b, lean-think-398b |
Model Quality Benchmarks
Standard evals to confirm no capability regression from the lmpack pipeline.
Results coming soon.
MMLU, HumanEval, BFCL, IFEval, GSM8K, MATH - run via lm-evaluation-harness against lean serve.
Methodology
Offloading benchmarks measure tok/s, VRAM cache hit rate,
prefetch hit rate, and expert preload throughput. All results from
lean bench.
Cross-validation compares lean-engine output against llama.cpp on the same GGUF weights with greedy decoding. 4/10 prompts match token-for-token; 6/10 diverge only in freeform thinking text due to expected FP precision differences.
Quality benchmarks will be run via
lm-evaluation-harness
against the lean serve
OpenAI-compatible API. Models must match base model scores before release.
Hardware: All benchmarks run locally on reference configurations. No cloud GPUs. Results are reproducible.