Benchmarks

Offloading performance and model quality, measured on real hardware.

Offloading Performance

Token generation rates with full expert offloading on an RTX 3090 (24 GB VRAM, 64 GB RAM, NVMe SSD).

Model	Quant	Prefill	Decode	VRAM Cache Hit
lean-agent-35b	Q4_K_M	10-15 tok/s	6.7-7.6 tok/s	93.1%
lean-agent-122b	Q4_K_M	-	2.3 tok/s	-
lean-think-398b	Q4_K_M	Testing in progress

All measurements on a single RTX 3090 with profile-guided preloading and speculative router prefetch enabled. Model load <1s. Preload throughput: 5.9 GB/s. CopyEngine prefetch hit rate: 81.3%.

Engine Features

Performance infrastructure built into the runtime.

93%

VRAM cache hit rate

LRU cache with profile-guided preloading

81%

Speculative prefetch hit rate

Router predicts next-layer experts ahead of computation

5.9 GB/s

Expert preload throughput

Async I/O via background thread pool

0.83s

Model load time

Core weights into VRAM, experts lazy-loaded via mmap

Bit-identical

Output vs llama.cpp

Cross-validated on 10 diverse prompts, same GGUF weights

Multi-GPU

Pipeline parallelism

Layers split across GPUs, output bit-identical to single-GPU

Hardware Reference Configurations

All benchmarks run on local hardware. No cloud GPUs.

Tier	VRAM	RAM	NVMe	Target Models
Minimal	12 GB	16 GB	1.8 TB	lean-agent-35b, lean-coder-80b
Prosumer	24 GB	32 GB	1.8 TB	lean-agent-122b
Enthusiast	48 GB	64 GB	1.8 TB	lean-reason-397b, lean-think-398b

Model Quality Benchmarks

Standard evals to confirm no capability regression from the lmpack pipeline.

Results coming soon.

MMLU, HumanEval, BFCL, IFEval, GSM8K, MATH - run via lm-evaluation-harness against lean serve.

Methodology

Offloading benchmarks measure tok/s, VRAM cache hit rate, prefetch hit rate, and expert preload throughput. All results from lean bench.

Cross-validation compares lean-engine output against llama.cpp on the same GGUF weights with greedy decoding. 4/10 prompts match token-for-token; 6/10 diverge only in freeform thinking text due to expected FP precision differences.

Quality benchmarks will be run via lm-evaluation-harness against the lean serve OpenAI-compatible API. Models must match base model scores before release.

Hardware: All benchmarks run locally on reference configurations. No cloud GPUs. Results are reproducible.