Benchmarks

Offloading performance and model quality, measured on real hardware.

Offloading Performance

Token generation rates with full expert offloading on an RTX 3090 (24 GB VRAM, 64 GB RAM, NVMe SSD).

Model Quant Prefill Decode VRAM Cache Hit
lean-agent-35b Q4_K_M 10-15 tok/s 6.7-7.6 tok/s 93.1%
lean-agent-122b Q4_K_M - 2.3 tok/s -
lean-think-398b Q4_K_M Testing in progress

All measurements on a single RTX 3090 with profile-guided preloading and speculative router prefetch enabled. Model load <1s. Preload throughput: 5.9 GB/s. CopyEngine prefetch hit rate: 81.3%.

Engine Features

Performance infrastructure built into the runtime.

93%

VRAM cache hit rate

LRU cache with profile-guided preloading

81%

Speculative prefetch hit rate

Router predicts next-layer experts ahead of computation

5.9 GB/s

Expert preload throughput

Async I/O via background thread pool

0.83s

Model load time

Core weights into VRAM, experts lazy-loaded via mmap

Bit-identical

Output vs llama.cpp

Cross-validated on 10 diverse prompts, same GGUF weights

Multi-GPU

Pipeline parallelism

Layers split across GPUs, output bit-identical to single-GPU

Hardware Reference Configurations

All benchmarks run on local hardware. No cloud GPUs.

Tier VRAM RAM NVMe Target Models
Minimal 12 GB 16 GB 1.8 TB lean-agent-35b, lean-coder-80b
Prosumer 24 GB 32 GB 1.8 TB lean-agent-122b
Enthusiast 48 GB 64 GB 1.8 TB lean-reason-397b, lean-think-398b

Model Quality Benchmarks

Standard evals to confirm no capability regression from the lmpack pipeline.

Results coming soon.

MMLU, HumanEval, BFCL, IFEval, GSM8K, MATH - run via lm-evaluation-harness against lean serve.

Methodology

Offloading benchmarks measure tok/s, VRAM cache hit rate, prefetch hit rate, and expert preload throughput. All results from lean bench.

Cross-validation compares lean-engine output against llama.cpp on the same GGUF weights with greedy decoding. 4/10 prompts match token-for-token; 6/10 diverge only in freeform thinking text due to expected FP precision differences.

Quality benchmarks will be run via lm-evaluation-harness against the lean serve OpenAI-compatible API. Models must match base model scores before release.

Hardware: All benchmarks run locally on reference configurations. No cloud GPUs. Results are reproducible.