An inference runtime that runs massive open-weight MoE models on consumer GPUs by intelligently offloading experts across VRAM, RAM, and SSD. From Qwen3.5 to Arcee Trinity - 35B to 398B parameters.
Run 398B parameter MoE models on consumer GPUs. Only active experts load into VRAM - cold experts live in RAM and SSD, with speculative prefetching that hits 81% of the time.
.lmpack Model FormatFile-per-expert packaging enables mmap-based memory management. The OS kernel handles caching automatically - hot experts stay in RAM, cold experts page in from NVMe.
Flash attention, multi-GPU pipeline parallelism, and async I/O. 93% VRAM cache hit rate. Output validated bit-identical against llama.cpp. OpenAI-compatible API included.
$ curl -sSf https://leanmodels.ai/install.sh | sh
$ lean pull lean-agent-35b
$ lean run lean-agent-35b Single binary, 15 MB. No Python, no Docker, no cloud dependency.
Three-tier memory hierarchy: VRAM → RAM → NVMe SSD
| Tier | VRAM | RAM | NVMe | Models |
|---|---|---|---|---|
| Minimal | 12 GB | 16 GB (32 GB for lean-coder-80b) | 1.8 TB | lean-agent-35b, lean-coder-80b |
| Prosumer | 24 GB | 32 GB (64 GB recommended) | 1.8 TB | lean-agent-122b |
| Enthusiast | 48 GB | 64 GB (128 GB recommended) | 1.8 TB | lean-reason-397b, lean-think-398b |
Frontier MoE models rival proprietary ones but only activate a fraction of their parameters per token. A 242 GB model needs 48 GB of VRAM, not 242 GB. The barrier is fitting them in memory - that's an engineering problem, and we solve it.