An inference runtime that runs massive open-weight MoE models on consumer GPUs by intelligently offloading experts across VRAM, RAM, and SSD.
Run 397B parameter MoE models on a single GPU. Only active experts load into VRAM — cold experts live in RAM and SSD, loaded on demand via mmap.
.lmpack Model FormatFile-per-expert packaging enables mmap-based memory management. The OS kernel handles caching automatically — hot experts stay in RAM, cold experts page in from NVMe.
From 35B to 397B total parameters. None fit on a single GPU — that's the point. The runtime handles the memory hierarchy so you don't have to.
$ curl -sSf https://leanmodels.ai/install.sh | sh
$ lean pull lean-agent-35b
$ lean run lean-agent-35b Single binary. No Python, no Docker, no cloud dependency.
Three-tier memory hierarchy: VRAM → RAM → NVMe SSD
| Tier | VRAM | RAM | NVMe | Max Model |
|---|---|---|---|---|
| Minimal | 12 GB | 32 GB | 1 TB | lean-agent-35b (35B) |
| Prosumer | 24 GB | 64 GB | 2 TB | lean-agent-122b (122B) |
| Enthusiast | 48 GB | 128 GB | 4 TB | lean-reason-397b (397B) |
Frontier MoE models rival proprietary ones but only activate a fraction of their parameters per token. The barrier to frontier-level local performance is fitting them in memory. That's an engineering problem — and we solve it.