Run Frontier-Scale AI
on Your Hardware

An inference runtime that runs massive open-weight MoE models on consumer GPUs by intelligently offloading experts across VRAM, RAM, and SSD.

Get Started View Models

Expert Offloading Engine

Run 397B parameter MoE models on a single GPU. Only active experts load into VRAM — cold experts live in RAM and SSD, loaded on demand via mmap.

.lmpack Model Format

File-per-expert packaging enables mmap-based memory management. The OS kernel handles caching automatically — hot experts stay in RAM, cold experts page in from NVMe.

Models Too Large for Your GPU

From 35B to 397B total parameters. None fit on a single GPU — that's the point. The runtime handles the memory hierarchy so you don't have to.

Quick Start

$ curl -sSf https://leanmodels.ai/install.sh | sh
$ lean pull lean-agent-35b
$ lean run lean-agent-35b

Single binary. No Python, no Docker, no cloud dependency.

Runs on Consumer Hardware

Three-tier memory hierarchy: VRAM → RAM → NVMe SSD

Tier VRAM RAM NVMe Max Model
Minimal 12 GB 32 GB 1 TB lean-agent-35b (35B)
Prosumer 24 GB 64 GB 2 TB lean-agent-122b (122B)
Enthusiast 48 GB 128 GB 4 TB lean-reason-397b (397B)

The intelligence is already in open-weight models

Frontier MoE models rival proprietary ones but only activate a fraction of their parameters per token. The barrier to frontier-level local performance is fitting them in memory. That's an engineering problem — and we solve it.