Run Frontier-Scale AI
on Your Hardware

An inference runtime that runs massive open-weight MoE models on consumer GPUs by intelligently offloading experts across VRAM, RAM, and SSD. From Qwen3.5 to Arcee Trinity - 35B to 398B parameters.

Get Started View Models

Expert Offloading Engine

Run 398B parameter MoE models on consumer GPUs. Only active experts load into VRAM - cold experts live in RAM and SSD, with speculative prefetching that hits 81% of the time.

.lmpack Model Format

File-per-expert packaging enables mmap-based memory management. The OS kernel handles caching automatically - hot experts stay in RAM, cold experts page in from NVMe.

Built for Performance

Flash attention, multi-GPU pipeline parallelism, and async I/O. 93% VRAM cache hit rate. Output validated bit-identical against llama.cpp. OpenAI-compatible API included.

Quick Start

$ curl -sSf https://leanmodels.ai/install.sh | sh
$ lean pull lean-agent-35b
$ lean run lean-agent-35b

Single binary, 15 MB. No Python, no Docker, no cloud dependency.

Runs on Consumer Hardware

Three-tier memory hierarchy: VRAM → RAM → NVMe SSD

Tier VRAM RAM NVMe Models
Minimal 12 GB 16 GB (32 GB for lean-coder-80b) 1.8 TB lean-agent-35b, lean-coder-80b
Prosumer 24 GB 32 GB (64 GB recommended) 1.8 TB lean-agent-122b
Enthusiast 48 GB 64 GB (128 GB recommended) 1.8 TB lean-reason-397b, lean-think-398b

The intelligence is already in open-weight models

Frontier MoE models rival proprietary ones but only activate a fraction of their parameters per token. A 242 GB model needs 48 GB of VRAM, not 242 GB. The barrier is fitting them in memory - that's an engineering problem, and we solve it.