Models

MoE models from 35B to 398B parameters. Run models larger than your VRAM - expert offloading handles the rest.

Featured Model

lean-think-398b

Arcee Trinity-Large-Thinking - 398B parameters, ~13B active per token. Chain-of-thought reasoning with agentic RL post-training. Apache 2.0 license.

Paid New

Q4_K_M download

241.9 GB

Min VRAM

48 GB

Active per token

~13B

Architecture

afmoe

A 242 GB model running on 48 GB VRAM. 256 experts per MoE layer, 4 active + 1 shared per token. Interleaved sliding window + global attention. The model you can't run without expert offloading.

View Details

lean-agent-35b

General-purpose agent - tool calling, structured output, multi-step reasoning

Free

Total params

35B

Active per token

Base model

Qwen3.5-35B-A3B

Architecture

GDN hybrid MoE

Min VRAM

12 GB

GGUF download sizes

Q3_K_M

16.3 GB

Q4_K_M

21.4 GB

Q5_K_M

25.0 GB

Q6_K

30.0 GB

Q8_0

36.9 GB

The entry point. A 21 GB model (Q4_K_M) that runs on 12 GB VRAM - expert offloading handles the rest. Qwen3.5 GDN hybrid architecture surpasses last-gen models 7× its size. 6.7-7.6 tok/s decode on an RTX 3090.

$ lean pull lean-agent-35b

lean-coder-80b

Code generation - debugging, refactoring, code review

Free

Total params

80B

Active per token

Base model

Qwen3-Coder-Next

Architecture

MoE (512 experts)

Min VRAM

12 GB

GGUF download sizes

Q3_K_M

36.7 GB

Q4_K_M

48.7 GB

Q5_K_M

57.0 GB

Q6_K

65.8 GB

Q8_0

84.8 GB

Code-specialized. 80B total with 512 experts, only 3B active per token. A 48.7 GB model (Q4_K_M) that runs on 12 GB VRAM. Tuned for code generation, debugging, and software engineering workflows.

$ lean pull lean-coder-80b

lean-agent-122b

Advanced agent - complex orchestration, long-context workflows

Paid

Total params

122B

Active per token

10B

Base model

Qwen3.5-122B-A10B

Architecture

GDN hybrid MoE

Min VRAM

24 GB

GGUF download sizes

Q3_K_M

56.6 GB

Q4_K_M

75.0 GB

Q5_K_M

87.8 GB

Q6_K

105.7 GB

Q8_0

129.9 GB

A 75 GB model (Q4_K_M) that runs on 24 GB VRAM. 256 experts per layer with 10B active per token - massive knowledge base with efficient per-token compute. 2.3 tok/s decode on an RTX 3090.

$ lean pull lean-agent-122b

lean-reason-397b

Frontier-scale - deep reasoning, complex analysis, research

Paid

Total params

397B

Active per token

17B

Base model

Qwen3.5-397B-A17B

Architecture

GDN hybrid MoE

Min VRAM

48 GB

GGUF download sizes

Q3_K_M

177.4 GB

Q4_K_M

244.1 GB

Q5_K_M

293.7 GB

Q6_K

326.6 GB

Q8_0

421.5 GB

Frontier-scale reasoning. 397B total parameters with 17B active per token delivers state-of-the-art capability running entirely on your hardware. A 244 GB model (Q4_K_M) that runs on 48 GB VRAM.

$ lean pull lean-reason-397b

lean-think-398b

Extended reasoning - chain-of-thought, agentic tasks, deep analysis

Paid New

Total params

398B

Active per token

~13B

Base model

Arcee Trinity-Large-Thinking

Architecture

afmoe (SWA + global)

Min VRAM

48 GB

GGUF download sizes

Q3_K_M

181.4 GB

Q4_K_M

241.9 GB

Q5_K_M

283.6 GB

Q6_K

343.2 GB

Q8_0

423.7 GB

A 242 GB model (Q4_K_M) that runs on 48 GB VRAM. 256 experts per MoE layer with 4 active + 1 shared per token. Interleaved sliding window + global attention architecture. Chain-of-thought reasoning with agentic RL post-training. Apache 2.0 license. The "you can't run this without expert offloading" model.

$ lean pull lean-think-398b

How offloading works

MoE models only activate a fraction of their parameters per token. The lean runtime keeps the hot path in VRAM and transparently pages in the rest from RAM and NVMe as needed.

The .lmpack format is designed for this workload. Combined with speculative prefetching and profile-guided preloading, it delivers interactive speeds on hardware that would otherwise be far too small.

Get started

$ curl -sSf https://leanmodels.ai/install.sh | sh
$ lean pull lean-agent-35b
$ lean run lean-agent-35b

Single binary, 15 MB. No Python, no Docker, no cloud dependency.