Models

MoE models from 35B to 398B parameters. Run models larger than your VRAM - expert offloading handles the rest.

Featured Model

lean-think-398b

Arcee Trinity-Large-Thinking - 398B parameters, ~13B active per token. Chain-of-thought reasoning with agentic RL post-training. Apache 2.0 license.

Paid New

Q4_K_M download

241.9 GB

Min VRAM

48 GB

Active per token

~13B

Architecture

afmoe

A 242 GB model running on 48 GB VRAM. 256 experts per MoE layer, 4 active + 1 shared per token. Interleaved sliding window + global attention. The model you can't run without expert offloading.

lean-agent-35b

General-purpose agent - tool calling, structured output, multi-step reasoning

Free

Total params

35B

Active per token

3B

Base model

Qwen3.5-35B-A3B

Architecture

GDN hybrid MoE

Min VRAM

12 GB

GGUF download sizes

Q3_K_M
16.3 GB
Q4_K_M
21.4 GB
Q5_K_M
25.0 GB
Q6_K
30.0 GB
Q8_0
36.9 GB

The entry point. A 21 GB model (Q4_K_M) that runs on 12 GB VRAM - expert offloading handles the rest. Qwen3.5 GDN hybrid architecture surpasses last-gen models 7× its size. 6.7-7.6 tok/s decode on an RTX 3090.

$ lean pull lean-agent-35b

lean-coder-80b

Code generation - debugging, refactoring, code review

Free

Total params

80B

Active per token

3B

Base model

Qwen3-Coder-Next

Architecture

MoE (512 experts)

Min VRAM

12 GB

GGUF download sizes

Q3_K_M
36.7 GB
Q4_K_M
48.7 GB
Q5_K_M
57.0 GB
Q6_K
65.8 GB
Q8_0
84.8 GB

Code-specialized. 80B total with 512 experts, only 3B active per token. A 48.7 GB model (Q4_K_M) that runs on 12 GB VRAM. Tuned for code generation, debugging, and software engineering workflows.

$ lean pull lean-coder-80b

lean-agent-122b

Advanced agent - complex orchestration, long-context workflows

Paid

Total params

122B

Active per token

10B

Base model

Qwen3.5-122B-A10B

Architecture

GDN hybrid MoE

Min VRAM

24 GB

GGUF download sizes

Q3_K_M
56.6 GB
Q4_K_M
75.0 GB
Q5_K_M
87.8 GB
Q6_K
105.7 GB
Q8_0
129.9 GB

A 75 GB model (Q4_K_M) that runs on 24 GB VRAM. 256 experts per layer with 10B active per token - massive knowledge base with efficient per-token compute. 2.3 tok/s decode on an RTX 3090.

$ lean pull lean-agent-122b

lean-reason-397b

Frontier-scale - deep reasoning, complex analysis, research

Paid

Total params

397B

Active per token

17B

Base model

Qwen3.5-397B-A17B

Architecture

GDN hybrid MoE

Min VRAM

48 GB

GGUF download sizes

Q3_K_M
177.4 GB
Q4_K_M
244.1 GB
Q5_K_M
293.7 GB
Q6_K
326.6 GB
Q8_0
421.5 GB

Frontier-scale reasoning. 397B total parameters with 17B active per token delivers state-of-the-art capability running entirely on your hardware. A 244 GB model (Q4_K_M) that runs on 48 GB VRAM.

$ lean pull lean-reason-397b

lean-think-398b

Extended reasoning - chain-of-thought, agentic tasks, deep analysis

Paid New

Total params

398B

Active per token

~13B

Base model

Arcee Trinity-Large-Thinking

Architecture

afmoe (SWA + global)

Min VRAM

48 GB

GGUF download sizes

Q3_K_M
181.4 GB
Q4_K_M
241.9 GB
Q5_K_M
283.6 GB
Q6_K
343.2 GB
Q8_0
423.7 GB

A 242 GB model (Q4_K_M) that runs on 48 GB VRAM. 256 experts per MoE layer with 4 active + 1 shared per token. Interleaved sliding window + global attention architecture. Chain-of-thought reasoning with agentic RL post-training. Apache 2.0 license. The "you can't run this without expert offloading" model.

$ lean pull lean-think-398b

How offloading works

MoE models only activate a fraction of their parameters per token. The lean runtime keeps the hot path in VRAM and transparently pages in the rest from RAM and NVMe as needed.

The .lmpack format is designed for this workload. Combined with speculative prefetching and profile-guided preloading, it delivers interactive speeds on hardware that would otherwise be far too small.

Get started

$ curl -sSf https://leanmodels.ai/install.sh | sh
$ lean pull lean-agent-35b
$ lean run lean-agent-35b

Single binary, 15 MB. No Python, no Docker, no cloud dependency.