Why go local?
- Air-gapped or privacy-sensitive codebases.
- Experimentation with distilled or quantized checkpoints.
- Predictable hardware spend instead of per-token cloud bills.
Common serving stacks
Ollama
Fastest path for individuals: pull a compatible GGUF or packaged model, expose an OpenAI-compatible route, then select the Ollama provider inside DeepSeek TUI. Quality tracks whichever checkpoint you actually downloaded—not the flagship cloud model by default.
vLLM
Popular for GPU clusters needing throughput and batched inference. Point DeepSeek TUI at your vLLM base URL after confirming tool-call compatibility for your chosen checkpoint.
SGLang
Another high-performance option for structured generation workloads. Pair it when you already run SGLang for other services and want one inference plane.
Quantization tradeoffs
GGUF, AWQ, GPTQ, and similar formats shrink footprints but can dent reasoning quality—especially on math-heavy or multi-hop coding tasks. Treat quantization as a knob: smaller quantizations for exploration, larger formats when accuracy matters.
Hardware reality check
Large MoE checkpoints may list hundreds of billions of parameters yet only activate a subset per token. Still, VRAM and RAM requirements bite quickly. Start with distilled sizes (7B–14B class) before attempting hero checkpoints unless you already operate a GPU fleet.
Connecting DeepSeek TUI
- Serve an OpenAI-compatible endpoint reachable from your workstation.
- Note the base URL, model name, and API key policy (some local stacks use dummy keys).
- Configure DeepSeek TUI via Configuration—set provider-specific environment variables as documented upstream.
- Smoke-test with Plan mode before granting Agent privileges.