The AI landscape has shifted. Two years ago, building with large language models meant choosing between a handful of proprietary APIs. Today, open-weight models — Llama 3, Mistral, Command R, Qwen, and Gemma — offer genuine alternatives that you can host, fine-tune, and inspect on your own terms.
For product teams, this is not just an ideology question. It is an architecture decision with real implications for cost, latency, privacy, and long-term control.
// What changed
Open-weight models have closed the quality gap faster than most predicted. Llama 3 70B performs within striking distance of GPT-4 on many benchmarks. Mistral's mixture-of-experts architecture delivers strong results at a fraction of the compute. Command R excels at retrieval-augmented generation out of the box.
More importantly, the tooling caught up. Frameworks like vLLM, llama.cpp, and Ollama make self-hosting practical, not just possible. Quantization techniques (GPTQ, AWQ, GGUF) let teams run capable models on surprisingly modest hardware.
// When self-hosting makes sense
- Data sensitivity: Healthcare, legal, and financial applications where data cannot leave your infrastructure.
- Latency requirements: Real-time features where API round-trips add unacceptable delay — autocomplete, inline suggestions, live classification.
- Cost at scale: High-volume workloads where per-token API pricing becomes the dominant line item. Self-hosted inference can cut costs by 5-10x at sufficient throughput.
- Fine-tuning needs: Domain-specific tasks where a smaller, tuned model outperforms a general-purpose giant. LoRA adapters make this accessible with limited GPU hours.
// When to stay on APIs
Self-hosting adds operational burden: GPU provisioning, model versioning, monitoring, and failover. If your team is small and your usage is moderate, the managed API path is almost always the right call. The breakeven point depends on volume, but for most early-stage products, it is higher than you think.
Frontier capabilities — the hardest reasoning, longest context windows, multi-modal understanding — still favor proprietary models. Open-weight is catching up, but for tasks that require the absolute best output quality, APIs remain the pragmatic choice.
// Our take
The best teams are not choosing one or the other. They are building routing layers that direct traffic to the right model for the task — small, fast open-weight models for classification and extraction; proprietary APIs for complex generation. The tooling to do this cleanly exists today. The question is not open vs. closed. It is knowing which tool fits which job.