On this page
- TL;DR
- Why using one model is a rookie move
- The model-to-task map
- Three approaches to actually doing the routing
- Approach 1: Manual (human-in-the-loop)
- Approach 2: Rules-based routing
- Approach 3: Model-picks-model (classifier routing)
- The cost impact is massive
- Model cascading (the advanced move)
- Multi-model within one conversation
- What about quality drops when switching mid-thread?
- A recommended default setup for developers
- Where to go next
- The summary
TL;DR
- Different tasks have different difficulty and latency profiles. One model for everything leaves massive cost and quality on the table.
- A good routing strategy assigns the cheapest model that can still do the job well — and escalates only when quality demands it.
- Common routes: Haiku / Flash for classification, Sonnet / GPT-4o for general chat, Opus / o3 for hard reasoning, Groq / Cerebras for low-latency UI.
- Routing can be manual (a switcher in your UI), rules-based (keyword or task-type triggers), or model-picks-model (a small cheap model decides which expensive model to call).
- BYOK workspaces like NovaKit let you switch models mid-conversation or set per-task defaults.
Why using one model is a rookie move
Here's what a typical "power user" does in one day with AI:
- 9:00 — asks a syntax question ("JS destructuring with defaults?"). Needs: answer in under 2 seconds.
- 10:30 — asks the model to refactor a 12-file TypeScript module. Needs: careful, long-context, correct-first-try reasoning.
- 14:00 — classifies 400 customer support tickets into 8 buckets. Needs: low cost per call.
- 16:00 — has the model draft a long internal memo. Needs: good writing voice.
- 18:00 — runs an overnight agent that scrapes, summarizes, and tags 500 articles. Needs: cheap bulk inference.
No single model optimizes for all of those. Running everything on Claude Opus 4 burns money on the easy stuff. Running everything on GPT-4o-mini gives you shoddy refactors. Running everything on Gemini Flash means your syntax questions still take 3 seconds.
The obvious move: route each task to a different model. This is what "multi-model" means.
The model-to-task map
A practical starting point. Mix and match based on your provider access.
| Task | Best fit | Rationale |
|---|---|---|
| Simple classification, labeling | GPT-4o-mini / Haiku 3.5 / Flash | Near-free, fast, accurate enough |
| Interactive chat (short turns) | Groq Llama 3.3 / GPT-4o | Speed matters, quality is fine |
| General coding | Claude Sonnet 4.6 / GPT-4o | Sweet spot of quality and cost |
| Hard multi-file coding / refactors | Claude Opus 4 | Best-in-class, worth the price |
| Long-document analysis (< 500k tokens) | Claude Sonnet 4.6 + cache | Cheap with caching, high quality |
| Huge-document analysis (> 500k) | Gemini 2.5 Pro | 1-2M context is unmatched |
| Fast draft / brainstorming | GPT-4o / Grok | Low latency, creative |
| Bulk batch processing | Gemini 2.0 Flash / DeepSeek V3 | Cheapest frontier-quality options |
| Hardest reasoning (math, proofs) | o3 / GPT-5 | When you need chain-of-thought strength |
| Voice / realtime | GPT-4o Realtime or dedicated voice models | Specialized |
There's nothing sacred about these assignments — they'll shift as prices and capabilities change. The principle is: pick by task type, not by brand loyalty.
Three approaches to actually doing the routing
Approach 1: Manual (human-in-the-loop)
Simplest and underrated. You pick the model. No automation, no config, no latency cost.
Implementation: A keyboard shortcut or dropdown in your chat UI that switches model mid-conversation. NovaKit uses ⌘K to swap model without losing context.
Pros: Zero setup. You know why you picked each model. No routing bugs. Cons: Doesn't scale to product use or automated pipelines.
Best for: Individual users. Most people should start here. Manual routing for personal use + observing your own patterns teaches you the rest.
Approach 2: Rules-based routing
You write explicit rules that map request properties to models.
Example rules:
if request.tokens > 150_000: use gemini-2.5-proif request.type == "code_refactor": use claude-opus-4if request.user_tier == "free": use gemini-2.0-flashif "deep analysis" in request.prompt: use o3else: use claude-sonnet-4.6
Pros: Deterministic, easy to debug, predictable cost. Cons: Rules get brittle as you add cases. Doesn't adapt to new tasks.
Implementation: LiteLLM has rule-based routing built in. You can also write it yourself — it's 30 lines.
Approach 3: Model-picks-model (classifier routing)
You use a tiny, cheap model to classify the incoming request, then dispatch to the appropriate big model.
Flow:
- User sends request.
- Tiny model (GPT-4o-mini or Haiku) classifies: "This is a simple syntax question."
- Router dispatches to the fast, cheap model.
- Response returns.
Pros: Adaptive. Handles novel inputs. Can improve over time with feedback. Cons: Adds latency (~100-300ms). Adds one layer of failure. Harder to debug.
When to use: Production products with many users doing many kinds of tasks. Worth it at scale.
The cost impact is massive
Let's model a realistic product workload of 100k requests/month. Assume this mix:
- 50% are short classification or labeling tasks (200 in / 50 out).
- 30% are general chat (500 in / 400 out).
- 15% are coding/analysis tasks (3k in / 800 out).
- 5% are hard reasoning / long context (20k in / 2k out).
Scenario A: Claude Opus 4 for everything
- Classification: 50k × $15/M in + 50k × $75/M out (scaled for message size) ≈ $337
- Chat: 30k × big numbers ≈ $225 + $900 = ~$1,125
- Coding: 15k × $45 + $60 = ~$1,575
- Hard reasoning: 5k × $300 + $150 = ~$2,250
Total: ~$5,300/month.
Scenario B: Smart routing
- Classification on GPT-4o-mini: 50k × trivial cost = ~$5
- Chat on Claude Sonnet 4.6: 30k × $1.50 + $2.40 = ~$117
- Coding on Claude Sonnet 4.6: 15k × $3 + $12 = ~$225
- Hard reasoning on Claude Opus 4: 5k × $300 + $150 = ~$2,250
Total: ~$2,600/month.
Savings: ~50% ($2,700/month) with no quality loss on the important work. The expensive reasoning tasks still get the expensive model; the cheap tasks stop subsidizing Anthropic's margins.
Model cascading (the advanced move)
Cascading = try the cheap model first, escalate only if it fails a quality check.
Flow:
- Send request to Claude Sonnet 4.6.
- A confidence check runs (is the answer plausible? Does it compile? Does it match a schema?).
- If it passes: return.
- If it fails: retry on Claude Opus 4 and return that.
Benefits: You only pay for Opus when Sonnet isn't enough. On paper, this could cut costs 70% while preserving quality.
Challenges: Requires a good "quality check." For code, a type-check + test run works. For creative writing, it's harder — you might need a judge model.
Use when: High-volume, well-defined output schemas (API responses, code, structured data). Avoid when: Free-form creative output where "quality" is subjective.
Multi-model within one conversation
This is where a BYOK workspace shines. Example workflow:
- User asks a quick syntax question. → GPT-4o answers in 1.5 seconds.
- User switches to "Claude Opus 4" and says "okay, now refactor this 8-file module." → Full context is preserved.
- User switches to "Gemini 2.5 Pro" and pastes the whole 200-page docs. → Model reads everything.
- User switches back to "Claude Sonnet 4.6" for cheap follow-up questions.
All in the same thread. All BYOK. All cost-tracked.
This is the workflow that's impossible on ChatGPT Plus or Claude Pro. It's the single biggest argument for BYOK + multi-provider access.
What about quality drops when switching mid-thread?
A legitimate concern: does switching models mid-conversation cause weird handoffs?
In practice, no — mostly. Each message is self-contained given the conversation history. The new model re-reads the whole thread and responds. You might notice tone or style shifts (especially Claude → GPT → Claude), but factual continuity is fine.
Caveat: Some workflows break if models have different context windows. If you've filled GPT-4o's 128k context and then switch to a 32k model, you'll truncate. Use one of the long-context models (Gemini 2.5 Pro, Claude 200k) as your "big thread" default.
A recommended default setup for developers
Here's a starter routing policy that works well in practice:
- Default for everything: Claude Sonnet 4.6 (the new "workhorse").
- Quick questions / fast feel: GPT-4o via keyboard shortcut.
- Multi-file / hard problems: Claude Opus 4 via keyboard shortcut.
- "Read this whole document" tasks: Gemini 2.5 Pro.
- Classification / extraction pipelines: GPT-4o-mini via API.
- Anything that feels like agent work: Claude Opus 4.
Tune based on your own observed usage. Most people rebalance once a month as they see their patterns.
Where to go next
- Compare current model prices and capabilities
- Estimate cost of a new workflow
- Pick a model for a specific task
- Read the RAG vs. long-context guide — related decision
The summary
- One model ≠ optimal. Routing is the highest-leverage cost optimization in AI.
- Manual routing works great for individuals. Rules-based works for small products. Classifier routing scales to large products.
- Cascading (cheap-first-then-escalate) is the advanced move for high-volume, schema-bound tasks.
- BYOK + multi-model is the only way to do this without multiplying your tooling.
If you're still on a single-provider subscription, you're not just overpaying — you're getting worse results on tasks that want a different model.
Multi-model, one workspace. NovaKit supports 13+ providers — Claude, GPT, Gemini, Llama, DeepSeek, Mistral — BYOK with keyboard-shortcut model switching and per-message cost tracking.