On this page
- TL;DR
- Why this matters
- Groq
- The pitch
- Real performance numbers
- Pricing
- Strengths
- Weaknesses
- Use Groq when
- Cerebras
- The pitch
- Real performance numbers
- Pricing
- Strengths
- Weaknesses
- Use Cerebras when
- Together AI
- The pitch
- Real performance numbers
- Pricing
- Strengths
- Weaknesses
- Use Together when
- The honorable mentions
- SambaNova
- DeepInfra
- Fireworks AI
- Benchmarks that matter vs. benchmarks that don't
- What actually matters
- What matters less than you'd think
- Side-by-side (common use cases)
- Use case: Real-time voice app
- Use case: Interactive chat UX
- Use case: Background agent loop (20 tool calls)
- Use case: Code generation for an IDE
- Use case: Bulk document processing (async)
- Use case: Need 405B or a rare fine-tune
- The BYOK angle
- Reliability notes
- Pricing-per-token is not the whole picture
- What I actually use
- Where to go from here
- The summary
TL;DR
- Groq — ~300-750 tokens/sec on Llama 3.3 and Mixtral. Reliable, widely supported, great free tier. The default fast-inference choice.
- Cerebras — Up to 1,800 tokens/sec on Llama 3.3 70B. Fastest on the market. Smaller model catalog, newer ecosystem.
- Together AI — 100-200 tokens/sec typically. Largest model catalog, dedicated endpoints, fine-tuning platform. The "do everything" option.
- SambaNova and DeepInfra also compete here — noted where relevant.
- Speed matters for voice apps, streaming UX, agent loops, and any workflow where latency compounds. For async work, quality > speed.
Why this matters
The gap between "AI that feels snappy" and "AI that feels slow" is roughly the first token under 500ms and 100+ tokens/sec after that. Below that threshold, users wait. Above it, conversation flows.
Groq, Cerebras, SambaNova, Together, and DeepInfra are all betting on the same proposition: open-source models served fast and cheap beat slow, expensive closed models for the right use cases. They compete on speed, price, model selection, and reliability.
As of April 2026, here's the honest state of each.
Groq
The pitch
Groq runs open-source models on custom LPU (Language Processing Unit) hardware. Tokens come out fast — typically 3-10x faster than GPU-based inference.
Real performance numbers
- Llama 3.3 70B: ~270-320 tokens/sec
- Llama 3.1 8B: ~600-800 tokens/sec
- Mixtral 8x7B: ~500 tokens/sec
- DeepSeek R1 Distill: ~180 tokens/sec
- Time-to-first-token: typically 200-400ms
Pricing
- Llama 3.3 70B: $0.59/M input, $0.79/M output
- Llama 3.1 8B: $0.05/M input, $0.08/M output
Notably: free tier is generous enough to prototype seriously. You can build and test an app on the free tier before paying anything.
Strengths
- The most mature fast-inference provider. Rate limits are predictable, uptime is strong.
- Wide model selection among open-source: Llama family, Mixtral, Gemma, DeepSeek distills.
- OpenAI-compatible API. Any BYOK client can point at Groq without a special adapter.
- Active ecosystem. Hugging Face Playground, LangChain support, documented everywhere.
Weaknesses
- Limited to open models. No Claude, no GPT, no Gemini.
- Rate limits on free tier. Enough for development; you'll need paid for real traffic.
- Some newer models arrive late. Not always first to host the latest Llama release.
Use Groq when
- You want fast Llama/Mixtral at a reasonable price.
- You're building interactive chat, voice, or streaming UX.
- You need OpenAI-compatible endpoints for a BYOK workflow.
Cerebras
The pitch
Cerebras runs models on wafer-scale engine hardware — a single chip the size of a dinner plate. Claimed inference speeds of 1,800+ tokens/sec on Llama 3.3 70B, dwarfing everything else.
Real performance numbers
- Llama 3.3 70B: 1,600-2,000 tokens/sec (highest in industry)
- Llama 3.1 8B: 2,500+ tokens/sec
- Time-to-first-token: 150-300ms
Pricing
Pricing as of April 2026 is similar to Groq ($0.60-0.85/M for 70B). Free tier exists for testing.
Strengths
- Absolutely fastest on the market. No one touches Cerebras on tokens/sec.
- Time-to-first-token is competitive with Groq and dramatically better than GPU-based inference.
- Low latency compounds: in agent loops with many model calls, Cerebras can be 2-3x faster end-to-end than Groq.
Weaknesses
- Smaller model catalog than Groq or Together — mostly Llama and a few others.
- Less mature ecosystem. Integrations and client libraries are improving but lag Groq.
- Lower rate limits on free / entry tiers.
- Documentation is thinner. Troubleshooting is slower.
Use Cerebras when
- Speed is the ball. Voice apps, real-time agents, streaming demos.
- You're on Llama 3.3 70B or Llama 3.1 8B specifically.
- You've outgrown Groq's throughput and need the next tier.
Together AI
The pitch
Together is the "hyperscaler for open-source AI." Largest model catalog, dedicated endpoints, fine-tuning platform, strong enterprise features.
Real performance numbers
- Llama 3.3 70B Turbo: ~100-180 tokens/sec
- Llama 3.1 405B Turbo: ~40-80 tokens/sec
- DeepSeek V3: ~80-140 tokens/sec
- Qwen 2.5 72B: ~90-150 tokens/sec
- Time-to-first-token: typically 500-900ms (slower than Groq/Cerebras)
Pricing
- Llama 3.3 70B Turbo: $0.88/M both ways
- Llama 3.1 405B Turbo: $3.50/M both ways
- DeepSeek V3: $1.25/M both ways
Not the cheapest, but reliable.
Strengths
- Biggest open-source catalog. Llama, Qwen, DeepSeek, Mistral, Yi, Gemma, specialized coding models.
- Dedicated endpoints. Reserve GPU capacity for predictable latency and pricing at scale.
- Fine-tuning platform. Fine-tune and host your custom model on the same infra.
- Llama 405B access. Together hosts the biggest open models — few competitors can say this.
- Strong enterprise features. SLAs, dedicated support, compliance.
Weaknesses
- Slower than Groq/Cerebras for interactive work.
- More expensive than Groq for equivalent 70B throughput.
- Less impressive demos — won't wow you with pure speed.
Use Together when
- You need a model nobody else hosts (405B, specialized coder, rare community fine-tune).
- You need dedicated endpoints for SLAs.
- You're going to fine-tune and serve.
- Reliability matters more than raw tokens/sec.
The honorable mentions
SambaNova
Competing directly with Cerebras for fastest-inference crown. Claims comparable speeds on Llama 3.3 70B. Smaller market presence as of early 2026 but technically very strong. Worth watching.
DeepInfra
Budget-tier fast inference. Cheaper than the above but with lower SLA quality. Good for experimentation; use with eyes open for production.
Fireworks AI
Balanced play — decent speed, good model catalog, strong fine-tuning. Competes mostly with Together on "breadth over speed." Growing fast.
Benchmarks that matter vs. benchmarks that don't
Fast-inference marketing is full of tokens/sec numbers. Not all of them translate to user experience.
What actually matters
- Time-to-first-token (TTFT). For chat UX, this is more important than raw speed. Target: under 500ms.
- Sustained tokens/sec during output. Determines how fast the response feels once it starts.
- Latency consistency (p99). Slow average latency with fast p50 is worse than slower-but-consistent.
- Cost per completed request. Speed is useless if it's 5x the price.
What matters less than you'd think
- Peak tokens/sec on short prompts. Marketing loves this. Real apps rarely hit it.
- Benchmark-exact scores. Public benchmarks are gamed; your own real prompts matter more.
- Single-model leadership. "Fastest at Llama 3.3 70B" is great if you use Llama 3.3 70B. If you need a newer model or a different architecture, that lead evaporates.
Side-by-side (common use cases)
Use case: Real-time voice app
Speed is critical. User hears the model reply as it streams.
Winner: Cerebras, followed by Groq. Together is too slow for smooth voice.
Use case: Interactive chat UX
User is reading as the model generates. Under 200 tokens/sec feels slow; over 300 feels snappy.
Winner: Groq is the safe default. Cerebras if you need even more.
Use case: Background agent loop (20 tool calls)
Agent does many sequential model calls. Each call's latency adds up.
Winner: Cerebras for minimum total runtime. Groq close second.
Use case: Code generation for an IDE
Speed matters (autocomplete), but quality matters more.
Winner: Groq + DeepSeek V3 distill, or fall back to Claude Sonnet 4.6 for harder problems.
Use case: Bulk document processing (async)
Speed doesn't matter — throughput + cost matter.
Winner: Together AI or Fireworks for cheapness and parallelism. DeepSeek off-peak for absolute cost floor.
Use case: Need 405B or a rare fine-tune
Winner: Together, definitively.
The BYOK angle
All three providers offer OpenAI-compatible endpoints. That means any BYOK client (NovaKit, OpenRouter, your own app) can use them alongside Claude, GPT, Gemini, and the rest. The usual pattern:
- Default to Claude Sonnet 4.6 or GPT-4o for quality.
- Swap to Groq Llama 3.3 70B when speed matters or cost is tight.
- Swap to Cerebras for the demo or voice moment that needs it fastest.
- Use Together when no one else hosts the model you need.
Multi-provider is the story. The fast-inference providers don't replace closed models; they complement them.
Reliability notes
From real-world use in 2026:
- Groq: Mature. Occasional rate-limit pressure during viral demo moments, but generally reliable.
- Cerebras: Newer, growing. Occasional capacity issues when something goes viral. Improving fast.
- Together: Most enterprise-grade of the three. Highest uptime in my experience.
- SambaNova: Less data — fewer deployments to observe.
- DeepInfra: Budget provider; occasional latency spikes. Use with monitoring.
For production workloads, have a fallback. "Primary = Groq, fallback = Together" is a common pattern.
Pricing-per-token is not the whole picture
When comparing providers, look at:
- Rate limits (tokens per minute, requests per minute, concurrent requests).
- Discounts at volume (enterprise contracts can be 30-50% off list).
- Dedicated endpoints (Together, Fireworks offer reserved capacity).
- Free-tier generosity (Groq's is best for prototyping).
- Geographic latency (where are their data centers relative to your users?).
For an individual user on BYOK, list pricing is what you pay. For products at scale, negotiate.
What I actually use
Honest picks for different workflows:
- Chat UX (primary): Groq Llama 3.3 70B for everyday chat; switch to Claude Sonnet 4.6 when quality matters.
- Agent loops: Cerebras when I'm demoing; Groq for day-to-day.
- Experimental / rare models: Together.
- Absolute cost floor: DeepSeek V3 off-peak + Together for hosting.
Where to go from here
- Current prices across all providers
- Pick a model for a specific task
- Open-source AI model comparison
- How to think about multi-model routing
The summary
- Groq = the sensible default. Fast, cheap, broad ecosystem.
- Cerebras = fastest. When speed is the product.
- Together = the catalog. When you need something specific or at scale.
- Match the provider to the workload. Use more than one.
- Tokens/sec is fun marketing. Time-to-first-token + consistency + cost is what actually ships products.
Use Groq, Cerebras, and Together alongside Claude, GPT, and Gemini — all in one BYOK workspace with NovaKit. Keys stay local, cost tracked per message, models swapped by keyboard shortcut.