The cheapest enterprise-grade frontier model in 2026 is Claude Haiku 4 at $0.25 per million input tokens and $1.25 per million output tokens. The most expensive is OpenAI o1 at $15 per million input and $60 per million output. Claude Sonnet 4 lands at $3 input and $15 output, GPT-4o at $2.50 input and $10 output, Gemini 2.0 Pro at $1.25 input and $5 output, Llama 3.1 405B hosted on Bedrock at $5.32 input and $16 output, Mistral Large 2 at $3 input and $9 output, and Cohere Command R+ at $2.50 input and $10 output. Cached input and batch API discounts of 50 to 90 percent change the realised cost materially on workloads with the right shape. This comparison covers the production-ready frontier and mid-tier models that show up in enterprise procurement evaluations.
Primary frontier model per-token comparison
The published per-token pricing across the frontier model families as of Q2 2026 is summarised below. The pricing reflects standard real-time API access on the vendor's direct endpoint or the primary cloud channel (Bedrock for Llama and Claude on AWS, Vertex for Gemini and Claude on GCP).
| Model | Vendor | Input per 1M tokens | Output per 1M tokens | Context window |
|---|---|---|---|---|
| Claude Opus 4 | Anthropic | $15.00 | $75.00 | 500K |
| Claude Sonnet 4 | Anthropic | $3.00 | $15.00 | 500K |
| Claude Haiku 4 | Anthropic | $0.25 | $1.25 | 200K |
| GPT-4o | OpenAI | $2.50 | $10.00 | 128K |
| GPT-4o mini | OpenAI | $0.15 | $0.60 | 128K |
| GPT-4.1 | OpenAI | $2.00 | $8.00 | 1M |
| GPT-4.1 mini | OpenAI | $0.40 | $1.60 | 1M |
| o1 (reasoning) | OpenAI | $15.00 | $60.00 | 200K |
| o3-mini (reasoning) | OpenAI | $1.10 | $4.40 | 200K |
| Gemini 2.0 Pro | $1.25 | $5.00 | 2M | |
| Gemini 2.0 Flash | $0.075 | $0.30 | 1M | |
| Gemini 2.0 Flash-Lite | $0.0375 | $0.15 | 1M | |
| Llama 3.1 405B (Bedrock) | Meta / AWS | $5.32 | $16.00 | 128K |
| Llama 3.1 70B (Bedrock) | Meta / AWS | $0.99 | $0.99 | 128K |
| Mistral Large 2 | Mistral | $3.00 | $9.00 | 128K |
| Mistral Small | Mistral | $0.20 | $0.60 | 128K |
| Cohere Command R+ | Cohere | $2.50 | $10.00 | 128K |
| Cohere Command R | Cohere | $0.15 | $0.60 | 128K |
Cached input and Batch API discount structures
The published per-token rates are starting points. Most enterprise workloads at scale qualify for cached input pricing (workloads with stable system prompts) or Batch API pricing (workloads that do not require real-time response). The discount structures vary materially by vendor.
| Vendor | Cached input discount | Batch API discount | Notes |
|---|---|---|---|
| Anthropic | 90% off (1-hour cache) | 50% off | Cache write costs 125% of standard for 5-minute cache |
| OpenAI | 50% off (5-minute cache) | 50% off | Implicit caching; explicit caching also available |
| Google Gemini | 75% off (context caching) | 50% off | Per-token cache storage fee applies |
| AWS Bedrock (Claude, Llama) | Per-model (matches direct) | 50% off via Batch Inference | Bedrock Provisioned Throughput separately priced |
| Mistral | Not generally available | 50% off | Self-hosted gives full caching control |
| Cohere | Limited | 50% off via batch jobs | Per-tenant negotiation |
The cached input effect in practice: A retrieval-augmented chat application with a stable 30,000-token system prompt and 500 token user prompts. Without caching, every request bills 30,500 input tokens at $3 per million on Claude Sonnet 4 ($0.0915 per request). With 90 percent caching discount on the 30,000-token system prompt, each request bills 3,000 input cache tokens plus 500 standard input tokens (about $0.011 per request). The cached approach delivers 88 percent cost reduction on the input side. For chat applications running at 100,000 requests per day, that compounds to $300,000 per year in saved spend on a single workload.
Provisioned Throughput and reserved capacity
For sustained high-throughput workloads, the per-token billing model is not the cheapest option. Vendor-side reserved capacity (Azure OpenAI PTU, AWS Bedrock Provisioned Throughput, Anthropic Capacity Reservations, Google Vertex Provisioned Throughput) typically delivers 25 to 45 percent cost reduction at sustained throughput above 100 to 300 tokens per second.
| Reserved capacity SKU | Pricing model | Breakeven point versus per-token |
|---|---|---|
| Azure OpenAI Provisioned Throughput Unit | Per-PTU per month, reserved capacity | ~150 tokens/sec sustained |
| AWS Bedrock Provisioned Throughput | Per-model-unit per hour | ~200 tokens/sec sustained |
| Anthropic Capacity Reservation | Per-tenant monthly commit | Negotiated per use case |
| Google Vertex Provisioned Throughput | Per-GSU per month | ~100 tokens/sec sustained |
The decision is driven by workload pattern. Steady-state workloads (chat assistants, RAG applications, classification pipelines) at sustained high throughput favour reserved capacity. Bursty or low-throughput workloads (occasional analysis, periodic batch jobs) favour per-token. For mixed workloads the right pattern is reserved capacity for the steady-state baseline plus per-token for burst, which most vendor reserved offerings support.
Cost comparison by workload class
The right model for a workload is rarely the cheapest model. The right model is the cheapest model that meets the capability threshold for that workload. The table below shows representative cost across the four primary workload classes at meaningful enterprise scale, with the typical model choice for each.
| Workload | Volume assumption | Recommended model | Monthly cost |
|---|---|---|---|
| Enterprise chat assistant (RAG) | 5M user requests / month, 8K tokens avg | Claude Sonnet 4 with caching | $48K to $72K |
| Document analysis (legal review) | 250K documents / month, 50K tokens avg | Claude Opus 4 or Gemini 2.0 Pro | $94K to $187K |
| Code generation (developer assistant) | 800K completions / month, 4K tokens avg | Claude Sonnet 4 or GPT-4o | $28K to $35K |
| Classification / extraction at scale | 200M items / month, 1K tokens avg | Claude Haiku 4 or Gemini Flash | $8K to $32K |
| Multi-step reasoning (analysis) | 50K tasks / month, 20K tokens avg | o1 or Claude Opus 4 extended thinking | $40K to $115K |
The hidden cost layers
The published per-token rates do not capture three cost layers that matter at enterprise scale. The cost model that omits these layers under-states three-year TCO by 15 to 35 percent on most workloads.
Embedding generation. Most enterprise AI workloads include a retrieval layer that converts customer documents into vector embeddings. Embedding generation costs are usually below 5 percent of total spend but show up as a discrete line item that surprises buyers who modelled only generation cost. OpenAI text-embedding-3-large runs at $0.13 per million tokens. Voyage AI (favoured for Claude integrations) runs at $0.18 per million tokens for voyage-3.
Image and multimodal input tokens. Image inputs to multimodal models are tokenised at vendor-specific rates. A standard image to GPT-4o is approximately 765 to 1,105 tokens depending on resolution. Heavy image workloads (document OCR, chart analysis, screenshot processing) can dominate the token spend.
Egress, observability, and orchestration. The infrastructure layer around the model (LangChain or LlamaIndex orchestration, vector database, observability tooling, prompt management, evaluation tooling) usually adds 15 to 30 percent to the model spend at enterprise scale.
The output-token bias in cost modelling: Output tokens cost 4x to 5x input tokens on most frontier models. Workloads that look cheap on input-token counts can be expensive on output. The mitigation is to constrain output length explicitly in prompts and to use the most concise model that meets the quality threshold for the workload. A response that takes 2,000 output tokens at $15 per million on Claude Sonnet 4 costs $0.030. The same response generated by Claude Haiku 4 at $1.25 per million costs $0.0025. For high-volume workloads where Haiku 4 quality is sufficient, the 12x cost reduction is the largest single optimisation available in 2026.
Volume discount bands and enterprise contracts
At enterprise scale (above $50,000 per month in API spend), all the frontier vendors entertain custom commercial terms. The discount realisation tracks roughly with monthly committed spend.
| Monthly committed spend | Typical discount band | Notes |
|---|---|---|
| $50K to $100K | 0 to 8 percent | Pay-as-you-go, no commit |
| $100K to $500K | 8 to 18 percent | 12-month commit with quarterly true-up |
| $500K to $2M | 15 to 25 percent | Multi-year commit, custom SLAs |
| $2M+ | 20 to 35 percent | Capacity reservation + bespoke terms |
Cloud channel arbitrage
Claude is available on Anthropic direct, AWS Bedrock, and Google Vertex. GPT-4 is available on OpenAI direct and Azure OpenAI. Llama is available on Bedrock, Vertex, and Azure AI Studio. Gemini is Google-only. Mistral is direct, Bedrock, Vertex, and Azure AI Studio.
The token economics are usually identical across channels (the model vendor does not pass discount through cloud sub-channels), but the commercial accounting changes materially. For customers with material AWS EDP commit, Claude via Bedrock burns the EDP balance. For customers with Microsoft MACC, Azure OpenAI burns the MACC. For customers with Google Cloud commit, Vertex burns the commit. The channel arbitrage is purely a commercial structure decision; capability is unaffected.
The full hyperscaler cloud cost framework lives in our AWS EDP pillar, Azure MACC analysis, and GCP enterprise agreement guide. For Bedrock specifically see AWS Bedrock pricing 2026.
Picking the right model for the workload
The right model decision in 2026 looks like this. For broad-population chat with RAG, Claude Sonnet 4 or GPT-4o with caching is the cost-efficient choice. For long-document analysis, Claude Opus 4 or Gemini 2.0 Pro. For multi-step reasoning, OpenAI o1 or Claude Opus 4 extended thinking. For multimodal, GPT-4o or Gemini 2.0 Flash. For high-volume classification and extraction, Claude Haiku 4 or Gemini Flash-Lite. For sovereign deployment, Llama 3.1 or Mistral self-hosted.
The full vendor selection framework lives in our enterprise AI vendor selection framework. For per-vendor deep dives see ChatGPT Enterprise pricing 2026, Claude Enterprise pricing 2026, OpenAI Enterprise pricing, Gemini Enterprise, and Microsoft 365 Copilot pricing 2026. For procurement counsel see AI procurement advisory, cloud contract negotiation, and software licensing advisory.