Enterprise LLM Cost Comparison 2026: Per-Token Pricing

The cheapest enterprise-grade frontier model in 2026 is Claude Haiku 4 at $0.25 per million input tokens and $1.25 per million output tokens. The most expensive is OpenAI o1 at $15 per million input and $60 per million output. Claude Sonnet 4 lands at $3 input and $15 output, GPT-4o at $2.50 input and $10 output, Gemini 2.0 Pro at $1.25 input and $5 output, Llama 3.1 405B hosted on Bedrock at $5.32 input and $16 output, Mistral Large 2 at $3 input and $9 output, and Cohere Command R+ at $2.50 input and $10 output. Cached input and batch API discounts of 50 to 90 percent change the realised cost materially on workloads with the right shape. This comparison covers the production-ready frontier and mid-tier models that show up in enterprise procurement evaluations.

Primary frontier model per-token comparison

The published per-token pricing across the frontier model families as of Q2 2026 is summarised below. The pricing reflects standard real-time API access on the vendor's direct endpoint or the primary cloud channel (Bedrock for Llama and Claude on AWS, Vertex for Gemini and Claude on GCP).

Model	Vendor	Input per 1M tokens	Output per 1M tokens	Context window
Claude Opus 4	Anthropic	$15.00	$75.00	500K
Claude Sonnet 4	Anthropic	$3.00	$15.00	500K
Claude Haiku 4	Anthropic	$0.25	$1.25	200K
GPT-4o	OpenAI	$2.50	$10.00	128K
GPT-4o mini	OpenAI	$0.15	$0.60	128K
GPT-4.1	OpenAI	$2.00	$8.00	1M
GPT-4.1 mini	OpenAI	$0.40	$1.60	1M
o1 (reasoning)	OpenAI	$15.00	$60.00	200K
o3-mini (reasoning)	OpenAI	$1.10	$4.40	200K
Gemini 2.0 Pro	Google	$1.25	$5.00	2M
Gemini 2.0 Flash	Google	$0.075	$0.30	1M
Gemini 2.0 Flash-Lite	Google	$0.0375	$0.15	1M
Llama 3.1 405B (Bedrock)	Meta / AWS	$5.32	$16.00	128K
Llama 3.1 70B (Bedrock)	Meta / AWS	$0.99	$0.99	128K
Mistral Large 2	Mistral	$3.00	$9.00	128K
Mistral Small	Mistral	$0.20	$0.60	128K
Cohere Command R+	Cohere	$2.50	$10.00	128K
Cohere Command R	Cohere	$0.15	$0.60	128K

Cached input and Batch API discount structures

The published per-token rates are starting points. Most enterprise workloads at scale qualify for cached input pricing (workloads with stable system prompts) or Batch API pricing (workloads that do not require real-time response). The discount structures vary materially by vendor.

Vendor	Cached input discount	Batch API discount	Notes
Anthropic	90% off (1-hour cache)	50% off	Cache write costs 125% of standard for 5-minute cache
OpenAI	50% off (5-minute cache)	50% off	Implicit caching; explicit caching also available
Google Gemini	75% off (context caching)	50% off	Per-token cache storage fee applies
AWS Bedrock (Claude, Llama)	Per-model (matches direct)	50% off via Batch Inference	Bedrock Provisioned Throughput separately priced
Mistral	Not generally available	50% off	Self-hosted gives full caching control
Cohere	Limited	50% off via batch jobs	Per-tenant negotiation

The cached input effect in practice: A retrieval-augmented chat application with a stable 30,000-token system prompt and 500 token user prompts. Without caching, every request bills 30,500 input tokens at $3 per million on Claude Sonnet 4 ($0.0915 per request). With 90 percent caching discount on the 30,000-token system prompt, each request bills 3,000 input cache tokens plus 500 standard input tokens (about $0.011 per request). The cached approach delivers 88 percent cost reduction on the input side. For chat applications running at 100,000 requests per day, that compounds to $300,000 per year in saved spend on a single workload.

Provisioned Throughput and reserved capacity

For sustained high-throughput workloads, the per-token billing model is not the cheapest option. Vendor-side reserved capacity (Azure OpenAI PTU, AWS Bedrock Provisioned Throughput, Anthropic Capacity Reservations, Google Vertex Provisioned Throughput) typically delivers 25 to 45 percent cost reduction at sustained throughput above 100 to 300 tokens per second.

Reserved capacity SKU	Pricing model	Breakeven point versus per-token
Azure OpenAI Provisioned Throughput Unit	Per-PTU per month, reserved capacity	~150 tokens/sec sustained
AWS Bedrock Provisioned Throughput	Per-model-unit per hour	~200 tokens/sec sustained
Anthropic Capacity Reservation	Per-tenant monthly commit	Negotiated per use case
Google Vertex Provisioned Throughput	Per-GSU per month	~100 tokens/sec sustained

The decision is driven by workload pattern. Steady-state workloads (chat assistants, RAG applications, classification pipelines) at sustained high throughput favour reserved capacity. Bursty or low-throughput workloads (occasional analysis, periodic batch jobs) favour per-token. For mixed workloads the right pattern is reserved capacity for the steady-state baseline plus per-token for burst, which most vendor reserved offerings support.

Cost comparison by workload class

The right model for a workload is rarely the cheapest model. The right model is the cheapest model that meets the capability threshold for that workload. The table below shows representative cost across the four primary workload classes at meaningful enterprise scale, with the typical model choice for each.

Workload	Volume assumption	Recommended model	Monthly cost
Enterprise chat assistant (RAG)	5M user requests / month, 8K tokens avg	Claude Sonnet 4 with caching	$48K to $72K
Document analysis (legal review)	250K documents / month, 50K tokens avg	Claude Opus 4 or Gemini 2.0 Pro	$94K to $187K
Code generation (developer assistant)	800K completions / month, 4K tokens avg	Claude Sonnet 4 or GPT-4o	$28K to $35K
Classification / extraction at scale	200M items / month, 1K tokens avg	Claude Haiku 4 or Gemini Flash	$8K to $32K
Multi-step reasoning (analysis)	50K tasks / month, 20K tokens avg	o1 or Claude Opus 4 extended thinking	$40K to $115K

The hidden cost layers

The published per-token rates do not capture three cost layers that matter at enterprise scale. The cost model that omits these layers under-states three-year TCO by 15 to 35 percent on most workloads.

Embedding generation. Most enterprise AI workloads include a retrieval layer that converts customer documents into vector embeddings. Embedding generation costs are usually below 5 percent of total spend but show up as a discrete line item that surprises buyers who modelled only generation cost. OpenAI text-embedding-3-large runs at $0.13 per million tokens. Voyage AI (favoured for Claude integrations) runs at $0.18 per million tokens for voyage-3.

Image and multimodal input tokens. Image inputs to multimodal models are tokenised at vendor-specific rates. A standard image to GPT-4o is approximately 765 to 1,105 tokens depending on resolution. Heavy image workloads (document OCR, chart analysis, screenshot processing) can dominate the token spend.

Egress, observability, and orchestration. The infrastructure layer around the model (LangChain or LlamaIndex orchestration, vector database, observability tooling, prompt management, evaluation tooling) usually adds 15 to 30 percent to the model spend at enterprise scale.

The output-token bias in cost modelling: Output tokens cost 4x to 5x input tokens on most frontier models. Workloads that look cheap on input-token counts can be expensive on output. The mitigation is to constrain output length explicitly in prompts and to use the most concise model that meets the quality threshold for the workload. A response that takes 2,000 output tokens at $15 per million on Claude Sonnet 4 costs $0.030. The same response generated by Claude Haiku 4 at $1.25 per million costs $0.0025. For high-volume workloads where Haiku 4 quality is sufficient, the 12x cost reduction is the largest single optimisation available in 2026.

Volume discount bands and enterprise contracts

At enterprise scale (above $50,000 per month in API spend), all the frontier vendors entertain custom commercial terms. The discount realisation tracks roughly with monthly committed spend.

Monthly committed spend	Typical discount band	Notes
$50K to $100K	0 to 8 percent	Pay-as-you-go, no commit
$100K to $500K	8 to 18 percent	12-month commit with quarterly true-up
$500K to $2M	15 to 25 percent	Multi-year commit, custom SLAs
$2M+	20 to 35 percent	Capacity reservation + bespoke terms

Cloud channel arbitrage

Claude is available on Anthropic direct, AWS Bedrock, and Google Vertex. GPT-4 is available on OpenAI direct and Azure OpenAI. Llama is available on Bedrock, Vertex, and Azure AI Studio. Gemini is Google-only. Mistral is direct, Bedrock, Vertex, and Azure AI Studio.

The token economics are usually identical across channels (the model vendor does not pass discount through cloud sub-channels), but the commercial accounting changes materially. For customers with material AWS EDP commit, Claude via Bedrock burns the EDP balance. For customers with Microsoft MACC, Azure OpenAI burns the MACC. For customers with Google Cloud commit, Vertex burns the commit. The channel arbitrage is purely a commercial structure decision; capability is unaffected.

The full hyperscaler cloud cost framework lives in our AWS EDP pillar, Azure MACC analysis, and GCP enterprise agreement guide. For Bedrock specifically see AWS Bedrock pricing 2026.

Picking the right model for the workload

The right model decision in 2026 looks like this. For broad-population chat with RAG, Claude Sonnet 4 or GPT-4o with caching is the cost-efficient choice. For long-document analysis, Claude Opus 4 or Gemini 2.0 Pro. For multi-step reasoning, OpenAI o1 or Claude Opus 4 extended thinking. For multimodal, GPT-4o or Gemini 2.0 Flash. For high-volume classification and extraction, Claude Haiku 4 or Gemini Flash-Lite. For sovereign deployment, Llama 3.1 or Mistral self-hosted.

The full vendor selection framework lives in our enterprise AI vendor selection framework. For per-vendor deep dives see ChatGPT Enterprise pricing 2026, Claude Enterprise pricing 2026, OpenAI Enterprise pricing, Gemini Enterprise, and Microsoft 365 Copilot pricing 2026. For procurement counsel see AI procurement advisory, cloud contract negotiation, and software licensing advisory.

SAP RISE Negotiation: 9 Contract Terms to Fix First

Fix these SAP RISE contract terms before you sign.

Read the white paper

Enterprise LLM Cost Comparison 2026

Primary frontier model per-token comparison

Cached input and Batch API discount structures

Provisioned Throughput and reserved capacity

Cost comparison by workload class

The hidden cost layers

Volume discount bands and enterprise contracts

Cloud channel arbitrage

Picking the right model for the workload

The Licensing Edge

Cut LLM Spend with the Right Model Decision

Enterprise LLM Cost Comparison 2026

Primary frontier model per-token comparison

Cached input and Batch API discount structures

Provisioned Throughput and reserved capacity

Cost comparison by workload class

The hidden cost layers

Volume discount bands and enterprise contracts

Cloud channel arbitrage

Picking the right model for the workload

Related Intelligence

Enterprise AI Vendor Selection

AI Usage-Based Pricing Negotiation

AWS Bedrock Pricing 2026

The Licensing Edge

Cut LLM Spend with the Right Model Decision