AI · Cost Comparison · 2026

Enterprise LLM Cost Comparison 2026

The per-token pricing reference across GPT-4, Claude 4, Gemini 2.0, Llama 3.1, Mistral Large, and Cohere Command for 2026. Cached input discounts, Batch API rates, Provisioned Throughput economics, and the right model choice by workload class.

Updated December 2025 2,400-Word Guide AI / LLM

The cheapest enterprise-grade frontier model in 2026 is Claude Haiku 4 at $0.25 per million input tokens and $1.25 per million output tokens. The most expensive is OpenAI o1 at $15 per million input and $60 per million output. Claude Sonnet 4 lands at $3 input and $15 output, GPT-4o at $2.50 input and $10 output, Gemini 2.0 Pro at $1.25 input and $5 output, Llama 3.1 405B hosted on Bedrock at $5.32 input and $16 output, Mistral Large 2 at $3 input and $9 output, and Cohere Command R+ at $2.50 input and $10 output. Cached input and batch API discounts of 50 to 90 percent change the realised cost materially on workloads with the right shape. This comparison covers the production-ready frontier and mid-tier models that show up in enterprise procurement evaluations.

Primary frontier model per-token comparison

The published per-token pricing across the frontier model families as of Q2 2026 is summarised below. The pricing reflects standard real-time API access on the vendor's direct endpoint or the primary cloud channel (Bedrock for Llama and Claude on AWS, Vertex for Gemini and Claude on GCP).

ModelVendorInput per 1M tokensOutput per 1M tokensContext window
Claude Opus 4Anthropic$15.00$75.00500K
Claude Sonnet 4Anthropic$3.00$15.00500K
Claude Haiku 4Anthropic$0.25$1.25200K
GPT-4oOpenAI$2.50$10.00128K
GPT-4o miniOpenAI$0.15$0.60128K
GPT-4.1OpenAI$2.00$8.001M
GPT-4.1 miniOpenAI$0.40$1.601M
o1 (reasoning)OpenAI$15.00$60.00200K
o3-mini (reasoning)OpenAI$1.10$4.40200K
Gemini 2.0 ProGoogle$1.25$5.002M
Gemini 2.0 FlashGoogle$0.075$0.301M
Gemini 2.0 Flash-LiteGoogle$0.0375$0.151M
Llama 3.1 405B (Bedrock)Meta / AWS$5.32$16.00128K
Llama 3.1 70B (Bedrock)Meta / AWS$0.99$0.99128K
Mistral Large 2Mistral$3.00$9.00128K
Mistral SmallMistral$0.20$0.60128K
Cohere Command R+Cohere$2.50$10.00128K
Cohere Command RCohere$0.15$0.60128K

Cached input and Batch API discount structures

The published per-token rates are starting points. Most enterprise workloads at scale qualify for cached input pricing (workloads with stable system prompts) or Batch API pricing (workloads that do not require real-time response). The discount structures vary materially by vendor.

VendorCached input discountBatch API discountNotes
Anthropic90% off (1-hour cache)50% offCache write costs 125% of standard for 5-minute cache
OpenAI50% off (5-minute cache)50% offImplicit caching; explicit caching also available
Google Gemini75% off (context caching)50% offPer-token cache storage fee applies
AWS Bedrock (Claude, Llama)Per-model (matches direct)50% off via Batch InferenceBedrock Provisioned Throughput separately priced
MistralNot generally available50% offSelf-hosted gives full caching control
CohereLimited50% off via batch jobsPer-tenant negotiation

The cached input effect in practice: A retrieval-augmented chat application with a stable 30,000-token system prompt and 500 token user prompts. Without caching, every request bills 30,500 input tokens at $3 per million on Claude Sonnet 4 ($0.0915 per request). With 90 percent caching discount on the 30,000-token system prompt, each request bills 3,000 input cache tokens plus 500 standard input tokens (about $0.011 per request). The cached approach delivers 88 percent cost reduction on the input side. For chat applications running at 100,000 requests per day, that compounds to $300,000 per year in saved spend on a single workload.

Provisioned Throughput and reserved capacity

For sustained high-throughput workloads, the per-token billing model is not the cheapest option. Vendor-side reserved capacity (Azure OpenAI PTU, AWS Bedrock Provisioned Throughput, Anthropic Capacity Reservations, Google Vertex Provisioned Throughput) typically delivers 25 to 45 percent cost reduction at sustained throughput above 100 to 300 tokens per second.

Reserved capacity SKUPricing modelBreakeven point versus per-token
Azure OpenAI Provisioned Throughput UnitPer-PTU per month, reserved capacity~150 tokens/sec sustained
AWS Bedrock Provisioned ThroughputPer-model-unit per hour~200 tokens/sec sustained
Anthropic Capacity ReservationPer-tenant monthly commitNegotiated per use case
Google Vertex Provisioned ThroughputPer-GSU per month~100 tokens/sec sustained

The decision is driven by workload pattern. Steady-state workloads (chat assistants, RAG applications, classification pipelines) at sustained high throughput favour reserved capacity. Bursty or low-throughput workloads (occasional analysis, periodic batch jobs) favour per-token. For mixed workloads the right pattern is reserved capacity for the steady-state baseline plus per-token for burst, which most vendor reserved offerings support.

Cost comparison by workload class

The right model for a workload is rarely the cheapest model. The right model is the cheapest model that meets the capability threshold for that workload. The table below shows representative cost across the four primary workload classes at meaningful enterprise scale, with the typical model choice for each.

WorkloadVolume assumptionRecommended modelMonthly cost
Enterprise chat assistant (RAG)5M user requests / month, 8K tokens avgClaude Sonnet 4 with caching$48K to $72K
Document analysis (legal review)250K documents / month, 50K tokens avgClaude Opus 4 or Gemini 2.0 Pro$94K to $187K
Code generation (developer assistant)800K completions / month, 4K tokens avgClaude Sonnet 4 or GPT-4o$28K to $35K
Classification / extraction at scale200M items / month, 1K tokens avgClaude Haiku 4 or Gemini Flash$8K to $32K
Multi-step reasoning (analysis)50K tasks / month, 20K tokens avgo1 or Claude Opus 4 extended thinking$40K to $115K

The hidden cost layers

The published per-token rates do not capture three cost layers that matter at enterprise scale. The cost model that omits these layers under-states three-year TCO by 15 to 35 percent on most workloads.

Embedding generation. Most enterprise AI workloads include a retrieval layer that converts customer documents into vector embeddings. Embedding generation costs are usually below 5 percent of total spend but show up as a discrete line item that surprises buyers who modelled only generation cost. OpenAI text-embedding-3-large runs at $0.13 per million tokens. Voyage AI (favoured for Claude integrations) runs at $0.18 per million tokens for voyage-3.

Image and multimodal input tokens. Image inputs to multimodal models are tokenised at vendor-specific rates. A standard image to GPT-4o is approximately 765 to 1,105 tokens depending on resolution. Heavy image workloads (document OCR, chart analysis, screenshot processing) can dominate the token spend.

Egress, observability, and orchestration. The infrastructure layer around the model (LangChain or LlamaIndex orchestration, vector database, observability tooling, prompt management, evaluation tooling) usually adds 15 to 30 percent to the model spend at enterprise scale.

The output-token bias in cost modelling: Output tokens cost 4x to 5x input tokens on most frontier models. Workloads that look cheap on input-token counts can be expensive on output. The mitigation is to constrain output length explicitly in prompts and to use the most concise model that meets the quality threshold for the workload. A response that takes 2,000 output tokens at $15 per million on Claude Sonnet 4 costs $0.030. The same response generated by Claude Haiku 4 at $1.25 per million costs $0.0025. For high-volume workloads where Haiku 4 quality is sufficient, the 12x cost reduction is the largest single optimisation available in 2026.

Volume discount bands and enterprise contracts

At enterprise scale (above $50,000 per month in API spend), all the frontier vendors entertain custom commercial terms. The discount realisation tracks roughly with monthly committed spend.

Monthly committed spendTypical discount bandNotes
$50K to $100K0 to 8 percentPay-as-you-go, no commit
$100K to $500K8 to 18 percent12-month commit with quarterly true-up
$500K to $2M15 to 25 percentMulti-year commit, custom SLAs
$2M+20 to 35 percentCapacity reservation + bespoke terms

Cloud channel arbitrage

Claude is available on Anthropic direct, AWS Bedrock, and Google Vertex. GPT-4 is available on OpenAI direct and Azure OpenAI. Llama is available on Bedrock, Vertex, and Azure AI Studio. Gemini is Google-only. Mistral is direct, Bedrock, Vertex, and Azure AI Studio.

The token economics are usually identical across channels (the model vendor does not pass discount through cloud sub-channels), but the commercial accounting changes materially. For customers with material AWS EDP commit, Claude via Bedrock burns the EDP balance. For customers with Microsoft MACC, Azure OpenAI burns the MACC. For customers with Google Cloud commit, Vertex burns the commit. The channel arbitrage is purely a commercial structure decision; capability is unaffected.

The full hyperscaler cloud cost framework lives in our AWS EDP pillar, Azure MACC analysis, and GCP enterprise agreement guide. For Bedrock specifically see AWS Bedrock pricing 2026.

Picking the right model for the workload

The right model decision in 2026 looks like this. For broad-population chat with RAG, Claude Sonnet 4 or GPT-4o with caching is the cost-efficient choice. For long-document analysis, Claude Opus 4 or Gemini 2.0 Pro. For multi-step reasoning, OpenAI o1 or Claude Opus 4 extended thinking. For multimodal, GPT-4o or Gemini 2.0 Flash. For high-volume classification and extraction, Claude Haiku 4 or Gemini Flash-Lite. For sovereign deployment, Llama 3.1 or Mistral self-hosted.

The full vendor selection framework lives in our enterprise AI vendor selection framework. For per-vendor deep dives see ChatGPT Enterprise pricing 2026, Claude Enterprise pricing 2026, OpenAI Enterprise pricing, Gemini Enterprise, and Microsoft 365 Copilot pricing 2026. For procurement counsel see AI procurement advisory, cloud contract negotiation, and software licensing advisory.

The Licensing Edge

Weekly vendor intelligence from former Oracle, SAP, and Microsoft executives, delivered every Tuesday.

Cut LLM Spend with the Right Model Decision

Independent LLM spend reviews identify a median 32 percent of consumption as recoverable. Our advisors run the review on a fixed fee and produce a 90-day optimisation plan.

Request LLM Spend Review