Vertex AI charges $0.000125 to $0.005 per 1,000 input characters for Gemini text models, with provisioned throughput priced at $42 to $158 per GSU-hour depending on model and region. A 2,000-seat enterprise rollout on Gemini 2.0 Pro typically lands at $240,000 to $660,000 in annual run-rate cost, with the variation driven by retrieval-augmented context size and image inputs. Vertex AI bills per-character rather than per-token, which makes direct comparison to Bedrock and Azure OpenAI deceptive. This page documents the 2026 Vertex AI price list, the per-character to per-token conversion, and the cost-control levers that work.
Gemini text model pricing
The pricing table below shows Google's Vertex AI public list pricing for Gemini text generation as of Q1 2026, in us-central1. Other regions vary 2 to 5 percent.
| Model | Input per 1K chars | Output per 1K chars | Context window |
|---|---|---|---|
| Gemini 2.0 Pro | $0.00125 | $0.005 | 2M tokens |
| Gemini 2.0 Flash | $0.0000375 | $0.00015 | 1M tokens |
| Gemini 2.0 Flash-Lite | $0.0000125 | $0.00005 | 1M tokens |
| Gemini 1.5 Pro | $0.00125 (≤128K), $0.0025 (>128K) | $0.005 / $0.01 | 2M tokens |
| Gemini 1.5 Flash | $0.0000375 (≤128K), $0.000075 | $0.00015 / $0.0003 | 1M tokens |
| Gemini 1.5 Flash-8B | $0.00001875 | $0.000075 | 1M tokens |
| Imagen 3 (image generation) | $0.04 per image | n/a | 1024x1024 standard |
| Veo 2 (video generation) | $0.50 per second | n/a | 8 second max default |
Character-based pricing makes per-query cost depend on language. English averages 4.1 characters per token. Spanish averages 4.4. German averages 5.1. Japanese averages 1.3 (because Japanese characters carry more information per character than Latin-script characters). A Japanese-language Gemini deployment costs 3x less per token than the equivalent English deployment.
Comparing Vertex to Bedrock and Azure OpenAI: Vertex's $0.00125 per 1,000 characters for Gemini 2.0 Pro converts to approximately $0.0051 per 1,000 tokens (using the 4.1 character-per-token English ratio). Claude 3.5 Sonnet on Bedrock is $0.003 per 1,000 input tokens. GPT-4o on Azure is $0.0025 per 1,000 input tokens. Gemini 2.0 Pro is therefore the most expensive of the three on input. On output, Gemini at $0.005 per 1,000 characters equals $0.0205 per 1,000 tokens, versus Claude at $0.015 and GPT-4o at $0.01. Pricing comparison must convert characters to tokens before drawing conclusions.
Provisioned throughput on Vertex AI
Vertex offers Provisioned Throughput as the reserved-capacity alternative to pay-per-character. Capacity is measured in Generative AI Service Units (GSUs). One GSU delivers a model-specific token-per-second throughput floor at fixed price.
| Model | Per GSU per hour | Throughput floor per GSU |
|---|---|---|
| Gemini 2.0 Pro | $158 | 1,800 input tokens/sec, 200 output tokens/sec |
| Gemini 2.0 Flash | $42 | 9,000 input tokens/sec, 1,000 output tokens/sec |
| Gemini 1.5 Pro | $120 | 1,500 input tokens/sec, 180 output tokens/sec |
| Gemini 1.5 Flash | $36 | 8,000 input tokens/sec, 800 output tokens/sec |
The break-even between on-demand and provisioned throughput is highly model-specific. For Gemini 2.0 Flash, one GSU at $42 per hour delivers 9,000 input plus 1,000 output tokens per second sustained. That equals 32.4 million input tokens per hour and 3.6 million output tokens per hour at the floor. On-demand pricing for the same throughput is $4.96 per hour. Provisioned throughput on Flash is therefore 8x more expensive than on-demand unless utilisation approaches 100 percent of the floor. Provisioned is the wrong call for Flash workloads.
For Gemini 2.0 Pro, one GSU at $158 per hour delivers 6.48M input plus 720K output tokens per hour at the floor. On-demand for the same is $19.30 per hour. Provisioned is 8.2x more expensive on Pro at the floor, but Pro workloads more often hit the floor capacity (and benefit from the latency floor), making the math closer. The break-even is roughly 12 to 15 percent of provisioned capacity sustained.
Provisioned throughput is the right answer when latency matters. The on-demand tier shares capacity across all Vertex customers and can throttle during high-demand periods. Provisioned guarantees the floor throughput regardless of regional demand.
Grounding, context caching, and tuning
Vertex offers four cost-control features that materially reduce token consumption.
Context caching lets enterprises cache large recurring prompts (knowledge base, system prompt, few-shot examples) and pay $0.000125 per 1,000 cached input characters (75 percent off standard input pricing) plus a $0.000125 per 1,000 character per hour storage fee. For a knowledge-base prompt of 100,000 tokens reused 10,000 times per day, context caching reduces cost by 73 percent compared to retransmitting the same context every query.
Grounding with Google Search adds web-grounded responses at $35 per 1,000 grounded queries, on top of the base Gemini token cost. Grounding with Vertex AI Search (the enterprise RAG product) costs $4 per 1,000 search queries plus the underlying token cost. The Google Search grounding option introduces external content; the Vertex AI Search option uses only the customer's indexed corpus.
Supervised fine-tuning lists at $0.0008 per 1,000 training characters (training cost) plus the standard inference cost on the tuned model. A 50,000-example tuning run averaging 4,000 characters per example costs approximately $160 to train. The fine-tuned model is then served at the same per-character rate as the base model, with a one-time hosting fee of $1.20 per hour per model copy.
Distillation, generally available in 2025, trains a smaller model (Flash) to mimic the responses of a larger model (Pro) using a teacher-student approach. Distillation cost is $0.0008 per 1,000 training characters. The resulting distilled model serves at the smaller model's price point, capturing 60 to 80 percent of the larger model's quality at 10 to 20 percent of the per-query cost.
Agent Builder and Generative AI Studio
Vertex AI Agent Builder lets developers compose grounded agents using Gemini, Vertex AI Search, and tool-calling. The Agent Builder feature is free of incremental fee. Costs are the underlying Gemini tokens, Vertex AI Search queries ($4 per 1,000), and any tool invocations (Cloud Functions, Cloud Run) the agent calls.
Generative AI Studio is the no-code build environment for Vertex AI. Studio itself is free. Prompts run from Studio consume tokens at standard rates. The Studio is the right environment for prompt iteration; production should move to the Vertex AI Python SDK or REST API for cost predictability.
Five cost-control patterns that work
Five patterns reliably reduce Vertex AI spend by 30 to 55 percent on enterprise deployments.
First, model cascade. Route the 75 to 85 percent of traffic that does not need Pro to Flash or Flash-Lite. Reserve Pro for the queries that need long context, complex reasoning, or multi-modal grounding. Cascade routing typically reduces token cost by 60 to 70 percent on customer-support and knowledge-base workloads.
Second, context caching. For any prompt with a recurring context segment over 32,000 characters, context caching pays back within 1.4 minutes of repeated use. Most production RAG pipelines should cache the knowledge context aggressively.
Third, distillation of frequent prompt patterns. Identify the top 20 prompt templates by volume. Distil each to a Flash-class model. Route those 20 templates to the distilled model; route the long-tail to base Gemini. Distillation effort pays back in two to four weeks of production traffic.
Fourth, output cap. Set max-output-tokens to the realistic upper bound for the prompt, not the default 8,192. Most production prompts produce 300 to 800 tokens. The cap eliminates the runaway-generation cost case.
Fifth, batch prediction. For non-real-time workloads (overnight summarisation, batch classification), Vertex Batch Prediction is priced 50 percent off on-demand on supported models. Batch jobs run on Google-managed schedule with completion within the agreed SLA.
The Gemini bill-shock pattern: A customer-support RAG chatbot serving 80,000 queries per day at 30,000 input characters (RAG context) and 1,200 output characters on Gemini 2.0 Pro lists at $13,500 per month before optimisation. The same workload on Gemini 2.0 Flash with context caching enabled lists at $1,275 per month. Cascade routing (75 percent Flash, 25 percent Pro for complex queries) with caching lands at $4,378 per month. Optimisation alone delivered 67 percent saving with no quality regression on the routed traffic.
Vertex AI contractual position
Vertex AI is a strategic Google Cloud line. Google account teams have explicit Vertex growth quotas, and Vertex consumption counts toward CASC (Customer Annual Spend Commitment) commit. For CASC customers, Vertex tokens burn down commit at full list, with the CASC discount tier applied to the bill.
Three negotiation moves work in 2026. First, request a Vertex-specific discount sleeve inside the CASC, separating Vertex from general infrastructure spend. Google account teams have authority to grant 10 to 20 percent Vertex-specific discount when Vertex is the strategic spend, on top of the CASC base discount.
Second, secure a model price-lock for the CASC term. Vertex prices have moved downward as the LLM market competes; the price-lock protects against upward repricing while still capturing beneficial price reductions.
Third, negotiate the data residency, customer-content training opt-out, and indemnity terms. Google's Vertex AI terms include strong customer-content protections by default, but the formal indemnity for IP claims on Gemini outputs is bounded. Review the indemnity language against the use case before signing. See our AI contract clauses guide for the comparative matrix across Google, OpenAI, Anthropic, and Microsoft.
For the broader Google Cloud commercial framework, see GCP Enterprise Agreement and CUD vs Flex CUD. For multi-vendor AI procurement that compares Vertex against Bedrock, Azure OpenAI, and direct Anthropic, see our AI procurement guide and GPT vs Claude vs Gemini comparison. The Google Cloud vendor hub aggregates the cluster. Engagement starts at AI procurement advisory.