CyberIntel ⬡ News
★ Saved ◆ Cyber Reads
← Back ◬ AI & Machine Learning Apr 02, 2026

New ways to balance cost and reliability in the Gemini API

Google AI Archived Apr 02, 2026 ✓ Full text saved

Google is introducing two new inference tiers to the Gemini API, Flex and Priority, to balance cost and latency.

Full text archived locally
✦ AI Summary · Claude Sonnet


    New ways to balance cost and reliability in the Gemini API Apr 02, 2026 Introducing Flex and Priority inference: advanced controls for developers to optimize costs and reliability through a single, unified interface. L Lucia Loher Product Manager, Gemini API H Hussein Hassan Harrirou Engineering, Gemini API Share Listen to article 4:22 minutes Today, we are adding two new service tiers to the Gemini API: Flex and Priority. These new options give you granular control over cost and reliability through a single, unified interface. As AI evolves from simple chat into complex, autonomous agents, developers typically have to manage two distinct types of logic: Background tasks: High-volume workflows like data enrichment or "thinking" processes that don't need instant responses. Interactive tasks: User-facing features like chatbots and copilots where high reliability is needed. Until now, supporting both meant splitting your architecture between standard synchronous serving and the asynchronous Batch API. Flex and Priority help to bridge this gap. You can now route background jobs to Flex and interactive jobs to Priority, both using standard synchronous endpoints. This eliminates the complexity of async job management while giving you the economic and performance benefits of specialized tiers. Flex Inference: scale innovation for 50% less Flex Inference is our new cost-optimized tier, designed for latency-tolerant workloads without the overhead of batch processing. 50% price savings: Pay half the price of the Standard API by downgrading criticality of your request (making them less reliable, and adding latency). Synchronous simplicity: Unlike the Batch API, Flex is a synchronous interface. You use the same familiar endpoints without managing input/output files or polling for job completion. Ideal use cases: Background CRM updates, large-scale research simulations, and agentic workflows where the model "browses" or "thinks" in the background. Get started fast by simply configuring the service_tier parameter in your request: # Using the Flex tier for background summarization response = client.models.generate_content( model="gemini-3-flash-preview", contents="Summarize this massive transcript...", config={"service_tier": "flex"}, ) # Check which tier actually served your request print(response.sdk_http_response.headers.get("x-gemini-service-tier")) Flex tier will be available for all paid tiers and is available for GenerateContent and Interactions API requests. Priority Inference: Highest reliability for critical apps The new Priority Inference tier offers our highest level of assurance at a premium price point. This helps to ensure your most important traffic is not preempted, even during peak platform usage. Highest criticality: Priority requests get highest criticality leading to higher reliability, even during peak load. Graceful downgrade: If your traffic exceeds your Priority limits, overflow requests are automatically served at the Standard tier instead of failing. This keeps your application online and helps to ensure business continuity. Transparent response: The API response indicates which tier served your request, giving you full visibility into your performance and billing. Ideal use cases: Real-time customer support bots, live content moderation pipelines, and time-sensitive requests. To use Priority Inference, simply set the service_tier parameter accordingly: # Using the Priority tier for a critical alert response = client.models.generate_content( model="gemini-3-flash-preview", contents="Triage this incoming security alert immediately.", config={"service_tier": "priority"}, ) Priority inference will be available to users with Tier 2 / 3 paid projects across the `GenerateContent` API and Interactions API endpoints. Visit the Gemini API documentation to see the full pricing breakdown and start optimizing your production tiers today. To see it in action, check out the cookbook for runnable code examples. POSTED IN:
    💬 Team Notes
    Article Info
    Source
    Google AI
    Category
    ◬ AI & Machine Learning
    Published
    Apr 02, 2026
    Archived
    Apr 02, 2026
    Full Text
    ✓ Saved locally
    Open Original ↗