Claude AI

Claude API Batch and Model Routing: 50% Off Tokens and Smart Cost Control

Claude Batch API cuts all token costs 50%. Model routing saves 40-80% by matching tasks to the cheapest capable model. Stack both and cached reads cost 5% of standard price.

The Claude Batch API cuts all token costs 50% for non-real-time workloads. Model routing saves another 40-80% per request by sending classification tasks to Haiku instead of Sonnet. Stack both with prompt caching and cached batch reads drop to 5% of standard price. This is Part 2 of the Claude API cost reduction series.

You'll need:

  • anthropic SDK (pip install anthropic or npm install @anthropic-ai/sdk)
  • An API key from console.anthropic.com
  • Familiarity with basic messages.create calls

Advanced developer workstation with multiple monitors showing source code and a colorful illuminated keyboard


Batch API: 50% off everything, no code changes required

In June 2026, Anthropic's pricing documentation confirmed the Message Batches API delivers exactly 50% off all input and output tokens across every model. Haiku 4.5 batch: $0.50/$2.50 per million tokens. Sonnet 4.6 batch: $1.50/$7.50. The tradeoff is latency. Batch jobs return within 24 hours rather than in real time.

If your workload doesn't need live responses (nightly report generation, bulk document classification, pre-generating summaries, queued jobs), the Batch API is the simplest 50% discount you'll find.

python
import anthropic
import time

client = anthropic.Anthropic()

def run_batch(prompts: list[str], model: str = "claude-haiku-4-5") -> list[str]:
    """Submit a batch request. 50% cheaper than individual calls."""

    batch = client.messages.batches.create(
        requests=[
            {
                "custom_id": f"item-{i}",
                "params": {
                    "model": model,
                    "max_tokens": 512,
                    "messages": [{"role": "user", "content": prompt}]
                }
            }
            for i, prompt in enumerate(prompts)
        ]
    )

    print(f"Batch submitted: {batch.id} — status: {batch.processing_status}")

    # Poll until complete (usually well under 24 hours)
    while True:
        batch = client.messages.batches.retrieve(batch.id)
        if batch.processing_status == "ended":
            break
        print("Still processing — checking again in 60s")
        time.sleep(60)

    # Collect results in original order
    results = {}
    for result in client.messages.batches.results(batch.id):
        if result.result.type == "succeeded":
            results[result.custom_id] = result.result.message.content[0].text
        else:
            results[result.custom_id] = None

    return [results.get(f"item-{i}") for i in range(len(prompts))]

Real cost example: Classifying 10,000 support tickets with Haiku 4.5 at standard rates: roughly $1.00 in input tokens. Same job via Batch API: $0.50. At that volume, the savings compound fast.

Try This

The Batch API supports tool use, vision inputs, and prompt caching. Batch and caching discounts are explicitly stackable per Anthropic's pricing docs. More on stacking at the end of this post.


Model routing: match the task to the cheapest model that handles it

In May 2026, CloudZero's analysis of Claude API pricing estimated a 40-60% blended cost reduction from routing requests to the cheapest capable model. The price spread is significant: Haiku 4.5 at $1.00/MTok input vs Opus 4.7 at $5.00/MTok is a 5x difference before output costs.

Claude API Standard Pricing per Million Tokens (June 2026)Input prices: Haiku $1, Sonnet $3, Opus $5. Output prices: Haiku $5, Sonnet $15, Opus $25. Output is the dominant cost driver. Source: Anthropic pricing docs, June 2026.Claude API Pricing per Million Tokens, Standard (June 2026)InputOutputHaiku 4.5Sonnet 4.6Opus 4.7$1$5$3$15$5$25Source: Anthropic Claude API pricing docs, platform.claude.com/docs/en/about-claude/pricing, June 2026

The routing rule: Haiku handles anything that doesn't need reasoning. Sonnet handles most code and analysis. Opus handles complex multi-step reasoning only.

python
from anthropic import Anthropic

client = Anthropic()

def get_model(task: str) -> str:
    """Return the cheapest model that reliably handles this task."""
    ROUTING = {
        # Haiku: fast, cheap — good for classification, extraction, short rewrites
        "classify":  "claude-haiku-4-5",
        "extract":   "claude-haiku-4-5",
        "summarize": "claude-haiku-4-5",
        "translate": "claude-haiku-4-5",

        # Sonnet: balanced — handles most code and analysis well
        "code":    "claude-sonnet-4-6",
        "debug":   "claude-sonnet-4-6",
        "analyze": "claude-sonnet-4-6",
        "draft":   "claude-sonnet-4-6",

        # Opus: complex reasoning, architecture, multi-step plans only
        "reason":    "claude-opus-4-7",
        "architect": "claude-opus-4-7",
    }
    return ROUTING.get(task, "claude-sonnet-4-6")

def routed_call(message: str, task: str) -> str:
    model = get_model(task)
    response = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": message}]
    )
    return response.content[0].text

# Classifying 10,000 tickets: Haiku at $1/MTok vs Opus at $5/MTok = 5x savings
result = routed_call("Classify this support ticket: ...", task="classify")
Watch Out

Opus 4.7 tokenizer note: As of June 2026, Opus 4.7 uses a tokenizer that consumes up to 35% more tokens on the same text compared to older Claude models (Anthropic pricing docs, June 2026). If you migrated existing prompts from a previous model without re-measuring token counts, your costs may have increased unexpectedly. Always re-run token counts after any model change.


Stacking everything together

In June 2026, Anthropic's pricing docs confirm that Batch API and prompt caching discounts are stackable. Cache-read tokens via Batch API cost 0.1x (caching) × 0.5x (batch) = 0.05x standard price. That's 95% off.

Claude API Cost Reduction by TechniqueStacking prompt caching with Batch API achieves 95 percent cost reduction on cached input tokens. Sources: Anthropic pricing docs June 2026, CloudZero May 2026.Cost Reduction by TechniqueCache + Batch stackedPrompt caching aloneModel routing (avg)Batch API alone95% off90% off40–80% off50% offSources: Anthropic pricing docs (June 2026), CloudZero blog (May 2026)

Here's a complete optimized client that applies all techniques (caching, relocation, routing, and batch) in one place:

python
import anthropic
from typing import Literal

client = anthropic.Anthropic()

# Static system prompt — keep ALL dynamic content out of this string
# Must exceed minimum token threshold: 1,024 (Sonnet), 4,096 (Haiku/Opus)
STATIC_SYSTEM = """You are a senior software engineer assistant.
[...your full static instructions here...]
"""

TaskType = Literal["classify", "extract", "summarize", "code", "debug", "analyze", "reason"]

MODEL_ROUTING: dict[TaskType, str] = {
    "classify":  "claude-haiku-4-5",
    "extract":   "claude-haiku-4-5",
    "summarize": "claude-haiku-4-5",
    "code":      "claude-sonnet-4-6",
    "debug":     "claude-sonnet-4-6",
    "analyze":   "claude-sonnet-4-6",
    "reason":    "claude-opus-4-7",
}

def call_optimized(
    user_message: str,
    task: TaskType = "analyze",
    context: str | None = None,
) -> str:
    """
    Full optimization stack:
    - Prompt caching on static system (90% off cache reads)
    - Dynamic context in user turn, not system (keeps cache hit rate high)
    - Model routing to cheapest capable model
    """
    model = MODEL_ROUTING.get(task, "claude-sonnet-4-6")

    # Dynamic context goes in user turn — never in the cached system prefix
    content = f"{context}\n\n{user_message}" if context else user_message

    response = client.messages.create(
        model=model,
        max_tokens=1024,
        system=[{
            "type": "text",
            "text": STATIC_SYSTEM,
            "cache_control": {"type": "ephemeral"}
        }],
        messages=[{"role": "user", "content": content}]
    )

    usage = response.usage
    cache_hit = usage.cache_read_input_tokens > 0
    print(f"Model: {model} | Cache hit: {cache_hit}")

    return response.content[0].text

Real-world impact: CloudZero's May 2026 analysis modeled a RAG app with a 50,000-token system prompt at 500 requests/day. Standard cost: ~$1,050/month. With prompt caching applied: $150-300/month. Add model routing for the 60% of requests that don't need Sonnet, and that number drops further.


Frequently Asked Questions

Can I use the Batch API with tool use and function calling?

Yes. The Batch API supports tool use, vision inputs, and prompt caching. Per Anthropic's June 2026 pricing docs, Batch API and caching discounts stack: cached batch reads cost 0.05x standard input price (95% off). Tool use token overhead (497-804 additional tokens depending on mode) applies as normal.

Will Haiku produce acceptable quality for real production tasks?

For classification, extraction, translation, and short summarization: yes. For multi-step code generation, architecture decisions, or nuanced analysis: no. Run 50 real samples through both Haiku and your current model, score the outputs, and measure the quality gap. In most production systems, 40-60% of requests are simple enough for Haiku with no meaningful quality loss.

How do I confirm the stacked discounts are actually applying?

Check response.usage for both cache_read_input_tokens (should be non-zero) and confirm you're using the batch endpoint by checking batch.processing_status. For the stacked path, the effective cost on cached input tokens should be 0.05x your standard rate. Build a cost monitoring function that logs cache_read_input_tokens * 0.1 * model_rate per request.


Batch API and model routing are the second layer of Claude API cost reduction. Prompt caching alone can get you to 90% off. Add routing and batching on top of that, and the math gets compelling fast.

If you haven't implemented prompt caching yet, start there. Part 1 of this series covers it in detail.


Sources