Claude Batch API: Cut Token Costs 50%

Q: How do I confirm the stacked discounts are actually applying?

Check response.usage for both cache_read_input_tokens (should be non-zero) and confirm you're using the batch endpoint by checking batch.processing_status. For the stacked path, the effective cost on cached input tokens should be 0.05x your standard rate. Build a cost monitoring function that logs cache_read_input_tokens * 0.1 * model_rate per request. --- Batch API and model routing are the second layer of Claude API cost reduction. Prompt caching alone gets you to 90% off. Add routing and batching and the numbers compound fast. If you haven't implemented prompt caching yet, start there. Part 1 of this series covers it in detail. For a complete picture of Claude's features and cost levers, see 11 Claude AI Things I Wish I Knew Earlier. --- Sources - Anthropic, "Claude API Pricing," retrieved 2026-06-01, platform.claude.com/docs/en/about-claude/pricing - Anthropic, "Message Batches API," retrieved 2026-06-01, platform.claude.com/docs/en/build-with-claude/batch - CloudZero, "Claude API Pricing," retrieved 2026-06-01, cloudzero.com/blog/claude-api-pricing - ProjectDiscovery, "How We Cut LLM Cost With Prompt Caching," retrieved 2026-06-01, projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching

The Claude Batch API cuts all token costs 50% for non-real-time workloads. Model routing saves another 40-80% per request by sending classification tasks to Haiku instead of Sonnet. Stack both with prompt caching and cached batch reads drop to 5% of standard price. This is Part 2 of the Claude API cost reduction series.

You'll need:

anthropic SDK (pip install anthropic or npm install @anthropic-ai/sdk)
An API key from console.anthropic.com
Familiarity with basic messages.create calls

Three independent cost levers: Batch API cuts 50%, model routing saves up to 80%, and stacking both with caching hits 95% off

Advanced developer workstation with multiple monitors showing source code and a colorful illuminated keyboard

Batch API: 50% off everything, no code changes required

In June 2026, Anthropic's pricing documentation confirmed the Message Batches API delivers exactly 50% off all input and output tokens across every model. Haiku 4.5 batch: $0.50/$2.50 per million tokens. Sonnet 4.6 batch: $1.50/$7.50. The tradeoff is latency. Batch jobs return within 24 hours rather than in real time.

If your workload doesn't need live responses (nightly report generation, bulk document classification, pre-generating summaries, queued jobs), the Batch API is the simplest 50% discount you'll find.

python

import anthropic
import time

client = anthropic.Anthropic()

def run_batch(prompts: list[str], model: str = "claude-haiku-4-5") -> list[str]:
    """Submit a batch request. 50% cheaper than individual calls."""

    batch = client.messages.batches.create(
        requests=[
            {
                "custom_id": f"item-{i}",
                "params": {
                    "model": model,
                    "max_tokens": 512,
                    "messages": [{"role": "user", "content": prompt}]
                }
            }
            for i, prompt in enumerate(prompts)
        ]
    )

    print(f"Batch submitted: {batch.id} — status: {batch.processing_status}")

    # Poll until complete (usually well under 24 hours)
    while True:
        batch = client.messages.batches.retrieve(batch.id)
        if batch.processing_status == "ended":
            break
        print("Still processing — checking again in 60s")
        time.sleep(60)

    # Collect results in original order
    results = {}
    for result in client.messages.batches.results(batch.id):
        if result.result.type == "succeeded":
            results[result.custom_id] = result.result.message.content[0].text
        else:
            results[result.custom_id] = None

    return [results.get(f"item-{i}") for i in range(len(prompts))]

Real cost example: Classifying 10,000 support tickets with Haiku 4.5 at standard rates: roughly $1.00 in input tokens. Same job via Batch API: $0.50. At that volume, the savings compound fast.

Try This

The Batch API supports tool use, vision inputs, and prompt caching. Batch and caching discounts are explicitly stackable per Anthropic's pricing docs. More on stacking at the end of this post.

Note

As of June 2026, Anthropic's Message Batches API delivers 50% off all input and output tokens for every Claude model. Haiku 4.5 batch pricing: $0.50 input, $2.50 output per million tokens. Sonnet 4.6 batch: $1.50 input, $7.50 output. Batch and caching discounts are explicitly stackable. Results returned within 24 hours. Source: Anthropic pricing docs, June 2026.

Model routing: match the task to the cheapest model that handles it

In May 2026, CloudZero's analysis of Claude API pricing estimated a 40-60% blended cost reduction from routing requests to the cheapest capable model. The price spread is significant: Haiku 4.5 at $1.00/MTok input vs Opus 4.7 at $5.00/MTok is a 5x difference before output costs.

The routing rule: Haiku for anything that doesn't need reasoning, Sonnet for most code and analysis, and Opus only when you need multi-step reasoning.

python

from anthropic import Anthropic

client = Anthropic()

def get_model(task: str) -> str:
    """Return the cheapest model that reliably handles this task."""
    ROUTING = {
        # Haiku: fast, cheap — good for classification, extraction, short rewrites
        "classify":  "claude-haiku-4-5",
        "extract":   "claude-haiku-4-5",
        "summarize": "claude-haiku-4-5",
        "translate": "claude-haiku-4-5",

        # Sonnet: balanced — handles most code and analysis well
        "code":    "claude-sonnet-4-6",
        "debug":   "claude-sonnet-4-6",
        "analyze": "claude-sonnet-4-6",
        "draft":   "claude-sonnet-4-6",

        # Opus: complex reasoning, architecture, multi-step plans only
        "reason":    "claude-opus-4-7",
        "architect": "claude-opus-4-7",
    }
    return ROUTING.get(task, "claude-sonnet-4-6")

def routed_call(message: str, task: str) -> str:
    model = get_model(task)
    response = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": message}]
    )
    return response.content[0].text

# Classifying 10,000 tickets: Haiku at $1/MTok vs Opus at $5/MTok = 5x savings
result = routed_call("Classify this support ticket: ...", task="classify")

Watch Out

Opus 4.7 tokenizer note: As of June 2026, Opus 4.7 uses a tokenizer that consumes up to 35% more tokens on the same text compared to older Claude models (Anthropic pricing docs, June 2026). If you migrated existing prompts from a previous model without re-measuring token counts, your costs may have increased unexpectedly. Always re-run token counts after any model change.

Stacking everything together

In June 2026, Anthropic's pricing docs confirm that Batch API and prompt caching discounts are stackable. Cache-read tokens via Batch API cost 0.1x (caching) × 0.5x (batch) = 0.05x standard price. That's 95% off.

Here's a complete optimized client that applies all techniques (caching, relocation, routing, and batch) in one place:

python

import anthropic
from typing import Literal

client = anthropic.Anthropic()

# Static system prompt — keep ALL dynamic content out of this string
# Must exceed minimum token threshold: 1,024 (Sonnet), 4,096 (Haiku/Opus)
STATIC_SYSTEM = """You are a senior software engineer assistant.
[...your full static instructions here...]
"""

TaskType = Literal["classify", "extract", "summarize", "code", "debug", "analyze", "reason"]

MODEL_ROUTING: dict[TaskType, str] = {
    "classify":  "claude-haiku-4-5",
    "extract":   "claude-haiku-4-5",
    "summarize": "claude-haiku-4-5",
    "code":      "claude-sonnet-4-6",
    "debug":     "claude-sonnet-4-6",
    "analyze":   "claude-sonnet-4-6",
    "reason":    "claude-opus-4-7",
}

def call_optimized(
    user_message: str,
    task: TaskType = "analyze",
    context: str | None = None,
) -> str:
    """
    Full optimization stack:
    - Prompt caching on static system (90% off cache reads)
    - Dynamic context in user turn, not system (keeps cache hit rate high)
    - Model routing to cheapest capable model
    """
    model = MODEL_ROUTING.get(task, "claude-sonnet-4-6")

    # Dynamic context goes in user turn — never in the cached system prefix
    content = f"{context}\n\n{user_message}" if context else user_message

    response = client.messages.create(
        model=model,
        max_tokens=1024,
        system=[{
            "type": "text",
            "text": STATIC_SYSTEM,
            "cache_control": {"type": "ephemeral"}
        }],
        messages=[{"role": "user", "content": content}]
    )

    usage = response.usage
    cache_hit = usage.cache_read_input_tokens > 0
    print(f"Model: {model} | Cache hit: {cache_hit}")

    return response.content[0].text

Real-world impact: CloudZero's May 2026 analysis modeled a RAG app with a 50,000-token system prompt at 500 requests/day. Standard cost: ~$1,050/month. With prompt caching applied: $150-300/month. Add model routing for the 60% of requests that don't need Sonnet, and that number drops further.

Frequently Asked Questions

Can I use the Batch API with tool use and function calling?

Yes. The Batch API supports tool use, vision inputs, and prompt caching. Per Anthropic's June 2026 pricing docs, Batch API and caching discounts stack: cached batch reads cost 0.05x standard input price (95% off). Tool use token overhead (497-804 additional tokens depending on mode) applies as normal.

Will Haiku produce acceptable quality for real production tasks?

For classification, extraction, translation, and short summarization: yes. For multi-step code generation, architecture decisions, or nuanced analysis: no. Run 50 real samples through both Haiku and your current model, score the outputs, and measure the quality gap. In most production systems, 40-60% of requests are simple enough for Haiku with no meaningful quality loss.

How do I confirm the stacked discounts are actually applying?

Check response.usage for both cache_read_input_tokens (should be non-zero) and confirm you're using the batch endpoint by checking batch.processing_status. For the stacked path, the effective cost on cached input tokens should be 0.05x your standard rate. Build a cost monitoring function that logs cache_read_input_tokens * 0.1 * model_rate per request.

Batch API and model routing are the second layer of Claude API cost reduction. Prompt caching alone gets you to 90% off. Add routing and batching and the numbers compound fast.

If you haven't implemented prompt caching yet, start there. Part 1 of this series covers it in detail.

For a complete picture of Claude's features and cost levers, see 11 Claude AI Things I Wish I Knew Earlier.

Sources

Anthropic, "Claude API Pricing," retrieved 2026-06-01, platform.claude.com/docs/en/about-claude/pricing
Anthropic, "Message Batches API," retrieved 2026-06-01, platform.claude.com/docs/en/build-with-claude/batch
CloudZero, "Claude API Pricing," retrieved 2026-06-01, cloudzero.com/blog/claude-api-pricing
ProjectDiscovery, "How We Cut LLM Cost With Prompt Caching," retrieved 2026-06-01, projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching