The Claude Batch API cuts all token costs 50% for non-real-time workloads. Model routing saves another 40-80% per request by sending classification tasks to Haiku instead of Sonnet. Stack both with prompt caching and cached batch reads drop to 5% of standard price. This is Part 2 of the Claude API cost reduction series.
You'll need:
anthropicSDK (pip install anthropicornpm install @anthropic-ai/sdk)- An API key from console.anthropic.com
- Familiarity with basic
messages.createcalls
Batch API: 50% off everything, no code changes required
In June 2026, Anthropic's pricing documentation confirmed the Message Batches API delivers exactly 50% off all input and output tokens across every model. Haiku 4.5 batch: $0.50/$2.50 per million tokens. Sonnet 4.6 batch: $1.50/$7.50. The tradeoff is latency. Batch jobs return within 24 hours rather than in real time.
If your workload doesn't need live responses (nightly report generation, bulk document classification, pre-generating summaries, queued jobs), the Batch API is the simplest 50% discount you'll find.
import anthropic
import time
client = anthropic.Anthropic()
def run_batch(prompts: list[str], model: str = "claude-haiku-4-5") -> list[str]:
"""Submit a batch request. 50% cheaper than individual calls."""
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"item-{i}",
"params": {
"model": model,
"max_tokens": 512,
"messages": [{"role": "user", "content": prompt}]
}
}
for i, prompt in enumerate(prompts)
]
)
print(f"Batch submitted: {batch.id} — status: {batch.processing_status}")
# Poll until complete (usually well under 24 hours)
while True:
batch = client.messages.batches.retrieve(batch.id)
if batch.processing_status == "ended":
break
print("Still processing — checking again in 60s")
time.sleep(60)
# Collect results in original order
results = {}
for result in client.messages.batches.results(batch.id):
if result.result.type == "succeeded":
results[result.custom_id] = result.result.message.content[0].text
else:
results[result.custom_id] = None
return [results.get(f"item-{i}") for i in range(len(prompts))]
Real cost example: Classifying 10,000 support tickets with Haiku 4.5 at standard rates: roughly $1.00 in input tokens. Same job via Batch API: $0.50. At that volume, the savings compound fast.
The Batch API supports tool use, vision inputs, and prompt caching. Batch and caching discounts are explicitly stackable per Anthropic's pricing docs. More on stacking at the end of this post.
Model routing: match the task to the cheapest model that handles it
In May 2026, CloudZero's analysis of Claude API pricing estimated a 40-60% blended cost reduction from routing requests to the cheapest capable model. The price spread is significant: Haiku 4.5 at $1.00/MTok input vs Opus 4.7 at $5.00/MTok is a 5x difference before output costs.
The routing rule: Haiku handles anything that doesn't need reasoning. Sonnet handles most code and analysis. Opus handles complex multi-step reasoning only.
from anthropic import Anthropic
client = Anthropic()
def get_model(task: str) -> str:
"""Return the cheapest model that reliably handles this task."""
ROUTING = {
# Haiku: fast, cheap — good for classification, extraction, short rewrites
"classify": "claude-haiku-4-5",
"extract": "claude-haiku-4-5",
"summarize": "claude-haiku-4-5",
"translate": "claude-haiku-4-5",
# Sonnet: balanced — handles most code and analysis well
"code": "claude-sonnet-4-6",
"debug": "claude-sonnet-4-6",
"analyze": "claude-sonnet-4-6",
"draft": "claude-sonnet-4-6",
# Opus: complex reasoning, architecture, multi-step plans only
"reason": "claude-opus-4-7",
"architect": "claude-opus-4-7",
}
return ROUTING.get(task, "claude-sonnet-4-6")
def routed_call(message: str, task: str) -> str:
model = get_model(task)
response = client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": message}]
)
return response.content[0].text
# Classifying 10,000 tickets: Haiku at $1/MTok vs Opus at $5/MTok = 5x savings
result = routed_call("Classify this support ticket: ...", task="classify")
Opus 4.7 tokenizer note: As of June 2026, Opus 4.7 uses a tokenizer that consumes up to 35% more tokens on the same text compared to older Claude models (Anthropic pricing docs, June 2026). If you migrated existing prompts from a previous model without re-measuring token counts, your costs may have increased unexpectedly. Always re-run token counts after any model change.
Stacking everything together
In June 2026, Anthropic's pricing docs confirm that Batch API and prompt caching discounts are stackable. Cache-read tokens via Batch API cost 0.1x (caching) × 0.5x (batch) = 0.05x standard price. That's 95% off.
Here's a complete optimized client that applies all techniques (caching, relocation, routing, and batch) in one place:
import anthropic
from typing import Literal
client = anthropic.Anthropic()
# Static system prompt — keep ALL dynamic content out of this string
# Must exceed minimum token threshold: 1,024 (Sonnet), 4,096 (Haiku/Opus)
STATIC_SYSTEM = """You are a senior software engineer assistant.
[...your full static instructions here...]
"""
TaskType = Literal["classify", "extract", "summarize", "code", "debug", "analyze", "reason"]
MODEL_ROUTING: dict[TaskType, str] = {
"classify": "claude-haiku-4-5",
"extract": "claude-haiku-4-5",
"summarize": "claude-haiku-4-5",
"code": "claude-sonnet-4-6",
"debug": "claude-sonnet-4-6",
"analyze": "claude-sonnet-4-6",
"reason": "claude-opus-4-7",
}
def call_optimized(
user_message: str,
task: TaskType = "analyze",
context: str | None = None,
) -> str:
"""
Full optimization stack:
- Prompt caching on static system (90% off cache reads)
- Dynamic context in user turn, not system (keeps cache hit rate high)
- Model routing to cheapest capable model
"""
model = MODEL_ROUTING.get(task, "claude-sonnet-4-6")
# Dynamic context goes in user turn — never in the cached system prefix
content = f"{context}\n\n{user_message}" if context else user_message
response = client.messages.create(
model=model,
max_tokens=1024,
system=[{
"type": "text",
"text": STATIC_SYSTEM,
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": content}]
)
usage = response.usage
cache_hit = usage.cache_read_input_tokens > 0
print(f"Model: {model} | Cache hit: {cache_hit}")
return response.content[0].text
Real-world impact: CloudZero's May 2026 analysis modeled a RAG app with a 50,000-token system prompt at 500 requests/day. Standard cost: ~$1,050/month. With prompt caching applied: $150-300/month. Add model routing for the 60% of requests that don't need Sonnet, and that number drops further.
Frequently Asked Questions
Can I use the Batch API with tool use and function calling?
Yes. The Batch API supports tool use, vision inputs, and prompt caching. Per Anthropic's June 2026 pricing docs, Batch API and caching discounts stack: cached batch reads cost 0.05x standard input price (95% off). Tool use token overhead (497-804 additional tokens depending on mode) applies as normal.
Will Haiku produce acceptable quality for real production tasks?
For classification, extraction, translation, and short summarization: yes. For multi-step code generation, architecture decisions, or nuanced analysis: no. Run 50 real samples through both Haiku and your current model, score the outputs, and measure the quality gap. In most production systems, 40-60% of requests are simple enough for Haiku with no meaningful quality loss.
How do I confirm the stacked discounts are actually applying?
Check response.usage for both cache_read_input_tokens (should be non-zero) and confirm you're using the batch endpoint by checking batch.processing_status. For the stacked path, the effective cost on cached input tokens should be 0.05x your standard rate. Build a cost monitoring function that logs cache_read_input_tokens * 0.1 * model_rate per request.
Batch API and model routing are the second layer of Claude API cost reduction. Prompt caching alone can get you to 90% off. Add routing and batching on top of that, and the math gets compelling fast.
If you haven't implemented prompt caching yet, start there. Part 1 of this series covers it in detail.
Sources
- Anthropic, "Claude API Pricing," retrieved 2026-06-01, platform.claude.com/docs/en/about-claude/pricing
- Anthropic, "Message Batches API," retrieved 2026-06-01, platform.claude.com/docs/en/build-with-claude/batch
- CloudZero, "Claude API Pricing," retrieved 2026-06-01, cloudzero.com/blog/claude-api-pricing
- ProjectDiscovery, "How We Cut LLM Cost With Prompt Caching," retrieved 2026-06-01, projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching