Prompt caching cuts Claude API input token costs by 90% on repeated system prompts. One cache_control key enables it, but most implementations quietly fail because of a single structural mistake that keeps cache hit rates near zero. This post covers the correct setup and the fix.
You'll need:
anthropicSDK (pip install anthropicornpm install @anthropic-ai/sdk)- An API key from console.anthropic.com
As of June 2026, Anthropic's pricing page shows Claude Sonnet 4.6 at $3.00/MTok input. That's the number we're reducing.
How does Claude prompt caching work?
In June 2026, Anthropic's prompt caching documentation confirmed that cache-read tokens cost 0.1x the base input price, a 90% reduction. Cache-write tokens cost 1.25x base (5-minute TTL) or 2x base (1-hour TTL). With a 5-minute TTL, caching breaks even after a single read. Every call after that is 90% cheaper on the cached portion.
The mechanics: Anthropic stores a hash of your system prompt on their infrastructure. On subsequent calls with the same prefix, they serve it from cache instead of reprocessing it. You pay 1/10th the normal rate for those tokens.
One cache_control key is all it takes to opt in.
Python:
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = """You are a senior software engineer specializing in Python and TypeScript.
You write clean, well-documented code. You follow existing project conventions.
You explain tradeoffs before implementing. You never use placeholder comments.
[Keep this prompt static — any change busts the cache.
Must be 1,024+ tokens for Sonnet, 4,096+ for Haiku/Opus.]
"""
def call_claude(user_message: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # this is all it takes
}],
messages=[{"role": "user", "content": user_message}]
)
usage = response.usage
print(f"Cache write: {usage.cache_creation_input_tokens}")
print(f"Cache read: {usage.cache_read_input_tokens}") # should be > 0 after first call
return response.content[0].text
TypeScript:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const SYSTEM_PROMPT = `You are a senior software engineer...
[...your static instructions — keep dynamic content out of here...]`;
async function callClaude(userMessage: string): Promise<string> {
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [{
type: "text",
text: SYSTEM_PROMPT,
cache_control: { type: "ephemeral" }
}],
messages: [{ role: "user", content: userMessage }]
});
console.log("Cache read tokens:", response.usage.cache_read_input_tokens);
return (response.content[0] as Anthropic.TextBlock).text;
}
After the first call, cache_read_input_tokens should be non-zero on every subsequent call. If it stays at 0, your prompt is either too short or, more likely, you're making a structural mistake covered in the next section.
Minimum token thresholds: 1,024 tokens for Sonnet 4.6/4.5 and Opus 4.1. 4,096 tokens for Haiku 4.5 and Opus 4.7/4.6/4.5. Prompts shorter than the threshold are silently not cached, with no error thrown, just a permanent 0 in cache_read_input_tokens. This is the most common reason caching appears broken.
Why most caching implementations underperform
In April 2026, ProjectDiscovery's engineering blog published a breakdown of their LLM cost reduction effort. They started with a 7% cache hit rate. After one architectural change, they reached 84%. The fix wasn't more caching. It was moving things that didn't belong in the cache.
The problem: any variable content inside your cached prefix resets the cache hash for that request.
Here's what breaks caching silently:
# WRONG — dynamic variables are inside the cached prefix
def call_claude_broken(message: str, user_id: str, project: str) -> str:
system = [{
"type": "text",
"text": f"""You are a senior engineer.
Project: {project}
User: {user_id}
[...2,000 tokens of static instructions...]""",
"cache_control": {"type": "ephemeral"}
}]
# Every request has project and user_id baked in — unique hash every time.
# Cache hit rate: ~0%
Every time project or user_id changes, the entire system prompt string changes. Different string, different hash, no cache hit.
The fix is one structural change: dynamic context moves to the user turn, not the system prompt.
# RIGHT — static prefix is cached; dynamic context goes in the user message
STATIC_SYSTEM = """You are a senior engineer.
[...2,000 tokens of static instructions — this string never changes...]"""
def call_claude_fixed(message: str, user_id: str, project: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[{
"type": "text",
"text": STATIC_SYSTEM,
"cache_control": {"type": "ephemeral"} # same hash every time
}],
messages=[{
"role": "user",
"content": f"Project: {project}\nUser: {user_id}\n\n{message}"
# dynamic context lives here — outside the cache boundary
}]
)
return response.content[0].text
ProjectDiscovery's system handled 20,000-token-per-agent YAML prompts. After the relocation fix, cache hit rate climbed from 7% to 84%. Overall LLM costs dropped 59%, peaking at 66% during their best 10-day window. 9.8 billion tokens served from cache.
The practical audit: search your system prompt for any f-string interpolation, .format() calls, or string concatenation. Anything that changes per request needs to move out.
Frequently Asked Questions
Does prompt caching work with streaming?
Yes. Add cache_control to your system prompt as shown above, then pass stream=True to messages.create. Caching and streaming work independently. Cache usage data appears in the stream's final message_delta event under usage.
How do I confirm my prompts are actually being cached?
Check response.usage.cache_read_input_tokens on every response. If it stays at 0 across multiple calls with the same system prompt, caching is not working. Log the raw usage object for 10+ consecutive calls. Common causes: prompt below minimum token threshold, or dynamic content inside the cached prefix.
What is the minimum prompt length for caching?
Per Anthropic's June 2026 caching docs: 1,024 tokens for Sonnet 4.6/4.5 and Opus 4.1, and 4,096 tokens for Haiku 4.5 and Opus 4.7/4.6/4.5. Below the threshold, the API silently skips caching with no error or warning.
Prompt caching is the highest-leverage cost change available in the Claude API. One cache_control key, one structural audit to move dynamic content out of the prefix, and the savings compound on every request.
For the next layer of savings, 50% off all tokens with the Batch API and 40-80% blended cost reduction through model routing, see Claude API Batch and Model Routing: 50% Off Tokens and Smart Cost Control.