Claude AI

Claude Prompt Caching: Cut API Input Costs 90% With Two Code Changes

Prompt caching cuts Claude API input token costs 90%. Most implementations underperform because of one structural mistake. Here's the fix with Python and JS code.

Prompt caching cuts Claude API input token costs by 90% on repeated system prompts. One cache_control key enables it, but most implementations quietly fail because of a single structural mistake that keeps cache hit rates near zero. This post covers the correct setup and the fix.

You'll need:

  • anthropic SDK (pip install anthropic or npm install @anthropic-ai/sdk)
  • An API key from console.anthropic.com

As of June 2026, Anthropic's pricing page shows Claude Sonnet 4.6 at $3.00/MTok input. That's the number we're reducing.

Developer working at a desk with multiple monitors showing code and productivity tools in a dark office workspace


How does Claude prompt caching work?

In June 2026, Anthropic's prompt caching documentation confirmed that cache-read tokens cost 0.1x the base input price, a 90% reduction. Cache-write tokens cost 1.25x base (5-minute TTL) or 2x base (1-hour TTL). With a 5-minute TTL, caching breaks even after a single read. Every call after that is 90% cheaper on the cached portion.

The mechanics: Anthropic stores a hash of your system prompt on their infrastructure. On subsequent calls with the same prefix, they serve it from cache instead of reprocessing it. You pay 1/10th the normal rate for those tokens.

One cache_control key is all it takes to opt in.

Python:

python
import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a senior software engineer specializing in Python and TypeScript.
You write clean, well-documented code. You follow existing project conventions.
You explain tradeoffs before implementing. You never use placeholder comments.

[Keep this prompt static — any change busts the cache.
Must be 1,024+ tokens for Sonnet, 4,096+ for Haiku/Opus.]
"""

def call_claude(user_message: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[{
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # this is all it takes
        }],
        messages=[{"role": "user", "content": user_message}]
    )

    usage = response.usage
    print(f"Cache write: {usage.cache_creation_input_tokens}")
    print(f"Cache read:  {usage.cache_read_input_tokens}")  # should be > 0 after first call

    return response.content[0].text

TypeScript:

typescript
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const SYSTEM_PROMPT = `You are a senior software engineer...
[...your static instructions — keep dynamic content out of here...]`;

async function callClaude(userMessage: string): Promise<string> {
  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: [{
      type: "text",
      text: SYSTEM_PROMPT,
      cache_control: { type: "ephemeral" }
    }],
    messages: [{ role: "user", content: userMessage }]
  });

  console.log("Cache read tokens:", response.usage.cache_read_input_tokens);
  return (response.content[0] as Anthropic.TextBlock).text;
}

After the first call, cache_read_input_tokens should be non-zero on every subsequent call. If it stays at 0, your prompt is either too short or, more likely, you're making a structural mistake covered in the next section.

Watch Out

Minimum token thresholds: 1,024 tokens for Sonnet 4.6/4.5 and Opus 4.1. 4,096 tokens for Haiku 4.5 and Opus 4.7/4.6/4.5. Prompts shorter than the threshold are silently not cached, with no error thrown, just a permanent 0 in cache_read_input_tokens. This is the most common reason caching appears broken.


Why most caching implementations underperform

In April 2026, ProjectDiscovery's engineering blog published a breakdown of their LLM cost reduction effort. They started with a 7% cache hit rate. After one architectural change, they reached 84%. The fix wasn't more caching. It was moving things that didn't belong in the cache.

The problem: any variable content inside your cached prefix resets the cache hash for that request.

Here's what breaks caching silently:

python
# WRONG — dynamic variables are inside the cached prefix
def call_claude_broken(message: str, user_id: str, project: str) -> str:
    system = [{
        "type": "text",
        "text": f"""You are a senior engineer.
Project: {project}
User: {user_id}

[...2,000 tokens of static instructions...]""",
        "cache_control": {"type": "ephemeral"}
    }]
    # Every request has project and user_id baked in — unique hash every time.
    # Cache hit rate: ~0%

Every time project or user_id changes, the entire system prompt string changes. Different string, different hash, no cache hit.

The fix is one structural change: dynamic context moves to the user turn, not the system prompt.

python
# RIGHT — static prefix is cached; dynamic context goes in the user message
STATIC_SYSTEM = """You are a senior engineer.

[...2,000 tokens of static instructions — this string never changes...]"""

def call_claude_fixed(message: str, user_id: str, project: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[{
            "type": "text",
            "text": STATIC_SYSTEM,
            "cache_control": {"type": "ephemeral"}  # same hash every time
        }],
        messages=[{
            "role": "user",
            "content": f"Project: {project}\nUser: {user_id}\n\n{message}"
            # dynamic context lives here — outside the cache boundary
        }]
    )
    return response.content[0].text

ProjectDiscovery's system handled 20,000-token-per-agent YAML prompts. After the relocation fix, cache hit rate climbed from 7% to 84%. Overall LLM costs dropped 59%, peaking at 66% during their best 10-day window. 9.8 billion tokens served from cache.

Performance analytics dashboard on a laptop screen displaying cost and usage graphs, tracking API spend after optimization

The practical audit: search your system prompt for any f-string interpolation, .format() calls, or string concatenation. Anything that changes per request needs to move out.


Frequently Asked Questions

Does prompt caching work with streaming?

Yes. Add cache_control to your system prompt as shown above, then pass stream=True to messages.create. Caching and streaming work independently. Cache usage data appears in the stream's final message_delta event under usage.

How do I confirm my prompts are actually being cached?

Check response.usage.cache_read_input_tokens on every response. If it stays at 0 across multiple calls with the same system prompt, caching is not working. Log the raw usage object for 10+ consecutive calls. Common causes: prompt below minimum token threshold, or dynamic content inside the cached prefix.

What is the minimum prompt length for caching?

Per Anthropic's June 2026 caching docs: 1,024 tokens for Sonnet 4.6/4.5 and Opus 4.1, and 4,096 tokens for Haiku 4.5 and Opus 4.7/4.6/4.5. Below the threshold, the API silently skips caching with no error or warning.


Prompt caching is the highest-leverage cost change available in the Claude API. One cache_control key, one structural audit to move dynamic content out of the prefix, and the savings compound on every request.

For the next layer of savings, 50% off all tokens with the Batch API and 40-80% blended cost reduction through model routing, see Claude API Batch and Model Routing: 50% Off Tokens and Smart Cost Control.