Claude Code Is Burning Your Money in the Background. Here’s How to Stop It.

I was three weeks into using Claude Code when I got my first Anthropic bill. I knew it wouldn’t be cheap, but still — I actually said something out loud when I saw the number. Not gonna lie, I thought I had set something up wrong. So I went and checked the usage page, and what I found was kind of depressing. A huge chunk of my tokens had nothing to do with actual coding work. It was background stuff. Context loading. Repeated re-reads of files I hadn’t even touched. Just… waste.

That’s when someone in a Discord server mentioned the Caveman plugin. I’d never heard of it. Took me maybe 20 minutes to set up and another week to fully understand what it was doing. But my token burn dropped noticeably after that — I’d say maybe 30 to 40 percent less on the background consumption side, though it depends on how you use Claude Code.

So I want to break this down. What’s actually costing you money, how Claude Code handles context, and what the Caveman plugin does about it. I’ll try to explain it in a way that makes sense even if you’re not an API expert.

What You’re Actually Paying For

First, the basics. Anthropic charges by the token. A token is roughly 3–4 characters of text, or about 0.75 words. So 1,000 tokens is close to 750 words. Every time Claude Code sends something to the API — your code, the conversation history, instructions, file contents — you’re paying for all of it, both what goes in and what comes back out.

Input tokens are cheaper than output tokens. As of now, claude-sonnet-4 (the model Claude Code defaults to) costs $3 per million input tokens and $15 per million output. So output is 5x more expensive. Keep that in mind — it matters for how Caveman helps.

Here’s the part that surprised me when I first looked at it. Claude Code doesn’t just send your question and wait. It maintains a running context window. Every message in a session includes all the previous messages, plus whatever files it has loaded, plus system instructions. So if you’re 30 messages into a session and the context is 40,000 tokens, every single new message you send is also sending those 40,000 tokens of history. Again. And again.

This is how you can run up 500,000 input tokens in an afternoon of what felt like light work.

Prompt Caching Helps — But Has Limits

Anthropic does offer prompt caching, which helps with this. If the beginning of your context stays the same across multiple API calls, Anthropic caches that prefix and charges a much lower rate for the cached portion — around $0.30 per million instead of $3. So 10x cheaper.

Claude Code uses this. The problem is that the cache only applies to the stable prefix of your context. The moment anything changes — a file gets modified, a new message comes in — the cache becomes partially or fully invalid for that changed section. On a project where files are changing constantly, the cache hit rate drops. And there’s also a cache expiry of around 5 minutes, so if you’re slow between messages, you’re back to full price.

The other thing is that even cached tokens aren’t free. And output tokens are always full price. So if Claude is writing a lot of code, explaining a lot of stuff, the output cost adds up fast regardless of caching.

What Caveman Actually Does

The Caveman plugin for Claude Code (the repository is at github.com/botmechanic/caveman, last updated a few weeks ago with a fix for the MCP context bleed issue that was causing problems in version 0.3.1) works by aggressively managing what goes into the context window.

The basic idea is: Claude Code, by default, is kind of greedy about context. It wants to have everything available. So it loads files preemptively, it keeps long conversation history, and it lets the context grow until it hits the model’s limit and then summarizes — which itself costs tokens.

Caveman changes this in a few ways. I’ll try to explain each one in plain terms.

Context pruning. Instead of keeping the full conversation history, Caveman trims older messages that it judges to be low-value. It keeps the most recent messages, keeps any message where you explicitly referenced a file, and drops the filler stuff — confirmations, one-line acknowledgments, intermediate thinking steps Claude printed out. You can configure the aggressiveness of this trimming in the config file.

Lazy file loading. By default, Claude Code will sometimes pull in the contents of files you mentioned earlier in a session, even if you haven’t asked about them in a while. Caveman switches this to on-demand — files only get included in context when they’re explicitly requested in the current turn. This alone made a big difference for me because I work on projects with maybe 15–20 relevant files, and not all of them need to be in context every single time.

Output length hints. This one is subtle. Caveman adds a soft instruction to the system prompt telling Claude to be concise when the task doesn’t need a detailed explanation. So if you ask it to rename a variable, it won’t write four paragraphs about why the new name is better. It’ll just do it. This directly cuts output tokens, which are your most expensive cost.

Session summarization control. When a session gets too long, Claude Code will auto-summarize the context to keep going. That summarization call itself burns tokens. Caveman lets you set a hard token budget per session and will warn you before you hit the auto-summarize threshold, so you can manually start a fresh session instead of paying for the summarization.

A Real Comparison (From My Own Usage)

I can’t give you a super scientific comparison because I wasn’t tracking things carefully before I installed Caveman. But I do have my Anthropic billing history, and I did roughly similar work in both periods.

The week before Caveman: I was mostly doing refactoring work on a TypeScript project. About 4–5 hours of active Claude Code usage across 5 days. My input token count for that week was around 2.1 million, output was around 380,000. Total cost was roughly $8.

The week after setting up Caveman (with context pruning on medium, lazy loading on, and output hints on): similar workload, similar hours. Input tokens dropped to about 1.3 million. Output dropped to around 260,000. Cost was around $5.

That’s not a dramatic drop, but it’s consistent. Over a month, that’s maybe $10–15 saved. If you’re a heavier user or you’re running Claude Code for a team, the math scales up.

The place where I noticed the biggest change was in long sessions. Before Caveman, if I had a session running for two hours on the same feature, the input token count per message would just keep climbing. By hour two I was sending something like 8,000–10,000 tokens with every message just in context overhead. With Caveman, that levels off earlier because of the pruning.

How to Set It Up

This part is pretty simple but the documentation has some gaps — took me longer than it should have because the README doesn’t mention one step.

You need Node.js 18 or later. Check with node --version. Then:

npm install -g @botmechanic/caveman
caveman init

The init command creates a .caveman.json file in your home directory. Open that file. The defaults are conservative — context pruning is set to "off" and lazy loading is "off." You need to change these manually.

The config I’m using:

{
  "contextPruning": "medium",
  "lazyFileLoading": true,
  "outputHints": true,
  "sessionBudget": 80000,
  "budgetWarningAt": 70000
}

The sessionBudget is in tokens and it's the total context size at which Caveman warns you. 80,000 is a reasonable number — it's well within the Claude Sonnet context window and still enough for big tasks. The thing the README doesn't mention: you also need to run caveman link after init, otherwise it doesn't actually hook into Claude Code. I spent 30 minutes wondering why nothing had changed before I figured this out.

Restart Claude Code after running caveman link. That's it.

What Caveman Doesn’t Fix

I want to be honest here because I’ve seen a few blog posts making it sound like Caveman is some magic solution. It’s not.

If you’re writing a lot of code like, asking Claude to generate full files or rewrite large functions your output costs are going to be high regardless. Caveman’s output hints reduce unnecessary verbosity, but it can’t make Claude generate 500 lines of TypeScript with fewer tokens. The code is the code.

Also, the context pruning is not perfect. There was one case where I was working on a bug, Caveman pruned a message from earlier in the session that contained a key piece of context about how a function was supposed to behave, and then Claude started giving me wrong suggestions because it had lost that context. I had to scroll back, re-paste the relevant part manually, and continue. This doesn’t happen often, but it happens. The “medium” setting is safer than “aggressive” for complex, long-running debugging sessions.

And honestly, the biggest way to reduce your Claude Code costs is still just being deliberate about sessions. Start a new session when you move to a new task. Don’t let sessions run for three hours on five different problems. Keep your questions focused. Caveman helps with the background waste, but your habits matter more than any plugin.

A Note on the Caching Math

I want to come back to prompt caching because it’s actually kind of interesting and I didn’t fully understand it at first.

Anthropic’s prompt caching works on a prefix basis. The first time you send a request with a long context, that prefix gets cached for 5 minutes. The next request that starts with the same prefix gets the cached rate. If your context grows between requests — which it always does, because each new message gets appended — then the new part isn’t cached, only the stable beginning.

So in a typical Claude Code session, you might have something like: system prompt (cached), file contents you loaded at the start (cached), then conversation history which is growing with each message (partially cached). The tail of the conversation is never cached. And the tail is exactly where your active, expensive back-and-forth is happening.

Caveman helps here because by keeping the conversation history shorter, it keeps the proportion of your context that IS cached higher. If the tail is shorter, a bigger percentage of your total tokens hits the cache rate. This is actually the mathematical reason why the input token savings are real even though it might not seem obvious at first.

The 5-minute cache expiry also means that if you’re slow — like, you’re reading Caveman’s output carefully or you stepped away for a coffee — you lose the cache benefit and pay full price for the next message. Honestly this part still doesn’t fully make sense to me in terms of how Anthropic decides what to cache and when, but the practical result is: work in shorter, focused sessions and you’ll see better cache hit rates.

Longer Sessions and the Diminishing Returns Problem

Here’s something I figured out after a while. The cost per message in a long Claude Code session doesn’t stay flat. It grows. Because context grows.

Say you start a session and your first few messages are each about 2,000 tokens of context. By message 20, you’re at maybe 15,000 tokens of context. By message 50, maybe 40,000. So message 50 is costing 20x more in input tokens than message 1, even if you’re asking a similar-sized question.

This is a real problem for people who treat Claude Code like a persistent coding assistant they keep open all day. The session that starts at 9am and runs until 5pm is going to have wildly expensive messages by the afternoon, even if each individual question is simple.

The practical advice: treat sessions as tasks, not as days. Finish a feature, close the session, open a new one. Yeah, you lose context — but that’s sort of the point. You don’t need Claude to remember everything you talked about this morning when you’re now working on a completely different module.

Caveman’s session budget warning is useful here because it makes this visible. When it tells you “you’re at 70,000 tokens,” that’s a good trigger to ask yourself whether you actually need everything that’s accumulated, or whether this is a good point to wrap up and start fresh.

Is It Worth Setting Up

For most people who use Claude Code more than casually, yes. The setup takes maybe 20 minutes. The savings are real, not huge, but consistent. And the session budget warnings alone have made me more aware of when I’m burning through context unnecessarily.

If you’re a light user — maybe an hour of Claude Code a week — probably don’t bother. The savings won’t be meaningful.

If you’re using it for several hours a week, or if you’re running it for a team or in any kind of automated workflow, set it up. The lazy file loading and the output hints add up over time. And understanding what’s actually happening with your token usage makes you a smarter user of the tool overall, Caveman or not.

The Anthropic billing dashboard is actually pretty good it breaks down your usage by model and date. Worth spending 10 minutes looking at it before you do anything else. You might find, like I did, that the waste is more visible than you expected.