Claude Opus 4.7 Review: Benchmarks and Real-World Gains, Vs Opus 4.6

Yet Another Day, Yet another Anthropic announcement , April 16, 2026 and there are actually two stories here. One is about Claude Opus 4.7, which just dropped and is available to everyone. The other is about a model called Claude Mythos Preview, sitting quietly in the background, more capable, and which you cannot have. Not yet.

Let me explain both, like explaining it to a five year old.

Quick note first: if you’re a regular Claude user who doesn’t code or build apps, the upgrade already happened automatically. You’re using 4.7 right now, whether you noticed or not. For developers using the API, switch to claude-opus-4-7. Price is the same $5 per million input tokens, $25 per million output. But there's a tokenizer change that might affect your bill at the end of the month, which I'll get to.

What actually improved in Opus 4.7

The biggest story is coding. Cursor ran their own internal benchmark 93 real software engineering tasks and Opus 4.7 solved 70% of them. Opus 4.6 sat at 58%. More interesting, 4.7 solved four tasks that neither 4.6 nor Sonnet 4.6 could do at all. Not “did slightly better” — just flat out solved problems the previous generation couldn’t handle.

Rakuten’s engineering team reported something even more dramatic on their own internal test: 3x more production task resolutions compared to Opus 4.6. That number sounds high, but multiple companies reported double-digit gains, so the direction is clear.

What seems to be driving this is how the model handles verification. Vercel engineers noticed 4.7 does something like a proof-check on systems code before writing anything. It thinks through its own logic first. Previous models would sometimes produce code that looked right but had subtle issues off-by-one errors, wrong variable scope, race conditions. This one catches more of that before it happens. Warp (the AI terminal app) specifically mentioned it fixed a tricky concurrency bug that Opus 4.6 couldn’t crack at all. Concurrency bugs require careful forward reasoning to spot. Good sign.

Also: Opus 4.7 is more opinionated. Replit’s president said it “pushes back during technical discussions to help me make better decisions.” Previous Claude models were sometimes too agreeable you’d suggest a suboptimal approach and they’d just go along with it. 4.7 will tell you when something has a problem. Some people will find this annoying. I think it’s how these tools should work.

The vision upgrade

Old Claude models handled images up to roughly 750 pixels on the long edge. Opus 4.7 accepts up to 2,576 pixels about 3.75 megapixels. More than three times the old limit.

One company, XBOW, builds autonomous penetration testing tools that rely on reading dense screenshots. Their visual-acuity benchmark went from 54.5% to 98.5% just from this resolution change. That’s not a marginal improvement. That’s a use case that basically didn’t work before now working reliably.

For regular users it means better reading of complex diagrams, technical screenshots, scanned documents. Life sciences companies using Claude for patent workflows like Solve Intelligence mentioned being able to read chemical structures directly from images rather than describing them in words. That alone probably saves hours per week.

One catch: higher-resolution images cost more tokens. If you’re sending images via the API and don’t actually need the extra detail, you can downsample before sending. There’s no toggle in the API for this. You have to handle it yourself.

Instruction following is almost too good now

Here’s something that caught me off guard in the realase notes. Anthropic added an actual warning: prompts written for Opus 4.6 might behave unexpectedly with 4.7 because the new model follows instructions so literally.

Where 4.6 used to interpret ambiguous instructions loosely or skip parts it thought were irrelevant, 4.7 takes everything literally. If your prompt has vague wording, 4.7 picks one interpretation and sticks with it no common-sense smoothing at the edges. If you have instructions that older models mostly ignored when they weren’t relevant, 4.7 will follow them anyway and produce weird results.

I had struggled previously fixing broken prompts when a different model had a similar upgrade some time back(in my case it was opus 4.5 to 4.6). We had instructions like “be concise unless the question requires depth” — and the new model had a completely different idea of what “requires depth” meant. Everything came out too long. It was a real headache. So when Anthropic explicitly warns about this in the release notes, I take it seriously.

If you’re moving production systems to 4.7: test your prompts first. Don’t assume behavior carries over.

The tokenizer change

New tokenizer in Opus 4.7. Same input text now maps to 1.0–1.35x more tokens depending on content type. Code-heavy inputs tend toward 1.0x; natural language with unusual formatting or multilingual content goes higher.

And on top of that, 4.7 thinks more at higher effort levels especially in later turns of agentic workflows. More thinking means more output tokens.

Anthropic’s own testing showed cost-per-task went down on their internal coding benchmark because you get more done per session. But that’s their benchmark, not yours. They say to measure the actual difference on real traffic before assuming it’s net positive. I’d do that. I’ve been burned by “it’s better overall, trust us” from model providers before.

There’s also a new effort level: xhigh. Sits between high and max. Anthropic made xhigh the default in Claude Code for all plans. If you use Claude Code and suddenly sessions are slower or token counts are higher, that is why. You can dial it back to high if you want.

For most people doing short interactive tasks, none of this matters much. For anyone running long agentic workflows, measure first.

Memory and long-running work

This one got less attention in the announcement but I think it matters quite a bit for people doing multi-session work. Opus 4.7 is better at using file system-based memory it can remember important notes across sessions and actually use them when starting new tasks.

If you’ve ever had to paste the same project background at the start of every conversation, you know why this matters. Or if you’ve watched an agent completely forget context it “knew” two sessions ago. That context-loss problem is genuinely frustrating.

Notion’s AI team mentioned that 4.7 is “the first model to pass our implicit-need tests” meaning it picks up on what you need without you explicitly saying it. That aligns with better memory and context handling. The model is apparently paying more attention to the full history of a task, not just the most recent message.

I haven’t tested this in my own setup yet so I can’t personally confirm how well it works, And I dont have enough credits to do so ;( But multiple testers mentioned it independently, which is a decent signal.

Claude Mythos Preview — the model they’re not releasing

Last week, Anthropic announced Project Glasswing. Part of that announcement introduced a model called Claude Mythos Preview. It’s a general-purpose frontier model they built, tested, and then decided not to release publicly.

Why? Because it’s exceptionally good at finding security vulnerabilities.

Anthropic says Mythos Preview found thousands of zero-day vulnerabilities in every major operating system and every major web browser. It found a 27-year-old bug in OpenBSD, which has a reputation as one of the most security-hardened operating systems around, used for firewalls and critical infrastructure. That vulnerability let an attacker remotely crash any machine just by connecting to it. It also found a 16-year-old bug in FFmpeg (the video encoding library that almost every piece of software uses), in a line of code that automated testing tools had hit five million times without catching it. And it chained together multiple Linux kernel vulnerabilities to go from normal user access to full machine control, autonomously, no human steering required.

Anthropic reported all of these to the relevant maintainers. The patched ones are now public on their Frontier Red Team blog. For the rest, they published cryptographic hashes of the details — a record of discovery — and will release specifics once fixes are in place.

This is why Opus 4.7 is not Anthropic’s most capable model. It’s the safer version a test bed for the safety guardrails they’ll eventually need before releasing Mythos-class capabilities to everyone. The logic makes sense. But it also means what you can access right now is a deliberately limited version of what they’ve built.

Companies in Project Glasswing — AWS, Google, Microsoft, Cisco, CrowdStrike, JPMorganChase, Palo Alto Networks, Nvidia, Apple, Broadcom, and the Linux Foundation — are using Mythos Preview for defensive security work. Anthropic committed $100 million in usage credits for this effort, plus $4 million in donations to open-source security organizations. Security professionals who need access for legitimate work can apply to the Cyber Verification Program.

We have covered in detail about the Mythos model, check the pinned article below for it.

How the benchmarks actually compare

Some numbers side by side, because this is where the gap becomes concrete.

SWE-bench Verified, which tests real-world software engineering: Opus 4.6 was at 80.8%, Mythos Preview is at 93.9%. Opus 4.7 improves on 4.6 but the exact score isn’t published.

SWE-bench Pro, harder version: Mythos at 77.8%, Opus 4.6 at 53.4%. Opus 4.7 sits between the two.

Humanity’s Last Exam — the hardest academic benchmark built so far for AI models — Mythos without tools hits 56.8%, Opus 4.6 was at 40%. With tools, Mythos gets 64.7%.

GPQA Diamond, graduate-level science questions: Mythos at 94.6%, Opus 4.6 at 91.3%.

On cybersecurity specifically, CyberGym benchmark: Opus 4.6 scored 73.8%, Mythos Preview scores 83.1%. Opus 4.7’s score here is intentionally lower than 4.6, because Anthropic reduced those capabilities during training. So in this area, the publicly released model is less capable than its predecessor — by design.

For what Opus 4.7 specifically improved: document reasoning (21% fewer errors than Opus 4.6 on Databricks’ OfficeQA Pro), finance agent work (top score on GDPval-AA, an independent evaluation run by Artificial Analysis), and vision. The vision resolution change is unique to 4.7 — Mythos’s vision specs aren’t published separately.

The gap to Mythos is biggest in advanced reasoning, cybersecurity, and complex multi-step coding chains. For everyday tasks — document analysis, coding assistance, writing, research — 4.7 vs 4.6 is a real and meaningful improvement.

Safety and alignment — the actual numbers

Anthropic ran automated behavioral audits on both models. Mythos Preview is the best-aligned model they’ve built — lowest rates of deception, sycophancy, and cooperating with misuse on their measurements.

Opus 4.7 is a modest improvement over 4.6 on honesty and on resisting prompt injection attacks where a malicious document tries to make the model do something the user didn’t intend. Good progress.

But there’s one regression they admitted openly: 4.7 gives slightly more detailed harm-reduction advice on controlled substances than 4.6 did. They said it in the announcement. They know about it, didn’t fix it for this release.

The cyber safeguards in 4.7 work by automatically detecting and blocking requests that look like prohibited cybersecurity use. From how Anthropic describes it, these are detection layers sitting on top of the model. Whether they hold up when people actively try to work around them is unknown gathering that data from real-world deployment is basically the whole point of releasing 4.7 first. It’s a live experiment, and Anthropic is being reasonably honest about that.

They’re publishing a report within 90 days by mid-July 2026 covering what vulnerabilities were found through Glasswing, what was fixed, and what they learned about deploying safeguards at scale. That report will be worth reading.

New features for developers and Claude Code users

A few other things launched alongside 4.7 that are worth knowing about.

In Claude Code, there’s a new /ultrareview command. You run it and it does a dedicated review session on your code changes — reads through everything, flags bugs, spots design issues that a careful reviewer would catch. Anthropic is giving Pro and Max Claude Code users three free ultrareviews to try it. I'm interested in this one specifically for the kind of deep-state issues that the model would normally miss in a quick pass — logic errors that only appear in edge cases, that sort of thing. Whether it actually catches those or just flags style issues remains to be seen.

They also extended “auto mode” to Max users. This is a permissions option where Claude makes decisions on your behalf during longer tasks, rather than stopping to ask every time. So you can run a long agentic task and come back when it’s done rather than babysitting it. The risk is it also means Claude might take actions you didn’t specifically approve. Anthropic describes it as “fewer interruptions but with less risk than skipping all permissions” — which is sort of the least reassuring way to frame it, but the idea is sound.

For API developers, there are task budgets in public beta now. You give Claude a token budget for a long run, and it can prioritize its work across that budget. If you’ve ever had Claude spend 80% of its budget on the first 20% of a task and then run out before finishing, this is meant to solve that. This part still feels early to me the description is vague but it’s a real problem that needs solving for production agentic systems, so good that they’re working on it.

Should you upgrade from 4.6?

For coding and agentic work: yes. Cursor, CodeRabbit, Replit, Warp, Notion, Factory, and Rakuten all saw double-digit improvements. The gains are real enough across enough different companies that there’s no question about the direction.

For document analysis, finance, and research: also a clear yes based on the data.

The things to manage on your end: measure your actual token costs before and after if you run high-volume workloads, test your existing prompts for behavior changes from stricter instruction following, and re-tune anything with ambiguous wording. None of these are reasons not to upgrade they’re just things to handle so you don’t get surprised.

For everyone using claude.ai in the browser: nothing to do. The upgrade is already there.

The migration guide in Anthropic’s docs has specific advice on tuning effort levels to manage token usage. Worth reading if you run anything at scale. And when the Glasswing report comes out in July, that’ll be the real signal of what Mythos Preview is actually capable of and whether any of this is moving as fast as Anthropic seems to think it is.