GPT-5.4 Review: Features, Benchmarks & Pricing 2026

GPT-5.4 Review: Features, Benchmarks & Pricing 2026

OpenAI dropped GPT-5.4 on March 5, 2026, and the AI world didn’t exactly yawn. Seven months after GPT-5 launched in August 2025, here was its fourth incremental — and arguably most consequential — revision: a unified model that absorbed the frontier coding muscle of GPT-5.3-Codex, introduced native computer-use capabilities, and stretched its memory to a full million tokens. That’s roughly 750,000 words processed in a single pass. Not quite the complete works of Shakespeare twice over, but close enough to make lawyers, engineers, and financial analysts sit up straight.

Then, just six weeks after that release, OpenAI went further. On April 14, 2026, the company launched GPT-5.4-Cyber — a fine-tuned variant specifically built for defensive cybersecurity work, with deliberately lowered refusal boundaries for vetted security professionals. The timing wasn’t accidental. Anthropic had unveiled Claude Mythos, its own cybersecurity-focused frontier model, just one week earlier under Project Glasswing. The race is openly competitive now, and neither company is pretending otherwise.

So here’s what actually matters if you’re trying to decide where to put your trust, your budget, or your enterprise workflows: what does GPT-5.4 genuinely do well? Where does Anthropic still have the edge? And is the cybersecurity push the beginning of something transformative, or a well-timed PR move dressed up as technical progress?

Access without medium partner : GPT- 5.4 transition on Cybersecurity


What GPT-5.4 Actually Does Well

Let’s start with the obvious win. Unified architecture is a bigger deal than it sounds. Before GPT-5.4, developers had to juggle separate model decisions: use GPT-5.3-Codex for coding-heavy tasks, use GPT-5.2 Thinking for deeper reasoning, consider something else for computer-use workflows. GPT-5.4 rolls all of that into a single system. One model, configurable reasoning depth — from a fast “none” setting to a thorough “xhigh” — and it handles the switching logic for you.

The benchmark numbers back up the confidence. According to OpenAI’s March 2026 launch data, GPT-5.4 scores 83% on GDPval, a test spanning 44 professional categories including law, medicine, and finance. On BigLaw Bench specifically, it hit 91%. On OSWorld — which measures how well the model can actually operate a desktop computer, navigate applications, fill forms, and complete browser tasks — it scored 75%, surpassing the human expert baseline of 72.4%. That’s not a demo trick. It means the model can, in principle, complete real digital workflows without step-by-step hand-holding.

Tool search is the other standout. In large agentic workflows, one of the silent cost-killers was stuffing thousands of tokens of tool descriptions into every request. GPT-5.4 introduces a lazy-loading system: it receives a lightweight index of available tools and only fetches full definitions when it actually decides to use one. In a 250-task evaluation across 36 MCP servers, this reduced token usage by 47% without any accuracy loss. For teams running large-scale agentic pipelines, that number translates directly into money.

The 1.05 million token context window in the API version is similarly well-executed. One independent reviewer ran a 500-page legal discovery document alongside 200 pages of case law — and the model held coherence through to the end. Not every model does that. Google’s Gemini 3.1 Pro offers a 2-million-token window, but reportedly loses coherence around the 800K-token mark. GPT-5.4’s smaller window, more reliably used, is arguably the better practical choice.


GPT-5.4-Cyber: The Cybersecurity Play

This is where things get genuinely interesting, and where the Anthropic rivalry sharpens into focus.

GPT-5.4-Cyber is not a new model from scratch. It’s a fine-tuned version of GPT-5.4 with one key difference: it lowers the refusal threshold for legitimate security work. Earlier GPT-5 versions frustrated security professionals because the model often refused to engage with dual-use queries — the kind that sound dangerous in isolation but are routine in penetration testing or malware analysis. GPT-5.4-Cyber is specifically trained to distinguish between those contexts.

The headline capability is binary reverse engineering: the model can analyze compiled software for malware potential and vulnerability patterns without needing access to the original source code. That’s a meaningful capability for security teams that regularly encounter software they don’t own and can’t decompile themselves. Access is tiered, starting with a few hundred verified users through OpenAI’s Trusted Access for Cyber (TAC) program, with plans to expand to thousands of security teams as verification scales up.

Anthropic’s Mythos went a different route. Rather than adapting an existing model, Mythos was reportedly built from the ground up as a cybersecurity-first system — and independently discovered thousands of zero-day vulnerabilities across critical infrastructure. That’s a different kind of claim. OpenAI’s Codex Security, in contrast, has helped fix over 3,000 critical and high-severity vulnerabilities since its recent launch. Both numbers sound impressive; they’re measuring different things.

The honest comparison: Mythos appears to have more raw offensive-capability depth, which is why Anthropic is keeping it behind closed doors under a heavily controlled preview. GPT-5.4-Cyber is more accessible, more controlled, and arguably more responsibly deployed right now. Whether “more accessible” is a feature or a limitation depends entirely on your job title.


Knowledge, Reasoning, and Where the Gap Is Closing

GPT-5.4 sits at #1 out of 106 models on BenchLM’s knowledge and understanding category with a score of 97.6. That’s not a narrow lead. On GDPval across professional domains, it’s matching or beating industry-level humans on a range of knowledge-work tasks. For professionals who use AI to draft legal documents, analyze financial models, or synthesize research literature, this is where GPT-5.4 arguably delivers the most immediate, practical value.

The reasoning configuration is clever, too. Five levels — none, low, medium, high, and xhigh — let you dial up compute and cost only when the task actually demands it. An xhigh reasoning query will cost significantly more and take longer, but for complex multi-step problems, it thinks its way through edge cases that lower settings miss. It’s not the kind of thing you’d use for a casual summary. But for a nuanced legal analysis or a debugging session across a sprawling codebase, it matters.

Here’s where I’ll give Anthropic its due, though. On SWE-bench Verified, the gold-standard coding benchmark, Claude Opus 4.6 still leads at 80.8% versus GPT-5.4’s ~80%. That’s a narrow margin, but it’s consistent. On multi-file refactoring and complex software architecture tasks, early data from Mythos suggests the gap might widen further in Anthropic’s favor. For teams where coding accuracy is the primary criterion, Claude has not been dethroned.

Web research is another area where Anthropic holds ground. On BrowseComp, Claude Opus 4.6 scores 84% versus GPT-5.4’s lower result. For workflows involving deep research across scattered, dynamic web sources, that difference shows up in the quality and completeness of the output.


Pricing: The Part Nobody Loves

Straightforward performance comparisons get complicated the moment you look at the bill. GPT-5.4 standard is priced at $2.50 per million input tokens and $15–$20 per million output tokens, depending on the configuration. That’s higher than average for frontier models — the median sits around $1.40 input and $8.25 output.

The GPT-5.4 Pro tier takes it further: $30 per million input and $180 per million output. That’s not a typo. For enterprises running high-volume pipelines, the math gets painful quickly, particularly since prompts exceeding 272K tokens are billed at 2x the input rate. You could be using the 1-million-token window and paying double for the privilege of going past the lower threshold. It’s a useful capability with a real sting attached.

To put that in concrete terms: a legal team running 500 document-review sessions per week using 400K-token prompts would see input costs balloon compared to the same workflow kept under 272K tokens. It’s not a dealbreaker for large enterprises, but for growing teams scaling agentic workflows, this pricing cliff needs to be planned around — not discovered on the first billing cycle.

There’s also a usage limit worth noting for ChatGPT Plus users: 80 messages per 3 hours in Thinking mode. That’s enough for casual professional use, but anyone running intensive research sessions will hit the wall and find it genuinely constraining. Pro tier ($200/month) removes that cap, but that’s a jump most individual contributors won’t make.

By comparison, Gemini 3.1 Pro from Google comes in at $2 input and $12 output — significantly cheaper for similar frontier-tier performance. Anthropic’s Claude Opus 4.6 output costs are reportedly higher than GPT-5.4 standard, but the coding performance justifies the premium for certain teams. Claude also has a slight tokenizer efficiency advantage. The recently launched Claude Opus 4.7 specifically improves on advanced software engineering tasks, making Anthropic’s coding value-per-dollar argument stronger.

GPT-5.4 Mini — a lighter version of the same architecture — runs at roughly $0.40 input and $1.60 output, with 94% of the standard model’s coding performance on SWE-bench. That’s a remarkable number. For high-volume tasks where you don’t need xhigh reasoning depth — content generation, routine document classification, straightforward Q&A over structured data — Mini may well be the smarter economic choice. It’s the option OpenAI doesn’t promote loudly because it cannibalizes revenue from the flagship tier, but it’s genuinely worth evaluating.

The broader pricing story of 2026 is one of convergence at the frontier paired with divergence in cost strategy. GPT-5.4 is expensive and highly capable. Gemini 3.1 Pro is affordable and nearly as capable. Claude occupies a trust-and-coding premium. None of these positions is wrong — they reflect genuinely different bets on what enterprise customers actually value.


Where GPT-5.4 Genuinely Falls Short

It’s worth pausing here, because a lot of GPT-5.4 coverage reads like a press release. There are real weaknesses, and they matter depending on what you’re building.

Multimodal performance is a relative blind spot. On BenchLM’s multimodal and grounded tasks category, GPT-5.4 ranks 15th — well behind its positions in other categories. For teams building applications that deeply integrate image understanding, document vision, or grounded spatial reasoning, this is a notable gap. Gemini 3.1 Pro scores 94.3% on GPQA Diamond versus GPT-5.4’s 92.8%, and leads on abstract reasoning benchmarks too. If your workflow involves interpreting charts, processing multi-modal documents, or analyzing visual data, this is an area worth testing carefully before committing.

Verbosity is a real problem at high reasoning effort. At xhigh reasoning, GPT-5.4 generated 120 million tokens during the Artificial Analysis Intelligence Index evaluation — compared to an average of 35 million for comparable models. That’s not just expensive; it’s a signal that the model is reasoning more exhaustively than many tasks require. For nuanced multi-step problems, exhaustive reasoning is good. For straightforward professional queries, it’s an unnecessary tax. The time to first token at xhigh is reportedly 205.54 seconds. Over three minutes before you see a single word of output. Production systems with real-time requirements simply can’t accommodate that.

Refusal behavior remains inconsistent. Even with GPT-5.4-Cyber lowering the threshold for security work, the standard GPT-5.4 has drawn criticism for refusing legitimate professional queries. One reviewer documented the model declining to analyze public court records — already published by the court — citing privacy concerns. Getting around it required framing the request as a law school teaching exercise. That’s not an edge case. Security researchers, legal professionals, and compliance teams regularly hit this friction, and it slows real work. Anthropic’s Claude models have earned a better reputation here: the refusals are more contextually calibrated, and less likely to block clearly professional requests based on surface-level keyword patterns.

Rapid iteration pace creates churn. GPT-5.3 Instant launched on March 3, 2026. GPT-5.4 dropped two days later. For teams that had just evaluated, tested, and deployed against GPT-5.3, that’s a disruptive cycle. GPT-5.2 Thinking is being retired June 5, 2026 — giving teams about three months to migrate production workflows. That’s not unreasonable, but when the cadence of new releases is this fast, “staying current” becomes its own operational cost. Compare this to Anthropic’s more deliberate versioning, where Opus 4.7 was recently released with detailed migration guidance and a clearly communicated tokenizer change that could affect costs.

Trust and transparency remain open questions. Some enterprise security teams have quietly moved sensitive workflows from GPT-5.4 to Claude following concerns about data handling and military contract discussions earlier this year. This isn’t purely a technical critique. For organizations handling genuinely sensitive IP or operating in regulated industries, perception and governance matter as much as benchmark scores.


GPT-5.4 vs. Anthropic: An Honest Scorecard

The temptation with these comparisons is to declare a winner and move on. That would be a disservice to how complex the actual trade-offs are.

GPT-5.4 wins on: knowledge work depth, computer-use automation, agentic tool efficiency, professional document tasks, and the overall breadth of capabilities in a single model. The GDPval score of 83% across 44 professions is genuinely impressive, and the native computer-use capabilities are the most polished in the market right now. If you need an AI that can independently operate a computer to complete complex, multi-step digital workflows, no other model currently comes close.

The tool search efficiency gain deserves more credit than it typically gets in reviews. In a real agentic pipeline — one where your model is orchestrating dozens of tools, APIs, and connectors — the 47% token reduction OpenAI measured isn’t just a cost improvement. It’s a complexity reduction. Fewer tokens in the pipeline means fewer chances for the model to lose track of what it was doing. That matters when you’re running agentic tasks that span minutes, not seconds.

Anthropic wins on: coding precision, long-context web research, security-first model design (Mythos, specifically), nuanced refusal handling, and — increasingly — enterprise trust. Claude Opus 4.7, released April 16, 2026, specifically narrows the gap on advanced software engineering. Anthropic’s comparison data shows Opus 4.7 outperforming both Google’s Gemini 3.1 Pro and GPT-5.4 in several coding categories, while Mythos remains the benchmark everything else is chasing on the security side.

Leaked benchmark comparisons between Mythos and GPT-5.4 Pro show Mythos ahead on coding and reasoning tasks for long-context inputs, and at a lower cost per token. That’s a double advantage — better performance and better economics — though it applies only to organizations with access, and access is still tightly controlled. For the vast majority of teams, that matchup is theoretical for now.

What the current landscape actually reflects is that benchmark convergence is the real story of 2026. GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro are all scoring within 2 to 3 percentage points of each other across most evaluations. At some point, developer experience, pricing, trust, and ecosystem integration start to matter more than whether one model scores 80.8% or 80% on SWE-bench. The raw intelligence race is still running, but the practical differentiation for most enterprise teams is shifting toward softer factors. Which brings us back to the question of which edges you can live with.


The Cybersecurity Race Nobody Should Downplay

The broader context here deserves attention. Both OpenAI and Anthropic are now building models that can find and potentially exploit software vulnerabilities — and both companies are framing this as a defensive play. The TAC program requires identity verification. Project Glasswing operates under strict controls. These are genuine safeguards, not theater.

That said, some security experts are already pointing out that many vulnerabilities identified by AI tools aren’t novel or immediately exploitable. The speed at which these models surface issues is what’s rattling governments and enterprise security teams — not necessarily the sophistication of what they find. GPT-5.4-Cyber contributing to over 3,000 fixed critical and high vulnerabilities through Codex Security is genuinely useful work. If it helps a mid-sized security firm triage vulnerability backlogs 40% faster, that’s real organizational value.

The structural difference between how OpenAI and Anthropic have approached cybersecurity is worth noting, because the two companies made fundamentally different product decisions. GPT-5.4-Cyber is GPT-5.4 fine-tuned for security use. Anthropic’s Mythos was reportedly built from scratch with cybersecurity capability as a core design goal, not an add-on. One approach produces faster deployment; the other produces deeper native capability. OpenAI chose the former, and their controlled access ramp looks appropriately cautious given that.

U.S. government agencies are notably absent from the initial GPT-5.4-Cyber rollout. OpenAI has confirmed it’s in discussions, pending internal governance and safety review. That’s a gap Anthropic hasn’t publicly addressed for Mythos either, but the eventual government market for verified cybersecurity AI is enormous — federal agencies, defense contractors, and intelligence-adjacent organizations all have genuine need for these capabilities and the budget to pay for them. Whichever company gets that relationship right first will have a durable institutional advantage.

OpenAI’s “Trusted Access” framing — verify identity, expand access incrementally, monitor usage — is the right approach in principle. Democratizing access while restricting the most sensitive tiers to vetted professionals is more defensible than keeping everything locked down or throwing everything open. It doesn’t eliminate risks. It means they’re being actively managed rather than ignored.

The conversation about who gets access to what, and under what conditions, is going to define the cybersecurity AI story for the next two to three years. GPT-5.4-Cyber is a milestone in that story, not the conclusion.


What to Make of All This

GPT-5.4 is an exceptional model. That’s not a compliment handed out lightly — it holds up when you move past the benchmark headlines and into the actual use cases. The computer-use capabilities work in production. The context window handles real-world document loads without losing coherence. The knowledge performance across professional domains is genuinely class-leading. For enterprises building agentic workflows that span multiple tools, applications, and long documents, this is the most capable single accessible model currently on the market.

But “most capable accessible model” and “the right model for your team” aren’t the same sentence. If coding accuracy is your primary criterion, Anthropic still has the slight edge — and that edge may widen as Mythos opens up further. If cost matters more than cutting-edge performance, Gemini 3.1 Pro offers an alternative at a considerably lower price point. If you’re building security tools and want the deepest possible capability, you’re probably still waiting for Mythos to come out of preview.

Here’s the thing about the 2026 AI market that doesn’t get said enough: the top three or four frontier models are now close enough in raw performance that your deployment decision probably shouldn’t be made primarily on benchmark scores. Latency, pricing, refusal behavior, developer tooling, data governance policies, and support quality are increasingly the actual differentiators. On those dimensions, GPT-5.4 wins some and loses some, just like its competitors.

One concrete suggestion: if your team is evaluating GPT-5.4, don’t run it on vendor-selected demo tasks. Test it on the specific tasks where your current solution most often fails. That’s where the real signal is. A model that scores 83% on GDPval across 44 professions might still struggle on your specific domain’s edge cases — legal documents in a niche jurisdiction, codebases with unusual architectural patterns, financial models with non-standard structures. The benchmark tells you what’s possible. Your actual test tells you what’s practical.

The competitive pressure between OpenAI and Anthropic is, on balance, good for everyone who uses these tools. GPT-5.4’s release accelerated Anthropic’s communication around Mythos. Mythos’s early benchmarks are pushing OpenAI to tighten GPT-5.4’s coding and security offerings. Neither company can afford to coast. That’s a better market than one where a single dominant model sets the terms unchallenged.

Test both. Look at the bill carefully. And pay close attention to what happens when Mythos becomes broadly available — because that’s the moment this particular scorecard gets rewritten.

Post a Comment

Previous Post Next Post