Vectara Hallucination Leaderboard: Claude, GPT, Gemini Compared

Vectara Hallucination Leaderboard: Claude, GPT, Gemini Compared

So a few weeks ago I was using one of the flagship models to help me fix a Python error. I was getting a ModuleNotFoundError and the model told me to install a package called requests-oauth2client with a specific version number, then import a class called OAuth2ClientCredentialsAuth from it. The explanation was detailed. The code looked clean. I ran pip install and tried it.

The class did not exist. I went to the actual package docs — the class the model described was not there, had never been there. I went back and told the model this. It apologized, gave me a slightly different class name, same package. Also did not exist. Third attempt, it confidently told me the first answer was correct and I was probably importing wrong.

I spent 30 minutes on this before I just went to Stack Overflow like it was 2018.

And this is the thing about AI hallucinations that really gets me. The model doesn’t slow down when it’s wrong. It doesn’t add a “hmm, I’m not sure about this.” It gets more specific. More confident. More helpful-sounding. The more wrong it is, the more it sounds like it knows exactly what it’s talking about.

Wait, What Even Is a Hallucination?

When AI people use the word “hallucination,” they don’t mean the model is seeing things in a visual sense. What it means is: the model says something that sounds completely true but is simply not. A fake study. A person who doesn’t exist. A law that was never passed. A quote that was never said.

Why does this happen? These models are not actually “knowing” things the way you and I know things. They are predicting. Given everything they’ve seen, what word should come next? What sentence sounds right here? When the model has a gap in its knowledge, instead of saying “I don’t know,” it fills the gap with whatever sounds most plausible. And it does this very, very confidently. 

Humans also do this at a certain point, we assume things to happen based on a hypothesis and sometimes its like a bet or gamble. But Something which we use for a professional use this cannot be neglected

MIT published research in January 2025 that found something uncomfortable: when AI models hallucinate, they tend to use more confident language than when they are actually correct — about 34% more likely to use phrases like “definitely” and “certainly.” So the model is most sure of itself exactly when it’s most wrong. That’s not a small bug. That’s a design problem.

The Leaderboard Nobody Wants to Talk About

There’s a thing called the Vectara HHEM leaderboard. It tracks hallucination rates across all the major models — basically how often they make up or distort information when summarizing real documents. The results are messy, and they depend a lot on which version of the test you’re looking at. But even on the easier version of the benchmark, the numbers are not great.

On the Vectara grounded summarization benchmark, Google’s Gemini models sit at the top, with Gemini-2.0-Flash at 0.7%. GPT-4o sits at 1.5%. Claude models range from 4.4% for Sonnet to 10.1% for Opus.

Yes you heard it right, recently opus also entered the leaderboard that too with 10% score. Which really makes me think, are we reaching a saturation point.

Okay, 10% doesn’t sound catastrophic until you think about what that actually means. If every tenth summary your AI produces contains a made-up or distorted fact, and you’re using that AI across thousands of documents per day, the errors pile up fast.

And those numbers? That was the easy test. Vectara later updated their benchmark to use longer, more complicated documents — the kind you’d actually encounter in law firms, hospitals, and finance teams. On that harder benchmark, reasoning models like GPT-5, Claude Sonnet 4.5, and Grok-4 all exceeded 10% hallucination rates. The models that are marketed as the most capable, the “reasoning” ones, performed worse on harder real-world content.

That surprised a lot of people.

The explanation from researchers is kind of wild when you think about it: reasoning models invest computational effort into “thinking through” answers, which sometimes leads them to overthink and deviate from source material rather than just sticking to what’s in the document. So the smarter the model tries to be, the more it wanders off into its own imagination. I kind of understand that actually. I do the same thing sometimes.

Bigger Picture: The Scaling Problem

Here’s where the conversation gets bigger and honestly a bit unsettling.

For the last five or six years, the way you made AI better was basically the same every time: make the model larger, feed it more data, throw more compute at it. This approach — called scaling laws — worked incredibly well. GPT-3 to GPT-4 was a huge jump. Things got better reliably. CEOs started making bold predictions. The whole industry bet enormous amounts of money on the assumption that this curve would just keep going up.

But the consensus inside labs is growing that simply adding more data and compute will not create the “all-knowing digital gods” once promised — TechCrunch’s words, not mine.

A survey released in March 2026 found that 76% of AI researchers now believe the gains from scaling have plateaued — and that a fundamentally different approach is needed.

That is a lot of researchers saying, basically: the thing we’ve been doing? It’s running out of road.

OpenAI, Google, and Anthropic have all reportedly been experiencing diminishing returns despite massive investments in computing power and data, according to Bloomberg and The Information. And the next idea people jumped to — giving models more “thinking time” at inference — has its own ceiling. More than two-thirds of the improved performance from reasoning models came from giving them more time to think. But there aren’t enough computer chips in the world to keep scaling thinking time indefinitely, not until more and faster hardware is manufactured.

So we already used that trick. It’s not a trend we can keep pulling.

The Money Problem

This is where I actually get a bit worried. Not in a doomsday way, but in the way you get worried when you see someone confidently building on a foundation that might have cracks.

The numbers are staggering. Companies, governments, banks, hospitals — everyone is deploying AI as fast as they can. The assumption underneath all of it is that these models are good enough, and getting better fast enough, to be trusted with real decisions.

A Deloitte survey found that 47% of enterprise AI users made at least one major decision based on hallucinated content in 2024. The financial cost of AI hallucination-driven errors reached $67.4 billion globally in 2024–2025. That’s not a rounding error. And knowledge workers now spend an average of 4.3 hours per week just verifying AI outputs — which is extra time you were supposed to be saving by using AI in the first place.

So we have a situation where models are hallucinating at meaningful rates, the improvement curve is flattening out, and enormous capital — and increasingly consequential decisions — depend on all of this working reliably. And it doesn’t, not completely.

A BBC and European Broadcasting Union study evaluated 3,000+ AI responses to news questions across 18 countries. Forty-five percent of responses had at least one significant problem. Eighty-one percent had at least one mistake of any kind. And crucially, refusal rates were only 0.5% — these systems almost never say “I don’t know.”

That last number is the one that bothers me. If the model just admitted it wasn’t sure more often, humans could compensate. But it doesn’t. It keeps going, confidently, into the wrong answer.

Even almost 90% of the YC startups during last few years are someway or the other way looks like a gpt wrapper.Every new startup Idea I see is always dependent on the AI, Which is quite concerning.

So Are We Stuck?

Not exactly. But we are at one of those moments where the next step isn’t obvious, and that’s uncomfortable for an industry that’s been printing confident roadmaps.

The scaling era had a beautiful simplicity to it: spend more money, get smarter AI. That’s easy to pitch to investors. The next era is messier. Some labs are experimenting with retrieval-augmented generation, where you essentially hook the model up to real databases instead of relying on what it memorized. That helps a lot with certain types of errors. Some are trying to bake in more “I don’t know” behavior. Anthropic actually found some success with this — Claude models tend to acknowledge uncertainty rather than hallucinate in some evaluations, often stating “I don’t have enough information” rather than inventing an answer. That’s the behavior you actually want.

But this stuff is harder to scale than just buying more GPUs. And it doesn’t have the same clean story: “we doubled compute, benchmarks went up.” It’s patchwork. It’s messy. It requires knowing exactly which kinds of errors you’re trying to fix.

A senior staff member at an AI company recently said this whole experience updated them toward longer timelines to AGI — because one possible path to rapid gains they had been counting on just got ruled out.

Some people will tell you this is fine, the next breakthrough is around the corner. Maybe. Yann LeCun has been saying for years that LLMs have fundamental limits and we need a completely different approach. He might be right. Or the “test-time compute” people might crack something new. I genuinely don’t know.

What I do know is that the story being sold publicly — smarter every month, trust it with more things, the returns keep coming — is not matching what the hallucination leaderboards and the research surveys are actually showing.

What Should You Do With This Information?

If you’re building something with AI, or your company is deploying it: don’t turn off your brain. The models are useful. They’re genuinely useful, and they do save time on the right tasks. But treat them like a very fast intern who is occasionally confidently wrong, not like a database that you can just query and trust.

Verify important outputs. Don’t deploy in situations where a wrong answer causes real harm without a human in the loop. And if the vendor tells you their model “never hallucinates now” — that citation you’re about to cite? Check it.

Because right now, the AI industry’s biggest unresolved problem is not a lack of capability. It’s a lack of honesty. Not about what the models can do. But about what they don’t know.

Post a Comment

Previous Post Next Post