Japanese AI Model Fugu By Sakana, Claims to Beat Mythos and Fable?

Japanese AI Model Fugu By Sakana, Claims to Beat Mythos and Fable?

Sakana AI launched something on June 22, 2026, and the AI community had a lot of opinions about it very fast. The company released Fugu and Fugu Ultra, their new multi-agent system, and within a day people were both impressed by the benchmark numbers and annoyed by the wait times. Ethan Mollick, who teaches at Wharton and does some of the most-watched AI testing on X, said his shader tests took around 30 minutes on Fugu Ultra. One Hacker News user was blunter: “For $200/month you get less than 3 hours of use per week, the API is extremely slow, and the output quality in my tests is nowhere near Fable.”

But on paper, the numbers are hard to ignore. Fugu Ultra scored 73.7 on SWE-Bench Pro, which is the benchmark everyone uses for real software engineering work. Claude Opus 4.8 scored 69.2 on the same test. GPT-5.5 scored 58.6. Gemini 3.1 Pro came in at 54.2. On TerminalBench 2.1, Fugu Ultra hit 82.1 while Opus 4.8 was at 74.6 and GPT-5.5 at 78.2.

Sakana is also claiming this is on par with Anthropic’s Mythos Preview and Fable 5, specifically on benchmarks like GPQA-Diamond, CharXiv Reasoning, and TerminalBench. Which is a big claim. Those two models aren’t even publicly accessible right now because of export controls. So you can’t directly verify by running them head-to-head. That’s part of what makes this whole story complicated.

Where Sakana Came From

The company started in 2023 in Tokyo. David Ha founded it, he was a researcher at Google Brain before this and also spent time at Stability AI. Llion Jones is the CTO and co-founded it with him. Jones is one of the co-authors of “Attention Is All You Need,” the 2017 paper that basically started the whole modern transformer era. The third co-founder is Ren Ito, who has a background in Japan’s Ministry of Foreign Affairs and also worked at Mercari. That’s an unusual mix for an AI startup.

The name “Sakana” means fish in Japanese. It’s about swarm behavior, basically the idea that a school of fish moves as one intelligent unit without any single fish being in charge. That’s the core philosophy behind everything they build. Not one giant brain, but many smaller things working together.

Their first funding round was a $30 million seed in early 2024 from Lux Capital and Khosla Ventures. Then in September 2024 they raised $214 million in a Series A at a $1.5 billion valuation, making them Japan’s first AI unicorn. That Series A included Mitsubishi UFJ, SMBC, Mizuho, Itochu, KDDI, Nomura, Nvidia. In November 2025, they closed a $135 million Series B at a $2.65 billion valuation. So right now they’ve raised about $379 million total and have around 199 employees.

Japan isn’t exactly known for having massive GPU clusters or huge computing budgets. That’s kind of the whole point. Ha has said in interviews that the chance of competing with OpenAI or Google on raw scale, from Japan, is basically low. So they didn’t try. Instead they went in a completely different direction.

The Idea Behind the Approach

Early on, Sakana’s research was mostly about evolutionary algorithms and collective intelligence. In January 2024 they published something called Evolutionary Model Merge, where they used evolutionary algorithms to combine multiple open-source models and create new ones. The idea was you could breed models together the same way evolution selects useful traits. It needed much less compute than training from scratch and it actually worked, producing some strong models for Japanese language tasks.

Then came The AI Scientist, which I think is still probably their most interesting research project. It’s an agentic system that can generate research ideas, run experiments, write the paper, and then do the peer review. A paper produced by AI Scientist v2 actually passed peer review at a workshop at a top AI conference. And then Sakana published the underlying research in Nature in March 2026, which is not something that usually happens with AI papers from startups.

They also did work on the Darwin Godel Machine, which is even more out there. It’s an AI that can write, test, and improve variants of its own code. The whole recursive self-improvement thing. In May 2025 they created ALE-Agent, which placed 21st out of 1,000 human participants in a live AtCoder Heuristic Competition for solving hard optimization problems. That’s a real competition with real people.

So this company isn’t just doing LLM product stuff. They’re doing genuinely weird research that goes in different directions than most labs.

What Fugu Actually Is

Fugu isn’t a model in the normal sense. It’s better to think of it as a system that knows how to use other models. The core is a 7 billion parameter model trained via reinforcement learning. That model, called the RL Conductor, doesn’t generate your answer directly. When you send a task through the API, the Conductor figures out what kind of task it is and then delegates pieces of it to other models in its pool. GPT-5, Gemini 3.1 Pro, Claude Opus 4.8, others. It coordinates them, verifies outputs, and synthesizes everything into a final answer.

The technical foundation comes from two papers at ICLR 2026. One is TRINITY, about an evolved LLM coordinator. The other is about the Conductor, which learns to orchestrate agents in natural language using reinforcement learning. Both were peer-reviewed and accepted at the conference, which matters because it means this isn’t just Sakana’s marketing claims. Other researchers looked at it.

What makes it different from simple routing tools like Not Diamond or RouteLLM is how it works internally. Those tools look at your prompt and decide which single model to send it to, then forward it. Fugu actually breaks the task into sub-parts, assigns each part to a different model in a Thinker, Worker, Verifier structure, lets them check each other’s work, and then puts it all together. It can even call itself recursively. So it reads its own output, sees where it went wrong, and starts corrective sub-workflows before giving you the final answer.

From outside, it looks like one model. You point your client at the Fugu endpoint with your API key and that’s it. It’s OpenAI-compatible, so if you’re already using GPT-5.5 or Opus 4.8 via the OpenAI SDK, switching to Fugu is mostly just changing the endpoint.

The Benchmark Numbers and Why They’re Complicated

The numbers I mentioned earlier are Sakana’s own published results. That’s important to say. SWE-Bench Pro, TerminalBench 2.1, GPQA-Diamond, CharXiv Reasoning, these are all real benchmarks and 73.7 on SWE-Bench Pro is genuinely a strong score. But as DataCamp noted in their coverage, the baseline scores for other models are also provider-reported, meaning everyone is measuring themselves differently.

And there’s a bigger thing. Fable 5 and Mythos Preview aren’t in Fugu’s agent pool. They can’t be, because access to them was suspended under US export controls starting June 12, 2026. Sakana isn’t comparing against those models by running them. They’re comparing against published benchmark numbers that Anthropic reported. So when they say Fugu Ultra is “on par with Mythos,” they mean: based on what Anthropic published about Mythos on these specific benchmarks, Fugu Ultra scores similarly. Not that they ran a head-to-head test.

Some people think this is fine. The claim is still meaningful. Some think it’s a bit misleading. I think it’s somewhere in between, but it’s worth knowing before you take the headline at face value.

Also, the standard Fugu model sometimes beats Fugu Ultra on specific tasks. On SciCode, tau-cubed Banking, and Long Context Reasoning, the cheaper Fugu model actually did better in Sakana’s own data. More orchestration is not always better. Long documents in particular seem to get worse when work keeps getting handed between agents because you lose coherence. A Classmethod engineer who tested it early reported that on one coding task, Fugu Ultra used 26,404 orchestration tokens, about 8.8 times the visible output, and took 4.5 minutes. Base Fugu did equivalent work in 55 seconds.

The Latency and Cost Problem

This is where things get messy in practice. The $20 per month entry tier runs out fast. The $200 per month tier gives you less than 3 hours of Fugu Ultra per week according to early users. The pay-as-you-go pricing for Fugu Ultra is $5 per million input tokens and $30 per million output tokens, which is at the expensive end compared to running single frontier models directly. And you’re also paying the hidden cost of orchestration tokens that you don’t control and can’t easily predict in advance.

For simple tasks, you’re paying Fugu Ultra prices for work that a cheaper model handles fine. Some people on Hacker News pointed out that running DeepSeek V4 on OpenRouter gets you decent quality at roughly one-tenth the cost for tasks that don’t really need deep multi-agent coordination.

The system isn’t available in the EU or EEA yet while they work on GDPR compliance. And there’s some unease in the community about Sakana’s military contracts. They won an award at a competition run jointly by the US and Japanese defense ministries, which some developers feel is worth knowing about before using the service.

That said, the real-world cases where Fugu seems to actually shine are cybersecurity analysis, paper reproduction, patent investigations, and Kaggle-level research tasks. One software engineer reported finding architecture bugs that single models missed. Another said Fugu maintained consistent persona and scope across a long security assessment. Those are real wins. But they’re the kind of wins you see on long, hard, multi-step work. Not on everyday coding tasks.

Why Japan Needed This

There’s a bigger context that makes Fugu more interesting than just another model launch. Japan’s Digital Minister had been warning earlier this year that without faster domestic AI development, the country risked becoming what he called an “AI colony.” Depending on models from the US or China for critical infrastructure is a sovereignty problem, and Fugu launched exactly ten days after Fable 5 and Mythos Preview got pulled from international access.

Sakana’s argument is that by learning to orchestrate a flexible pool of models, you’re less vulnerable to any single provider cutting off access. If one model disappears from the pool, the system routes around it. That logic makes sense, though there’s a fair counterargument: Fugu still depends heavily on US-based APIs from OpenAI and Anthropic. So it’s not fully independent, it’s just less concentrated. That’s different from sovereign.

But the direction of travel is real. Sakana also has their Namazu LLMs for Japanese language specifically, and they’ve been working with Japanese enterprises including MUFG on bank-specific AI systems. The Japanese government gave them access to national supercomputing resources back in 2024. They won a joint Japan-US defense competition. This is a company that the Japanese state is treating as a national asset, not just another startup.

What Comes Next

Sakana launched Marlin in June 2026, about two weeks before Fugu. Marlin is an autonomous research agent aimed at enterprise users that can run for up to eight hours on a task. They’re also doing work through their RSI Lab on recursive self-improvement, building toward AI systems that can improve their own architecture without human intervention. The AI Scientist project is ongoing.

The Fugu pool will update over time. Sakana says when a new frontier model becomes publicly available, they expect to spend about two weeks training and evaluating updated Fugu models before rolling them in. So theoretically the whole system improves as the underlying models improve, without users having to change anything on their end.

I think what’s most interesting about Sakana isn’t any single benchmark number. It’s that they have a coherent research philosophy that actually produces real products. Most labs either do deep research and publish papers, or they ship products that are mostly wrappers around someone else’s research. Sakana has been doing both for three years now, from a country that most people didn’t expect to produce a frontier AI company. That’s worth paying attention to, even if Fugu Ultra is currently a bit slow and a bit expensive for what you get out of it.

Whether learned model orchestration becomes a real category or stays a niche approach is still genuinely unsettled. The architecture is sound in theory. The execution still has rough edges. Both of those things can be true at the same time.

Post a Comment

Previous Post Next Post