Sakana Fugu Ultra Benchmarks: A Practical Deep Dive into the Numbers (2026)
Fugu Ultra claims top scores on SWE-Bench Pro, LiveCodeBench, and more. Here's what the numbers mean and how to evaluate them critically.

You’ve seen the headlines: Fugu Ultra beats Claude Opus 4.8 and GPT‑5.5 on engineering benchmarks. But all those numbers come from Sakana itself — no independent replication yet.
💡 Fugu Ultra is a multi-agent orchestration system, not a single LLM. Its benchmark scores reflect coordination quality, not raw model capability. Treat them as strong signals, not final verdicts.
What Is Sakana Fugu Ultra?
Fugu Ultra is a learned orchestration system that coordinates a pool of frontier models behind one OpenAI‑compatible API. It has no own monolithic LLM — the product is the orchestrator. Two variants exist:
Announced June 22, 2026, it is a multi‑agent system marketed “as a model.”

Flat illustration of a pipeline of AI agents (colored cubes) passing data through a central coordinator, with
API Pricing (per million tokens)
| Metric | Cost |
|---|---|
| Input tokens | $5 |
| Output tokens | $30 |
| Cached input (≤272K context) | $0.50 – $1.00 |
| Context >272K (input/output) | $10 / $45 |
These prices are for the API. SaaS subscriptions ($20/$100/$200 per month) are also available as wrappers.
Fugu Ultra Benchmarks: The Numbers
All scores below are self‑reported by Sakana as of June 2026. No independent lab has replicated them yet.
SWE‑Bench Pro (Software Engineering)
Fugu Ultra solves 73.7% of real‑world code‑fix tasks, outperforming every single‑model competitor. However, Anthropic’s Fable 5 (unavailable due to export restrictions) scores ~80.0%.
LiveCodeBench v6 (Interactive Coding)
Fugu Ultra scores ~93.2, ahead of Claude Fable 5 (89.8) and GPT‑5.4, Gemini 3.1 Pro.
GPQA‑Diamond (Advanced Science QA)
Accuracy: 95.5% — slightly above Mythos Preview (94.6%) and well above older models.
TerminalBench 2.1 (CLI / Tool‑Use)
Fugu Ultra: 82.1, Fugu: 80.2, Claude Opus 4.8: 74.6, GPT‑5.5: 78.2. The orchestrator shines in complex shell environments.
Humanity’s Last Exam (HLE)
Fugu Ultra: 50.0, Claude Opus 4.8: 49.8. Essentially tied, but Sakana uses this to argue “at least not worse.”
MMLU & HumanEval (Alternative Table)
Some reviews report: MMLU 87.3% vs Fable 5’s 85.1%; HumanEval 92.4% vs 89.7%. These differ slightly from other sources — always check the exact benchmark variant.
How to Read These Benchmarks Like a Pro
1. Distinguish Self‑Reported vs Independent
Every major number here is from Sakana’s own table. Until third parties confirm, treat them as strong marketing claims, not definitive rankings.
Best practice: Always write “according to Sakana’s self‑reported benchmarks” and avoid absolute statements like “Fugu Ultra is the best.”
2. Specify Benchmark Version
Confusion easily arises:
Always include the exact name and version in comparisons.
3. Separate Engineering vs General Knowledge
Fugu Ultra dominates coding and tool‑use (SWE‑Bench Pro, LiveCodeBench, TerminalBench). On general knowledge (HLE, MMLU) it is roughly on par with top models, not dramatically ahead. Pick the right benchmark for your use case.
4. Remember the Architecture
High scores do not come from a single super‑model. They come from clever orchestration of multiple models + tools. That means latency and cost can be higher than a single LLM. For production, evaluate total cost of ownership (TCO).
💡 Fugu Ultra’s benchmark results are a measure of system coordination, not raw model intelligence. Always pair benchmark data with your own domain‑specific tests.
Common Mistakes to Avoid
How to Evaluate Fugu Ultra for Your Stack
FAQ
Is Fugu Ultra really better than Claude Opus 4.8?
According to Sakana’s self‑reported benchmarks, yes on SWE‑Bench Pro (73.7% vs 69.2%) and TerminalBench 2.1 (82.1 vs 74.6). But these results need independent replication. On general‑knowledge tests like HLE, the gap is negligible.
Can I use Fugu Ultra for everyday automation?
Only if you need top‑tier quality for complex tasks and can tolerate higher latency and cost. For daily use, the standard Fugu tier ($5/$30 per million tokens) might be more practical.
Why are there different numbers for SWE‑Bench?
Some sources report 78.9% on “SWE‑Bench” (without “Pro”). That’s a different, likely easier dataset. Always check the exact benchmark name and version.
Start Building with the Right Tools
Fugu Ultra’s benchmarks are impressive — but numbers alone don’t guarantee results in your workflow. The best way to evaluate any AI system is to test it on your own tasks. That’s why we built MakeBox AI: to give automation engineers a transparent, practical way to compare models and orchestrate workflows.
Want to talk through your setup?
Get in touch and I'll walk through what would actually move the needle for your shop or business.