MakeBox AI
← Back to Blog
AutomationJune 29, 20267 min read

Sakana Fugu Ultra Benchmarks: A Practical Deep Dive into the Numbers (2026)

Sakana Fugu UltraAI benchmarksmulti-agent systemssoftware engineeringLLM orchestrationcode generationreasoning testsAI evaluation

Fugu Ultra claims top scores on SWE-Bench Pro, LiveCodeBench, and more. Here's what the numbers mean and how to evaluate them critically.

makebox.ai / blog / sakana-fugu-ultra-benchmarks
Sakana Fugu Ultra Benchmarks: A Practical Deep Dive into the Numbers (2026)

You’ve seen the headlines: Fugu Ultra beats Claude Opus 4.8 and GPT‑5.5 on engineering benchmarks. But all those numbers come from Sakana itself — no independent replication yet.

💡 Fugu Ultra is a multi-agent orchestration system, not a single LLM. Its benchmark scores reflect coordination quality, not raw model capability. Treat them as strong signals, not final verdicts.

What Is Sakana Fugu Ultra?

Fugu Ultra is a learned orchestration system that coordinates a pool of frontier models behind one OpenAI‑compatible API. It has no own monolithic LLM — the product is the orchestrator. Two variants exist:

  • Fugu – for everyday, latency‑sensitive tasks
  • Fugu Ultra – quality‑first, for complex multi‑step problems like engineering, advanced code, and scientific analysis
  • Announced June 22, 2026, it is a multi‑agent system marketed “as a model.”

    Flat illustration of a pipeline of AI agents (colored cubes) passing data through a central coordinator, with

    Flat illustration of a pipeline of AI agents (colored cubes) passing data through a central coordinator, with

    API Pricing (per million tokens)

    MetricCost
    Input tokens$5
    Output tokens$30
    Cached input (≤272K context)$0.50 – $1.00
    Context >272K (input/output)$10 / $45

    These prices are for the API. SaaS subscriptions ($20/$100/$200 per month) are also available as wrappers.

    Fugu Ultra Benchmarks: The Numbers

    All scores below are self‑reported by Sakana as of June 2026. No independent lab has replicated them yet.

    73.7%SWE‑Bench Pro (Fugu Ultra)
    69.2%Claude Opus 4.8
    58.6%GPT‑5.5
    54.2%Gemini 3.1 Pro

    SWE‑Bench Pro (Software Engineering)

    Fugu Ultra solves 73.7% of real‑world code‑fix tasks, outperforming every single‑model competitor. However, Anthropic’s Fable 5 (unavailable due to export restrictions) scores ~80.0%.

    LiveCodeBench v6 (Interactive Coding)

    Fugu Ultra scores ~93.2, ahead of Claude Fable 5 (89.8) and GPT‑5.4, Gemini 3.1 Pro.

    GPQA‑Diamond (Advanced Science QA)

    Accuracy: 95.5% — slightly above Mythos Preview (94.6%) and well above older models.

    TerminalBench 2.1 (CLI / Tool‑Use)

    Fugu Ultra: 82.1, Fugu: 80.2, Claude Opus 4.8: 74.6, GPT‑5.5: 78.2. The orchestrator shines in complex shell environments.

    Humanity’s Last Exam (HLE)

    Fugu Ultra: 50.0, Claude Opus 4.8: 49.8. Essentially tied, but Sakana uses this to argue “at least not worse.”

    MMLU & HumanEval (Alternative Table)

    Some reviews report: MMLU 87.3% vs Fable 5’s 85.1%; HumanEval 92.4% vs 89.7%. These differ slightly from other sources — always check the exact benchmark variant.

    How to Read These Benchmarks Like a Pro

    1. Distinguish Self‑Reported vs Independent

    Every major number here is from Sakana’s own table. Until third parties confirm, treat them as strong marketing claims, not definitive rankings.

    Best practice: Always write “according to Sakana’s self‑reported benchmarks” and avoid absolute statements like “Fugu Ultra is the best.”

    2. Specify Benchmark Version

    Confusion easily arises:

  • SWE‑Bench vs SWE‑Bench Pro (different task sets)
  • LiveCodeBench v6 vs older versions
  • TerminalBench 2.1
  • Always include the exact name and version in comparisons.

    3. Separate Engineering vs General Knowledge

    Fugu Ultra dominates coding and tool‑use (SWE‑Bench Pro, LiveCodeBench, TerminalBench). On general knowledge (HLE, MMLU) it is roughly on par with top models, not dramatically ahead. Pick the right benchmark for your use case.

    4. Remember the Architecture

    High scores do not come from a single super‑model. They come from clever orchestration of multiple models + tools. That means latency and cost can be higher than a single LLM. For production, evaluate total cost of ownership (TCO).

    💡 Fugu Ultra’s benchmark results are a measure of system coordination, not raw model intelligence. Always pair benchmark data with your own domain‑specific tests.

    Common Mistakes to Avoid

  • Calling it a “model” – It’s an orchestration system. Comparing 1:1 with Claude Opus 4.8 is like comparing a chef’s team to a single cook.
  • Trusting self‑reported numbers blindly – Many articles ignore the “self‑reported” caveat. Be the one who doesn’t.
  • Mixing benchmark variants – 78.9% on “SWE‑Bench” is not the same as 73.7% on “SWE‑Bench Pro.” Cite exactly.
  • Ignoring cost and latency – Ultra is quality‑first; for real‑time apps, the cheaper Fugu tier may be better.
  • Overclaiming against Fable 5 – Those comparisons are indirect, using Anthropic’s published numbers, not a head‑to‑head test.
  • How to Evaluate Fugu Ultra for Your Stack

  • 1.Define your tasks – Engineering? General QA? Research?
  • 2.Choose relevant benchmarks – SWE‑Bench Pro for code, GPQA‑D for science.
  • 3.Run your own mini‑benchmark – Pick 10–20 real tasks from your codebase or domain.
  • 4.Measure cost and latency – Compare against your current model (e.g., GPT‑5.5 or Claude Opus 4.8).
  • 5.Treat Sakana’s numbers as a starting point – Validate before committing.
  • FAQ

    Is Fugu Ultra really better than Claude Opus 4.8?

    According to Sakana’s self‑reported benchmarks, yes on SWE‑Bench Pro (73.7% vs 69.2%) and TerminalBench 2.1 (82.1 vs 74.6). But these results need independent replication. On general‑knowledge tests like HLE, the gap is negligible.

    Can I use Fugu Ultra for everyday automation?

    Only if you need top‑tier quality for complex tasks and can tolerate higher latency and cost. For daily use, the standard Fugu tier ($5/$30 per million tokens) might be more practical.

    Why are there different numbers for SWE‑Bench?

    Some sources report 78.9% on “SWE‑Bench” (without “Pro”). That’s a different, likely easier dataset. Always check the exact benchmark name and version.

    Start Building with the Right Tools

    Fugu Ultra’s benchmarks are impressive — but numbers alone don’t guarantee results in your workflow. The best way to evaluate any AI system is to test it on your own tasks. That’s why we built MakeBox AI: to give automation engineers a transparent, practical way to compare models and orchestrate workflows.

    Want to talk through your setup?

    Get in touch and I'll walk through what would actually move the needle for your shop or business.