The 41-Day Cycle

Anthropic shipped Claude Opus 4.8 on Thursday May 28 — forty-one days after Opus 4.7 landed on April 16. That is the fastest Opus cycle the company has ever run. Opus 4.6 to 4.7 took roughly seven months. The previous accelerations in the line were measured in months, not weeks. The pricing did not change — $5 per million input tokens and $25 per million output tokens, the same as 4.7, with a Fast Mode now running at 2.5× speed for 3× lower cost than the prior generation. Anthropic also raised $65 billion at a $965 billion valuation on the same day. The combination — fastest-ever cycle, same pricing, larger fast-mode discount, fresh capital at unprecedented scale — is the closest thing to a 2025-style cadence the frontier has produced this calendar year, and it is exactly the test H2 was set up to detect.

The independent measurements published over the last week broadly confirm Anthropic’s own benchmark claims while sharpening the qualitative picture. Artificial Analysis posted Opus 4.8 at 61.4 on its Intelligence Index — the first time since OpenAI’s April launch that a Claude model sat at the top of that ranking. GPT-5.5 had been at 60.2; Opus 4.7 was 57.3. On Artificial Analysis’s GDPval-AA economic-value evaluation, Opus 4.8 reached 1,890 Elo — up 137 points from Opus 4.7 (1,753) and 121 points ahead of GPT-5.5 (1,769), implying roughly a 67% head-to-head win rate on the kind of multi-step knowledge-work tasks the benchmark is designed to score. Browserbase’s Miguel Gonzalez reported 84% on Online-Mind2Web — characterised as a meaningful jump over both 4.7 and GPT-5.5. Harvey reported Opus 4.8 as the first model to break 10% overall on its Legal Agent Benchmark at the all-pass standard, which requires every sub-task in a multi-step legal workflow to be completed correctly. Cursor’s evaluation noted more efficient tool calling — fewer steps for the same intelligence. The Every.to Vibe Check team wrote that they had moved autonomous workflows from GPT-5.5 to Opus 4.8 and described it as the “best model tested for writing and knowledge work.” Simon Willison endorsed Anthropic’s own description of the release as “a modest but tangible improvement,” characterising the framing as “refreshing.”

The “modest but tangible” framing is doing a lot of work in the H2 reading of this release, and it has to be held against the cadence number. The within-family benchmark deltas are incremental. SWE-Bench Verified moved 87.6% → 88.6% — one percentage point. SWE-Bench Pro moved 64.3% → 69.2% — call it five. OSWorld is up about half a point against a restated baseline (Anthropic transparently flagged that the prior 4.7 number was revised up to 82.8% after a zoom-tool bug fix, which collapses the 4.7-to-4.8 delta on that benchmark). Terminal-Bench 2.1 still goes to GPT-5.5 at 78.2% versus Opus 4.8 at 74.6%. The benchmark Anthropic chose to lead the announcement with is honesty — four times fewer unflagged code errors than 4.7 — which is a reliability metric, not a capability metric per se. Read narrowly, this is exactly the picture H2 was set up to predict: an incumbent frontier lab shipping a polish-and-reliability release inside an established family at a moment when the broad benchmarks are saturating.

Read against the H2 window discipline, the picture is different. The hypothesis text is specifically about whether the 2025-style innovation leap is repeating — whether the 3-to-6 month rate of frontier movement is decelerating relative to the preceding 12 months. On the SWE-Bench Verified frontier, calendar 2025 ran roughly +24 percentage points across twelve months — about 2 points a month. The last 3 months (March through May 2026) ran from ~82% to 88.6% on Anthropic’s leaderboard and to a Mythos preview at 93.9% in restricted access, which is at minimum the same 2 pp/month rate and arguably above it. The Artificial Analysis Intelligence Index moved 57.3 → 61.4 from Opus 4.7 to Opus 4.8 across forty-one days — four points on an index where GPT-5.5 sat at 60.2 — and the #1-spot reordering is itself a leaderboard event the preceding 12 months did not produce more than three or four times across all vendors combined. The rate has not visibly slowed. The cadence has materially accelerated, with three Opus releases in three calendar months (Opus 4.6 in late 2025 with the Thinking variant, Opus 4.7 in mid-April, Opus 4.8 in late May) — the densest Anthropic frontier release schedule on record.

The Mythos overhang is the most consequential piece of evidence and the least decisive. Anthropic has confirmed that a Mythos-class model — the one whose tentative preview previously scored 93.9% on SWE-Bench Verified and 100% on Cybench and which the company decided not to ship — is being prepared for general availability “in the coming weeks,” held back on cybersecurity safeguards rather than capability. The sentence “the constraint is loosening” appears in multiple independent write-ups of the announcement. If Mythos GA produces an independently measured SWE-Bench Verified score in the low-90% range when it ships, that is a step-change in the broad-capability frontier larger than anything calendar 2026 has produced so far. If it ships substantially below the preview number, or ships with material restrictions on what tasks are allowed, that is a partial verify on the ceiling thesis. Either outcome is decisive in a way Opus 4.8 alone is not.

H2 50% → 50%. Falsify, weight 2 (cadence acceleration from ~7-month Opus cycles to 41 days is the clearest counter-signal to “2025-style innovation leap is not repeating” this cycle; Artificial Analysis Index #1 spot reorders for the first time since April; GDPval-AA +137 Elo crosses the 1,800 threshold; Mythos-class GA being prepared at independently-verified-territory levels held back on safety not capability). Verify, weight 2 (within-family benchmark deltas are incremental and Anthropic itself led with the “modest but tangible” framing endorsed by Willison; SWE-Bench Verified +1.0pp and OSWorld near-flat on restated baseline reads as polish-cycle rather than capability leap; Terminal-Bench 2.1 still goes to GPT-5.5 with no Claude-side breakaway; the headline benchmark Anthropic chose was honesty, a reliability metric). The deterministic score holds at 50% because the cadence-meta-signal and the within-family-incremental-signal genuinely cancel inside one release; the qualitative read is that the next three to six weeks resolve the ambiguity depending on whether Mythos GA ships at the preview level or below it, and whether OpenAI or DeepMind responds inside thirty days.

What to watch: vals.ai’s independent SWE-Bench Verified score for Opus 4.8 when posted, against the 82–83% Opus 4.7 independent number from last month; the Mythos GA event when it lands, particularly whether the SWE-Bench Verified score is independently re-measured at the 93.9% preview level; whether GPT-5.6 or Gemini 3.5 Pro lands inside thirty days as competitive response to the Opus #1 Intelligence Index spot, and at what cadence; whether the Cursor and Codex usage telemetry shows a measurable Claude-share recovery over the next thirty days; and whether the $65 billion Anthropic raise produces commensurate compute-spend disclosure inside the next quarter. The window the hypothesis is built around is short enough that resolution lands in weeks, not quarters.

Sources: Anthropic — Claude Opus 4.8 announcement; Vellum — Claude Opus 4.8 benchmarks explained; LLM-Stats — Opus 4.8 release benchmarks and Intelligence Index; VentureBeat — fast mode 3× cheaper, near-Mythos alignment; Every.to — Vibe Check Opus 4.8; TechCrunch — Opus 4.8 with dynamic workflows; Technology.org — 41-day cycle context; SiliconAngle — $65B at $965B valuation; R&D World — comparison vs Mythos and GPT-5.5.