The Proof That Didn't Move the Score

Nine mathematicians signed a 19-page companion paper on Wednesday confirming that an internal OpenAI reasoning model — general-purpose, not math-specialised — produced a counterexample to Erdős’s 1946 planar unit-distance conjecture. The list of names is the story: Noga Alon (Princeton), W.T. Gowers (Fields medalist, Cambridge/Collège de France), Will Sawin (Princeton), Daniel Litt, Arul Shankar, Jacob Tsimerman, Victor Wang, Melanie Matchett Wood, and Thomas Bloom — the same Bloom who maintains erdosproblems.com and called OpenAI’s previous math announcement “a dramatic misrepresentation” because the model had retrieved existing solutions from the literature. Sawin’s separate paper the same day gives an explicit improvement exponent δ = 0.014, derived through class field towers and Golod-Shafarevich theory — techniques from algebraic number theory that had not previously been brought to bear on what was thought to be a purely geometric problem. Gowers’ written verdict: “a milestone in AI mathematics… I would have recommended acceptance to the Annals of Mathematics without any hesitation.”

The mechanical details matter. OpenAI’s own writeup is explicit that the model was a “new general-purpose reasoning model,” not math-specific; the proof file was generated in one shot, with exposition refined through a second pass via Codex; the model was not scaffolded to search proof strategies and was not targeted at this problem. Bloom’s involvement is the strongest validation signal. A mathematician whose explicit role is policing claimed AI math results against the published literature is co-author on the paper that confirms this one is original. That is not noise.

It is, however, an event that has to be read against its cadence rather than as a singularity. Frontier labs have been landing peer-verified narrow autonomous-discovery results in mathematics every three to six months for the past eighteen. DeepMind’s AlphaProof + AlphaGeometry 2 took silver at the 2024 IMO using Lean-verified formal proofs and had its methodology written up in Nature in November 2025. The 2025 IMO produced two gold-medal performances — DeepMind’s Gemini Deep Think framework and an OpenAI model. AlphaEvolve, published on arXiv in November 2025, tested an evolutionary-search system across 67 problems in analysis, combinatorics, and geometry and improved on existing bounds in several. FunSearch produced new constructions for the cap set problem; PatternBoost disproved a 30-year-old conjecture in extremal graph theory. The Erdős unit-distance result is the most editorially celebrated entry in the stream, and Bloom’s signature does close the standard literature-retrieval objection more cleanly than most prior events — but it sits inside a roughly steady release rate, not above it. A single event consistent with an existing cadence is weak evidence either way on whether the underlying rate is accelerating or decelerating, which is the question H2 is built around.

The broader benchmarks have not visibly slowed either. SWE-Bench Verified — the workhorse independent test of real GitHub bug-fixing — was clustered around 65% at the top of the leaderboard in Q1 2025. The April 2026 wave (GPT-5.5, Claude Opus 4.7, DeepSeek V4 Pro Max, Kimi K2.6) pushed the frontier to 88.7% on vendor-reported numbers and roughly 82–83% on independent re-measurement at vals.ai; Claude Mythos preview from Anthropic is currently at 93.9% on at least one tracker. The frontier has moved roughly twenty percentage points in sixteen months — one to two percentage points per month at the top — and the most recent quarter contributed several of the biggest jumps. The cluster description that’s often used to argue plateau is largely a consequence of contamination concerns flagging SWE-Bench Verified scores at the top, which is why SWE-Bench Pro is being adopted as the successor benchmark with frontier models scoring in the 55–65% band on it — exactly the lower base you’d expect to leave room for further improvement. The same lab releases that made the leaderboard tight on Verified are not visibly running out of room on Pro.

The LMArena story is more ambiguous. The top tier in mid-May 2026 — GPT-5.4, Claude Opus 4.6 Thinking, Gemini 3.1 Pro, Grok 4, DeepSeek V3.2 — sits between 1,450 and 1,561 Elo, and the methodology was sharpened in November 2025 with the Arena Expert track to surface harder prompts, which compresses scores at the top mechanically. New frontier entries (GPT-5.2 in December, ERNIE 5.0 in January, Olmo 3.1 in January) have continued to land. The cluster is tight, but the open-source-to-proprietary gap widened back from a low of four Elo points in early 2025 as the proprietary labs pushed forward in the second half. Whether this is a plateau or a methodology-and-saturation artefact is genuinely undecidable on Elo data alone over a six-month window.

So what does the Erdős result actually update on the ceiling hypothesis?

Not much, and the honest score reflects that. The narrow capability axis is producing peer-verified results at roughly the same rate it was in calendar 2025, so the Erdős event is consistent with the existing trend rather than evidence the trend is accelerating. The broad capability axis — measured on independent benchmarks where new frontier models continue to ship one-to-two percentage point gains per month and where the contamination-cleaner successor benchmarks have left meaningful headroom — is also consistent with continued progress rather than a clear deceleration. Both signals favour “no clear inflection visible in the May 2026 data” over either side of the ceiling thesis, and the inaugural H2 evidence items should weight accordingly.

H2 50% → 50%. Falsify, weight 1 (one more peer-verified narrow autonomous-discovery result, fully credentialed by an unusually authoritative panel of human verifiers, inside an existing eighteen-month release cadence; the cleanest of the genre to date but not a break in the rate). Verify, weight 1 (LMArena top cluster has stayed in the 1,450–1,561 band over the last six months and the methodology was sharpened mid-window, both of which are consistent with plateau but neither dispositive). The deterministic score does not move on this evidence — two weight-1 items in opposing directions cancel exactly, which is the correct answer when the underlying event is fully consistent with the existing cadence and the broader-benchmark data over the relevant window is genuinely ambiguous. The qualitative lean is very slightly toward falsify, picking up the SWE-Bench Verified one-to-two-percentage-point monthly frontier movement through Q2 2026, but that lean is real on the broad axis and not on this specific event. The release tracker entry remains delivers because the proof itself was verified; the H2-relevance is the weak part.

What to watch over the next quarter, against the same window discipline: whether SWE-Bench Pro frontier scores move more than the historical one-to-two percentage points per month (would push H2 toward falsification); whether the LMArena top-tier cluster either widens (one model breaks away — falsify) or compresses further with no new entrants (verify); whether the next two peer-verified narrow autonomous-discovery events arrive on the existing three-to-six-month cadence or land closer together (acceleration — falsify) or stretch out (deceleration — verify); whether a non-OpenAI lab reproduces this class of result on an open mathematical problem with comparable peer review. The window the hypothesis is built around is short enough that the next two quarters of data will matter more than any single result this week.

Sources: OpenAI announcement; Companion arXiv paper 2605.20695 (Alon et al.); Sawin’s explicit-bound paper, arXiv 2605.20579; DeepMind AlphaProof IMO post; AlphaEvolve, arXiv 2511.02864; AlphaProof Nature publication coverage; vals.ai SWE-Bench Verified independent leaderboard; LocalAI Master SWE-Bench history; LMArena leaderboard; LMArena changelog; BenchLM Elo history.