brianletort.ai
All issues

The Model Pulse

Issue 11 · Week 27 of 2026.

/Weekly read/~6 min read/Public sources onlyDownload brief

The Big Read

Washington entered the release pipeline: GPT-5.6 launched gated to ~20 approved partners, Fable 5 came back, and Sonnet 5 split sticker price from cost-per-task.

The thesis this issue defends

Within five days the US government both gated a new frontier launch and un-gated a suspended one. OpenAI previewed the GPT-5.6 family (Sol, Terra, Luna) on June 26 to roughly 20 government-approved partner organizations — the first US frontier model launched under a government-managed access list — while Commerce withdrew the Anthropic export-control order on June 30 and Claude Fable 5 returned globally on July 1. The procurement implication is direct: frontier availability is now partly a regulatory variable, so contracts and architectures must assume any flagship can be gated at launch or suspended after it, and multi-model routing is availability insurance, not just cost optimization. The week's biggest GA release, Claude Sonnet 5 (June 30, 1M context, 85.2% SWE-bench Verified, default for Claude Free/Pro), delivered the second structural lesson: at max effort its token appetite pushes measured cost to $2.29 per task — ~15% above Opus 4.8 — despite a far lower per-token price, so sticker price and cost-per-task have formally diverged. Buyers should route by measured cost-per-completed-task with effort level as an explicit parameter, and re-run that math before Sonnet 5's introductory pricing lapses on August 31. Meituan's MIT-licensed LongCat-2.0 rounds out the week by adding demand-proven open-weight pressure — and a claimed NVIDIA-free trillion-scale training run — to the cost lane.

Tree delta

What changed in the tree.

5 models added, 1 updated.

Five rows added: claude-sonnet-5 and gpt-5-6-sol extend the closed reasoning canopy, longcat-2-0 lands a 1.6T open MoE, leanstral-1-5 opens a formal-verification specialist node, and nemotron-labs-twotower extends the diffusion-retrofit lineage. One row updated: claude-fable-5 now records the July 1 restoration.

Added (5)

  • claude-sonnet-5
  • gpt-5-6-sol
  • longcat-2-0
  • leanstral-1-5
  • nemotron-labs-twotower

Updated (1)

  • claude-fable-5

gpt-5-6-sol carries status gated (preview limited to ~20 government-approved partners, no independent benchmarks). The unverified Pulpie family (Feyn Inc.) was excluded as below the significance and sourcing threshold.

Explore the LLM Evolutionary Tree

Frontier movements

Flagship-class releases.

3 releases this period.

Vendor-stated frontier capability. The releases that reset the closed-source ceiling.

  • /Anthropic/Reasoning/Agentic

    Claude Sonnet 5

    The week's biggest GA release: 1M context, 85.2% SWE-bench Verified, effort dials, at introductory $2/$10 per M tokens

    Sonnet 5 gives architects near-Opus agentic quality at a mid-tier sticker price — independent testing shows it beating Opus 4.8 on agentic knowledge-work benchmarks — but at max effort it costs ~15% more per completed task than Opus 4.8. Route by measured cost-per-task, not per-token price, and model post-promo economics (standard $3/$15 from September 1) before committing volume.

    Anthropic; Artificial Analysis

  • /OpenAI/Specialist/Reasoning

    GPT-5.6 family (Sol / Terra / Luna)

    Next OpenAI flagship generation previewed — but launched to only ~20 government-approved partner organizations

    Marked specialist because it is gated: this is the first US frontier launch under a government-managed access list, so procurement teams should treat release gating as a live availability risk for every future frontier upgrade and keep multi-vendor fallbacks tested. Terra's claimed GPT-5.5-level capability at half the price is the pricing lever to model ahead of GA.

    OpenAI

  • /Anthropic/Frontier/Reasoning

    Claude Fable 5

    Restored globally July 1 after Commerce withdrew the June 12 export-control order; back atop the available boards

    The strongest model on the public leaderboards is purchasable again, resetting the practical-leader question W26 settled in Opus 4.8's favor — but the 19-day forced outage is now a demonstrated failure mode, so any single-model dependency on a frontier flagship needs a tested fallback path. Note the fine print: after July 7, Fable 5 moves to usage credits even on paid plans, so treat it as a premium metered resource in budgets.

    Anthropic; VentureBeat

Open weights

Open-frontier and open-source drops.

3 releases this period.

Open-weights releases that change procurement options. Pull these into pilot when score parity meets license parity.

  • /Meituan/Open frontier/MoE

    LongCat-2.0

    OpenRouter's stealth chart-leader 'Owl Alpha' unmasked: a 1.6T MoE with native 1M context under MIT, trained on 50K+ domestic Chinese ASICs

    A near-frontier agentic-coding MoE with two months of real developer demand joins GLM-5.2 and DeepSeek V4 in the open cost-pressure lane — benchmark it wherever unit economics beat absolute frontier quality, but verify actual weight availability before treating it as self-hostable. The claimed NVIDIA-free 35T-token training run, if it holds, weakens the assumption that export controls throttle Chinese frontier training; factor that into sovereignty and supply planning.

    Hugging Face model card; VentureBeat; SiliconANGLE

  • /Mistral AI/Specialist/MoE

    Leanstral 1.5

    Apache 2.0 Lean 4 proof-engineering agent (119B MoE / 6.5B active) saturates miniF2F and solves 587/672 PutnamBench problems at ~$4 each

    The strongest evidence yet that narrow verticalized agent models can beat frontier brute force on economics — ~75x cheaper per problem than Seed-Prover 1.5 high — with a practical hook for engineering leaders: Mistral reports 5 previously unknown bugs found across 57 open-source repositories via formal code verification. Self-host the Apache weights for anything durable; the free labs API endpoint retires September 30.

    Mistral AI; MarkTechPost

  • /NVIDIA/Specialist/MoE

    Nemotron-Labs-TwoTower

    Open block-diffusion model on a frozen AR backbone: 2.42x wall-clock throughput at 98.7% of autoregressive quality

    TwoTower shows diffusion decoding can be retrofitted onto an existing pretrained AR checkpoint for ~2.1T training tokens instead of a 25T re-pretrain — a cheap serving-throughput upgrade path any lab with a strong AR model can copy, which compounds W26's serving-economics story. One checkpoint supports diffusion, mock-AR, and standard AR decoding, so platform teams can A/B the pattern with low switching risk once vLLM-class support lands.

    arXiv 2606.26493; MarkTechPost

Architecture watch

Patterns to track.

4 patterns reshaping the canopy.

Architectural patterns that crossed multiple vendors this period. Each pattern lists exemplar releases and what it changes for deployment, cost, or capability.

  • Effort dials decouple sticker price from cost-per-task

    Claude Sonnet 5DeepSeek V4GPT-5.5

    Every major vendor now ships user-controllable reasoning-effort settings, and this week produced the clearest data point yet that token-hungry effort scaling inverts price intuition: Sonnet 5 at max effort costs $2.29 per Intelligence Index task at standard pricing — more per completed task than the nominally pricier Opus 4.8. Evaluation harnesses must price the task, not the token, and effort level must become an explicit routing parameter in every model gateway.

    Artificial Analysis

  • Diffusion decoding retrofitted onto frozen AR backbones

    Nemotron-Labs-TwoTowerDiffusionGemma

    Two vendors in a month have shipped open diffusion LLMs built on top of existing autoregressive checkpoints rather than trained from scratch — TwoTower needed only ~2.1T training tokens against its backbone's 25T. Diffusion is emerging as a cheap post-hoc throughput upgrade (2.42x wall-clock at 98.7% quality) rather than a rival pretraining paradigm; platform teams should watch for this reaching production serving stacks via vLLM-class support.

    arXiv 2606.26493

  • 1M-context sparse attention is the new open-weight table stakes

    DeepSeek V4LongCat-2.0GLM-5.2

    The Chinese open-weight labs have converged on the same recipe: sparse/compressed attention variants that make 1M-token context economically routine. The DeepSeek V4 technical report published this window puts hard numbers on it — roughly 90% KV-cache reduction and 27% of V3.2's inference FLOPs at 1M tokens — which directly attacks the HBM-scarcity constraint W26 flagged. Long-context agentic workloads are becoming an open-weight strength, not a closed-model premium, so re-benchmark long-context routing assumptions.

    DeepSeek V4 technical report (arXiv 2606.19348)

  • Training and serving substrates diversify away from NVIDIA

    LongCat-2.0GPT-5.6 Sol on CerebrasGroq inference cloud

    In one week, a claimed frontier-scale non-NVIDIA training run (LongCat-2.0 on 50K+ domestic Chinese ASICs) and a flagship non-NVIDIA serving deal (GPT-5.6 Sol on Cerebras at up to 750 tok/s, planned July) both landed. Treat the LongCat training claim as vendor-asserted, but the direction is consistent: buyers should start asking where a model was trained and where it can be served as part of sovereignty and capacity planning.

    SiliconANGLE; OpenAI

Benchmark moves

Where the leaderboard moved.

4 benchmarks shifted.

Benchmark deltas that change a procurement read. Scores reflect public leaderboards or vendor model cards as of publication.

  • Artificial Analysis Intelligence Index (v4.1)

    Sonnet 5 debuted at 53 (max effort), +6 over Sonnet 4.6 and tying GPT-5.5 (high); Fable 5's July 1 reinstatement puts the index leader back on sale

    • Claude Fable 5 (max)60 (64.9 at June launch under earlier eval set)
    • Claude Opus 4.8 (max)56
    • GPT-5.5 (xhigh)55
    • Claude Opus 4.7 (max)54
    • Claude Sonnet 5 (max)53 — new this week

    Artificial Analysis

  • SWE-bench Pro (vendor-scaffold aggregate)

    Sonnet 5 entered at 63.2% (+5.1pt over Sonnet 4.6, with 85.2% on Verified); Fable 5's 80.0% score re-entered the available-model conversation over the prior ~69% ceiling

    • Claude Mythos 580.3% — restricted (Glasswing only)
    • Claude Fable 580.0% — available again 2026-07-01
    • Claude Opus 4.869.2% — prior available leader
    • Claude Sonnet 563.2% — new this week
    • GLM-5.262.1% — open-weights leader

    Anthropic system card via Vellum; benchlm.ai (updated 2026-07-03); vendor-scaffold numbers, not Scale-standardized

  • LMArena

    claude-sonnet-5-thinking added to the Code, Text, Search, Vision, and Document leaderboards on July 2; claude-fable-5 holds #1 overall while Elo settles

    • claude-fable-5#1 overall (1653 in one view; 1509 in another scoring window)
    • GLM-5.2 (max)~1584 — top open entry
    • claude-sonnet-5-thinking~1551 early snapshot; votes still accumulating

    Arena leaderboard changelog

  • PutnamBench / miniF2F (formal math)

    Leanstral 1.5 reset the open state of the art: miniF2F saturated at 100%, PutnamBench 587/672 — edging Seed-Prover 1.5 high by 7 problems at ~1/75th the per-problem cost

    • Leanstral 1.5 — miniF2F100% (val+test)
    • Leanstral 1.5 — PutnamBench587/672 at ~$4/problem
    • Leanstral 1.5 — FATE-H / FATE-X87% / 34% (SOTA)
    • Leanstral 1.5 — FLTEval pass@843.2 vs Opus 4.6's 39.6

    Mistral AI (vendor-reported; independent re-runs pending)

Tier scorecard

Who leads, who pushes.

6 tiers · leaders as of Jul 4, 2026.

A snapshot of leader-vs-challenger by tier. Useful for procurement shortlists when matching workload to model class. Pair with the benchmark moves above for the underlying scores.

  • Closed frontier

    Leader: Claude Fable 5

    Challenger: Claude Opus 4.8

    Fable 5's July 1 restoration makes the board leader purchasable again — but it is metered to usage credits after July 7 and its 19-day outage proved gating risk, so keep Opus 4.8 as the tested fallback.

  • Open frontier

    Leader: GLM-5.2

    Challenger: LongCat-2.0

    GLM-5.2 keeps the permissive open lead on scores; LongCat-2.0 arrives with real OpenRouter demand evidence but staged weights — verify availability before piloting self-host.

  • Reasoning

    Leader: Claude Fable 5

    Challenger: GPT-5.5

    The AA Intelligence Index leader is available again; GPT-5.5 remains the strongest generally-available OpenAI entry while GPT-5.6 sits in gated preview.

  • Coding

    Leader: Claude Fable 5

    Challenger: Claude Sonnet 5

    Fable 5's 80.0% SWE-bench Pro is back on sale; Sonnet 5 (63.2% Pro, 85.2% Verified) is the cost-tier challenger — priced per task, not per token.

  • Multimodal

    Leader: Gemini 3.5 Flash

    Challenger: MiniMax-M3

    No W27 multimodal reset; Google's media line churned (Veo 2.0/3.0 shut down June 30 as cheap successors shipped), which is a migration-budget signal rather than a leadership change.

  • Edge / small

    Leader: Mellum2

    Challenger: North Mini Code

    No edge leadership change in-window; TwoTower's diffusion retrofit (2.42x throughput on a 30B-class backbone) is the efficiency pattern to watch for this tier.

Vendor signals

Pricing, gating, deprecation.

6 non-release signals worth tracking.

The non-release moves that shift vendor risk — pricing, deprecations, gating decisions, license changes — with a one-line procurement read.

  • /OpenAI + Anthropic (US Commerce)

    Within five days the US government gated a new frontier launch (GPT-5.6, ~20 approved partners) and un-gated a suspended one (Fable 5 restored July 1)

    Frontier model availability is now partially a regulatory variable. Contracts and architectures should assume any flagship can be gated at launch or suspended post-launch; multi-model routing is availability insurance, not just cost optimization.

    OpenAI; Anthropic

  • /Anthropic

    Sonnet 5 launched at introductory $2/$10 per M tokens through August 31 (then $3/$15); Fable 5 moves to usage credits even on paid plans after July 7

    Budget owners should model post-promo Sonnet 5 economics now — the independent cost-per-task data already uses standard pricing — and treat Fable 5 as a premium metered resource, not a plan entitlement.

    Anthropic

  • /DeepSeek

    Legacy deepseek-chat / deepseek-reasoner aliases go fully dark July 24 15:59 UTC; they currently silently route to deepseek-v4-flash

    Anyone with DeepSeek integrations has a three-week migration deadline, and reasoner traffic wanting Pro-tier quality must explicitly move to deepseek-v4-pro. Open-weight vendors deprecate as aggressively as closed ones — budget for it.

    DeepSeek API docs

  • /Google

    Veo 2.0/3.0 video models shut down June 30; same day, Nano Banana 2 Lite hit GA and Gemini Omni Flash entered public preview at $0.10/second of 720p video

    Google's media-model line is churning fast — aggressive price points on the way in, one-year-class deprecations on the way out. Media pipelines built on Gemini need version-migration budgets as a standing line item.

    Google Gemini API changelog

  • /xAI

    Voice Agent Builder beta: no-code speech-to-speech agents on Grok Voice at $0.05/min audio + $0.01/min telephony — roughly 1/3 to 1/5 of ElevenLabs/Vapi-class pricing

    Model vendors keep moving up-stack into the agent-platform layer, putting voice-AI middleware margins under direct attack from a model owner. Re-price any voice-agent build-vs-buy decision, and treat vendor-run voice benchmark claims as marketing.

    xAI

  • /Anthropic

    Claude Science beta shipped — a multi-agent scientific workbench on existing Claude models, with up to 50 funded projects at $30K credits each (applications through July 15)

    Anthropic is monetizing workflow ownership rather than raw capability, consistent with the W26 read that model adoption is turning into workflow operations. Expect vertical workbenches, not just model upgrades, to be the competitive surface in H2 2026.

    Anthropic; TechCrunch

Watchlist

On the radar next.

7 catalysts to watch, starting July 2026.

Specific model-side catalysts in the next 7–30 days that would change the read materially. Watching these tells us whether the canopy is widening or thinning.

  • July 2026

    GPT-5.6 general availability

    Terra's claimed GPT-5.5-level capability at half the price is the biggest potential closed-model price reset of Q3; also watch whether the government preview process becomes a repeatable framework for every frontier launch.

  • July 2026

    Gemini 3.5 Pro GA

    A 2M-context flagship GA (slipped from June) would contest the closed-frontier ordering for the first time since May; it is already on LMArena and Antigravity for some users.

  • July 7, 2026

    Fable 5 usage-credit transition

    The included-allowance window ends; watch adoption and routing behavior once Fable 5 is purely metered, plus completeness of Bedrock/Vertex/Foundry re-enablement.

  • July 24, 2026

    DeepSeek legacy alias shutdown

    deepseek-chat / deepseek-reasoner go dark at 15:59 UTC; expect a burst of migration issues and possible V4 usage-share data.

  • Days to weeks

    LongCat-2.0 full weight availability and independent benchmarks

    The HF repo said weights were coming soon at announcement (INT8/FP8 uploads observed since); independent evals will test both the near-frontier claim and the domestic-ASIC training story.

  • By August 31, 2026

    Sonnet 5 promo pricing expiry

    Standard $3/$15 pricing plus its measured token appetite could make Sonnet 5 more expensive per task than Opus 4.8 in production; re-run cost-per-task math before the promo lapses.

  • July 2026

    GPT-5.6 Sol on Cerebras

    Up to 750 tok/s serving for select customers would be the first datapoint for frontier closed-model serving on non-NVIDIA silicon at scale.

Edits this issue

  • Five tree rows added (claude-sonnet-5, gpt-5-6-sol, longcat-2-0, leanstral-1-5, nemotron-labs-twotower) and claude-fable-5 updated to record the July 1 restoration.
  • Market architecture refreshed: five matching model cards, a July 2026 model-portfolio position read, and new W27 sources added.
  • Excluded as unverifiable: the Pulpie family (Feyn Inc.) — aggregator mentions only, no primary source (grade 1-2).

About The Model Pulse

A weekly read on the software side of the AI stack. Anchored to the LLM Evolutionary Tree, which the brief annotates each week. The cross-stack flywheel (capital, hardware, networking) is covered in The AI Stack Weekly.

Authorship and sources

Compiled from public model cards, vendor blogs, leaderboards, and official lab announcements. Written by Brian Letort. Independent analysis. Not investment guidance.

Operate. Publish. Teach.