Anthropic Model Benchmark Analysis· June 2026
Opus delivers 94% of Fable's insight quality at 40% of the cost for business analyses
272 outputs · 4 models · 4 companies · 17 modules each · blind dual-judge (Anthropic + OpenAI), rank correlation 1.00 · every output published, unedited
SeanPropApp pressure-tests a business proposition (a company, a product idea, or an investment thesis) through a fixed 17-module structured analysis: market sizing, customer and jobs-to-be-done, competition, moat, unit economics, go-to-market, and an executive synthesis. You can ground it in your own documents and context; without them it works from public information, as these demos do. Because the quality of that analysis depends heavily on the underlying AI model, we benchmarked it: the same 17-module analysis, on four company use cases, across four models, blind-scored by two independent judges, one Anthropic and one OpenAI. These are the same models you choose when you run your own analysis in the app.
The four analyses show a range of SeanPropApp use cases: Tesla (New Market for Robotics Elder Care), Pendo (Hypothetical M&A LaunchDarkly), DocuSign (New Product Idea AI Contract Negotiation Workspace), and Cursor (Whole Company Analysis).
New here? Read the 17-module methodology this benchmark scores.
Executive summary
Opus by default; Fable when the decision is big.
We ran SeanPropApp's 17-module proposition analysis on four companies across four models and blind-scored all 272 outputs with two independent judges (one Anthropic, one OpenAI). Claude Fable 5 set the quality ceiling at 8.6 / 10, but Opus 4.8 reached 94% of that quality at 40% of the cost ($2.78 vs $6.89), making Opus the rational default and, since Fable 5's access was suspended on 2026-06-12, the strongest model available today. The advantage of premium models concentrates in judgment-heavy work (executive synthesis, market sizing, competitive and moat analysis) and is real depth, not length: insight per 1,000 words nearly doubles up the ladder. Both judges produced the identical ranking, blind (rank correlation 1.00).
At the other end of the range, Haiku performed poorly for these business critical thinking tasks, scoring the lowest on every criterion, with insight density roughly half the top tiers: more text, less analysis. The few dollars it saves are a false economy when the output is meant to shape a real strategy, product, or investment call. For a quick throwaway pass it is fine; for a decision you will bet on, it is the wrong tool.
Results
Composite, cost, and every criterion
94% of the ceiling at 40% of the cost, and the top model available today.
Set the measured ceiling; worth the premium when the call warrants it.
Insight per 1k words nearly doubles up the ladder: depth, not length.
An Anthropic and an OpenAI judge produced the identical ranking, blind.
Bars are proportional on a zero-based 0-10 scale and sorted by composite score, highest first; each bar's score sits at its right end and its cost per analysis follows. Opus 4.8 (green) is the value pick: the most quality available per dollar, and the top model available today.
| Tier (model / blend) | Composite | Quality | Insight | Relevance | Compliance | Cost / Analysis | Insight Density | Words / Analysis | Latency |
|---|---|---|---|---|---|---|---|---|---|
| Fable 5Fable 5 | 8.6 | 8.7 | 8.6 | 8.8 | 8.3 | $6.89 | 8.7 | 16,678 | 1,155s |
| Opus 4.8Opus 4.8 | 8.1 | 8.1 | 8.0 | 8.3 | 8.0 | $2.78 | 8.7 | 15,597 | 784s |
| Sonnet 4.6Sonnet 4.6 | 7.6 | 7.7 | 7.3 | 7.9 | 7.3 | $2.72 | 7.4 | 16,809 | 2,190s |
| Haiku 4.5Haiku 4.5 | 5.9 | 6.1 | 5.5 | 6.4 | 5.6 | $0.84 | 4.7 | 19,569 | 1,108s |
Criteria. Quality: structure, rigor, and craft. Insight: original, non-obvious judgments a reader can act on. Relevance: grounded in this company's actual situation. Compliance: instruction adherence, no fabricated data, no harness artifacts. Composite is the mean of the four. Insight density is insight score per 1,000 words. The 0-10 score cells are shaded on a shared high-to-low scale (dark is strong, pale is weak); the cost, word, and latency columns are not. Costs are methodology token cost at provider list prices as of the run dates; see Appendix B.
Deeper analysis · by module
Where the model choice matters most
Composite score per methodology module, averaged across the four companies, in published-analysis flow: the Executive Summary first, then the Research, Proposition, and Business Model groups (the same order, and the same module labels, you see when you run an analysis in the app). The advantage of premium models concentrates in narrative synthesis and judgment-heavy strategy: executive summaries (+3.7 Fable over Haiku), pitches (+3.4), gap and moat analysis (+3.3). It nearly vanishes on mechanical modules like customer quotes (+1.7). Dark cells are strong; pale cells are weak.
| Module | Fable 5 | Opus 4.8 | Sonnet 4.6 | Haiku 4.5 | |
|---|---|---|---|---|---|
| Exec | Executive Summary | 8.8 | 8.0 | 7.1 | 5.1 |
| Research | Initial Framing | 8.1 | 8.2 | 7.5 | 6.1 |
| Market Sizing & TAM | 8.8 | 7.9 | 6.6 | 5.7 | |
| Customer Profile | 8.6 | 8.3 | 7.6 | 5.6 | |
| Jobs To Be Done | 8.8 | 8.2 | 8.0 | 6.4 | |
| Competitive Field | 8.6 | 8.1 | 7.9 | 5.8 | |
| Proposition | Positioning | 8.7 | 8.3 | 8.0 | 6.2 |
| Elevator Pitches | 8.8 | 7.6 | 6.8 | 5.4 | |
| Customer Quotes | 8.6 | 8.1 | 8.3 | 6.8 | |
| Future Press Release | 8.6 | 8.1 | 7.6 | 5.5 | |
| Discovery Plan | 8.9 | 8.3 | 7.6 | 6.5 | |
| Gap Analysis | 8.8 | 7.9 | 7.5 | 5.4 | |
| Business Model | Value Stack | 8.2 | 8.2 | 7.7 | 6.0 |
| Moat Deep Dive | 8.8 | 8.2 | 7.8 | 5.5 | |
| Unit Economics | 8.1 | 8.0 | 7.0 | 6.2 | |
| Top Questions | 8.4 | 7.9 | 7.7 | 6.1 | |
| Additional Ideas | 8.9 | 8.4 | 7.7 | 5.6 |
The narrow left column groups the modules into the analysis's three phases (Research, Proposition, Business Model), with the Executive Summary on its own at the top. The model gap is widest on the narrative, judgment-heavy modules (executive synthesis, pitches, gap and moat analysis) and narrowest on mechanical modules like customer quotes. For what each module does, see the 17-module methodology.
What the four criteria reveal
Every tier scores highest on relevance and lowest on insight or compliance. Staying on topic is essentially solved at every price point (even Haiku reaches 6.4 on relevance); what money buys is depth of analysis and faithfulness to instructions. Insight has the widest spread of any criterion, 5.5 to 8.6, which makes it the single best discriminator between models. Compliance lags everywhere, including at the top: it is Fable's lowest score (8.3) and its smallest gain over Opus. Buying a better model gains reasoning more than obedience, so output-format guardrails stay necessary at every price point.
Depth, not length
Word count runs inversely to quality. Haiku wrote the longest analyses in the study (19,569 words) and scored the worst; Fable wrote 15% fewer words and scored 2.7 points higher. Insight density, insight points per 1,000 words, nearly doubles up the table, from 4.7 (Haiku) to 8.7 (Fable). Cheaper models do not produce less analysis; they produce more text containing less analysis. One nuance: Fable is not denser than Opus. Their densities are statistically identical (8.7 each); Fable's higher insight total comes from sustaining Opus-level density across about 7% more material, not from tighter writing.
What an extra dollar buys
Composite points gained per extra dollar versus the next-cheaper model. One step is nearly free and dominates the ladder: moving from Sonnet to Opus buys +0.5 for six cents. The Opus-to-Fable step buys another +0.5 but costs four more dollars, so it earns its premium only when the decision is big enough to warrant the ceiling.
| Model Step Up Gains | Δ Composite | Δ Cost | Pts per extra $1 |
|---|---|---|---|
| Haiku 4.5 to Sonnet 4.6 | +1.7 | +$1.88 | 0.9 |
| Sonnet 4.6 to Opus 4.8 | +0.5 | +$0.06 | 9.0 |
| Opus 4.8 to Fable 5 | +0.5 | +$4.11 | 0.1 |
What the gap looks like in practice
Two verbatim Tesla Optimus excerpts, same module and same company, from different tiers. The first shows the gap at its widest; the second shows where the extra spend buys nothing.
"SOM: est $200M-$2B over 12-24 months (contingent on FDA clarity and pilot proof). Year 1 base case ARR (50 facilities, 80 households): est $78M. Conservative scenario: $39M. Optimistic scenario: $159M. […] Implied valuation at $78M ARR: est $600M-1.2B."
"SOM (12-24 months): est $0-50M; the planning number is zero revenue. […] Revenue ramp: Year 1 (2028) est $0-50M. All speculative; kill criteria are conversion and incident rates, not revenue."
"Critically, the one Power that would be durable as hardware commoditizes, Cornered Resource (an owned home-safety certification and insurance product), scores 1 because it does not yet exist. Tesla's current posture is 'strong on the layers that will erode, absent on the layer that will not.'"
"As scoped today, this proposition is structurally undefensible; the entire strategic case rests on converting manufacturing scale into a trust-layer Power (Process Power plus Branding) before hardware and autonomy commoditize."
Why these results can be trusted
The two judges, one Anthropic and one OpenAI, produced identical tier rankings (Spearman rank correlation 1.00). On Fable specifically the result is unusually well corroborated: the Anthropic judge scored it 8.60 and gpt-5.5 scored it 8.61, and 8.61 was gpt-5.5's highest mark for any configuration. A competing vendor's model, judging blind, independently rated Fable the best configuration in the study, which removes the obvious objection that an Anthropic-judged benchmark flatters an Anthropic flagship. The judges differed only in harshness toward weak outputs, never in ranking: the better the output, the less it mattered who graded it.
Conclusions
Spend for the best on the decisions that matter
On high-stakes work, the price gap between tiers is negligible. Strategy and proposition refinement are high-leverage, make-or-break activities. Set the cost of a single analysis against what the decision actually consumes: leadership and employee time, payroll, software licenses, external research, consulting, and go-to-market spend, frequently hundreds of thousands to millions of dollars. Against that, the difference between a $2 and a $5 analysis disappears. The rational move is to buy the best model your access allows for any decision that matters, and reserve the cheap tiers for low-stakes drafting. Opus is the value default; Fable, where available, is worth its premium when the single decision is big enough to warrant the ceiling. Haiku is not worth the savings for business-critical strategic decisions
Draft framing for review; the numbers above are from the study, the interpretation is editorial.
Appendix A
Every analysis, by tier and company
All 16 full analyses (four companies across four models) are published unedited in the standard SeanPropApp viewer, each shown with its blended composite score and cost. Browse and open any of them from the Results Matrix, or compare a single proposition module by module across all four models in the Tesla deep dive.
Appendix B
Method, criteria, and scope
Blind, dual-judge, re-randomized. Each module's six outputs were anonymized, re-shuffled for every module, and scored 0-10 on quality, insight, relevance, and compliance by two independent judges from different labs: a Claude model and gpt-5.5. The published score is the mean of the two judges' composites. Rank correlation between judges was 1.00 (identical ordering); the mean per-model difference was 0.39, tightest at the top of the table.
What we measured, and what we did not
This benchmark tests one thing: structured business-proposition analysis (17 strategy modules, four companies). It is a domain benchmark, not a general capability ranking. Fable 5's largest advances are in domains we did not test (advanced coding, agentic and tool use, long-context technical reasoning, math); this study is silent on those. The takeaway is scoped accordingly: for this kind of analysis, the quality gap to Opus is small and the cost gap is large.
When, and at what cost
Haiku 4.5 and Sonnet 4.6 ran on 2026-05-28; Opus 4.8 on 2026-05-29; and Fable 5 across 2026-06-10 to 06-11. All 272 outputs were scored on 2026-06-11, the day before Fable's 2026-06-12 suspension. Costs are the methodology's token cost per full 17-module analysis at provider list prices as of the run dates, with a uniform roughly 2x thinking-token inflation applied across all tiers, so ratios between tiers are sound and absolute dollar figures should be read as approximate. Methodology snapshot: Production v2.1.0. This page is a snapshot of that moment; model versions, capabilities, and prices all move.
Limitations
One run per configuration, four companies, one methodology version. Per-module figures average only four data points per tier, so module-level deltas under about 0.3 should be treated as ties. Judges agree perfectly on ranking but differ in harshness; the OpenAI judge scored low-end outputs harder. AI judges, however independent, are not a human expert panel.