Beta v1.6.4|Methodology v2.1.0

Anthropic Model Benchmark Analysis· June 2026

Opus delivers 94% of Fable's insight quality at 40% of the cost for business analyses

272 outputs · 4 models · 4 companies · 17 modules each · blind dual-judge (Anthropic + OpenAI), rank correlation 1.00 · every output published, unedited

SeanPropApp pressure-tests a business proposition (a company, a product idea, or an investment thesis) through a fixed 17-module structured analysis: market sizing, customer and jobs-to-be-done, competition, moat, unit economics, go-to-market, and an executive synthesis. You can ground it in your own documents and context; without them it works from public information, as these demos do. Because the quality of that analysis depends heavily on the underlying AI model, we benchmarked it: the same 17-module analysis, on four company use cases, across four models, blind-scored by two independent judges, one Anthropic and one OpenAI. These are the same models you choose when you run your own analysis in the app.

The four analyses show a range of SeanPropApp use cases: Tesla (New Market for Robotics Elder Care), Pendo (Hypothetical M&A LaunchDarkly), DocuSign (New Product Idea AI Contract Negotiation Workspace), and Cursor (Whole Company Analysis).

On June 12, 2026, Anthropic suspended access to Claude Fable 5 to comply with a US government directive. We completed this benchmark before access was suspended; access may be restored in future. The results below stand on the evidence as run.

New here? Read the 17-module methodology this benchmark scores.


Executive summary

Opus by default; Fable when the decision is big.

We ran SeanPropApp's 17-module proposition analysis on four companies across four models and blind-scored all 272 outputs with two independent judges (one Anthropic, one OpenAI). Claude Fable 5 set the quality ceiling at 8.6 / 10, but Opus 4.8 reached 94% of that quality at 40% of the cost ($2.78 vs $6.89), making Opus the rational default and, since Fable 5's access was suspended on 2026-06-12, the strongest model available today. The advantage of premium models concentrates in judgment-heavy work (executive synthesis, market sizing, competitive and moat analysis) and is real depth, not length: insight per 1,000 words nearly doubles up the ladder. Both judges produced the identical ranking, blind (rank correlation 1.00).

At the other end of the range, Haiku performed poorly for these business critical thinking tasks, scoring the lowest on every criterion, with insight density roughly half the top tiers: more text, less analysis. The few dollars it saves are a false economy when the output is meant to shape a real strategy, product, or investment call. For a quick throwaway pass it is fine; for a decision you will bet on, it is the wrong tool.

Results

Composite, cost, and every criterion

Chart 1 · Composite score and cost per full analysis (composite 0-10)
Fable 58.6$6.89
Opus 4.88.1$2.78
Sonnet 4.67.6$2.72
Haiku 4.55.9$0.84

Bars are proportional on a zero-based 0-10 scale and sorted by composite score, highest first; each bar's score sits at its right end and its cost per analysis follows. Opus 4.8 (green) is the value pick: the most quality available per dollar, and the top model available today.

Table 1 · Results by tier (blended dual-judge; composite = mean of the four criteria)
Tier (model / blend)CompositeQualityInsightRelevanceComplianceCost / AnalysisInsight DensityWords / AnalysisLatency
Fable 5Fable 58.68.78.68.88.3$6.898.716,6781,155s
Opus 4.8Opus 4.88.18.18.08.38.0$2.788.715,597784s
Sonnet 4.6Sonnet 4.67.67.77.37.97.3$2.727.416,8092,190s
Haiku 4.5Haiku 4.55.96.15.56.45.6$0.844.719,5691,108s

Criteria. Quality: structure, rigor, and craft. Insight: original, non-obvious judgments a reader can act on. Relevance: grounded in this company's actual situation. Compliance: instruction adherence, no fabricated data, no harness artifacts. Composite is the mean of the four. Insight density is insight score per 1,000 words. The 0-10 score cells are shaded on a shared high-to-low scale (dark is strong, pale is weak); the cost, word, and latency columns are not. Costs are methodology token cost at provider list prices as of the run dates; see Appendix B.


Deeper analysis · by module

Where the model choice matters most

Composite score per methodology module, averaged across the four companies, in published-analysis flow: the Executive Summary first, then the Research, Proposition, and Business Model groups (the same order, and the same module labels, you see when you run an analysis in the app). The advantage of premium models concentrates in narrative synthesis and judgment-heavy strategy: executive summaries (+3.7 Fable over Haiku), pitches (+3.4), gap and moat analysis (+3.3). It nearly vanishes on mechanical modules like customer quotes (+1.7). Dark cells are strong; pale cells are weak.

Table 2 · Composite score by module and tier (0-10), in published-analysis flow
ModuleFable 5Opus 4.8Sonnet 4.6Haiku 4.5
ExecExecutive Summary8.88.07.15.1
ResearchInitial Framing8.18.27.56.1
Market Sizing & TAM8.87.96.65.7
Customer Profile8.68.37.65.6
Jobs To Be Done8.88.28.06.4
Competitive Field8.68.17.95.8
PropositionPositioning8.78.38.06.2
Elevator Pitches8.87.66.85.4
Customer Quotes8.68.18.36.8
Future Press Release8.68.17.65.5
Discovery Plan8.98.37.66.5
Gap Analysis8.87.97.55.4
Business ModelValue Stack8.28.27.76.0
Moat Deep Dive8.88.27.85.5
Unit Economics8.18.07.06.2
Top Questions8.47.97.76.1
Additional Ideas8.98.47.75.6

The narrow left column groups the modules into the analysis's three phases (Research, Proposition, Business Model), with the Executive Summary on its own at the top. The model gap is widest on the narrative, judgment-heavy modules (executive synthesis, pitches, gap and moat analysis) and narrowest on mechanical modules like customer quotes. For what each module does, see the 17-module methodology.

What the four criteria reveal

Every tier scores highest on relevance and lowest on insight or compliance. Staying on topic is essentially solved at every price point (even Haiku reaches 6.4 on relevance); what money buys is depth of analysis and faithfulness to instructions. Insight has the widest spread of any criterion, 5.5 to 8.6, which makes it the single best discriminator between models. Compliance lags everywhere, including at the top: it is Fable's lowest score (8.3) and its smallest gain over Opus. Buying a better model gains reasoning more than obedience, so output-format guardrails stay necessary at every price point.

Depth, not length

Word count runs inversely to quality. Haiku wrote the longest analyses in the study (19,569 words) and scored the worst; Fable wrote 15% fewer words and scored 2.7 points higher. Insight density, insight points per 1,000 words, nearly doubles up the table, from 4.7 (Haiku) to 8.7 (Fable). Cheaper models do not produce less analysis; they produce more text containing less analysis. One nuance: Fable is not denser than Opus. Their densities are statistically identical (8.7 each); Fable's higher insight total comes from sustaining Opus-level density across about 7% more material, not from tighter writing.

What an extra dollar buys

Composite points gained per extra dollar versus the next-cheaper model. One step is nearly free and dominates the ladder: moving from Sonnet to Opus buys +0.5 for six cents. The Opus-to-Fable step buys another +0.5 but costs four more dollars, so it earns its premium only when the decision is big enough to warrant the ceiling.

Table 3 · Marginal value of each model step up (composite points per extra $1)
Model Step Up GainsΔ CompositeΔ CostPts per extra $1
Haiku 4.5 to Sonnet 4.6+1.7+$1.880.9
Sonnet 4.6 to Opus 4.8+0.5+$0.069.0
Opus 4.8 to Fable 5+0.5+$4.110.1

What the gap looks like in practice

Two verbatim Tesla Optimus excerpts, same module and same company, from different tiers. The first shows the gap at its widest; the second shows where the extra spend buys nothing.

Tesla Optimus · Executive Summary · revenue forecastsame module, same company
Haiku 4.5 5.1
"SOM: est $200M-$2B over 12-24 months (contingent on FDA clarity and pilot proof). Year 1 base case ARR (50 facilities, 80 households): est $78M. Conservative scenario: $39M. Optimistic scenario: $159M. […] Implied valuation at $78M ARR: est $600M-1.2B."
$0.84 per analysis
Fable 5 8.8
"SOM (12-24 months): est $0-50M; the planning number is zero revenue. […] Revenue ramp: Year 1 (2028) est $0-50M. All speculative; kill criteria are conversion and incident rates, not revenue."
$6.89 per analysis
Both paragraphs forecast revenue for a robot that does not yet ship. One invents a $78M base case; one commits to a planning number of zero and names the evidence that would change it. That is the difference between a 5.1 and an 8.8.
Tesla Optimus · Moat Analysis · defensibility readsame module, same company
Opus 4.8 8.6
"Critically, the one Power that would be durable as hardware commoditizes, Cornered Resource (an owned home-safety certification and insurance product), scores 1 because it does not yet exist. Tesla's current posture is 'strong on the layers that will erode, absent on the layer that will not.'"
$2.78 per analysis
Fable 5 8.5
"As scoped today, this proposition is structurally undefensible; the entire strategic case rests on converting manufacturing scale into a trust-layer Power (Process Power plus Branding) before hardware and autonomy commoditize."
$6.89 per analysis
On this module the judges scored Opus 8.6 and Fable 8.5: a coin flip, at a $4.11 price difference. On some modules the extra spend buys nothing; the by-module table above shows you which.

Why these results can be trusted

The two judges, one Anthropic and one OpenAI, produced identical tier rankings (Spearman rank correlation 1.00). On Fable specifically the result is unusually well corroborated: the Anthropic judge scored it 8.60 and gpt-5.5 scored it 8.61, and 8.61 was gpt-5.5's highest mark for any configuration. A competing vendor's model, judging blind, independently rated Fable the best configuration in the study, which removes the obvious objection that an Anthropic-judged benchmark flatters an Anthropic flagship. The judges differed only in harshness toward weak outputs, never in ranking: the better the output, the less it mattered who graded it.


Conclusions

Spend for the best on the decisions that matter

On high-stakes work, the price gap between tiers is negligible. Strategy and proposition refinement are high-leverage, make-or-break activities. Set the cost of a single analysis against what the decision actually consumes: leadership and employee time, payroll, software licenses, external research, consulting, and go-to-market spend, frequently hundreds of thousands to millions of dollars. Against that, the difference between a $2 and a $5 analysis disappears. The rational move is to buy the best model your access allows for any decision that matters, and reserve the cheap tiers for low-stakes drafting. Opus is the value default; Fable, where available, is worth its premium when the single decision is big enough to warrant the ceiling. Haiku is not worth the savings for business-critical strategic decisions

Draft framing for review; the numbers above are from the study, the interpretation is editorial.


Appendix A

Every analysis, by tier and company

All 16 full analyses (four companies across four models) are published unedited in the standard SeanPropApp viewer, each shown with its blended composite score and cost. Browse and open any of them from the Results Matrix, or compare a single proposition module by module across all four models in the Tesla deep dive.


Appendix B

Method, criteria, and scope

Blind, dual-judge, re-randomized. Each module's six outputs were anonymized, re-shuffled for every module, and scored 0-10 on quality, insight, relevance, and compliance by two independent judges from different labs: a Claude model and gpt-5.5. The published score is the mean of the two judges' composites. Rank correlation between judges was 1.00 (identical ordering); the mean per-model difference was 0.39, tightest at the top of the table.

What we measured, and what we did not

This benchmark tests one thing: structured business-proposition analysis (17 strategy modules, four companies). It is a domain benchmark, not a general capability ranking. Fable 5's largest advances are in domains we did not test (advanced coding, agentic and tool use, long-context technical reasoning, math); this study is silent on those. The takeaway is scoped accordingly: for this kind of analysis, the quality gap to Opus is small and the cost gap is large.

When, and at what cost

Haiku 4.5 and Sonnet 4.6 ran on 2026-05-28; Opus 4.8 on 2026-05-29; and Fable 5 across 2026-06-10 to 06-11. All 272 outputs were scored on 2026-06-11, the day before Fable's 2026-06-12 suspension. Costs are the methodology's token cost per full 17-module analysis at provider list prices as of the run dates, with a uniform roughly 2x thinking-token inflation applied across all tiers, so ratios between tiers are sound and absolute dollar figures should be read as approximate. Methodology snapshot: Production v2.1.0. This page is a snapshot of that moment; model versions, capabilities, and prices all move.

Limitations

One run per configuration, four companies, one methodology version. Per-module figures average only four data points per tier, so module-level deltas under about 0.3 should be treated as ties. Judges agree perfectly on ranking but differ in harshness; the OpenAI judge scored low-end outputs harder. AI judges, however independent, are not a human expert panel.

Run this analysis on your own proposition.
Bring your own Claude or ChatGPT. Your content never leaves your browser.
Get Started Now

Beta Feedback