SeanPropApp benchmarks

What long-form AI analysis actually costs, and what you give up to make it cheaper.

Two controlled studies on chained, multi-step LLM work: what each model tier is worth per dollar, and how far you can compress carried context before the thinking degrades. Every output, score and correction is published in full, details below.

Study 01 · Model tiers

The strongest model is not the rational default.

Claude Opus 4.8 reached 94% of the quality ceiling at 40% of the cost. The premium tier wins on judgment-heavy work, and the gap is depth, not length.

Four models, four companies, blind-scored by two independent judges that produced the identical ranking (rank correlation 1.00).

Claude Fable 58.60 · $6.89Claude Opus 4.88.10 · $2.78Claude Sonnet 4.67.56 · $2.72Claude Haiku 4.55.89 · $0.84

Read the full study

Study 02 · Token efficiency

You can cut a third of the tokens. The thinking survives.

Compressing carried-forward context saved 15 to 31% of tokens with no measurable loss of insight. Writing quality is where the cost shows up, and only on some strategies and some models.

The same 18-module workload, four context strategies, every claim at 95% confidence.

Core-Keptno measured lossDependency-Awarequality −0.20Compact AllinconclusiveClaude Sonnet 5loses quality on all 3

Read the full study

One finding neither study set out to look for

Run the identical analysis three times, on the same model version, with the same inputs, and you do not get the same quality of thinking. The chart below is the spread of those repeat runs. The strongest model is also the steadiest, by nearly three times.

How much the same model varies between identical runs

Standard deviation of the Insight score (judged 0 to 10) on the first module, across 48 repeat runs per model. The first module is the fair test: it receives almost no carried context, so all four context strategies are effectively the same condition there, and what is left is the model's own run-to-run variation. Lower is better. A high value means the single analysis you actually commission may land well below what the model averages.

Ordered strongest model first. Claude Sonnet 4.6 to Sonnet 5 raised mean quality and also raised spread, 1.38 to 1.67: newer, better on average, less repeatable. One version step is not a law, which is why the fixed inputs are published for re-running against future models.

Open data

Every module output, judge verdict, token count and drift audit behind these studies is published: 8,065 files across four model versions, three replicate runs each.

github.com/seanomich/seanpropapp-benchmarks · corrections recorded in the open · the analysis script reproduces the published figures byte for byte

What long-form AI analysis actually costs, and what you give up to make it cheaper.

The strongest model is not the rational default.

You can cut a third of the tokens. The thinking survives.

One finding neither study set out to look for

How much the same model varies between identical runs

Beta Feedback