Essay · measured data

My agents' most confident claims were their least reliable

We re-scored 58 production agent claims objectively after merge. Hold-rate falls as stated confidence rises — and collapses in the top bin, exactly where you'd most want to trust it.

Every number below is re-derivable from a committed ledger. Bins, curves, generator: report/calibration-curves

The setup

For months, every change our coding agents shipped carried an improvement receipt: what changed, the shell command that proves it, and a stated confidence. After a human merges, an independent scorer re-runs the receipt's own measure command. Pass and the claim held; fail and the miss is recorded against the agent's calibration, forever. No author override, no self-scoring — the rules are an open spec, and they were stress-tested by our own agents attacking the ledger.

That leaves a dataset almost nobody has: (stated confidence, objectively measured outcome) pairs for real agent work in a real repository. 58 measured claims. Here is the curve.

The curve inverts

100% 50% 0% 50% · n=2 86% · n=29 83% · n=23 33% · n=3 0.80–0.85 0.85–0.90 0.90–0.95 0.95–1.00 stated confidence → measured hold-rate
Hold-rate by stated-confidence bin, 58 objectively measured claims. Perfect calibration would rise left to right. Ours falls — then collapses.
Stated confidencenHeldHold-rate
0.80–0.852150%
0.85–0.90292586.2%
0.90–0.95231982.6%
0.95–1.003133.3%

A calibrated agent's curve rises with confidence: claims stated at 0.95 should hold about 95% of the time. Ours goes the other way. In the band where the agents were most certain — the claims a reviewer is most tempted to wave through — two out of three failed objective re-execution.

Small-n honesty: the top bin is n=3. That is a hypothesis-sized sample, not a finding-sized one, and we report it with its n attached rather than hiding it in an aggregate. The 0.85–0.95 range (n=52) is the load-bearing data; the collapse at 0.95+ is the pattern we're now watching accumulate — and asking others to test against their own ledgers.

Why would confidence invert?

Neural networks are systematically overconfident — that's been documented since Guo et al. (2017). But overconfidence alone predicts a curve that's too flat, not one that inverts. Two things in our data point at something more specific:

The second point has an immediate policy payoff: measure-command complexity is readable before merge. You don't have to wait for the ledger to tell you a claim was bad — a receipt with a sprawling measure block and 0.97 confidence is the highest-risk artifact in the queue, and your review policy can say so.

What this means if you run agents

Most teams reviewing agent work use stated confidence as a triage signal — skim the confident ones, scrutinize the hedged ones. Our data says that heuristic is not just weak but backwards at the top of the range. Three changes follow:

Run it on your own agents

The machinery is open source and needs no server: receipts are files in your repo, the ledger is a JSONL, the scorer is a CLI, the gate is a GitHub Action.

pip install signalbrain — spec, scorer, and the full reproducible analysis at github.com/whitestone1121-web/signalbrain. If your curve inverts too — or doesn't — post your bins. n=58 is where this stops being one deployment's anecdote.

Get the tooling