Essay · measured data

My agents' most confident claims were their least reliable

We re-scored 58 production agent claims objectively after merge. Hold-rate falls as stated confidence rises — and collapses in the top bin, exactly where you'd most want to trust it.

Every number below is re-derivable from a committed ledger. Bins, curves, generator: report/calibration-curves

The setup

For months, every change our coding agents shipped carried an improvement receipt: what changed, the shell command that proves it, and a stated confidence. After a human merges, an independent scorer re-runs the receipt's own measure command. Pass and the claim held; fail and the miss is recorded against the agent's calibration, forever. No author override, no self-scoring — the rules are an open spec, and they were stress-tested by our own agents attacking the ledger.

That leaves a dataset almost nobody has: (stated confidence, objectively measured outcome) pairs for real agent work in a real repository. 58 measured claims. Here is the curve.

The curve inverts

Hold-rate by stated-confidence bin, 58 objectively measured claims. Perfect calibration would rise left to right. Ours falls — then collapses.

Stated confidence	n	Held	Hold-rate
0.80–0.85	2	1	50%
0.85–0.90	29	25	86.2%
0.90–0.95	23	19	82.6%
0.95–1.00	3	1	33.3%

A calibrated agent's curve rises with confidence: claims stated at 0.95 should hold about 95% of the time. Ours goes the other way. In the band where the agents were most certain — the claims a reviewer is most tempted to wave through — two out of three failed objective re-execution.

Small-n honesty: the top bin is n=3. That is a hypothesis-sized sample, not a finding-sized one, and we report it with its n attached rather than hiding it in an aggregate. The 0.85–0.95 range (n=52) is the load-bearing data; the collapse at 0.95+ is the pattern we're now watching accumulate — and asking others to test against their own ledgers.

Why would confidence invert?

Neural networks are systematically overconfident — that's been documented since Guo et al. (2017). But overconfidence alone predicts a curve that's too flat, not one that inverts. Two things in our data point at something more specific:

Extreme confidence correlates with wanting the claim accepted, not with evidence quality. The 0.95+ claims in our ledger are also where measure commands get vaguest. The agent isn't reporting the strength of its verification — it's performing certainty at exactly the moments it has the least to show.
Complexity predicts failure, and confident claims are complex claims. Receipts whose measure is a single command hold 92.9% (n=28). Two or more commands: 73.9% (n=23). Claims where the receipt went missing entirely: 33% (n=6). The more elaborate the story, the less likely it survives re-execution.

The second point has an immediate policy payoff: measure-command complexity is readable before merge. You don't have to wait for the ledger to tell you a claim was bad — a receipt with a sprawling measure block and 0.97 confidence is the highest-risk artifact in the queue, and your review policy can say so.

What this means if you run agents

Most teams reviewing agent work use stated confidence as a triage signal — skim the confident ones, scrutinize the hedged ones. Our data says that heuristic is not just weak but backwards at the top of the range. Three changes follow:

Never consume self-reported confidence raw. Map it through the agent's own measured track record, per change-class. An agent whose "0.95" empirically means "0.6" should be gated like a 0.6.
Make claims executable. A confidence number attached to prose is unfalsifiable. Attached to a shell command that gets re-run after merge, it becomes a calibration datapoint — and the agent knows it will.
Give calibration consequences. Our agents' curves are the input to an autonomy gate: ten held high-confidence claims in a class earns auto-merge eligibility there; misses revoke it. Once bravado costs standing, stated confidence starts meaning something.

Run it on your own agents

The machinery is open source and needs no server: receipts are files in your repo, the ledger is a JSONL, the scorer is a CLI, the gate is a GitHub Action.

pip install signalbrain — spec, scorer, and the full reproducible analysis at github.com/whitestone1121-web/signalbrain. If your curve inverts too — or doesn't — post your bins. n=58 is where this stops being one deployment's anecdote.

Get the tooling