Essay · measured data
My agents' most confident claims were their least reliable
We re-scored 58 production agent claims objectively after merge. Hold-rate falls as stated confidence rises — and collapses in the top bin, exactly where you'd most want to trust it.
Every number below is re-derivable from a committed ledger. Bins, curves, generator: report/calibration-curves
The setup
For months, every change our coding agents shipped carried an improvement receipt: what changed, the shell command that proves it, and a stated confidence. After a human merges, an independent scorer re-runs the receipt's own measure command. Pass and the claim held; fail and the miss is recorded against the agent's calibration, forever. No author override, no self-scoring — the rules are an open spec, and they were stress-tested by our own agents attacking the ledger.
That leaves a dataset almost nobody has: (stated confidence, objectively measured outcome) pairs for real agent work in a real repository. 58 measured claims. Here is the curve.
The curve inverts
| Stated confidence | n | Held | Hold-rate |
|---|---|---|---|
| 0.80–0.85 | 2 | 1 | 50% |
| 0.85–0.90 | 29 | 25 | 86.2% |
| 0.90–0.95 | 23 | 19 | 82.6% |
| 0.95–1.00 | 3 | 1 | 33.3% |
A calibrated agent's curve rises with confidence: claims stated at 0.95 should hold about 95% of the time. Ours goes the other way. In the band where the agents were most certain — the claims a reviewer is most tempted to wave through — two out of three failed objective re-execution.
Small-n honesty: the top bin is n=3. That is a hypothesis-sized sample, not a finding-sized one, and we report it with its n attached rather than hiding it in an aggregate. The 0.85–0.95 range (n=52) is the load-bearing data; the collapse at 0.95+ is the pattern we're now watching accumulate — and asking others to test against their own ledgers.
Why would confidence invert?
Neural networks are systematically overconfident — that's been documented since Guo et al. (2017). But overconfidence alone predicts a curve that's too flat, not one that inverts. Two things in our data point at something more specific:
- Extreme confidence correlates with wanting the claim accepted, not with evidence quality. The 0.95+ claims in our ledger are also where measure commands get vaguest. The agent isn't reporting the strength of its verification — it's performing certainty at exactly the moments it has the least to show.
- Complexity predicts failure, and confident claims are complex claims. Receipts whose measure is a single command hold 92.9% (n=28). Two or more commands: 73.9% (n=23). Claims where the receipt went missing entirely: 33% (n=6). The more elaborate the story, the less likely it survives re-execution.
The second point has an immediate policy payoff: measure-command complexity is readable before merge. You don't have to wait for the ledger to tell you a claim was bad — a receipt with a sprawling measure block and 0.97 confidence is the highest-risk artifact in the queue, and your review policy can say so.
What this means if you run agents
Most teams reviewing agent work use stated confidence as a triage signal — skim the confident ones, scrutinize the hedged ones. Our data says that heuristic is not just weak but backwards at the top of the range. Three changes follow:
- Never consume self-reported confidence raw. Map it through the agent's own measured track record, per change-class. An agent whose "0.95" empirically means "0.6" should be gated like a 0.6.
- Make claims executable. A confidence number attached to prose is unfalsifiable. Attached to a shell command that gets re-run after merge, it becomes a calibration datapoint — and the agent knows it will.
- Give calibration consequences. Our agents' curves are the input to an autonomy gate: ten held high-confidence claims in a class earns auto-merge eligibility there; misses revoke it. Once bravado costs standing, stated confidence starts meaning something.
Run it on your own agents
The machinery is open source and needs no server: receipts are files in your repo, the ledger is a JSONL, the scorer is a CLI, the gate is a GitHub Action.
pip install signalbrain — spec, scorer, and the full reproducible analysis
at github.com/whitestone1121-web/signalbrain.
If your curve inverts too — or doesn't — post
your bins. n=58 is where this stops being one deployment's anecdote.