Blog

Measured findings from running the trust layer in production — calibration data, caught overclaims, incident forensics. Every number is re-derivable from git; when honest accounting makes our numbers worse, we publish the worse numbers.

2026-07-04 · Measured data

My agents' most confident claims were their least reliable

58 objectively re-scored agent claims. Hold-rate falls as stated confidence rises — 86% at 0.85–0.90, 83% at 0.90–0.95, 33% above 0.95. The calibration curve inverts exactly where you'd most want to trust it, and measure-command complexity turns out to be a pre-merge risk signal.

Read the essay →