AI did the work. Calma checks it.
Calma re-runs your AI's work, rebuilds the numbers it reported, and tells you — in one word — whether to trust them.
A wrong number looks exactly like a right one.
An AI tested a trading strategy and reported a spectacular result. The report looked perfect. Re-running the work on data the AI never saw told a different story.
Nobody re-checks these numbers. They get believed, shipped, and spent on.
Catches what reviewers miss.
117 labeled results — honest and tampered — built on UCI benchmark datasets, scikit-learn ground truth, and published real-world cases. Calma versus trusting the report, and versus asking Claude to judge the same data.
It doesn't read the work. It runs it.
Point Calma at the work and say what was claimed. Everything happens on your machine, in one command.
Run it again
Your AI's work re-executes in a sandbox, from scratch, with nothing taken on faith.
Rebuild the number
The result is recomputed from the raw output files — never copied from the AI's report.
Get the verdict
Code compares rebuilt against reported and gives one clear answer. Nobody — including the AI — can argue it into passing.
Simple to use. Hard to fool.
A verdict nobody can argue with — including the AI that did the work
Every number and the verdict itself come from deterministic code, and the ledger re-derives each verdict byte-for-byte from its recorded inputs. A persuasive model — or a motivated author — cannot argue, edit, or charm their way to a pass.
It proves its own sandbox before trusting it
Before any run, Calma plants a fake secret and tries to steal it — and tries to reach the network — under its own sandbox. Only when every attempt fails does it claim isolation; a machine that can't provide it is stamped honestly. Nothing is uploaded: your code and data never leave the machine.
Deterministic to the bit
Same inputs, same number, on any machine. The recompute runs on correctly-rounded kernels with Calma's own deterministic math — no GPU noise, no platform library drift, anywhere in the path.
Calibrated tolerance budgets
A claim is only refuted when the gap is statistically distinguishable: the budget comes from the claim's own reported precision — “$4.2M” is a ±$50k claim — plus the metric's sampling error and a measured noise floor.
Honesty guards
REFUTED is structurally blocked on an ambiguous column binding, a failed re-run, flaky outputs, or uncontrolled randomness. It degrades to can't-confirm with a fix: line naming the exact unblock — a caveat over a false alarm, every time.
Plain-English claims
“p95 latency 120ms.” “pass@5 0.62.” “monthly CAGR 23.9%.” The number, the metric, and even the convention — which k, which period, Welch or pooled — are parsed straight from the words.
Auto-drafted, graded contracts
Calma scans the output files, works out which column holds the metric, and double-checks that guess independently before it's allowed to matter. Only an independently-verified reading can ever refute. Pin everything with one small config file when you want it explicit.
Signed, forensic attestation
Every run leaves a signed report your counterparty can check with tools already on their machine — stock OpenSSH, zero installs, fully offline. An optional trusted timestamp makes the date provable in years, not promises. One command re-derives every verdict and can replay the whole run. (For the engineers: DSSE/in-toto, Sigstore-compatible, RFC 3161.)
Built for agent loops & CI
Verifications are cached by the content hash of code, data, and claim — unchanged work answers instantly. Agents branch on machine-readable verdicts mid-task; the GitHub Action gates CI only when a claim actually breaks.
Any language, black box
Python, R, Julia, C++, Rust — the program runs as a sealed box and Calma rebuilds the number in its own layer. No instrumentation, no SDK, no changes to the code under test.
Every catch leaves a record
A break produces a shareable teardown — claimed X, recomputed Y, here's the reproduction — and every verification appends to a per-project history. The track record compounds; it can't be retconned.
A recipe is how Calma rebuilds one kind of number — a Sortino ratio, a p95 latency, a pass@1, a Fisher exact p, a WER — from the raw output files. Every one is validated against the published reference implementation (scikit-learn, SciPy, NumPy, numpy-financial, statsmodels) across 385 pinned reference vectors before it ships, and runs deterministically: same inputs, same number, to the bit. New recipes are compiled, not improvised: drafted offline, admitted by a deterministic gate, frozen under a content hash.
Browse all 120Three ways people use Calma.
Catch the mistake before your users do
Your agent checks its own work as it goes — so the wrong number dies in the loop, not in production.
Free · open source · MITA result that doesn't reproduce never ships
Run Calma in CI as a gate. The proof travels with the work, and anyone can replay it later.
GitHub Action includedProof before the money moves
Before you act on a number, the lab independently re-executes the research and reports — with a reproduction your own side can run.
Engagements are limited — a person repliesWhoever did the work never gets to grade it.
That's the whole idea. Funds have administrators. Companies have auditors. Work done by AI gets Calma. The engine is open source so anyone can check the checker — and the lab signs its name to every report.
Calma is built and run by Rikhin Kavuru. Every line of the verification engine is public at github.com/rikhinkavuru/calma, and a person answers rikhinkavuru@gmail.com.
