AI did the work. Calma checks it.

Calma re-runs your AI's work, rebuilds the numbers it reported, and tells you — in one word — whether to trust them.

The problem

A wrong number looks exactly like a right one.

An AI tested a trading strategy and reported a spectacular result. The report looked perfect. Re-running the work on data the AI never saw told a different story.

what the AI reported0%data the AI never saw →the truth

Nobody re-checks these numbers. They get believed, shipped, and spent on.

What the AI reported+14,698%
What was actually true−32.4%
Benchmarks

Catches what reviewers miss.

117 labeled results — honest and tampered — built on UCI benchmark datasets, scikit-learn ground truth, and published real-world cases. Calma versus trusting the report, and versus asking Claude to judge the same data.

catch rate on 77 planted wrong numbers
117 cases · 30 metrics · UCI benchmark datasets + published cases · ground truth cross-validated against scikit-learn, SciPy & NumPy · ~216 ms per check · methodology
How it works

It doesn't read the work. It runs it.

Point Calma at the work and say what was claimed. Everything happens on your machine, in one command.

01

Run it again

Your AI's work re-executes in a sandbox, from scratch, with nothing taken on faith.

02

Rebuild the number

The result is recomputed from the raw output files — never copied from the AI's report.

03

Get the verdict

Code compares rebuilt against reported and gives one clear answer. Nobody — including the AI — can argue it into passing.

Features

Simple to use. Hard to fool.

Adversarial by design

A verdict nobody can argue with — including the AI that did the work

Every number and the verdict itself come from deterministic code, and the ledger re-derives each verdict byte-for-byte from its recorded inputs. A persuasive model — or a motivated author — cannot argue, edit, or charm their way to a pass.

Self-proving isolation

It proves its own sandbox before trusting it

Before any run, Calma plants a fake secret and tries to steal it — and tries to reach the network — under its own sandbox. Only when every attempt fails does it claim isolation; a machine that can't provide it is stamped honestly. Nothing is uploaded: your code and data never leave the machine.

Deterministic to the bit

Same inputs, same number, on any machine. The recompute runs on correctly-rounded kernels with Calma's own deterministic math — no GPU noise, no platform library drift, anywhere in the path.

Calibrated tolerance budgets

A claim is only refuted when the gap is statistically distinguishable: the budget comes from the claim's own reported precision — “$4.2M” is a ±$50k claim — plus the metric's sampling error and a measured noise floor.

Honesty guards

REFUTED is structurally blocked on an ambiguous column binding, a failed re-run, flaky outputs, or uncontrolled randomness. It degrades to can't-confirm with a fix: line naming the exact unblock — a caveat over a false alarm, every time.

Plain-English claims

“p95 latency 120ms.” “pass@5 0.62.” “monthly CAGR 23.9%.” The number, the metric, and even the convention — which k, which period, Welch or pooled — are parsed straight from the words.

Auto-drafted, graded contracts

Calma scans the output files, works out which column holds the metric, and double-checks that guess independently before it's allowed to matter. Only an independently-verified reading can ever refute. Pin everything with one small config file when you want it explicit.

Signed, forensic attestation

Every run leaves a signed report your counterparty can check with tools already on their machine — stock OpenSSH, zero installs, fully offline. An optional trusted timestamp makes the date provable in years, not promises. One command re-derives every verdict and can replay the whole run. (For the engineers: DSSE/in-toto, Sigstore-compatible, RFC 3161.)

Built for agent loops & CI

Verifications are cached by the content hash of code, data, and claim — unchanged work answers instantly. Agents branch on machine-readable verdicts mid-task; the GitHub Action gates CI only when a claim actually breaks.

Any language, black box

Python, R, Julia, C++, Rust — the program runs as a sealed box and Calma rebuilds the number in its own layer. No instrumentation, no SDK, no changes to the code under test.

Every catch leaves a record

A break produces a shareable teardown — claimed X, recomputed Y, here's the reproduction — and every verification appends to a per-project history. The track record compounds; it can't be retconned.

120validated recipes

A recipe is how Calma rebuilds one kind of number — a Sortino ratio, a p95 latency, a pass@1, a Fisher exact p, a WER — from the raw output files. Every one is validated against the published reference implementation (scikit-learn, SciPy, NumPy, numpy-financial, statsmodels) across 385 pinned reference vectors before it ships, and runs deterministically: same inputs, same number, to the bit. New recipes are compiled, not improvised: drafted offline, admitted by a deterministic gate, frozen under a content hash.

Browse all 120
Who it's for

Three ways people use Calma.

Builders

Catch the mistake before your users do

Your agent checks its own work as it goes — so the wrong number dies in the loop, not in production.

Free · open source · MIT
Teams

A result that doesn't reproduce never ships

Run Calma in CI as a gate. The proof travels with the work, and anyone can replay it later.

GitHub Action included
Investors & funds

Proof before the money moves

Before you act on a number, the lab independently re-executes the research and reports — with a reproduction your own side can run.

Engagements are limited — a person replies
About

Whoever did the work never gets to grade it.

That's the whole idea. Funds have administrators. Companies have auditors. Work done by AI gets Calma. The engine is open source so anyone can check the checker — and the lab signs its name to every report.

Calma is built and run by Rikhin Kavuru. Every line of the verification engine is public at github.com/rikhinkavuru/calma, and a person answers rikhinkavuru@gmail.com.

A desk lamp examining a stack of printed pages in a dark room
The lab — every claim under the lamp
Questions
What is Calma, in one sentence?
A tool that re-runs work done by AI and checks the numbers it reported — so you don't have to take its word for it.
Why can't the AI just check its own work?
Because it grades its own homework. Even when it re-runs the code, it still decides whether the answer matches — and it tends to agree with itself. Calma's decision is made by code the AI can't influence.
What do I get back?
One of four answers: confirmed, refuted, can't confirm, or confirmed with caveats (checking several claims at once can come back mixed) — plus the reason, the fix when something's missing, and a one-command replay anyone can run.
Does my code or data leave my machine?
No. Everything runs locally, inside a sandbox that blocks the network. Nothing is uploaded, ever.
What does it cost?
The skill is free and open source — install it and your agents use it today. The lab's signed verification reports are paid engagements, for when money is about to move on a number.
CALMA
Independent verification lab
Proof
is here.

Catch the bad number before the money moves.

Talk to a person: rikhinkavuru@gmail.com · read the code at github.com/rikhinkavuru/calma

© 2026 Calma · run by Rikhin KavuruThe producer is never the verifier