AI did the work. Calma checks it.

Calma re-runs your agent's work and recomputes the number — from raw outputs, never its word — and blocks the wrong one before it ships.

Verify a repo Try the live demo

Every number, one final say.

Backtests, evals, benchmarks, datasets — recomputed from raw outputs, then proven or broken.

Trading

Sharpe 2.61
+14,698%

ML & evals

accuracy 0.94
AUC 0.91

Engineering

2.3× faster
p99 142 ms

Analytics

$4.2M total
10,482 rows

CONFIRMEDREFUTEDINVALIDATED

Anyone can read your number. Calma re-derives it.

Three things a model grading its own homework can't do.

Recompute, don't trust.

Connect a repo and Calma rebuilds the environment, re-running scripts and notebooks in a network-off sandbox — even on era-pinned Python. Then it recomputes the headline number from the raw arrays, never the README.

Amber-toned painting of a vast planet rising over a rocky landscape

acme/churn-model

calma

Reproducible isn't the same as true.

A number can reproduce perfectly and still be a lie. Calma checks the structure behind it — train/eval leakage, trivial baselines, misreported predictions — and stamps a contaminated result INVALIDATED.

Amber-toned painting of a large planet over a sunset sea

validity scan · no execution

model reproduces

beats trivial baseline

train / eval leakage

A guardrail, not a report.

A stop-hook in your agent's loop spots a checkable claim and runs the scan before the wrong number ships. The decision comes from deterministic code, never the model — when it can't prove a claim it says INCONCLUSIVE. Zero false confirms, ever.

Amber-toned painting of a crescent moon and sun over calm water

❯ agent …done. Final accuracy: 0.94

The fine print, in plain English.

Connect a repo and Calma independently re-runs the work, recomputes every number it claims, and tells you which ones hold — Confirmed, Refuted, Invalidated, or Inconclusive.

Because it grades its own homework. Even when it re-runs the code, it still decides whether the answer matches — and it tends to agree with itself. Calma's verdict is produced by deterministic code the AI can't influence.

A verdict per claim, with the recomputed number next to the claimed one, the reason behind the verdict, and the evidence — which artifact it was recomputed from and whether the run was deterministic. When Calma can't verify a claim it says so; it never guesses.

In an isolated, single-use microVM sandbox with the network cut off while your code executes. The sandbox is created for your run and destroyed when it finishes.

No. The validity scan works without executing anything — it checks committed train/eval splits for leakage, compares the score against a trivial baseline, and recomputes from committed predictions. The full re-run is what upgrades a claim to Confirmed.

Verifying starts free. Paid tiers add depth and volume — see pricing.