Proof before the money moves.
The lab is a verification practice for capital allocation in the age of AI-produced research. Before a number changes a decision — an allocation, a seed, a mandate — Calma independently re-executes the work and recomputes the result, deterministically, for the people whose money is at stake. Not an opinion on the methodology. The number, rebuilt.
The failure modes that matter — overfitting, cherry-picking, leakage — are driven by incentives, not model error. They get stronger as models improve: a better optimizer produces more convincing overfits. That's why this is forensic work, built to survive an adversarial author.
Allocators & ODD teams
A result in a pitch — a backtest, a model, a research claim — independently re-executed and recomputed before it enters your IC memo. Slots into the operational due diligence you already run.
Seeders & platforms
The same forensic pass across a stable of candidate managers or internal research teams — with a verdict per claim and a reproduction your own analysts can run.
Managers & research teams
Attestation, under terms that make it mean something: prepaid, logged in the registry, and a disclosed trial log behind any headline performance stamp. A stamp anyone could buy would be worthless to you.
Scope
You name the claims that matter — the return, the Sharpe, the accuracy, the capacity figure. We contract the exact artifacts, code, and data that produced them. Prepaid, non-contingent.
Re-execute
The work runs again in an isolated environment, from scratch. The headline numbers are rebuilt from the raw outputs on deterministic kernels — never read from the deck.
Report
Per claim: confirmed, refuted, confirmed-with-caveats, or can't-confirm — with the recomputed number, the gap, what was and wasn't verified, and a signed, trusted-timestamped attestation bundle your side checks offline: verify the signature with stock OpenSSH, re-derive every verdict, replay the entire run.
Registry
The engagement is logged in the consented public registry — including engagements that were withdrawn or refuted. The population of stamps carries signal because the misses are in it too.
A verification only has value if it would have caught the lie. Every engagement runs under terms designed for the adversarial case:
Prepaid and non-contingent
The fee is the same whether the result confirms or breaks. Nobody can buy a verdict, and no verdict is softened to keep a client.
Every engagement is logged
The registry records every engagement — confirmed, refuted, withdrawn — clinical-trial style. A stamp only means something if the stamps that failed are visible too.
Headline stamps require the trial log
A backtest stamp without the other attempts behind it is marketing. Headline performance claims require a disclosed trial log, and we deflate for multiple testing before any stamp.
Every report states its limits
Reproducible is not the same as right. Each report stamps exactly what was verified, what wasn't, and at what isolation and determinism tier — no verdict ever overreaches its evidence.
Your code and data stay yours
Engagements run under NDA in an isolated environment scoped in the engagement letter. What enters the public registry is redacted by construction — claim, metric, claimed vs recomputed, verdict, and content hashes. Never code, never data, never positions.
The auditor can't be the auditee. The best agents score ~21% at assessing reproducibility on REPRO-Bench (arxiv.org/abs/2507.18901); Calma re-executes and decides with code.
They sell self-evaluation to the builder. The lab sells verdicts to the counterparty — the side whose money moves on the answer.
They prove when you knew the number. Calma proves the number is real.
The engine behind every engagement is open source — the free Calma skill is the live, public proof it's real. The lab is the practice built on top of it.
Have a number that's about to move money?
Send the claim. We'll scope what it would take to prove it — or break it.