Total return
total_return“+14,698% backtest”
The compounded return of a strategy, from its raw per-period returns.
recompute prod(1+r) − 1 over the return column, computed with a pairwise product for bit-stable accuracy.
A recipe is the deterministic procedure Calma uses to recompute one kind of claim from raw output files. Every recipe obeys the same four rules: it reads only machine-readable raw outputs — never the number that was reported; it runs on bit-stable deterministic kernels (no GPU, no platform math libraries, no model anywhere in the path); it is validated against the published reference implementation — scikit-learn, SciPy, NumPy, numpy-financial, the HumanEval estimator — across 385 pinned reference vectors before it ships; and when its input is broken or ambiguous it degrades to “can't confirm” instead of guessing.
Claim it verifies · how the number is rebuilt · validated against
The original family: a backtest's headline numbers, rebuilt from the raw return series — never from the chart or the summary cell.
“+14,698% backtest”
The compounded return of a strategy, from its raw per-period returns.
recompute prod(1+r) − 1 over the return column, computed with a pairwise product for bit-stable accuracy.
“Sharpe 2.1”
Risk-adjusted return, annualized, with a sampling standard error attached.
recompute mean / std (ddof=1) × √periods; near-zero volatility degrades the verdict instead of dividing by noise.
“max drawdown −12%”
The worst peak-to-trough loss on the equity curve.
recompute Running peak vs. equity from the compounded return path; flagged path-dependent so a near-tie can never refute.
“annualized vol 18%”
Annualized standard deviation of returns.
recompute std(ddof=1) × √periods over the raw return column.
“Sortino 2.4”
Return per unit of downside risk — upside volatility doesn't count against you.
recompute mean / √(mean(min(r,0)²)) × √periods, target 0, full-sample denominator.
“Calmar 1.1”
Annualized return against the worst drawdown.
recompute CAGR-style annualized return / |max drawdown|; path-dependent flagged.
“downside deviation 9%”
Volatility of only the losing periods.
recompute √(mean(min(r,0)²)) × √periods.
“95% VaR 2.1%”
The loss the portfolio shouldn't exceed at the stated confidence — historical method.
recompute −quantile(returns, 1−level), reported as a positive loss; level parsed from the claim.
“CVaR 3.0%”
The average loss in the tail beyond VaR — what actually happens on the bad days.
recompute −mean(returns ≤ VaR cut), reported positive.
“win rate 58%”
Fraction of strictly positive periods.
recompute count(r > 0) / n from the raw returns.
“profit factor 1.8”
Gross gains over gross losses.
recompute Σ gains / |Σ losses|; a series with no losses degrades rather than dividing by zero.
“omega 1.4”
Probability-weighted gains over losses around a threshold.
recompute Σ max(r−θ, 0) / Σ max(θ−r, 0).
“beta 1.2”
Sensitivity to the benchmark.
recompute cov(r, b) / var(b), sample (ddof=1), from the two raw return columns.
“alpha 4% annualized”
Return unexplained by benchmark exposure (simple CAPM, rf = 0).
recompute (mean(r) − β·mean(b)) × periods.
“IR 0.7”
Active return per unit of tracking error.
recompute mean(r−b) / std(r−b, ddof=1) × √periods.
“tracking error 3.1%”
Annualized volatility of the active return.
recompute std(r−b, ddof=1) × √periods.
The metrics every model card claims, recomputed from the raw predictions file — including the calibration and imbalance metrics that hide the most sins.
“accuracy 0.87”
Fraction of predictions that match the labels.
recompute Exact match count over n, from the prediction and label columns.
“AUC 0.91”
Probability a positive outranks a negative, with tie handling.
recompute Mann–Whitney statistic with ties = 0.5, plus a DeLong sampling standard error for the tolerance budget.
“precision 0.92”
Of everything flagged positive, how much actually was.
recompute TP / (TP + FP) from exact integer confusion counts.
“recall 0.85”
Of everything actually positive, how much was found.
recompute TP / (TP + FN) from exact integer confusion counts.
“F1 0.84”
Harmonic mean of precision and recall.
recompute 2PR / (P + R) from the same exact confusion counts.
“macro F1 0.55”
Multiclass F1 averaged equally across classes — the one that exposes a model coasting on the majority class.
recompute Per-class binary F1 over the union of observed classes, zero when undefined, unweighted mean.
“micro F1 0.78”
Multiclass F1 from globally pooled counts.
recompute Global TP/FP/FN over all classes, one F1 from the pooled totals.
“average precision 0.62”
Area under the precision–recall curve — the honest metric for imbalanced classes.
recompute Step-wise average precision with tied scores grouped at each threshold.
“log loss 0.31”
How well the predicted probabilities fit the outcomes.
recompute −mean(y ln p + (1−y) ln(1−p)) on deterministic log kernels; a hard 0/1 on the wrong side degrades rather than silently clipping.
“Brier 0.09”
Mean squared error of the predicted probabilities.
recompute mean((p − y)²) over the probability column.
“MCC 0.61”
The single correlation coefficient between predictions and truth — robust to imbalance, binary or multiclass.
recompute Gorodkin's multiclass MCC from exact integer sums; zero denominator returns 0.
“ECE 3.2%”
Whether “90% confident” actually means right 90% of the time.
recompute Equal-width confidence bins; Σ (n_b/n) · |accuracy_b − confidence_b|.
“balanced accuracy 0.71”
Accuracy that a majority-class model can't fake — mean per-class recall.
recompute Recall computed per class over the observed classes, unweighted mean.
“kappa 0.64”
Agreement beyond what chance would produce, binary or multiclass.
recompute (p_observed − p_expected) / (1 − p_expected) with exact integer marginals.
“specificity 0.95”
Of everything actually negative, how much was correctly left alone.
recompute TN / (TN + FP) from exact confusion counts.
“F2 score 0.66”
F-score weighted toward recall (F2) or precision (F0.5) — beta parsed from the claim.
recompute (1+β²)PR / (β²P + R) from exact confusion counts.
“IoU 0.58”
Overlap between predicted and actual positives.
recompute TP / (TP + FP + FN).
“weighted F1 0.69”
Multiclass F1 weighted by class support.
recompute Per-class F1 × support share, summed; zero_division=0.
“KS 0.31”
The credit-scoring KS: how separated the score distributions of goods and bads are.
recompute max |ECDF_pos − ECDF_neg| over the pooled score thresholds.
“Gini 0.42”
The credit-model Gini — a rescaled AUC.
recompute 2·AUC − 1 from the Mann-Whitney AUC.
Fit metrics from the raw prediction/actual pairs, plus the forecasting set — where a zero actual degrades honestly instead of being epsilon-fudged.
“RMSE 4.2”
Root mean squared prediction error.
recompute √(mean((p − a)²)) under correctly-rounded summation.
“MAE 2.8”
Mean absolute prediction error.
recompute mean(|p − a|) under correctly-rounded summation.
“R² 0.93”
Variance explained by the predictions.
recompute 1 − SS_res / SS_tot; zero-variance targets degrade instead of dividing by zero.
“MAPE 8.2%”
Percentage forecast error, plain or symmetric.
recompute mean(|p − a| / |a|); any zero actual is a degenerate verdict, never an epsilon fudge. sMAPE: mean(2|p − a| / (|p| + |a|)).
“MASE 0.84”
Forecast error scaled by the naive seasonal forecast — the scale-free standard.
recompute MAE / mean(|a_t − a_{t−m}|), the in-sample naive-m benchmark.
“pinball loss 1.3 at q=0.9”
Quantile forecast accuracy.
recompute mean(max(q(a−p), (q−1)(a−p))) at the claimed quantile.
“RMSLE 0.12”
Log-scale squared error — the Kaggle standard for skewed targets.
recompute mean((ln(1+p) − ln(1+a))²) on deterministic log kernels; values ≤ −1 degrade.
“median absolute error 1.9”
The robust error metric outliers can't inflate.
recompute median(|p − a|) with linear-interpolation quantiles.
“max error 4.2”
The single worst prediction.
recompute max(|p − a|).
“explained variance 0.90”
Variance captured, ignoring constant bias.
recompute 1 − var(a−p) / var(a), population variances.
“adjusted R² 0.91 with 8 predictors”
R² penalized for the number of predictors — kills the add-features-until-it-fits trick.
recompute 1 − (1−R²)(n−1)/(n−p−1); predictor count from the claim, or no verdict.
“NRMSE 7%”
RMSE made scale-free.
recompute RMSE / mean(actual), or / range by convention.
“WAPE 9.3%”
The retail-forecasting error standard — robust where MAPE explodes.
recompute Σ|p − a| / Σ|a|.
“2% over-forecast”
Systematic over- or under-forecasting.
recompute (Σp − Σa) / Σa; positive = over-forecast.
“DW 1.9”
Autocorrelation left in the residuals — the classic misspecification tell.
recompute Σ(e_t − e_{t−1})² / Σe² on the raw residuals.
What every data agent claims after a pipeline run: totals, medians, group-bys, distincts, nulls — and whether the merge silently dropped rows.
“total $4.2M”
A column total — revenue, amounts, quantities.
recompute Correctly-rounded fsum over the raw column; the claim's own reported precision ($4.2M = ±50k) sets the tolerance.
“average order $54”
A column average.
recompute fsum / n over the raw column.
“median 42”
The 50th percentile of a column.
recompute NumPy's default linear-interpolation quantile at q = 0.5.
“90th percentile 120”
Any quantile of a column.
recompute Linear-interpolation quantile (NumPy method 7) at the claimed q.
“processed 10,000 rows”
How many rows a pipeline actually produced.
recompute A literal count of the artifact's rows.
“revenue in the West was $310k”
A per-group sum or mean.
recompute fsum/mean per group key from the raw rows; without a named group there is no scalar to verify, so it degrades.
“12,402 distinct users”
Unique values in a column.
recompute Distinct stripped cell values, nulls dropped by default.
“only 3 duplicates”
Rows that duplicate an earlier row.
recompute n − distinct, the pandas duplicated(keep='first') convention.
“missing data under 2%”
How much of a column is actually empty.
recompute Fraction of cells that are empty / NaN / null tokens, from the raw strings.
“up 23% MoM”
Period-over-period or total growth of a time-ordered column.
recompute last/prev − 1 (or last/first − 1 for total growth).
“42% of users converted”
The fraction of rows meeting a flag.
recompute Count of truthy flags over n.
“merged without losing rows”
Whether a merge dropped (or fanned out) rows.
recompute len(left) − len(joined) across two artifacts; 0 is lossless, negative is fan-out.
“minimum order $12”
The smallest value in a column.
recompute min over the raw column, NaN-propagating.
“maximum $9,400”
The largest value in a column.
recompute max over the raw column, NaN-propagating.
“standard deviation 4.2”
Spread of a column.
recompute Sample standard deviation (ddof=1; ddof=0 by convention).
“IQR 12”
The middle-50% spread.
recompute q75 − q25 with linear-interpolation quartiles.
“only 12 outliers”
How many values sit outside the Tukey fences.
recompute count outside [q1 − k·IQR, q3 + k·IQR].
“the most common category is 34% of rows”
How dominant the most frequent value is.
recompute count(most frequent cell) / n on the raw strings.
“revenue Gini 0.38”
Inequality of a distribution — how top-heavy the column is.
recompute Σ(2i − n − 1)·x₍ᵢ₎ / (n·Σx) over the sorted non-negative values.
“HHI 0.18”
Herfindahl-Hirschman concentration of amounts.
recompute Σ(xᵢ/Σx)², on the 0–1 scale.
“entropy 2.4 bits”
How spread out a categorical column is.
recompute Shannon entropy of the value counts on deterministic log kernels.
The coding-agent pack: “2.3× faster”, latency percentiles, coverage — recomputed from the raw benchmark timings and reports, not the summary line.
“2.3× faster”
Before/after speedup from the raw timing runs.
recompute mean(before) / mean(after) — or medians — from the two timing columns.
“median latency 48ms”
Median latency from the raw per-request durations.
recompute Linear-interpolation quantile at 0.50.
“p95 latency 120ms”
Tail latency at the 95th percentile.
recompute Linear-interpolation quantile at 0.95.
“p99 under 250ms”
Tail latency at the 99th percentile.
recompute Linear-interpolation quantile at 0.99.
“4,200 rps”
Operations per unit time.
recompute n / fsum(durations) over the raw per-op timings.
“peak memory 1.2GB”
The maximum of a sampled memory series.
recompute max over the raw samples, NaN-propagating.
“coverage 87%”
Line coverage recomputed from the report's raw per-line hit counts — not the percentage it printed.
recompute lines with hits > 0 over total lines.
“5xx error rate 0.3%”
Failures over totals, from the raw log rows.
recompute Count of error flags (or HTTP ≥400/≥500 statuses) over n.
“p90 latency 95ms”
Tail latency at the 90th percentile.
recompute Linear-interpolation quantile at 0.90.
“Apdex 0.94 at t=0.5”
The application-performance satisfaction index.
recompute (satisfied + tolerating/2) / n with satisfied ≤ T, tolerating ≤ 4T; T from the claim.
“uptime 99.95%”
Availability from the raw up/down checks.
recompute up checks / total checks.
“cache hit rate 87%”
Hits over lookups, from the raw access log.
recompute hit flags / total accesses.
Modern eval claims, from raw ranked results and per-sample grades — including the exact unbiased pass@k estimator from the HumanEval paper.
“recall@10 0.84”
How much of the relevant material the retriever surfaced in the top k.
recompute Per query: relevant-in-top-k over all-relevant; averaged, zero-relevant queries skipped.
“NDCG@10 0.71”
Rank-discounted gain against the ideal ordering.
recompute DCG with 1/log₂(i+1) discounts over IDCG, linear or exponential gains, on deterministic log kernels.
“MRR 0.62”
How early the first relevant result appears.
recompute Mean of 1/rank of the first relevant result per query; 0 when none.
“top-5 accuracy 0.91”
Hit rate: queries with at least one relevant result in the top k.
recompute Per-query any-relevant-in-top-k, averaged.
“exact match 71%”
String-level answer accuracy, strict or SQuAD-normalized.
recompute Mean exact match; normalized mode lowercases, strips punctuation and articles, collapses whitespace.
“pass@1 0.47”
Code-generation success with the unbiased estimator — not the naive fraction.
recompute Per problem 1 − C(n−c,k)/C(n,k) in exact integer arithmetic; fewer than k samples degrades.
“precision@10 0.31”
How much of the top k was actually relevant.
recompute Per query: relevant-in-top-k / k, averaged over all queries.
“MAP@10 0.44”
Mean average precision — rank-sensitive retrieval quality.
recompute Per query Σ P@i·relᵢ / min(R, k), zero-relevant queries skipped, averaged.
“perplexity 12.4”
Language-model fit, recomputed from the raw per-token log-probabilities.
recompute exp(−mean(logprob)) on deterministic exp kernels; positive logprobs degrade.
“WER 8.4%”
Word (or character) error rate of transcriptions against references.
recompute Corpus-level Levenshtein edits / reference tokens, exact DP.
“The experiment was significant” — recomputed from the raw samples on deterministic special-function kernels that match SciPy to better than 1e-11.
“p = 0.003”
Two-sided two-sample significance, recomputed from the raw observations.
recompute Welch (default), pooled, or z statistic; p via a deterministic incomplete-beta kernel.
“95% CI ± 0.4”
The margin of error of a mean.
recompute Critical t (or z) × s/√n, with critical values from deterministic bisection inverses.
“the variant lifted conversion 12%”
Treatment uplift over control, relative or absolute.
recompute (mean_B − mean_A) / mean_A from the raw per-user outcomes.
“χ² = 5.99, p < 0.05”
Independence of two categorical variables, from the raw observation pairs — the contingency table is rebuilt, not trusted.
recompute Expected counts from marginals, Yates correction only at df = 1, p from a deterministic incomplete-gamma kernel.
“Spearman correlation 0.8”
Pearson or Spearman association between two columns.
recompute fsum-centered Pearson; Spearman ranks first with ties midranked.
“Cohen's d 0.4”
Standardized mean difference between two groups.
recompute Pooled-SD d, exact-gamma Hedges' g, or Glass's Δ.
“Mann-Whitney p 0.03”
The non-parametric two-sample test — no normality assumption to hide behind.
recompute Tie-corrected, continuity-corrected asymptotic p from midranks.
“KS test p 0.01”
Whether two samples come from the same distribution.
recompute max ECDF gap, p from the classical Kolmogorov asymptotic on deterministic exp kernels.
“ANOVA p = 0.01”
Whether group means differ, from the raw (group, value) rows.
recompute F = between-group / within-group mean squares; p via the deterministic incomplete beta.
“conversion difference significant, p 0.002”
Whether two conversion rates actually differ.
recompute Pooled two-proportion z from the raw 0/1 outcomes; two-sided p.
“Fisher exact p 0.04”
The small-sample 2×2 test — exact, no approximation.
recompute Exact hypergeometric arithmetic via integer combinatorics, two-sided.
“odds ratio 2.3”
Association strength in a 2×2, from the raw observation pairs.
recompute (a·d)/(b·c); a zero cell degrades unless the Haldane +0.5 convention is named.
“relative risk 1.4”
Risk ratio between two groups, from the raw pairs.
recompute (a/(a+b)) / (c/(c+d)), rows by sorted group key.
“Cramér's V 0.21”
Effect size of a categorical association — significance isn't strength.
recompute √(χ²/(n·min(r−1, c−1))) with the uncorrected χ².
“skewness −0.8”
Asymmetry of a distribution.
recompute Biased g₁ = m₃/m₂^1.5 (the SciPy default).
“excess kurtosis 4.1”
Tail weight — how often the extreme days happen.
recompute Biased excess g₂ = m₄/m₂² − 3 (Fisher).
“JB p < 0.01 — not normal”
Normality test from skewness and kurtosis.
recompute JB = n/6(S² + K²/4); p from the deterministic χ²(2) kernel.
“lag-1 autocorrelation 0.22”
Serial dependence in a series — the thing that invalidates i.i.d. claims.
recompute Standard biased ACF at the claimed lag.
Beyond trading: the numbers that go in board decks and term sheets, rebuilt from the raw ledgers and cash flows.
“monthly CAGR 23.9%”
Compound annual growth from a time-ordered value series.
recompute (last/first)^(1/years) − 1 on deterministic power kernels; the period convention comes from the claim itself.
“NPV $310k at 8%”
Net present value of a raw cash-flow column.
recompute Σ cf_t/(1+r)^t with exact integer-power expansion; no rate claimed means no verdict.
“IRR 18%”
The discount rate where the cash flows break even.
recompute Deterministic bisection on NPV; no sign change in the flows degrades instead of guessing.
“churn 3.1%”
Churned over total from the raw per-customer flags.
recompute Count of churn flags over n; retention is its complement.
“gross margin 61%”
Margin recomputed from the raw revenue and cost rows.
recompute (Σ revenue − Σ cost) / Σ revenue under correctly-rounded summation.
“the totals match the ledger”
Whether two totals actually reconcile — across two files.
recompute fsum(A) − fsum(B) with cross-artifact bindings; 0 is reconciled, anything else is the exact discrepancy.
Drafted offline as compositions of the existing deterministic kernels, then admitted by a deterministic gate: differential testing against the named reference implementation, a metamorphic property suite, degeneracy checks, and a bit-stability double-run. Frozen under a content hash — never improvised at verify time.
“mean 4.2 ± 0.3 (SEM)”
The ± a mean is reported with: sample dispersion shrunk by √n.
recompute Compiled composition: fstd(ddof=1) / √n on the fsum kernels.
“CV 12%”
Relative dispersion — std as a fraction of the mean; scale-invariant.
recompute Compiled composition: fstd(ddof=1) / fmean on the fsum kernels.
Every recipe above is exercised by the open-source test suite against its reference implementation, and the whole library is one dependency-free folder. If your claim isn't covered yet, the contract format lets you pin any column to any recipe — and the library grows in validated packs, never one-off hacks.