The recipe library

120 ways to rebuild a number. Zero opinions.

A recipe is the deterministic procedure Calma uses to recompute one kind of claim from raw output files. Every recipe obeys the same four rules: it reads only machine-readable raw outputs — never the number that was reported; it runs on bit-stable deterministic kernels (no GPU, no platform math libraries, no model anywhere in the path); it is validated against the published reference implementation — scikit-learn, SciPy, NumPy, numpy-financial, the HumanEval estimator — across 385 pinned reference vectors before it ships; and when its input is broken or ambiguous it degrades to “can't confirm” instead of guessing.

Claim it verifies · how the number is rebuilt · validated against

01 · Trading & backtests16 recipes

The original family: a backtest's headline numbers, rebuilt from the raw return series — never from the chart or the summary cell.

Total return

total_return

“+14,698% backtest”

The compounded return of a strategy, from its raw per-period returns.

recompute prod(1+r) − 1 over the return column, computed with a pairwise product for bit-stable accuracy.

validated against NumPy cumulative product

Sharpe ratio

sharpe

“Sharpe 2.1”

Risk-adjusted return, annualized, with a sampling standard error attached.

recompute mean / std (ddof=1) × √periods; near-zero volatility degrades the verdict instead of dividing by noise.

validated against NumPyconventions periods: 252 / 365 / 52

Max drawdown

max_drawdown

“max drawdown −12%”

The worst peak-to-trough loss on the equity curve.

recompute Running peak vs. equity from the compounded return path; flagged path-dependent so a near-tie can never refute.

validated against NumPy

Volatility

volatility

“annualized vol 18%”

Annualized standard deviation of returns.

recompute std(ddof=1) × √periods over the raw return column.

validated against NumPyconventions periods: 252 / 365 / 52

Sortino ratio

sortino

“Sortino 2.4”

Return per unit of downside risk — upside volatility doesn't count against you.

recompute mean / √(mean(min(r,0)²)) × √periods, target 0, full-sample denominator.

validated against standard definition (documented convention)conventions periods: 252 / 365 / 52

Calmar ratio

calmar

“Calmar 1.1”

Annualized return against the worst drawdown.

recompute CAGR-style annualized return / |max drawdown|; path-dependent flagged.

validated against standard definition

Downside deviation

downside_deviation

“downside deviation 9%”

Volatility of only the losing periods.

recompute √(mean(min(r,0)²)) × √periods.

validated against standard definition

Value at risk

value_at_risk

“95% VaR 2.1%”

The loss the portfolio shouldn't exceed at the stated confidence — historical method.

recompute −quantile(returns, 1−level), reported as a positive loss; level parsed from the claim.

validated against NumPy quantile (historical VaR)conventions p95 (default) / p99

CVaR / expected shortfall

cvar

“CVaR 3.0%”

The average loss in the tail beyond VaR — what actually happens on the bad days.

recompute −mean(returns ≤ VaR cut), reported positive.

validated against historical expected shortfallconventions p95 (default) / p99

Win rate

win_rate

“win rate 58%”

Fraction of strictly positive periods.

recompute count(r > 0) / n from the raw returns.

validated against definitional

Profit factor

profit_factor

“profit factor 1.8”

Gross gains over gross losses.

recompute Σ gains / |Σ losses|; a series with no losses degrades rather than dividing by zero.

validated against definitional

Omega ratio

omega_ratio

“omega 1.4”

Probability-weighted gains over losses around a threshold.

recompute Σ max(r−θ, 0) / Σ max(θ−r, 0).

validated against Keating & Shadwickconventions threshold=<frac> (0)

Beta

beta

“beta 1.2”

Sensitivity to the benchmark.

recompute cov(r, b) / var(b), sample (ddof=1), from the two raw return columns.

validated against NumPy cov

Alpha

alpha

“alpha 4% annualized”

Return unexplained by benchmark exposure (simple CAPM, rf = 0).

recompute (mean(r) − β·mean(b)) × periods.

validated against CAPM definition (documented)conventions periods: 252 / 365 / 52

Information ratio

information_ratio

“IR 0.7”

Active return per unit of tracking error.

recompute mean(r−b) / std(r−b, ddof=1) × √periods.

validated against standard definition

Tracking error

tracking_error

“tracking error 3.1%”

Annualized volatility of the active return.

recompute std(r−b, ddof=1) × √periods.

validated against standard definition

02 · ML classification20 recipes

The metrics every model card claims, recomputed from the raw predictions file — including the calibration and imbalance metrics that hide the most sins.

Accuracy

accuracy

“accuracy 0.87”

Fraction of predictions that match the labels.

recompute Exact match count over n, from the prediction and label columns.

validated against scikit-learn accuracy_score

ROC-AUC

auc

“AUC 0.91”

Probability a positive outranks a negative, with tie handling.

recompute Mann–Whitney statistic with ties = 0.5, plus a DeLong sampling standard error for the tolerance budget.

validated against scikit-learn roc_auc_score

Precision

precision

“precision 0.92”

Of everything flagged positive, how much actually was.

recompute TP / (TP + FP) from exact integer confusion counts.

validated against scikit-learn precision_score

Recall

recall

“recall 0.85”

Of everything actually positive, how much was found.

recompute TP / (TP + FN) from exact integer confusion counts.

validated against scikit-learn recall_score

F1

“F1 0.84”

Harmonic mean of precision and recall.

recompute 2PR / (P + R) from the same exact confusion counts.

validated against scikit-learn f1_score

Macro-F1

macro_f1

“macro F1 0.55”

Multiclass F1 averaged equally across classes — the one that exposes a model coasting on the majority class.

recompute Per-class binary F1 over the union of observed classes, zero when undefined, unweighted mean.

validated against scikit-learn f1_score (macro)

Micro-F1

micro_f1

“micro F1 0.78”

Multiclass F1 from globally pooled counts.

recompute Global TP/FP/FN over all classes, one F1 from the pooled totals.

validated against scikit-learn f1_score (micro)

PR-AUC

pr_auc

“average precision 0.62”

Area under the precision–recall curve — the honest metric for imbalanced classes.

recompute Step-wise average precision with tied scores grouped at each threshold.

validated against scikit-learn average_precision_scoreconventions average_precision / trapezoid

Log loss

log_loss

“log loss 0.31”

How well the predicted probabilities fit the outcomes.

recompute −mean(y ln p + (1−y) ln(1−p)) on deterministic log kernels; a hard 0/1 on the wrong side degrades rather than silently clipping.

validated against scikit-learn log_lossconventions exact / clip

Brier score

brier

“Brier 0.09”

Mean squared error of the predicted probabilities.

recompute mean((p − y)²) over the probability column.

validated against scikit-learn brier_score_loss

Matthews correlation

mcc

“MCC 0.61”

The single correlation coefficient between predictions and truth — robust to imbalance, binary or multiclass.

recompute Gorodkin's multiclass MCC from exact integer sums; zero denominator returns 0.

validated against scikit-learn matthews_corrcoef

Calibration error (ECE)

ece

“ECE 3.2%”

Whether “90% confident” actually means right 90% of the time.

recompute Equal-width confidence bins; Σ (n_b/n) · |accuracy_b − confidence_b|.

validated against Guo et al. 2017conventions bins=<n> (15)

Balanced accuracy

balanced_accuracy

“balanced accuracy 0.71”

Accuracy that a majority-class model can't fake — mean per-class recall.

recompute Recall computed per class over the observed classes, unweighted mean.

validated against scikit-learn balanced_accuracy_score

Cohen's kappa

cohen_kappa

“kappa 0.64”

Agreement beyond what chance would produce, binary or multiclass.

recompute (p_observed − p_expected) / (1 − p_expected) with exact integer marginals.

validated against scikit-learn cohen_kappa_score

Specificity

specificity

“specificity 0.95”

Of everything actually negative, how much was correctly left alone.

recompute TN / (TN + FP) from exact confusion counts.

validated against definitional (confusion matrix)

F-beta

fbeta

“F2 score 0.66”

F-score weighted toward recall (F2) or precision (F0.5) — beta parsed from the claim.

recompute (1+β²)PR / (β²P + R) from exact confusion counts.

validated against scikit-learn fbeta_scoreconventions beta=<v> (1)

Jaccard / IoU

jaccard

“IoU 0.58”

Overlap between predicted and actual positives.

recompute TP / (TP + FP + FN).

validated against scikit-learn jaccard_score

Weighted F1

weighted_f1

“weighted F1 0.69”

Multiclass F1 weighted by class support.

recompute Per-class F1 × support share, summed; zero_division=0.

validated against scikit-learn f1_score (weighted)

KS statistic

ks_statistic

“KS 0.31”

The credit-scoring KS: how separated the score distributions of goods and bads are.

recompute max |ECDF_pos − ECDF_neg| over the pooled score thresholds.

validated against SciPy ks_2samp statistic

Gini (accuracy ratio)

gini_norm

“Gini 0.42”

The credit-model Gini — a rescaled AUC.

recompute 2·AUC − 1 from the Mann-Whitney AUC.

validated against scikit-learn roc_auc_score (rescaled)

03 · Regression & forecasting15 recipes

Fit metrics from the raw prediction/actual pairs, plus the forecasting set — where a zero actual degrades honestly instead of being epsilon-fudged.

RMSE

rmse

“RMSE 4.2”

Root mean squared prediction error.

recompute √(mean((p − a)²)) under correctly-rounded summation.

validated against scikit-learn mean_squared_error

MAE

mae

“MAE 2.8”

Mean absolute prediction error.

recompute mean(|p − a|) under correctly-rounded summation.

validated against scikit-learn mean_absolute_error

R²

“R² 0.93”

Variance explained by the predictions.

recompute 1 − SS_res / SS_tot; zero-variance targets degrade instead of dividing by zero.

validated against scikit-learn r2_score

MAPE / sMAPE

mape

“MAPE 8.2%”

Percentage forecast error, plain or symmetric.

recompute mean(|p − a| / |a|); any zero actual is a degenerate verdict, never an epsilon fudge. sMAPE: mean(2|p − a| / (|p| + |a|)).

validated against scikit-learn mean_absolute_percentage_errorconventions mape / smape

MASE

mase

“MASE 0.84”

Forecast error scaled by the naive seasonal forecast — the scale-free standard.

recompute MAE / mean(|a_t − a_{t−m}|), the in-sample naive-m benchmark.

validated against Hyndman & Koehler 2006conventions m=<season> (1)

Pinball loss

pinball_loss

“pinball loss 1.3 at q=0.9”

Quantile forecast accuracy.

recompute mean(max(q(a−p), (q−1)(a−p))) at the claimed quantile.

validated against scikit-learn mean_pinball_lossconventions q=<quantile> (0.5)

MSLE / RMSLE

msle

“RMSLE 0.12”

Log-scale squared error — the Kaggle standard for skewed targets.

recompute mean((ln(1+p) − ln(1+a))²) on deterministic log kernels; values ≤ −1 degrade.

validated against scikit-learn mean_squared_log_errorconventions msle / rmsle

Median absolute error

medae

“median absolute error 1.9”

The robust error metric outliers can't inflate.

recompute median(|p − a|) with linear-interpolation quantiles.

validated against scikit-learn median_absolute_error

Max error

max_error

“max error 4.2”

The single worst prediction.

recompute max(|p − a|).

validated against scikit-learn max_error

Explained variance

explained_variance

“explained variance 0.90”

Variance captured, ignoring constant bias.

recompute 1 − var(a−p) / var(a), population variances.

validated against scikit-learn explained_variance_score

Adjusted R²

adjusted_r2

“adjusted R² 0.91 with 8 predictors”

R² penalized for the number of predictors — kills the add-features-until-it-fits trick.

recompute 1 − (1−R²)(n−1)/(n−p−1); predictor count from the claim, or no verdict.

validated against standard formula over sklearn R²conventions p=<predictors> (required)

Normalized RMSE

nrmse

“NRMSE 7%”

RMSE made scale-free.

recompute RMSE / mean(actual), or / range by convention.

validated against standard definitionconventions mean (default) / range

WAPE

wape

“WAPE 9.3%”

The retail-forecasting error standard — robust where MAPE explodes.

recompute Σ|p − a| / Σ|a|.

validated against standard definition

Forecast bias

forecast_bias

“2% over-forecast”

Systematic over- or under-forecasting.

recompute (Σp − Σa) / Σa; positive = over-forecast.

validated against standard definition

Durbin-Watson

durbin_watson

“DW 1.9”

Autocorrelation left in the residuals — the classic misspecification tell.

recompute Σ(e_t − e_{t−1})² / Σe² on the raw residuals.

validated against statsmodels durbin_watson

04 · Data & analytics21 recipes

What every data agent claims after a pipeline run: totals, medians, group-bys, distincts, nulls — and whether the merge silently dropped rows.

Column sum

column_sum

“total $4.2M”

A column total — revenue, amounts, quantities.

recompute Correctly-rounded fsum over the raw column; the claim's own reported precision ($4.2M = ±50k) sets the tolerance.

validated against NumPy sum

Column mean

column_mean

“average order $54”

A column average.

recompute fsum / n over the raw column.

validated against NumPy mean

Column median

column_median

“median 42”

The 50th percentile of a column.

recompute NumPy's default linear-interpolation quantile at q = 0.5.

validated against NumPy median

Percentile

percentile

“90th percentile 120”

Any quantile of a column.

recompute Linear-interpolation quantile (NumPy method 7) at the claimed q.

validated against NumPy quantileconventions p95 / p99.9 / q=0.9

Row count

row_count

“processed 10,000 rows”

How many rows a pipeline actually produced.

recompute A literal count of the artifact's rows.

validated against definitional

Group-by aggregate

groupby_aggregate

“revenue in the West was $310k”

A per-group sum or mean.

recompute fsum/mean per group key from the raw rows; without a named group there is no scalar to verify, so it degrades.

validated against NumPy per-groupconventions sum:<group> / mean:<group>

Distinct count

distinct_count

“12,402 distinct users”

Unique values in a column.

recompute Distinct stripped cell values, nulls dropped by default.

validated against pandas nunique conventionconventions drop_null / include_null

Duplicate count

duplicate_count

“only 3 duplicates”

Rows that duplicate an earlier row.

recompute n − distinct, the pandas duplicated(keep='first') convention.

validated against pandas convention

Null fraction

null_fraction

“missing data under 2%”

How much of a column is actually empty.

recompute Fraction of cells that are empty / NaN / null tokens, from the raw strings.

validated against definitional

Growth rate

growth_rate

“up 23% MoM”

Period-over-period or total growth of a time-ordered column.

recompute last/prev − 1 (or last/first − 1 for total growth).

validated against definitionalconventions period / total

Share / ratio

ratio_share

“42% of users converted”

The fraction of rows meeting a flag.

recompute Count of truthy flags over n.

validated against definitional

Join row loss

join_row_loss

“merged without losing rows”

Whether a merge dropped (or fanned out) rows.

recompute len(left) − len(joined) across two artifacts; 0 is lossless, negative is fan-out.

validated against definitional

Column min

column_min

“minimum order $12”

The smallest value in a column.

recompute min over the raw column, NaN-propagating.

validated against NumPy min

Column max

column_max

“maximum $9,400”

The largest value in a column.

recompute max over the raw column, NaN-propagating.

validated against NumPy max

Column std dev

column_std

“standard deviation 4.2”

Spread of a column.

recompute Sample standard deviation (ddof=1; ddof=0 by convention).

validated against NumPy stdconventions ddof=1 (default) / ddof=0

Interquartile range

iqr

“IQR 12”

The middle-50% spread.

recompute q75 − q25 with linear-interpolation quartiles.

validated against SciPy iqr

Outlier count

outlier_count

“only 12 outliers”

How many values sit outside the Tukey fences.

recompute count outside [q1 − k·IQR, q3 + k·IQR].

validated against Tukey fencesconventions k=<fence> (1.5)

Mode share

mode_share

“the most common category is 34% of rows”

How dominant the most frequent value is.

recompute count(most frequent cell) / n on the raw strings.

validated against definitional

Gini coefficient

gini_coefficient

“revenue Gini 0.38”

Inequality of a distribution — how top-heavy the column is.

recompute Σ(2i − n − 1)·x₍ᵢ₎ / (n·Σx) over the sorted non-negative values.

validated against standard Gini formula

Concentration (HHI)

hhi

“HHI 0.18”

Herfindahl-Hirschman concentration of amounts.

recompute Σ(xᵢ/Σx)², on the 0–1 scale.

validated against standard HHI

Entropy

entropy

“entropy 2.4 bits”

How spread out a categorical column is.

recompute Shannon entropy of the value counts on deterministic log kernels.

validated against SciPy entropyconventions bits (default) / nats

05 · Performance & engineering12 recipes

The coding-agent pack: “2.3× faster”, latency percentiles, coverage — recomputed from the raw benchmark timings and reports, not the summary line.

Speedup ratio

speedup_ratio

“2.3× faster”

Before/after speedup from the raw timing runs.

recompute mean(before) / mean(after) — or medians — from the two timing columns.

validated against NumPyconventions mean / median

Latency p50

latency_p50

“median latency 48ms”

Median latency from the raw per-request durations.

recompute Linear-interpolation quantile at 0.50.

validated against NumPy quantile

Latency p95

latency_p95

“p95 latency 120ms”

Tail latency at the 95th percentile.

recompute Linear-interpolation quantile at 0.95.

validated against NumPy quantile

Latency p99

latency_p99

“p99 under 250ms”

Tail latency at the 99th percentile.

recompute Linear-interpolation quantile at 0.99.

validated against NumPy quantile

Throughput

throughput

“4,200 rps”

Operations per unit time.

recompute n / fsum(durations) over the raw per-op timings.

validated against definitional

Peak memory

peak_memory

“peak memory 1.2GB”

The maximum of a sampled memory series.

recompute max over the raw samples, NaN-propagating.

validated against definitional

Test coverage

test_coverage

“coverage 87%”

Line coverage recomputed from the report's raw per-line hit counts — not the percentage it printed.

recompute lines with hits > 0 over total lines.

validated against coverage.py semantics

Error rate

error_rate

“5xx error rate 0.3%”

Failures over totals, from the raw log rows.

recompute Count of error flags (or HTTP ≥400/≥500 statuses) over n.

validated against definitionalconventions flag / http4xx / http5xx

Latency p90

latency_p90

“p90 latency 95ms”

Tail latency at the 90th percentile.

recompute Linear-interpolation quantile at 0.90.

validated against NumPy quantile

Apdex

apdex

“Apdex 0.94 at t=0.5”

The application-performance satisfaction index.

recompute (satisfied + tolerating/2) / n with satisfied ≤ T, tolerating ≤ 4T; T from the claim.

validated against Apdex standardconventions t=<seconds> (required)

Uptime

uptime_pct

“uptime 99.95%”

Availability from the raw up/down checks.

recompute up checks / total checks.

validated against definitional

Cache hit rate

cache_hit_rate

“cache hit rate 87%”

Hits over lookups, from the raw access log.

recompute hit flags / total accesses.

validated against definitional

06 · Retrieval, RAG & LLM evals10 recipes

Modern eval claims, from raw ranked results and per-sample grades — including the exact unbiased pass@k estimator from the HumanEval paper.

Recall@k

recall_at_k

“recall@10 0.84”

How much of the relevant material the retriever surfaced in the top k.

recompute Per query: relevant-in-top-k over all-relevant; averaged, zero-relevant queries skipped.

validated against standard IR definitionconventions k=<n> (10)

NDCG@k

ndcg_at_k

“NDCG@10 0.71”

Rank-discounted gain against the ideal ordering.

recompute DCG with 1/log₂(i+1) discounts over IDCG, linear or exponential gains, on deterministic log kernels.

validated against scikit-learn ndcg_scoreconventions k=<n>, optional exp

MRR

mrr

“MRR 0.62”

How early the first relevant result appears.

recompute Mean of 1/rank of the first relevant result per query; 0 when none.

validated against standard IR definitionconventions k=<n> (uncapped)

Top-k accuracy

top_k_accuracy

“top-5 accuracy 0.91”

Hit rate: queries with at least one relevant result in the top k.

recompute Per-query any-relevant-in-top-k, averaged.

validated against standard definitionconventions k=<n> (5)

Exact match

exact_match

“exact match 71%”

String-level answer accuracy, strict or SQuAD-normalized.

recompute Mean exact match; normalized mode lowercases, strips punctuation and articles, collapses whitespace.

validated against SQuAD evaluation scriptconventions strict / normalized

pass@k

pass_at_k

“pass@1 0.47”

Code-generation success with the unbiased estimator — not the naive fraction.

recompute Per problem 1 − C(n−c,k)/C(n,k) in exact integer arithmetic; fewer than k samples degrades.

validated against Chen et al. 2021 (HumanEval)conventions k=<n> (1)

Precision@k

precision_at_k

“precision@10 0.31”

How much of the top k was actually relevant.

recompute Per query: relevant-in-top-k / k, averaged over all queries.

validated against standard IR definitionconventions k=<n> (10)

MAP@k

map_at_k

“MAP@10 0.44”

Mean average precision — rank-sensitive retrieval quality.

recompute Per query Σ P@i·relᵢ / min(R, k), zero-relevant queries skipped, averaged.

validated against standard definition (min(R,k) denominator, documented)conventions k=<n> (10)

Perplexity

perplexity

“perplexity 12.4”

Language-model fit, recomputed from the raw per-token log-probabilities.

recompute exp(−mean(logprob)) on deterministic exp kernels; positive logprobs degrade.

validated against standard definition

WER / CER

wer

“WER 8.4%”

Word (or character) error rate of transcriptions against references.

recompute Corpus-level Levenshtein edits / reference tokens, exact DP.

validated against jiwerconventions wer (default) / cer

07 · Statistical claims18 recipes

“The experiment was significant” — recomputed from the raw samples on deterministic special-function kernels that match SciPy to better than 1e-11.

p-value

p_value

“p = 0.003”

Two-sided two-sample significance, recomputed from the raw observations.

recompute Welch (default), pooled, or z statistic; p via a deterministic incomplete-beta kernel.

validated against SciPy ttest_indconventions welch / pooled / z

Confidence interval

confidence_interval

“95% CI ± 0.4”

The margin of error of a mean.

recompute Critical t (or z) × s/√n, with critical values from deterministic bisection inverses.

validated against SciPy t.ppfconventions t95 / t99 / z95

A/B lift

lift

“the variant lifted conversion 12%”

Treatment uplift over control, relative or absolute.

recompute (mean_B − mean_A) / mean_A from the raw per-user outcomes.

validated against definitionalconventions relative / absolute

Chi-square

chi_square

“χ² = 5.99, p < 0.05”

Independence of two categorical variables, from the raw observation pairs — the contingency table is rebuilt, not trusted.

recompute Expected counts from marginals, Yates correction only at df = 1, p from a deterministic incomplete-gamma kernel.

validated against SciPy chi2_contingencyconventions p / statistic, ±yates

Correlation

correlation

“Spearman correlation 0.8”

Pearson or Spearman association between two columns.

recompute fsum-centered Pearson; Spearman ranks first with ties midranked.

validated against SciPy pearsonr / spearmanrconventions pearson / spearman

Effect size

effect_size

“Cohen's d 0.4”

Standardized mean difference between two groups.

recompute Pooled-SD d, exact-gamma Hedges' g, or Glass's Δ.

validated against standard formulas, exact Hedges Jconventions cohen_d / hedges_g / glass_delta

Mann-Whitney U

mann_whitney

“Mann-Whitney p 0.03”

The non-parametric two-sample test — no normality assumption to hide behind.

recompute Tie-corrected, continuity-corrected asymptotic p from midranks.

validated against SciPy mannwhitneyu (asymptotic)

KS test

ks_test

“KS test p 0.01”

Whether two samples come from the same distribution.

recompute max ECDF gap, p from the classical Kolmogorov asymptotic on deterministic exp kernels.

validated against Kolmogorov asymptotic (SciPy kstwobign)

One-way ANOVA

anova

“ANOVA p = 0.01”

Whether group means differ, from the raw (group, value) rows.

recompute F = between-group / within-group mean squares; p via the deterministic incomplete beta.

validated against SciPy f_onewayconventions p (default) / statistic

Two-proportion z-test

proportion_z

“conversion difference significant, p 0.002”

Whether two conversion rates actually differ.

recompute Pooled two-proportion z from the raw 0/1 outcomes; two-sided p.

validated against statsmodels proportions_ztest

Fisher's exact test

fisher_exact

“Fisher exact p 0.04”

The small-sample 2×2 test — exact, no approximation.

recompute Exact hypergeometric arithmetic via integer combinatorics, two-sided.

validated against SciPy fisher_exact

Odds ratio

odds_ratio

“odds ratio 2.3”

Association strength in a 2×2, from the raw observation pairs.

recompute (a·d)/(b·c); a zero cell degrades unless the Haldane +0.5 convention is named.

validated against SciPy fisher_exact (sample OR)conventions sample (default) / haldane

Relative risk

relative_risk

“relative risk 1.4”

Risk ratio between two groups, from the raw pairs.

recompute (a/(a+b)) / (c/(c+d)), rows by sorted group key.

validated against standard definition

Cramér's V

cramers_v

“Cramér's V 0.21”

Effect size of a categorical association — significance isn't strength.

recompute √(χ²/(n·min(r−1, c−1))) with the uncorrected χ².

validated against standard definition over SciPy χ²

Skewness

skewness

“skewness −0.8”

Asymmetry of a distribution.

recompute Biased g₁ = m₃/m₂^1.5 (the SciPy default).

validated against SciPy skew

Kurtosis

kurtosis

“excess kurtosis 4.1”

Tail weight — how often the extreme days happen.

recompute Biased excess g₂ = m₄/m₂² − 3 (Fisher).

validated against SciPy kurtosis

Jarque-Bera

jarque_bera

“JB p < 0.01 — not normal”

Normality test from skewness and kurtosis.

recompute JB = n/6(S² + K²/4); p from the deterministic χ²(2) kernel.

validated against SciPy jarque_beraconventions p (default) / statistic

Autocorrelation

autocorrelation

“lag-1 autocorrelation 0.22”

Serial dependence in a series — the thing that invalidates i.i.d. claims.

recompute Standard biased ACF at the claimed lag.

validated against statsmodels acfconventions lag=<k> (1)

08 · Business & finance6 recipes

Beyond trading: the numbers that go in board decks and term sheets, rebuilt from the raw ledgers and cash flows.

CAGR

cagr

“monthly CAGR 23.9%”

Compound annual growth from a time-ordered value series.

recompute (last/first)^(1/years) − 1 on deterministic power kernels; the period convention comes from the claim itself.

validated against definitionalconventions periods=<per-year>

NPV

npv

“NPV $310k at 8%”

Net present value of a raw cash-flow column.

recompute Σ cf_t/(1+r)^t with exact integer-power expansion; no rate claimed means no verdict.

validated against numpy-financial npvconventions rate=<frac>

IRR

irr

“IRR 18%”

The discount rate where the cash flows break even.

recompute Deterministic bisection on NPV; no sign change in the flows degrades instead of guessing.

validated against numpy-financial irr

Churn / retention

churn_rate

“churn 3.1%”

Churned over total from the raw per-customer flags.

recompute Count of churn flags over n; retention is its complement.

validated against definitionalconventions churn / retention

Margin

margin_pct

“gross margin 61%”

Margin recomputed from the raw revenue and cost rows.

recompute (Σ revenue − Σ cost) / Σ revenue under correctly-rounded summation.

validated against definitional

Reconciliation

reconciliation_total

“the totals match the ledger”

Whether two totals actually reconcile — across two files.

recompute fsum(A) − fsum(B) with cross-artifact bindings; 0 is reconciled, anything else is the exact discrepancy.

validated against definitional

09 · Compiled recipes2 recipes

Drafted offline as compositions of the existing deterministic kernels, then admitted by a deterministic gate: differential testing against the named reference implementation, a metamorphic property suite, degeneracy checks, and a bit-stability double-run. Frozen under a content hash — never improvised at verify time.

Standard error of the mean

sem

“mean 4.2 ± 0.3 (SEM)”

The ± a mean is reported with: sample dispersion shrunk by √n.

recompute Compiled composition: fstd(ddof=1) / √n on the fsum kernels.

validated against SciPy semconventions compiled-validated · gate: differential + metamorphic + degeneracy + bit-stability

Coefficient of variation

coefficient_of_variation

“CV 12%”

Relative dispersion — std as a fraction of the mean; scale-invariant.

recompute Compiled composition: fstd(ddof=1) / fmean on the fsum kernels.

validated against SciPy variationconventions compiled-validated · gate: differential + metamorphic + degeneracy + bit-stability

Every recipe above is exercised by the open-source test suite against its reference implementation, and the whole library is one dependency-free folder. If your claim isn't covered yet, the contract format lets you pin any column to any recipe — and the library grows in validated packs, never one-off hacks.

Get the free skill ← Back to Calma