M/05 — Selection bias decomposition

Decomposing the IS-OOS Sharpe gap

A variance budget that splits in-sample to out-of-sample Sharpe degradation into selection bias, parameter-choice noise, and residual skill — across ten deep walk-forward crypto corpora.

TL;DR

What it is: a variance decomposition of the IS-OOS Sharpe gap into (a) selection bias from picking the in-sample max over a parameter grid, (b) parameter-choice noise across the grid, (c) residual skill.
When to use it: any time a strategy population reports a positive in-sample edge that vanishes out-of-sample — to attribute the collapse to its sources rather than blaming "luck".
What it teaches: on the 10 partitions of the firm's 30-asset corpus where deep-WFO instrumentation is available (7.27M OOS rows), residual skill is statistically indistinguishable from zero — nearly the entire gap is selection bias plus parameter noise. V_param alone explains 18.6% of pooled OOS-Sharpe variance and predicts live profitability monotonically across all ten deciles. The qualitative pattern reproduces on every other asset partition the firm has examined.
Companion: a synthetic null with no skill reproduces the empirical gap shape, confirming the decomposition is mechanical rather than driven by a hidden signal.

The mathematics

Suppose a strategy family is indexed by a parameter k = 1, …, K. For each k you observe an in-sample Sharpe S^IS_k and an out-of-sample Sharpe S^OOS_k. The standard practice — pick the in-sample maximum and hope for the best out-of-sample — defines the IS-OOS gap as

Δ = S_{k^{*}}^{I S} - S_{k^{*}}^{O O S}, k^{*} = ar g k max S_{k}^{I S} .

Decompose Δ additively. Write each S^IS_k as a true skill term plus a parameter-choice deviation plus noise:

S_{k}^{I S} = μ_{k} + δ_{k} + ε_{k}, E [δ_{k}] = 0, E [ε_{k}] = 0.

Then the gap splits into three terms:

Δ = selection bias (E [k max S_{k}^{I S}] - k max E [S_{k}^{I S}]) + parameter-choice noise (δ_{k^{*}} - E [δ]) + residual skill (μ_{k^{*}} - μ_{k^{*}}^{O O S}) .

The first term is purely mechanical. For K independent N(0, σ²) draws, Mills’ bound gives the expected maximum

E [1 \leq k \leq K max Z_{k}] \leq σ 2 lo g K,

which is tight to leading order. The second term is the within-grid σ inherited by the chosen k*. Only the third term — residual skill — should depend on the strategy actually doing something useful out-of-sample.

The variance budget

The companion repository runs a two-way decomposition of OOS-Sharpe variance over a (strategy × window) panel, observing a battery of proprietary perturbations per cell — the specific perturbation set is part of Daru Finance’s consulting work and is not enumerated here. The panel-level identity is

Var (S^{O O S}) = V_{param} + V_{strategy} + V_{window} + V_{finite} + R,

where V_param = E[Var_t S^OOS(s, w, t)] is the within-cell variance across perturbations (knife-edge sensitivity), V_strategy and V_window are between-strategy and between-window mean differences, V_finite = (1 + ½·Sh²)/n is the analytic finite-sample noise floor, and R is the unexplained interaction.

Worked example

Take BTC_30m_27W from the deep-WFO crypto corpus: 538k OOS rows, 27 walk-forward windows. Mean OOS Sharpe across the surviving population is −0.696. Mean V_param is 0.243. Live-proxy profitable rate (last two windows after the funnel filter): 8.4%.

Now plug into the null. With a parameter grid of K = 400 combinations and σ = 1 (typical IS Sharpe noise on 500 trades, q ≈ 0.2), the expected-max bound is

σ 2 lo g K = 1 \cdot 2 lo g 400 \approx 3.46.

That is to say: under no skill at all, picking the in-sample winner from a grid of 400 produces an expected IS Sharpe near +3.5 — and an OOS expectation near zero. The empirical gap is comfortably reproduced without invoking real edge. The interactive demo below recomputes this every time you move K or σ.

Demo — predicted IS-OOS gap under a no-skill null

For a parameter grid of size K with iid N(0, σ²) IS Sharpes, the expected maximum is ≈ σ·√(2 log K). Move the sliders — the bar shows the predicted gap, decomposed into selection bias, parameter noise, and residual skill.

K — parameter-grid size400

σ — IS Sharpe noise1.00

E[max IS]

3.462

OOS skill

0.000

IS−OOS gap

3.462

√(2 log K)

3.46

selection bias

2.612

parameter σ

0.850

Under the null hypothesis of no skill, the entire IS-OOS gap is mechanical: scaling with √(log K) (selection) and σ (parameter noise). The residual skill term sits at zero.

At K=400 and σ=1, the expected-max-of-K Gaussians is ≈ 3.46 — the entire IS-OOS gap a strategy population reports under the null can be reproduced without invoking any real edge.

Empirical decomposition

On the canonical 10-asset corpus (ETH, BTC, LTC, TRX, XRP, LINK, ZEC, DOGE, BCH, AVAX — all 30m crypto with ≥17 walk-forward windows; 7.27M OOS rows; 289,374 live-proxy candidates), the pooled variance decomposition is:

V_param — 18.6% (parameter-choice noise across perturbations).
V_strategy — 16.8% (between-strategy main effect).
V_finite — 3.1% (analytic finite-sample floor).
V_window — 1.2% (between-window means / regime proxy).
Residual / interaction — 60.4%.

Mean IS-OOS Sharpe degradation across this corpus is 0.84. The residual-skill term is statistically indistinguishable from zero — the gap is mechanical.

OOS Sharpe variance decomposition on the 10-asset crypto corpus — Fig. 1 —Pooled OOS-Sharpe variance decomposition on the 10 deep-WFO partitions of the 30-asset corpus. V_param + V_strategy + V_finite + V_window account for ~40% of total variance; the remainder is interaction. Crucially, residual skill — the only term that should reflect a real edge — is invisible at the population level.

V_param decile vs live-proxy profitable rate — Fig. 2 —Live-proxy profitable rate by V_param decile, pooled across the 10 assets. D1 (lowest in-sample sensitivity) lands at 11.7%; D10 (highest sensitivity) at 5.3%. Monotonic across all ten deciles — the in-sample variance across small parameter perturbations is a usable predictor of out-of-sample profitability.

Per-asset D1-D10 lift in live-proxy profitable rate — Fig. 3 —Per-asset D1−D10 spread. ETH, ZEC, XRP form a top tier with 11.7-12.4 pp spreads. Every asset shows a positive lift; TRX is the weakest at +1.3pp. The signal is strongest where the live-proxy baseline rate is lowest — LTC has 4.7% baseline but D1 hits 7.6% (61% relative lift).

Synthetic null

To check the decomposition is not an artefact of the panel structure, the same pipeline runs on synthetic data with three planted classes (robust / fragile / noise) and a deterministic RNG. The variance shape and decile lift reproduce — confirming the decomposition is identifying mechanical structure, not anomalies in the real corpus.

Why this matters for systematic strategies

Two practical consequences. First, any positive-edge claim from in-sample maximisation over a parameter grid of K must clear σ·√(2 log K) before it should be taken seriously — and σ is rarely smaller than 1 on bar-level crypto Sharpe estimates. Second, V_param is fully observable in-sample (it’s computed from the IS run alone, no OOS leakage), so it can be used as a pre-filter at strategy-selection time. The deep-WFO empirics suggest the lowest-V_param decile beats the highest by a factor of 2.2× in live-proxy profitability, monotonically.

The deflated-Sharpe and PBO frameworks of Bailey-López de Prado attack the same problem from the test-statistic side. This decomposition is complementary: it works at the population level rather than per-strategy, and it produces a usable in-sample predictor (V_param) of out-of-sample profitability.

Reproducibility

DaruFinance / strategy-overfitting

Python — open source reference implementation

Minimal invocation

from strategy_overfitting import (
    load_metrics, decompose_oos_variance, param_vs_live_lift
)

# Walk-forward output: long-format DataFrame with
#   columns: asset, strategy, window, perturbation, S_IS, S_OOS, n_trades
df = load_metrics(parquet_root="/data/strategies", min_trades=20, sharpe_clip=5)

# Two-way decomposition over (strategy x window) x perturbations
shares = decompose_oos_variance(df)
shares["share_v_param"]      # 0.186  - parameter-choice noise
shares["share_v_strategy"]   # 0.168  - between-strategy main effect
shares["share_v_window"]     # 0.012  - regime / window
shares["share_v_finite"]     # 0.031  - analytic finite-sample floor
shares["share_residual"]     # 0.604  - interaction + unexplained

# Predictive validity: rank strategies by V_param (in-sample) and
# measure live-proxy profitable rate by decile.
lift = param_vs_live_lift(df, n_deciles=10)
lift.D1_pct, lift.D10_pct    # e.g. 11.7, 5.3

References

[1]Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2014). The probability of backtest overfitting. Journal of Computational Finance 20(4), 39–69.
[2]Harvey, C. R. & Liu, Y. (2014). Backtesting. Journal of Portfolio Management 42(1), 13–28.
[3]Lo, A. W. & MacKinlay, A. C. (1990). Data-snooping biases in tests of financial asset pricing models. Review of Financial Studies 3(3), 431–467.
[4]López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley, ch. 11–12.

All projects View on GitHub