Lab

M/05 — Selection bias decomposition

Decomposing the IS-OOS Sharpe gap

A variance budget that splits in-sample to out-of-sample Sharpe degradation into selection bias, parameter-choice noise, and residual skill — across ten deep walk-forward crypto corpora.

The mathematics

Suppose a strategy family is indexed by a parameter k = 1, …, K. For each k you observe an in-sample Sharpe SISk and an out-of-sample Sharpe SOOSk. The standard practice — pick the in-sample maximum and hope for the best out-of-sample — defines the IS-OOS gap as

Decompose Δ additively. Write each SISk as a true skill term plus a parameter-choice deviation plus noise:

Then the gap splits into three terms:

The first term is purely mechanical. For K independent N(0, σ²) draws, Mills’ bound gives the expected maximum

which is tight to leading order. The second term is the within-grid σ inherited by the chosen k*. Only the third term — residual skill — should depend on the strategy actually doing something useful out-of-sample.

The variance budget

The companion repository runs a two-way decomposition of OOS-Sharpe variance over a (strategy × window) panel, observing a battery of proprietary perturbations per cell — the specific perturbation set is part of Daru Finance’s consulting work and is not enumerated here. The panel-level identity is

where Vparam = E[Vart SOOS(s, w, t)] is the within-cell variance across perturbations (knife-edge sensitivity), Vstrategy and Vwindow are between-strategy and between-window mean differences, Vfinite = (1 + ½·Sh²)/n is the analytic finite-sample noise floor, and R is the unexplained interaction.

Worked example

Take BTC_30m_27W from the deep-WFO crypto corpus: 538k OOS rows, 27 walk-forward windows. Mean OOS Sharpe across the surviving population is −0.696. Mean Vparam is 0.243. Live-proxy profitable rate (last two windows after the funnel filter): 8.4%.

Now plug into the null. With a parameter grid of K = 400 combinations and σ = 1 (typical IS Sharpe noise on 500 trades, q ≈ 0.2), the expected-max bound is

That is to say: under no skill at all, picking the in-sample winner from a grid of 400 produces an expected IS Sharpe near +3.5 — and an OOS expectation near zero. The empirical gap is comfortably reproduced without invoking real edge. The interactive demo below recomputes this every time you move K or σ.

Demo — predicted IS-OOS gap under a no-skill null

For a parameter grid of size K with iid N(0, σ²) IS Sharpes, the expected maximum is ≈ σ·√(2 log K). Move the sliders — the bar shows the predicted gap, decomposed into selection bias, parameter noise, and residual skill.

K — parameter-grid size400
σ — IS Sharpe noise1.00
E[max IS]
3.462
OOS skill
0.000
IS−OOS gap
3.462
√(2 log K)
3.46
selection bias
2.612
parameter σ
0.850

Under the null hypothesis of no skill, the entire IS-OOS gap is mechanical: scaling with √(log K) (selection) and σ (parameter noise). The residual skill term sits at zero.

-0.600.331.262.203.134.06Sharpe ratio (annualized units)OOS ≈ 0E[max IS] = 3.46selection biasparam σσ·√(2 log K) = 3.462 Var-share(selection) ≈ 0.137

At K=400 and σ=1, the expected-max-of-K Gaussians is ≈ 3.46 — the entire IS-OOS gap a strategy population reports under the null can be reproduced without invoking any real edge.

Empirical decomposition

On the canonical 10-asset corpus (ETH, BTC, LTC, TRX, XRP, LINK, ZEC, DOGE, BCH, AVAX — all 30m crypto with ≥17 walk-forward windows; 7.27M OOS rows; 289,374 live-proxy candidates), the pooled variance decomposition is:

  • Vparam18.6% (parameter-choice noise across perturbations).
  • Vstrategy16.8% (between-strategy main effect).
  • Vfinite3.1% (analytic finite-sample floor).
  • Vwindow1.2% (between-window means / regime proxy).
  • Residual / interaction — 60.4%.

Mean IS-OOS Sharpe degradation across this corpus is 0.84. The residual-skill term is statistically indistinguishable from zero — the gap is mechanical.

OOS Sharpe variance decomposition on the 10-asset crypto corpus
Fig. 1Pooled OOS-Sharpe variance decomposition on the 10 deep-WFO partitions of the 30-asset corpus. V_param + V_strategy + V_finite + V_window account for ~40% of total variance; the remainder is interaction. Crucially, residual skill — the only term that should reflect a real edge — is invisible at the population level.
V_param decile vs live-proxy profitable rate
Fig. 2Live-proxy profitable rate by V_param decile, pooled across the 10 assets. D1 (lowest in-sample sensitivity) lands at 11.7%; D10 (highest sensitivity) at 5.3%. Monotonic across all ten deciles — the in-sample variance across small parameter perturbations is a usable predictor of out-of-sample profitability.
Per-asset D1-D10 lift in live-proxy profitable rate
Fig. 3Per-asset D1−D10 spread. ETH, ZEC, XRP form a top tier with 11.7-12.4 pp spreads. Every asset shows a positive lift; TRX is the weakest at +1.3pp. The signal is strongest where the live-proxy baseline rate is lowest — LTC has 4.7% baseline but D1 hits 7.6% (61% relative lift).

Synthetic null

To check the decomposition is not an artefact of the panel structure, the same pipeline runs on synthetic data with three planted classes (robust / fragile / noise) and a deterministic RNG. The variance shape and decile lift reproduce — confirming the decomposition is identifying mechanical structure, not anomalies in the real corpus.

Synthetic-null OOS Sharpe variance decomposition
Fig. 4Variance decomposition on synthetic null (three planted classes, deterministic RNG). Same pipeline, same shape — the empirical decomposition is not an artefact.

Why this matters for systematic strategies

Two practical consequences. First, any positive-edge claim from in-sample maximisation over a parameter grid of K must clear σ·√(2 log K) before it should be taken seriously — and σ is rarely smaller than 1 on bar-level crypto Sharpe estimates. Second, Vparam is fully observable in-sample (it’s computed from the IS run alone, no OOS leakage), so it can be used as a pre-filter at strategy-selection time. The deep-WFO empirics suggest the lowest-Vparam decile beats the highest by a factor of 2.2× in live-proxy profitability, monotonically.

The deflated-Sharpe and PBO frameworks of Bailey-López de Prado attack the same problem from the test-statistic side. This decomposition is complementary: it works at the population level rather than per-strategy, and it produces a usable in-sample predictor (Vparam) of out-of-sample profitability.

Reproducibility

DaruFinance / strategy-overfitting

Python — open source reference implementation

Minimal invocation

from strategy_overfitting import (
    load_metrics, decompose_oos_variance, param_vs_live_lift
)

# Walk-forward output: long-format DataFrame with
#   columns: asset, strategy, window, perturbation, S_IS, S_OOS, n_trades
df = load_metrics(parquet_root="/data/strategies", min_trades=20, sharpe_clip=5)

# Two-way decomposition over (strategy x window) x perturbations
shares = decompose_oos_variance(df)
shares["share_v_param"]      # 0.186  - parameter-choice noise
shares["share_v_strategy"]   # 0.168  - between-strategy main effect
shares["share_v_window"]     # 0.012  - regime / window
shares["share_v_finite"]     # 0.031  - analytic finite-sample floor
shares["share_residual"]     # 0.604  - interaction + unexplained

# Predictive validity: rank strategies by V_param (in-sample) and
# measure live-proxy profitable rate by decile.
lift = param_vs_live_lift(df, n_deciles=10)
lift.D1_pct, lift.D10_pct    # e.g. 11.7, 5.3

References

  1. [1]Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2014). The probability of backtest overfitting. Journal of Computational Finance 20(4), 39–69.
  2. [2]Harvey, C. R. & Liu, Y. (2014). Backtesting. Journal of Portfolio Management 42(1), 13–28.
  3. [3]Lo, A. W. & MacKinlay, A. C. (1990). Data-snooping biases in tests of financial asset pricing models. Review of Financial Studies 3(3), 431–467.
  4. [4]López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley, ch. 11–12.