Lab

M/05 · Selection bias decomposition

Decomposing the IS-OOS Sharpe gap

A variance budget that splits in-sample to out-of-sample Sharpe degradation into selection bias, parameter-choice noise, and residual skill, across ten deep walk-forward crypto corpora.

The mathematics

Suppose a strategy family is indexed by a parameter k = 1, …, K. For each k you observe an in-sample Sharpe SISk and an out-of-sample Sharpe SOOSk. The standard practice, pick the in-sample maximum and hope for the best out-of-sample, defines the IS-OOS gap as

Decompose Δ additively. Write each SISk as a true skill term plus a parameter-choice deviation plus noise:

Then the gap splits into three terms:

The first term is purely mechanical. For K independent N(0, σ²) draws, Mills’ bound gives the expected maximum

which is tight to leading order. The second term is the within-grid σ inherited by the chosen k*. Only the third term, residual skill, should depend on the strategy actually doing something useful out-of-sample.

The variance budget

The companion repository runs a two-way decomposition of OOS-Sharpe variance over a (strategy × window) panel, observing a battery of proprietary perturbations per cell, the specific perturbation set is part of Daru Finance’s consulting work and is not enumerated here. The panel-level identity is

where Vparam = E[Vart SOOS(s, w, t)] is the within-cell variance across perturbations (knife-edge sensitivity), Vstrategy and Vwindow are between-strategy and between-window mean differences, Vfinite = (1 + ½·Sh²)/n is the analytic finite-sample noise floor, and R is the unexplained interaction.

Worked example

Take BTC_30m_27W from the deep-WFO crypto corpus: 538k OOS rows, 27 walk-forward windows. Mean OOS Sharpe across the surviving population is −0.696. Mean Vparam is 0.243. Live-proxy profitable rate (last two windows after the funnel filter): 8.4%.

Now plug into the null. With a parameter grid of K = 400 combinations and σ = 1 (typical IS Sharpe noise on 500 trades, q ≈ 0.2), the expected-max bound is

That is to say: under no skill at all, picking the in-sample winner from a grid of 400 produces an expected IS Sharpe near +3.5, and an OOS expectation near zero. The empirical gap is comfortably reproduced without invoking real edge. The interactive demo below recomputes this every time you move K or σ.

Demo: predicted IS-OOS gap under a no-skill null

For a parameter grid of size K with iid N(0, σ²) IS Sharpes, the expected maximum is ≈ σ·√(2 log K). Move the sliders, the bar shows the predicted gap, decomposed into selection bias, parameter noise, and residual skill.

K, parameter-grid size400
σ, IS Sharpe noise1.00
E[max IS]
3.462
OOS skill
0.000
IS−OOS gap
3.462
√(2 log K)
3.46
selection bias
2.612
parameter σ
0.850

Under the null hypothesis of no skill, the entire IS-OOS gap is mechanical: scaling with √(log K) (selection) and σ (parameter noise). The residual skill term sits at zero.

-0.600.331.262.203.134.06Sharpe ratio (annualized units)OOS ≈ 0E[max IS] = 3.46selection biasparam σσ·√(2 log K) = 3.462 Var-share(selection) ≈ 0.137

At K=400 and σ=1, the expected-max-of-K Gaussians is ≈ 3.46, the entire IS-OOS gap a strategy population reports under the null can be reproduced without invoking any real edge.

Empirical decomposition

On the canonical 10-asset corpus (ETH, BTC, LTC, TRX, XRP, LINK, ZEC, DOGE, BCH, AVAX, all 30m crypto with ≥17 walk-forward windows; 7.27M OOS rows; 289,374 live-proxy candidates), the pooled variance decomposition is:

  • Vparam, 18.6% (parameter-choice noise across perturbations).
  • Vstrategy, 16.8% (between-strategy main effect).
  • Vfinite, 3.1% (analytic finite-sample floor).
  • Vwindow, 1.2% (between-window means / regime proxy).
  • Residual / interaction, 60.4%.

Mean IS-OOS Sharpe degradation across this corpus is 0.84. The residual-skill term is statistically indistinguishable from zero, the gap is mechanical.

Fig. 1:Pooled OOS-Sharpe variance decomposition on the 10 deep-WFO partitions of the 30-asset corpus. V_param + V_strategy + V_finite + V_window account for ~40% of total variance; the remainder is interaction. Crucially, residual skill, the only term that should reflect a real edge, is invisible at the population level.
Fig. 2:Live-proxy profitable rate by V_param decile, pooled across the 10 assets. D1 (lowest in-sample sensitivity) lands at 11.7%; D10 (highest sensitivity) at 5.3%. Monotonic across all ten deciles, the in-sample variance across small parameter perturbations is a usable predictor of out-of-sample profitability.
Fig. 3:Per-asset D1−D10 spread. ETH, ZEC, XRP form a top tier with 11.7-12.4 pp spreads. Every asset shows a positive lift; TRX is the weakest at +1.3pp. The signal is strongest where the live-proxy baseline rate is lowest, LTC has 4.7% baseline but D1 hits 7.6% (61% relative lift).

Synthetic null

To check the decomposition is not an artefact of the panel structure, the same pipeline runs on synthetic data with three planted classes (robust / fragile / noise) and a deterministic RNG. The variance shape and decile lift reproduce, confirming the decomposition is identifying mechanical structure, not anomalies in the real corpus.

Fig. 4:Variance decomposition on synthetic null (three planted classes, deterministic RNG). Same pipeline, same shape, the empirical decomposition is not an artefact.

Why this matters for systematic strategies

Two practical consequences. First, any positive-edge claim from in-sample maximisation over a parameter grid of K must clear σ·√(2 log K) before it should be taken seriously, and σ is rarely smaller than 1 on bar-level crypto Sharpe estimates. Second, Vparam is fully observable in-sample (it’s computed from the IS run alone, no OOS leakage), so it can be used as a pre-filter at strategy-selection time. The deep-WFO empirics suggest the lowest-Vparam decile beats the highest by a factor of 2.2× in live-proxy profitability, monotonically.

The deflated-Sharpe and PBO frameworks of Bailey-López de Prado attack the same problem from the test-statistic side. This decomposition is complementary: it works at the population level rather than per-strategy, and it produces a usable in-sample predictor (Vparam) of out-of-sample profitability.

Reproducibility

DaruFinance / strategy-overfitting

Python · open source reference implementation

Minimal invocation

from strategy_overfitting import (
    load_metrics, decompose_oos_variance, param_vs_live_lift
)

# Walk-forward output: long-format DataFrame with
#   columns: asset, strategy, window, perturbation, S_IS, S_OOS, n_trades
df = load_metrics(parquet_root="/data/strategies", min_trades=20, sharpe_clip=5)

# Two-way decomposition over (strategy x window) x perturbations
shares = decompose_oos_variance(df)
shares["share_v_param"]      # 0.186  - parameter-choice noise
shares["share_v_strategy"]   # 0.168  - between-strategy main effect
shares["share_v_window"]     # 0.012  - regime / window
shares["share_v_finite"]     # 0.031  - analytic finite-sample floor
shares["share_residual"]     # 0.604  - interaction + unexplained

# Predictive validity: rank strategies by V_param (in-sample) and
# measure live-proxy profitable rate by decile.
lift = param_vs_live_lift(df, n_deciles=10)
lift.D1_pct, lift.D10_pct    # e.g. 11.7, 5.3

References

  1. [1]Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2014). The probability of backtest overfitting. Journal of Computational Finance 20(4), 39–69.
  2. [2]Harvey, C. R. & Liu, Y. (2014). Backtesting. Journal of Portfolio Management 42(1), 13–28.
  3. [3]Lo, A. W. & MacKinlay, A. C. (1990). Data-snooping biases in tests of financial asset pricing models. Review of Financial Studies 3(3), 431–467.
  4. [4]López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley, ch. 11–12.
Decomposing the IS-OOS Sharpe gap | Daru Finance