M/05 · Selection bias decomposition
Decomposing the IS-OOS Sharpe gap
A variance budget that splits in-sample to out-of-sample Sharpe degradation into selection bias, parameter-choice noise, and residual skill, across ten deep walk-forward crypto corpora.
The mathematics
Suppose a strategy family is indexed by a parameter k = 1, …, K. For each k you observe an in-sample Sharpe SISk and an out-of-sample Sharpe SOOSk. The standard practice, pick the in-sample maximum and hope for the best out-of-sample, defines the IS-OOS gap as
Decompose Δ additively. Write each SISk as a true skill term plus a parameter-choice deviation plus noise:
Then the gap splits into three terms:
The first term is purely mechanical. For K independent N(0, σ²) draws, Mills’ bound gives the expected maximum
which is tight to leading order. The second term is the within-grid σ inherited by the chosen k*. Only the third term, residual skill, should depend on the strategy actually doing something useful out-of-sample.
The variance budget
The companion repository runs a two-way decomposition of OOS-Sharpe variance over a (strategy × window) panel, observing a battery of proprietary perturbations per cell, the specific perturbation set is part of Daru Finance’s consulting work and is not enumerated here. The panel-level identity is
where Vparam = E[Vart SOOS(s, w, t)] is the within-cell variance across perturbations (knife-edge sensitivity), Vstrategy and Vwindow are between-strategy and between-window mean differences, Vfinite = (1 + ½·Sh²)/n is the analytic finite-sample noise floor, and R is the unexplained interaction.
Worked example
Take BTC_30m_27W from the deep-WFO crypto corpus: 538k OOS rows, 27 walk-forward windows. Mean OOS Sharpe across the surviving population is −0.696. Mean Vparam is 0.243. Live-proxy profitable rate (last two windows after the funnel filter): 8.4%.
Now plug into the null. With a parameter grid of K = 400 combinations and σ = 1 (typical IS Sharpe noise on 500 trades, q ≈ 0.2), the expected-max bound is
That is to say: under no skill at all, picking the in-sample winner from a grid of 400 produces an expected IS Sharpe near +3.5, and an OOS expectation near zero. The empirical gap is comfortably reproduced without invoking real edge. The interactive demo below recomputes this every time you move K or σ.
Demo: predicted IS-OOS gap under a no-skill null
For a parameter grid of size K with iid N(0, σ²) IS Sharpes, the expected maximum is ≈ σ·√(2 log K). Move the sliders, the bar shows the predicted gap, decomposed into selection bias, parameter noise, and residual skill.
Under the null hypothesis of no skill, the entire IS-OOS gap is mechanical: scaling with √(log K) (selection) and σ (parameter noise). The residual skill term sits at zero.
At K=400 and σ=1, the expected-max-of-K Gaussians is ≈ 3.46, the entire IS-OOS gap a strategy population reports under the null can be reproduced without invoking any real edge.
Empirical decomposition
On the canonical 10-asset corpus (ETH, BTC, LTC, TRX, XRP, LINK, ZEC, DOGE, BCH, AVAX, all 30m crypto with ≥17 walk-forward windows; 7.27M OOS rows; 289,374 live-proxy candidates), the pooled variance decomposition is:
- Vparam, 18.6% (parameter-choice noise across perturbations).
- Vstrategy, 16.8% (between-strategy main effect).
- Vfinite, 3.1% (analytic finite-sample floor).
- Vwindow, 1.2% (between-window means / regime proxy).
- Residual / interaction, 60.4%.
Mean IS-OOS Sharpe degradation across this corpus is 0.84. The residual-skill term is statistically indistinguishable from zero, the gap is mechanical.
Synthetic null
To check the decomposition is not an artefact of the panel structure, the same pipeline runs on synthetic data with three planted classes (robust / fragile / noise) and a deterministic RNG. The variance shape and decile lift reproduce, confirming the decomposition is identifying mechanical structure, not anomalies in the real corpus.
Why this matters for systematic strategies
Two practical consequences. First, any positive-edge claim from in-sample maximisation over a parameter grid of K must clear σ·√(2 log K) before it should be taken seriously, and σ is rarely smaller than 1 on bar-level crypto Sharpe estimates. Second, Vparam is fully observable in-sample (it’s computed from the IS run alone, no OOS leakage), so it can be used as a pre-filter at strategy-selection time. The deep-WFO empirics suggest the lowest-Vparam decile beats the highest by a factor of 2.2× in live-proxy profitability, monotonically.
The deflated-Sharpe and PBO frameworks of Bailey-López de Prado attack the same problem from the test-statistic side. This decomposition is complementary: it works at the population level rather than per-strategy, and it produces a usable in-sample predictor (Vparam) of out-of-sample profitability.
Reproducibility
DaruFinance / strategy-overfitting
Python · open source reference implementation
Minimal invocation
from strategy_overfitting import (
load_metrics, decompose_oos_variance, param_vs_live_lift
)
# Walk-forward output: long-format DataFrame with
# columns: asset, strategy, window, perturbation, S_IS, S_OOS, n_trades
df = load_metrics(parquet_root="/data/strategies", min_trades=20, sharpe_clip=5)
# Two-way decomposition over (strategy x window) x perturbations
shares = decompose_oos_variance(df)
shares["share_v_param"] # 0.186 - parameter-choice noise
shares["share_v_strategy"] # 0.168 - between-strategy main effect
shares["share_v_window"] # 0.012 - regime / window
shares["share_v_finite"] # 0.031 - analytic finite-sample floor
shares["share_residual"] # 0.604 - interaction + unexplained
# Predictive validity: rank strategies by V_param (in-sample) and
# measure live-proxy profitable rate by decile.
lift = param_vs_live_lift(df, n_deciles=10)
lift.D1_pct, lift.D10_pct # e.g. 11.7, 5.3
References
- [1]Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2014). The probability of backtest overfitting. Journal of Computational Finance 20(4), 39–69.
- [2]Harvey, C. R. & Liu, Y. (2014). Backtesting. Journal of Portfolio Management 42(1), 13–28.
- [3]Lo, A. W. & MacKinlay, A. C. (1990). Data-snooping biases in tests of financial asset pricing models. Review of Financial Studies 3(3), 431–467.
- [4]López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley, ch. 11–12.

