Research / Working Paper · Revised May 2026

Predictive Value of Within-Strategy Permutation Tests for Forward Selection

Evidence from Over 6 Billion Strategy-Level Permutations Across Three Asset Classes

A large-scale empirical test of whether within-strategy Monte Carlo permutation testing, a standard validation step in quantitative strategy development, actually improves forward-looking strategy selection. Across 437,911 strategy configurations on nine instruments spanning crypto, forex and commodities, 160 walk-forward windows, and 26.5 billion permutations end-to-end, the answer is no, but for a different reason than a March-2026 draft of this paper claimed. On the genuinely path-dependent statistics for which the test is well-defined, MC filtering produces lift indistinguishable from zero; on sum-based statistics, the previously reported negative-lift result is reattributed to a floating-point summation-order artefact.

The three findings

FINDING 1

Your most-used selection metrics are invisible to the test.

Under fixed per-trade notional sizing, the standard practitioner setting, total ROI, trade-level Sharpe, and Profit Factor are each a function of the multiset of trade returns alone. Their MC rank distributions are degenerate by construction. Any leftward shift you see on these metrics is a floating-point summation-order artefact in vectorised code, not a property of the data.

FINDING 2

On the metrics where the test is well-defined, the signal is noise.

Maximum Drawdown, Calmar, and Ulcer index are genuine path-dependent statistics. Their MC ranks cluster within a few percentage points of the 50% random-reshuffle baseline on every instrument. Adding an MC filter to a simple IS Profit Factor > 1 gate moves OOS profitability by less than ±0.5 pp across the nine instruments.

FINDING 3

At the portfolio level, the signal predicts in the wrong direction.

Combining strategies into 10-strategy equal-weight portfolios produces a clear rightward shift in MC-MDD rank, real path-dependence detected. But the same rank, used as a forward-OOS selector, fails: top-decile portfolios by in-sample MC-MDD underperform the bottom decile by −3.48 pp pooled. MC detects path-dependence; it does not predict performance.

Abstract

Within-strategy Monte Carlo permutation testing, shuffling a strategy’s trade returns to assess significance, rests on an exchangeability assumption that financial time series violate. The test’s diagnostic properties under exchangeability are well established. What has not been evaluated is whether acting on the test’s output actually improves forward-looking strategy selection.

We evaluate this across 437,911 strategy configurations (8 indicator families; within-family bar-level PnL correlations average only |ρ̄| = 0.031, refuting the common assumption of near-duplicate return streams among parameter variants) on nine instruments spanning three asset classes, crypto perpetual futures (BTC, DOGE, BNB, SOL), forex majors (EUR/USD, USD/JPY, EUR/GBP), and commodities (XAU/USD, WTI Oil), and 160 walk-forward windows. The pipeline produces 6,629,125 strategy-window observations and approximately 26.5 billion permutations end-to-end: 6.63 billion within-strategy MC permutations, 19.9 billion block-permutation runs, plus a gold-standard bar-permutation pipeline (~207 million full strategy re-executions, Aronson/Masters construction) as an independent dependence-preserving check.

On path-dependent statistics (Maximum Drawdown, Calmar, Ulcer index, the metrics for which the within-strategy permutation test is mathematically well-defined), adding MC filtering on top of a simple in-sample profitability gate (IS Profit Factor > 1) produces incremental lift indistinguishable from zero, the calendar-quarter cluster bootstrap (61 clusters across the nine instruments) gives pooled MC-MDD p50 lift of +0.34 pp [+0.07, +0.62], which does not survive Bonferroni correction across the family of six MC variants. A previously circulated draft of this paper reported a negative-lift result on sum-based statistics (ROI / Sharpe / Profit Factor); that result is reattributed in the May 2026 revision to a floating-point summation-order artefact in vectorised permutation code, not to a property of the data.

Two independent checks reinforce the corrected null. A placebo test on the artefactual filter produces sign-flipping lift across asset classes (crypto and gold negative, three forex pairs positive), a pattern no substantive non-exchangeability theory predicts, but the floating-point summation pitfall does. And a gold-standard bar-permutation Monte Carlo (stationary-bootstrap resampling of bar returns with full strategy re-execution, ~207 million runs across the nine instruments) confirms genuine in-sample bar-level edge yet reproduces the same null forward-selection result.

Fig. 1:Bootstrap distribution of MC filter lift under the corrected path-dependent (MDD) ranks. Per-instrument 95% CIs straddle zero for every asset except WTI (+0.20 pp, p=0.007). The pooled cluster bootstrap point estimate is +0.34 pp [+0.07, +0.62], below any practically actionable threshold and not robust to multiple-testing correction.

Demo: Watch the floating-point shift appear

Shuffle the same multiset of trade returns thousands of times. Every shuffle should produce the same sum, mathematically they do. But under IEEE-754 strict-greater-than on the batched vectorised sum, you will see one shuffle ‘beat’ another with non-trivial frequency. The rank distribution collapses to ~37 ± 2%, exactly the leftward shift the March 2026 draft mistook for evidence of non-exchangeability. Toggle the summation procedure to see the artefact appear and disappear.

Switch to mid-rank tie handling: the rank distribution snaps to a uniform null centred on 50%. Switch to Kahan compensated summation: it collapses to a thin sliver near 0. The data hasn't changed, only the order in which the floats were added.

Why it happens

Two structural reasons and one numerical pitfall:

  1. The standard practitioner metrics are permutation-invariant. Under fixed per-trade notional sizing, the standard practitioner setting, total ROI, trade-level Sharpe, and Profit Factor are each functions only of the multiset of trade returns. They take the same value under every reshuffling and are invisible to the MC test by construction (proposition in Appendix A4 extending the classical PF-invariance result). The within-strategy MC test on these metrics is structurally uninformative regardless of any property of the data-generating process.
  2. On the path-dependent metrics where the test is well-defined, realised trade ordering does not separate winners from losers out-of-sample. Maximum Drawdown, Calmar, and Ulcer are genuine functions of trade ordering and produce a non-degenerate MC null. But empirically the realised ordering is statistically indistinguishable from a random reshuffle, mean ranks sit within a few percentage points of the 50% benchmark on every instrument, and the residual filter lift is at most a fraction of a percentage point.
  3. A floating-point summation-order artefact inflated earlier numbers. Vectorised permutation code accumulates batches of shuffled trade-returns in different summation orders than the realised order, producing a small but non-zero fraction of IEEE-754 strict greater-than comparisons between two mathematically equal sums. The previously reported negative-lift signal on sum-based statistics was almost entirely this artefact. The May 2026 revision documents the pitfall, ships a self-contained reproducer, and proposes three repairs (path-dependent statistics, mid-rank tie handling, or exact-precision summation).
  4. Sampling noise in tail-quantile estimation. Even on path-dependent statistics, the MC rank is a noisy estimate of a tail quantile of an unknown null distribution. The noise compounds across the pipeline and dominates whatever residual signal remains after the two structural barriers above.

Demo: Pick a metric, see if MC can rank it

Choose any selection metric. We will shuffle a fixed PnL multiset 1,000 times and plot the distribution of metric values. Any metric that collapses to a single point is invisible to the within-strategy MC test, no amount of permutation can rank what is mathematically constant.

Profit Factor, ROI, trade-level Sharpe, win rate, and Sortino all collapse. Maximum Drawdown, Calmar, and Ulcer index distribute across a non-trivial range. The test is well-defined on the second group only, and on real data those ranks centre near 50%.

Fig. 2:Mean MC-MDD rank by walk-forward window across the nine instruments under the corrected (path-dependent) procedure. The dashed line is the random-reshuffle expectation (50%). Means hover near the baseline with high within-window variance and no consistent across-window signal, visually consistent with the predictive-power null and showing the result is not driven by a small subset of regimes.
Fig. 3:Synthetic validation under known ground truth. MC rank distributions on path-dependent statistics for three signal tiers (Pure Null, Known Edge, Adversarial). When the data-generating process injects genuine path-dependence, the test detects it as a small leftward shift; on real data the same metrics produce a flat distribution centred on 50, so the empirical absence-of-signal is a genuine null rather than low statistical power.

Demo: Detection isn't prediction

Pick an instrument and slide K from 1 to 10 strategies per portfolio. Watch the in-sample MC-MDD rank distribution drift right of 50, the test correctly detects that the realised portfolio drawdown geometry is better than a typical permutation would suggest. Then check the next-window OOS panel.

On the pooled 9-instrument result, the same rank that drifted right of 50 in-sample inverts as a predictor: smoothest-IS portfolios underperform the noisiest by −3.48 pp out-of-sample. The MC test detects real path-dependence and trades against you.

Fig. 4:Portfolio MC-MDD ranks shift rightward (in-sample detection of real path-dependence) as strategies are combined into 10-strategy equal-weight portfolios, yet the same rank inverts as a predictor: top-decile portfolios by in-sample MC-MDD underperform bottom-decile portfolios in next-window OOS profitability by −3.48 pp pooled across the nine instruments. Detection is not prediction.

Practical consequence

The per-strategy MC-rank signal carries no positive forward-selection content; it adds compute time and an extra mental step that doesn't move the OOS distribution. At the portfolio level the picture is sharper: combining strategies produces a rightward shift in MDD ranks (the actual portfolio drawdown geometry is better than typical permutations would suggest), but a forward-OOS test shows the signal mean-reverts, top-decile portfolios by in-sample MC-MDD rank underperform bottom-decile portfolios in the next walk-forward window by −3.48 pp pooled across the nine instruments. MC at the portfolio level is a detector of path-dependence, not a predictor of future profitability.

At the portfolio level, equal-weight diversification dominates filter selection: 53–62pp improvement from diversification versus 3–5pp from the best filter. Equal-weight multi-strategy portfolios achieve 89–98% OOS profitability regardless of filter choice, consistent with the broader literature on naive 1/N portfolios (DeMiguel et al., 2009).

Why a shared signal too small to detect at the strategy level becomes a confident signal at the portfolio level is the subject of The signal is collective. The production work behind that project and the firm’s broader research programme uses the same Daru Finance proprietary filter, a different construction from the comparison filters evaluated in this paper.

Demo: Move the slider, watch nothing happen

A common practitioner response to a null finding is to tighten the filter. Drag the MC percentile threshold from p0 to p99 and watch the OOS lift confidence interval. The CI ribbon never leaves the ±1 pp band around zero, no matter which path-dependent statistic you pick.

Demo: Threshold lift

Sweep the MC percentile threshold and the metric. The OOS top-bottom decile lift CI never leaves the ±1 pp band, while the IS PF > 1 gate alone delivers +4.54 pp.

Paper Table 15 · cal-cluster
percentile threshold (p)50
MC metric
point estimate
+0.34 pp
95% CI
[+0.07, +0.62]
bonferroni
does not survive
IS PF>1 alone
+4.54 pp
OOS top−bottom decile lift vs MC threshold -20+2+4+6±1 pp bandIS PF>1 alone (+4.54)p50: +0.34 ppp0p25p50p75p99MC percentile threshold · drag or hover to scrub

The whole curve is the story: drag across every threshold and the selected metric, plus the faint trails of the others, stays pinned inside the ±1 pp band. The IS PF > 1 gate alone delivers +4.54 pp (the dashed line far above), so MC filtering adds noise around zero no matter which path-dependent statistic you pick. The MC-ROI* artefact curve drifts above zero only because the floating-point summation pitfall sign-flips on forex pairs (paper §8.4).

Now check the IS PF > 1 gate alone, that single Boolean filter delivers +4.54 pp of OOS lift. MC threshold tightening adds compute time and noise around zero; the in-sample profitability gate does the lifting.

Methodology summary

  • Strict walk-forward optimization: 10,000-candle in-sample windows for crypto (~10,000 clock hours for forex/commodities), advancing every 5,000 candles/hours, with no parameter adjustment during OOS evaluation.
  • Realistic transaction costs: 0.16% round-trip for crypto perpetuals (slippage + fee + funding), 1.4 pips for forex majors, instrument-specific spreads for commodities.
  • 1,000 within-strategy MC permutations per strategy-window pair (~6.63 billion total) on path-dependent statistics (MDD, Calmar, Ulcer); a block-permutation battery at b ∈ {2, 3, 5, 10, 20} adds ~19.9 billion runs at the portfolio level; a gold-standard bar-permutation pipeline (stationary-bootstrap resampling of bar returns with full strategy re-execution, Aronson/Masters construction) adds ~207 million full re-executions across the nine instruments as an independent dependence-preserving check. Approximately 26.5 billion permutations end-to-end.
  • Comparison filters in §3.4: a deterministic robustness battery (cost +50%, timing ±1 bar, parameter ±1 step) under matched gating conditions. The portfolio analysis (§6) covers within-window MC, forward-OOS deciles, and top-N-by-IS-PF construction across approximately 160,000 portfolios.

Reproducibility and disclosure

The repository ships analysis scripts only, every figure, table, and inline statistic in the paper can be reproduced by running the released pipeline against any strategy-window output that conforms to the documented schema. Code spans three languages: Python (13 scripts, orchestration, figures, tables, and a self-contained reproducer of the floating-point summation-order pitfall), Rust (7 crates parallelised with Rayon, the four new path-dependent crates mc_path_ranks, block_perm_path, portfolio_mc_path, portfolio_mc_oos, plus the legacy block_perm, the correlation tensor, and the synthetic pipeline), and R (5 scripts independently re-deriving the key statistical claims in a different language as a methodological cross-validation). All scripts use fixed random seeds (42); bootstrap CIs use 10,000 resamples; the lift estimates are stable across three independent seed sequences (mean lift varies by less than 0.05 pp).

The package is explicit about what is not shipped: the raw bar/trade data and the 437,911 strategy configurations remain proprietary. Readers reproduce by pointing the scripts at their own bar-level data and strategy universe; the strategy backtester that produces the trade streams the pipeline consumes is open source at github.com/DaruFinance/quant-research-framework-rs. Separately, the Daru Finance production filter, used in consulting work and in the firm's broader research programme, is a different construction from the comparison filters evaluated here, and is not the subject of this paper.

Practitioner checklist

An eight-item audit to run against your own MC pipeline before trusting its output. The most expensive mistakes are below.

  • 01

    Check your sizing assumption first.

    The multiset-invariance proposition assumes fixed per-trade notional sizing. Under Kelly, fixed-fractional, or volatility-scaled sizing, the structural invariance breaks, but so does most of the practitioner literature that motivates the test in the first place. Be explicit about which regime you are in.

  • 02

    Use mid-rank tie handling on your MC counter.

    Replace strict-greater-than counts with mid-rank tie correction: rank = (count_strict_less + 0.5 × count_equal) / B. This single change removes the leftward shift produced by IEEE-754 summation noise on permutation-invariant metrics.

  • 03

    Or switch to compensated / exact summation.

    math.fsum in Python or Kahan summation in any language pushes the surviving rounding error below the threshold at which the artefact can sign-flip a ranking. Slower per call, but eliminates the bug class.

  • 04

    Use a path-dependent statistic, not a sum-based one.

    If your selection metric is in the multiset-invariant family (ROI, Sharpe, Profit Factor, win rate, Sortino) the within-strategy MC test is structurally uninformative regardless of implementation. Switch to MDD, Calmar, or Ulcer if you want a permutation null to mean anything.

  • 05

    Use at least B = 1,000 permutations.

    Below that, the tail of the null is too noisy to rank reliably and you will see arbitrary leftward or rightward shifts driven by sampling noise. The paper uses B = 1,000 throughout.

  • 06

    Cluster your standard errors by calendar period.

    Naive i.i.d. standard errors will convince you of effects that are not there: the design effect on a 9-instrument 160-window panel is roughly 18×. Use a calendar-quarter cluster bootstrap (~61 clusters), this is the source of the headline CI in the paper.

  • 07

    Don't filter on MC rank as a forward selector.

    Use the MC test as a detector of path-dependence, not as a forward predictor. The portfolio-level result shows that even when MC detects real path-dependence, the rank inverts as an OOS predictor by approximately −3.5 pp.

  • 08

    Treat block permutation as detection, not prediction.

    Block permutation preserves local autocorrelation and produces a slightly narrower null than the i.i.d. variant. It does not produce a positive forward-selection signal. Use it to characterise dependence structure, not to pick strategies.

A side observation worth flagging

Within-family bar-level PnL correlations average only |ρ̄| = 0.031 across the full sample, refuting the common assumption that parameter variants of a single indicator family produce near-duplicate return streams. Even at the 99th-percentile pair, the average |ρ| is just 0.224, well below the |ρ| > 0.7 level that “near-duplicate” intuition would predict. This means the 437,911-strategy universe carries far more independent information than a casual reading would assume.

Fig. 5:The floating-point summation-order artefact, in three panels. Vectorised permutation code accumulates batches of shuffled trade returns in a different summation order than the realised one. For roughly 1.1% of shuffles, IEEE-754 reports a strict greater-than between two mathematically equal sums (by ≤ 5×10⁻¹⁵); for roughly 28.6% it reports exact equality; only the remaining ~70.3% are strictly greater for substantive reasons. Under the strict-greater-than count used by typical MC code, the rank distribution collapses to ~37 ± 2% across strategies on i.i.d. data, exactly matching the leftward shift previously interpreted as evidence of non-exchangeability.
Fig. 6:Gold-standard bar-permutation Monte Carlo across the nine instruments. Top row: in-sample PF rank vs the 50% random-reshuffle baseline (real edge present on crypto and commodities). Bottom row: forward-OOS lift relative to the bar-permuted null. Crypto and commodities sit within ±0.75 pp of zero; the apparent forex lift collapses under the placebo control reported alongside.

Even the gold-standard bar-permutation MC produces a null

A reasonable objection at this point: ‘the within-strategy trade-shuffle MC is the cheap variant, the dependence-preserving bar-permutation MC of Aronson and Masters is the one I actually trust.’ The paper runs that procedure too, at scale: stationary-bootstrap resampling of bar returns with full strategy re-execution, approximately 207 million runs across the nine instruments. It detects genuine in-sample bar-level edge, PF rank 80–86 on crypto and commodities. It produces the same null forward-selection result.

The placebo control is what makes the reading airtight. The same artefactual filter that appeared to add value produces lift that flips sign across asset classes, negative on crypto and gold, positive on three forex pairs, a pattern no theory of genuine non-exchangeability predicts but a floating-point summation pitfall reproduces exactly. Once that artefact is corrected, the apparent forex edge collapses with it, and crypto and commodities settle within ±0.75 percentage points of zero. So the gold-standard procedure agrees with the cheap one on the only question that matters at decision time: it confirms there is real bar-level structure to find in-sample, and it confirms that structure does not carry into the next walk-forward window. Detection is not prediction, and paying for the most expensive permutation test buys you a more credible null, not a different one.

FAQ

Does this apply to equity strategies?
The empirical scope of the paper is four crypto perpetuals, three forex majors, and two commodities. The structural arguments (multiset invariance, the floating-point pitfall, calendar-cluster CIs) are language-of-implementation features and do not depend on the asset class. We have not run the procedure on equities; that is a deliberate scope limitation flagged in §10.
What about Kelly / fixed-fractional / vol-scaled sizing?
Equity-dependent position sizing breaks the multiset-invariance proposition, under Kelly, the order in which trades arrive changes the per-trade notional and therefore ROI and Sharpe. So the structural argument is narrower than the practitioner literature that motivates the test. In practice most published MC permutation tests assume fixed sizing because that is what makes the null tractable; the paper's critique applies to that body of work specifically.
Then what should I use instead?
Two procedures pass the same forward-OOS audit and are methodologically appropriate. Bar-level stationary-bootstrap resampling with full strategy re-execution (the Aronson/Masters construction) tests a meaningful dependence-preserving null. Combinatorial purged cross-validation (López de Prado) tests a different question. Both produce the same null forward-selection result as the within-strategy test on the data here; both are documented in §10.
Did you test universe-null methods like White's Reality Check?
Out of scope. White's Reality Check, Hansen's SPA, and Romano-Wolf StepM address the multiple-comparisons-across-strategies hypothesis, not the within-strategy permutation hypothesis. They could be evaluated with the same forward-lift methodology this paper introduces; we leave that for follow-up work.
Where's the code, and how do I reproduce the floating-point demo?
The full pipeline is at github.com/DaruFinance/Monte-Carlo-paper. The self-contained floating-point reproducer (python/fp_pitfall_demo.py) runs in under 30 seconds with no external data and reproduces the same ~37% mean rank the paper documents.
Why does the May 2026 revision retract a finding from the March 2026 draft?
The March draft reported a uniformly negative MC lift of approximately −1.2 pp across nine instruments. That result was driven by vectorised summation code that compared two mathematically equal floats under IEEE-754 strict-greater-than. Once corrected, via mid-rank ties, exact summation, or a switch to genuinely path-dependent statistics, the leftward shift disappears. The revision documents the bug, ships a public reproducer, and runs the corrected analysis at the same scale.

Cite this paper

See also

For a less-formal companion that walks through the broader empirical landscape, and shows where the edge actually lives in a 533,638-strategy population once within-strategy permutation testing is set aside (in selection on the population, reproducible from realised daily PnL), read Edge is in the Process.

Research | Daru Finance