Lab

M/01 — Random Matrix Theory

Eigenspectrum of the strategy correlation matrix

Marchenko–Pastur theory and parallel analysis as a noise-floor for the principal components of a strategy population.

The mathematics

Suppose you observe an N × T matrix of standardized returns X (each of N strategies recorded over T bars, each row mean-zero unit-variance). The sample correlation matrix is

If the rows of X are independent N(0, 1) — i.e. there is no real structure — then in the joint limit N, T → ∞ with fixed ratio q = N/T, the empirical eigenvalue distribution of C converges to a deterministic density on a finite support [λ₋, λ₊]:

For standardized data σ² = 1. The interval [λ₋, λ₊] is the bulk. Any eigenvalue outside the bulk cannot be explained by sample-size noise alone — it is a candidate for real structure. At q = 0.1, λ₊ ≈ 1.73; at q = 0.2, λ₊ ≈ 2.09; at q = 0.5, λ₊ ≈ 2.91. The bulk widens as q grows because with fewer observations per dimension, sample correlations become noisier.

Parallel analysis as a non-parametric null

MP assumes Gaussian, identically-distributed columns. Real strategy returns violate both. Parallel analysis (Horn 1965) substitutes a fully data-driven null: take X, independently permute each row in time, recompute the eigenvalues. Repeat B = 1000 times and record the maximum eigenvalue at each replication. The 99th-percentile of that distribution is a non-parametric upper bound on the bulk:

Permuting in time preserves each row’s marginal distribution (so heavy-tailed returns stay heavy-tailed) but destroys cross-row dependence. An eigenvalue exceeding both λ₊ from MP and the PA 99th-percentile is robust signal.

Worked example

Take N = 60 strategies, T = 300 bars, q = 0.2. Plant a single correlated cluster: 10 of the 60 rows are loaded on a common factor with intra-cluster correlation ρ = 0.4. The other 50 are independent N(0,1).

  • MP bulk: λ₊ = (1 + √0.2)² ≈ 2.087, λ₋ ≈ 0.106.
  • Theoretical leading eigenvalue from the planted cluster ≈ k · ρ + (1−ρ) ≈ 10 · 0.4 + 0.6 = 4.6.
  • The remaining 59 eigenvalues should fall inside [λ₋, λ₊].

The interactive demo below recomputes this every time you change a slider — the bulk-edge marker λ₊ moves with q, and any eigenvalue exceeding it is highlighted in amber. Drop the cluster size to zero and the bulk is all you see.

Demo — Marchenko–Pastur eigenspectrum

Generate an N×T strategy returns matrix with a planted correlated cluster. Eigenvalues outside the MP bulk are signal.

N (strategies)60
T (observations)300
Cluster size10
Cluster ρ0.40
seed=7
q = N/T
0.200
λ₊ bulk edge
2.094
λ₋ bulk edge
0.306
λ_max observed
4.739
# eigenvalues > λ₊
1
cluster planted
10 @ ρ=0.40
0.000.250.500.751.000.000.971.932.903.874.83λ₊λ₋eigenvalue λ

Solid curve: MP density ρ_MP(λ) for q=0.200. Bars: empirical histogram of 60 sample-correlation eigenvalues. Bars in amber are above the bulk edge λ₊=2.094 — these are the signal eigenvalues.

Figures

Empirical eigenspectrum vs Marchenko-Pastur on real BTC strategy correlations
Fig. 1Empirical eigenvalue density of a 220-strategy correlation matrix built from BTC daily strategy P&L (T ≈ 2,800 days, q ≈ 0.08), with the Marchenko-Pastur bulk overlaid. The bulk fits the body of the histogram; a handful of leading eigenvalues sit far above λ₊ — those are the candidates for genuine cluster structure.
Parallel-analysis null band over the empirical scree on BTC strategies
Fig. 2Scree plot of the leading 40 eigenvalues against a non-parametric null built by shuffling each row in time and refitting the spectrum 150 times. Eigenvalues clearing the 99-percent quantile of that null (red rings) are robust signal under the strongest available test — they survive both the MP bulk edge and a fully data-driven distributional null.

Why this matters for systematic strategies

A strategy population at q = N/T = 0.2 (say N = 6,000 strategies on T = 30,000 bars of common history) will exhibit several apparent factors in its empirical correlation matrix purely by sample-size noise. Building a portfolio that diversifies along these directions will not diversify anything — it will diversify noise. The MP bound is the cheapest non-parametric guard against this failure mode. In the firm’s production pipeline this check runs before any clustering or factor decomposition.

Mathematically equivalent statement: the leading eigenvector of an unstructured correlation matrix has a participation ratio that is a known function of q. Our M/01 implementation reports both the bulk upper bound and the participation ratio of the leading eigenvector against its noise distribution.

Reproducibility

DaruFinance / strategy-rmt

Python — open source reference implementation

Minimal invocation

import numpy as np
from strategy_rmt import mp_bounds, parallel_analysis

# X: N x T returns matrix (rows = strategies, cols = bars)
N, T = X.shape
C = np.corrcoef(X)
eigs = np.linalg.eigvalsh(C)
lo, hi = mp_bounds(N, T, sigma2=1.0)   # (1 - sqrt(q))^2, (1 + sqrt(q))^2
signal = eigs[eigs > hi]
# Optional: parallel analysis null
pa_threshold = parallel_analysis(X, n_perm=1000, q=0.99)
robust_signal = eigs[eigs > pa_threshold]

References

  1. [1]Marchenko, V. A. & Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices. Mat. Sb. (N.S.) 72(114):4, 507–536.
  2. [2]Laloux, L., Cizeau, P., Bouchaud, J.-P., & Potters, M. (1999). Noise dressing of financial correlation matrices. Physical Review Letters 83(7), 1467–1470.
  3. [3]Plerou, V., Gopikrishnan, P., Rosenow, B., et al. (2002). Random matrix approach to cross correlations in financial data. Physical Review E 65, 066126.
  4. [4]Bouchaud, J.-P. & Potters, M. (2009). Financial Applications of Random Matrix Theory: a short review. in The Oxford Handbook of Random Matrix Theory.