M/01 · Random Matrix Theory

Eigenspectrum of the strategy correlation matrix

Marchenko–Pastur theory and parallel analysis as a noise-floor for the principal components of a strategy population.

The mathematics

Suppose you observe an N × T matrix of standardized returns X (each of N strategies recorded over T bars, each row mean-zero unit-variance). The sample correlation matrix is

C = \frac{1}{T} X X^{⊤} \in R^{N \times N} .

If the rows of X are independent N(0, 1), i.e. there is no real structure, then in the joint limit N, T → ∞ with fixed ratio q = N/T, the empirical eigenvalue distribution of C converges to a deterministic density on a finite support [λ₋, λ₊]:

ρ_{M P} (λ) = \frac{1}{2 π q σ ^{2} λ} (λ_{+} - λ) (λ - λ_{-}), λ \in [λ_{-}, λ_{+}]

λ_{\pm} = σ^{2} (1 \pm q)^{2}, q = N / T .

For standardized data σ² = 1. The interval [λ₋, λ₊] is the bulk. Any eigenvalue outside the bulk cannot be explained by sample-size noise alone, it is a candidate for real structure. At q = 0.1, λ₊ ≈ 1.73; at q = 0.2, λ₊ ≈ 2.09; at q = 0.5, λ₊ ≈ 2.91. The bulk widens as q grows because with fewer observations per dimension, sample correlations become noisier.

Parallel analysis as a non-parametric null

MP assumes Gaussian, identically-distributed columns. Real strategy returns violate both. Parallel analysis (Horn 1965) substitutes a fully data-driven null: take X, independently permute each row in time, recompute the eigenvalues. Repeat B = 1000 times and record the maximum eigenvalue at each replication. The 99th-percentile of that distribution is a non-parametric upper bound on the bulk:

λ_{(0.99)}^{PA} = Q_{0.99} ({λ_{m a x} (C^{(b)})}_{b = 1}^{B}) .

Permuting in time preserves each row’s marginal distribution (so heavy-tailed returns stay heavy-tailed) but destroys cross-row dependence. An eigenvalue exceeding both λ₊ from MP and the PA 99th-percentile is robust signal.

Worked example

Take N = 60 strategies, T = 300 bars, q = 0.2. Plant a single correlated cluster: 10 of the 60 rows are loaded on a common factor with intra-cluster correlation ρ = 0.4. The other 50 are independent N(0,1).

MP bulk: λ₊ = (1 + √0.2)² ≈ 2.087, λ₋ ≈ 0.106.
Theoretical leading eigenvalue from the planted cluster ≈ k · ρ + (1−ρ) ≈ 10 · 0.4 + 0.6 = 4.6.
The remaining 59 eigenvalues should fall inside [λ₋, λ₊].

The interactive demo below recomputes this every time you change a slider, the bulk-edge marker λ₊ moves with q, and any eigenvalue exceeding it is highlighted in amber. Drop the cluster size to zero and the bulk is all you see.

Demo: Marchenko–Pastur eigenspectrum

Generate an N×T strategy returns matrix with a planted correlated cluster. Eigenvalues outside the MP bulk are signal.

N (strategies)60

T (observations)300

Cluster size10

Cluster ρ0.40

seed=7

q = N/T

0.200

λ₊ bulk edge

2.094

λ₋ bulk edge

0.306

λ_max observed

4.739

# eigenvalues > λ₊

cluster planted

10 @ ρ=0.40

Solid curve: MP density ρ_MP(λ) for q=0.200. Bars: empirical histogram of 60 sample-correlation eigenvalues. Bars in amber are above the bulk edge λ₊=2.094, these are the signal eigenvalues.

Figures

Fig. 1:Empirical eigenvalue density of a 220-strategy correlation matrix built from BTC daily strategy P&L (T ≈ 2,800 days, q ≈ 0.08), with the Marchenko-Pastur bulk overlaid. The bulk fits the body of the histogram; a handful of leading eigenvalues sit far above λ₊, those are the candidates for genuine cluster structure.

Fig. 2:Scree plot of the leading 40 eigenvalues against a non-parametric null built by shuffling each row in time and refitting the spectrum 150 times. Eigenvalues clearing the 99-percent quantile of that null (red rings) are robust signal under the strongest available test, they survive both the MP bulk edge and a fully data-driven distributional null.

Why this matters for systematic strategies

A strategy population at q = N/T = 0.2 (say N = 6,000 strategies on T = 30,000 bars of common history) will exhibit several apparent factors in its empirical correlation matrix purely by sample-size noise. Building a portfolio that diversifies along these directions will not diversify anything, it will diversify noise. The MP bound is the cheapest non-parametric guard against this failure mode. In the firm’s production pipeline this check runs before any clustering or factor decomposition.

Mathematically equivalent statement: the leading eigenvector of an unstructured correlation matrix has a participation ratio that is a known function of q. Our M/01 implementation reports both the bulk upper bound and the participation ratio of the leading eigenvector against its noise distribution.

Reproducibility

DaruFinance / strategy-rmt

Python · open source reference implementation

Minimal invocation

import numpy as np
from strategy_rmt import mp_bounds, parallel_analysis

# X: N x T returns matrix (rows = strategies, cols = bars)
N, T = X.shape
C = np.corrcoef(X)
eigs = np.linalg.eigvalsh(C)
lo, hi = mp_bounds(N, T, sigma2=1.0)   # (1 - sqrt(q))^2, (1 + sqrt(q))^2
signal = eigs[eigs > hi]
# Optional: parallel analysis null
pa_threshold = parallel_analysis(X, n_perm=1000, q=0.99)
robust_signal = eigs[eigs > pa_threshold]

References

[1]Marchenko, V. A. & Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices. Mat. Sb. (N.S.) 72(114):4, 507–536.
[2]Laloux, L., Cizeau, P., Bouchaud, J.-P., & Potters, M. (1999). Noise dressing of financial correlation matrices. Physical Review Letters 83(7), 1467–1470.
[3]Plerou, V., Gopikrishnan, P., Rosenow, B., et al. (2002). Random matrix approach to cross correlations in financial data. Physical Review E 65, 066126.
[4]Bouchaud, J.-P. & Potters, M. (2009). Financial Applications of Random Matrix Theory: a short review. in The Oxford Handbook of Random Matrix Theory.

All projects View on GitHub