Lab

M/03 — Topological Data Analysis

Persistence barcodes on strategy correlation structure

H₀ persistent homology under a correlation-distance Vietoris–Rips filtration: counting how many stable clusters survive in a strategy population.

The mathematics

Take N strategies with return histories ri ∈ ℝT. Define the correlation distance

which is the Euclidean distance between the unit-norm representatives of the two return vectors; it is a true metric (Mantegna 1999). At ε = 0 every strategy is its own cluster — N connected components. As ε grows we add an edge for every pair with d(i, j) ≤ ε. The Vietoris–Rips complex VRε is the simplicial complex whose k-simplices are (k+1)-cliques in this graph; for H₀ we only need its 1-skeleton (the graph itself).

Connected components of VRε are tracked by union-find. As ε increases we obtain a nested sequence of graphs

Each new edge either (a) connects two vertices already in the same component (no change to H₀), or (b) merges two components into one. In case (b), the younger component “dies” at ε. Persistence formalizes this: every component has a birth at ε = 0 and a death at the merge time. The 0-th persistence diagram is

i.e. N − 1 finite bars and one immortal bar (the eventually-merged-everything component). The barcode is just this set drawn as horizontal segments.

Stable cluster count

Long bars correspond to clusters that resisted being absorbed for a long range of ε — meaning that any reasonable threshold around the bar would split the population the same way. A standard heuristic for “number of stable clusters” is

with c around 1.5–2.0; the +1 accounts for the immortal bar. Tighter alternatives include the bottleneck distance to a perturbed barcode under bootstrap resampling, which we report in the production pipeline.

Worked example

Three planted clusters of 8 strategies each, intra-cluster ρ = 0.6 on T = 250 bars. With probability 1 as T grows, the empirical correlations within a cluster concentrate around 0.6 (so d ≈ √(2·0.4) ≈ 0.89) and cross-cluster correlations around 0 (d ≈ √2 ≈ 1.41). Expected barcode:

  • 21 short bars dying near ε ≈ 0.89 (the 21 within-cluster merges).
  • 2 long bars dying near ε ≈ 1.41 (the two between-cluster merges).
  • 1 immortal bar.

K* = 1 + 2 = 3, matching ground truth. The demo below reproduces this experiment.

Demo — H0 persistence barcode

3 planted clusters of 8 strategies each. Pairwise correlation distance d(i,j)=√(2(1−ρ)). Barcode: when each connected component merges into a larger one.

K (true clusters)3
N per cluster8
Intra-cluster ρ0.60
seed=5
N points
24
planted K
3
median death
0.85
long bars
2
0.000.290.570.861.151.43connected components (sorted)filtration value ε (correlation distance)

Each horizontal bar is one connected component, born at ε=0 and dying when it merges with another component. Amber bars are persistent — they survive past 1.5× the median merge-distance. Their count + 1 (the immortal component) ≈ the number of stable clusters in the population.

Figures

H0 persistence barcode on a stratified BTC strategy sample
Fig. 1H₀ persistence barcode built from the correlation distance d(i, j) = √(2(1 − ρᵢⱼ)) on a 95-strategy stratified sample of the BTC pool (≈16 strategies per family across 6 indicator families). Each segment is the lifetime of a connected component under the Vietoris-Rips filtration; the immortal bar is shown at top with the right-arrow marker. Teal early-merge bars are within-cluster (near-duplicate) merges; the gray bulk are typical merges close to √2 (independence). Under a median+2·MAD long-bar rule no finite bar clears the threshold — K* = 1 — meaning the BTC strategy population is one connected style continuum, not a discrete partition.
Lifetime histogram of H0 persistence bars on BTC strategies
Fig. 2Companion lifetime distribution. The body of the merge distances clusters between 1.10 and 1.35 — short of √2 ≈ 1.41 (independence) but well clear of the within-cluster regime — characteristic of a population of correlated strategies sharing market data without splitting into discrete style clusters. The MAD-based threshold (amber) is unattained: the topological answer is one stable cluster.

Why this matters for systematic strategies

Cluster-count is an under-appreciated regime indicator. In a benign market the strategy population decomposes into several stable clusters of orthogonal styles — long bars are common. In a stressed regime the cross-cluster correlations rise, the within-cluster correlations rise more, and the barcode collapses to one or two long bars: everything trades together. The integer K* derived from the H₀ barcode telegraphs that collapse before it shows up in headline portfolio statistics.

Operationally we run M/03 alongside M/01 (RMT eigenspectrum). The MP signal eigenvalue count gives an upper bound on the number of effective factors; the H₀ persistent-cluster count gives a lower bound on the number of effective styles. The two should track. When they diverge — usually because a single factor is driving everything — that’s a flag.

Reproducibility

DaruFinance / strategy-tda

Python — open source reference implementation

Minimal invocation

import numpy as np
from strategy_tda import h0_barcode_from_returns

# X: N x T strategy returns matrix
bars = h0_barcode_from_returns(X)
# bars is an array of shape (N-1, 2): [birth, death] for each merge
# H0 is born at 0; long bars indicate stable clusters.
n_robust_clusters = sum(d > 1.5 * np.median(bars[:, 1]) for d in bars[:, 1]) + 1

References

  1. [1]Edelsbrunner, H., Letscher, D. & Zomorodian, A. (2002). Topological persistence and simplification. Discrete & Computational Geometry 28, 511–533.
  2. [2]Carlsson, G. (2009). Topology and data. Bulletin of the AMS 46, 255–308.
  3. [3]Gidea, M. & Katz, Y. (2018). Topological data analysis of financial time series: landscapes of crashes. Physica A 491, 820–834.
  4. [4]Mantegna, R. N. (1999). Hierarchical structure in financial markets. European Physical Journal B 11, 193–197.