M/05 — Strategy Manifold
PCA + UMAP geometry of the strategy population
Embed 100,000 walk-forward strategies from 90-D metric space into 2-D, then ask whether robustness is a contiguous region or a constellation of isolated islands.
The mathematics
Each strategy is a point x ∈ ℝᵈ in a high-dimensional feature space whose components are per-window (Sharpe, PF, MaxDD) across the last 6 walk-forward windows, evaluated under Daru Finance’s proprietary perturbation suite. With N = 100,000 strategies we have an N × d matrix X. We ask: in this space, do robust strategies cluster, and if so does the cluster form a single connected region or many isolated islands?
Principal-component projection
Centre the columns of X and take the singular value decomposition:
is the linear projection onto the leading two principal directions. On the production corpus PC1 explains 16.6% of the variance and PC2 explains 10.0% — i.e. the embedding is intrinsically low-dimensional, two coordinates already capture about a quarter of the signal in 90.
UMAP as a non-linear lens
PCA preserves global geometry but ignores neighbourhoods. UMAP (McInnes & Healy 2018) fits a fuzzy simplicial set μ in input space (each k-NN edge weighted by a local Riemannian metric) and a corresponding set ν in 2-D, then minimises their cross-entropy
Under continuity assumptions on a Riemannian uniform manifold this preserves local topology while tearing global geometry. The result is a layout that surfaces tight neighbourhoods PCA flattens.
Connectivity and modularity
Build the k-NN graph (k = 15) on the 2-D embedding. Let A be its adjacency matrix, k_i the degree of node i, m the edge count, and c_i a label in {robust, fragile}. Newman’s modularity
measures how much edge density inside each label exceeds the configuration-model null. We use a cheap proxy Q̃ = fraction of edges whose endpoints share a label. The chance baseline is r̄² + (1−r̄)²; with r̄ = 0.0687 that gives 0.872. The empirical value on the production embedding is Q̃ ≈ 0.903 — a real but small lift over chance, consistent with weak clustering.
Worked example
- 10 deepest-WFO assets, 100,000-strategy stratified subsample, 6-window metrics feature.
- Robustness rate
r̄= 6.87% → 6,869 robust vs 93,131 fragile. - UMAP modularity proxy 0.903 (vs 0.872 baseline); PCA proxy 0.911.
- Number of connected components in the robust-only k-NN subgraph: 1,229 for UMAP and 938 for PCA. Average ≈ 5.6 robust strategies per island.
The interactive demo below recomputes the connectivity statistic every time you move the τ slider — drag it down and the robust set merges; drag it up and it shatters.
Demo — synthetic strategy manifold
N points sampled from one diffuse fragile cloud + K tight robust islands. Sweep the robustness threshold τ; watch how the robust subset partitions.
Amber: r ≥ τ (robust). Grey: r < τ (fragile). Connectivity is computed on an 8-NN graph over the amber subset using union-find. Lift over baseline = +0.027. With the production corpus (N=100,000, real metrics) the analogous numbers are 6,869 robust points, ~1,229 components, Q̃ ≈ 0.903.
Figures
Why this matters for systematic strategies
Many search procedures over strategy space — gradient ascent on a smoothed score, evolutionary crossover, Bayesian optimisation — implicitly assume the robust region is locally convex: that small perturbations of a robust strategy stay robust. The connectivity analysis directly contradicts that assumption. The robust population is not a single connected manifold at any scale we’ve checked; it is a constellation of ~5–8-strategy islands separated by fragile gaps.
Operationally, two consequences. First, edge cannot be reliably reached by perturbing a known good strategy — the neighbours of a robust strategy are fragile with high probability. Second, the robust subset must be enumerated combinatorially over the indicator/transform/confluence grid, not recovered by local search. The pipeline that downstream models consume already respects this: candidates are generated combinatorially and only then filtered, never optimised toward.
Reproducibility
DaruFinance / strategy-manifold
Python — open source reference implementation
Minimal invocation
import numpy as np
from sklearn.decomposition import PCA
from sklearn.neighbors import kneighbors_graph
import umap
# X: N x d feature matrix (rows = strategies, cols = per-window metrics).
# r: length-N {0,1} vector — 1 = passed the proprietary robustness funnel.
N, d = X.shape
# 1) Linear baseline.
pca = PCA(n_components=2).fit(X)
T_pca = pca.transform(X)
print("PC1 var:", pca.explained_variance_ratio_[0])
# 2) Non-linear embedding.
emb = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=0).fit_transform(X)
# 3) Connectivity of the robust subset under the 15-NN graph.
A = kneighbors_graph(emb, n_neighbors=15, mode="connectivity")
A_robust = A[r == 1][:, r == 1]
from scipy.sparse.csgraph import connected_components
n_components, _ = connected_components(A_robust, directed=False)
print("robust components:", n_components)
References
- [1]Pearson, K. (1901). On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine 2(11), 559–572.
- [2]McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426.
- [3]Newman, M. E. J. (2006). Modularity and community structure in networks. PNAS 103(23), 8577–8582.