Subspace identification

Recovering latent dynamics from recorded neural activity.

The inverse problem

You recorded 100 neurons in motor cortex during a reaching task. From Part 12, you know a useful model for what is going on: there is a low-dimensional latent state xtRdx_t \in \mathbb{R}^d that evolves according to linear dynamics, and each neuron's firing rate is a linear readout of that state plus noise. Written out:

xt+1=Axtx_{t+1} = A x_t
(1)
yt=Cxt+wty_t = C x_t + w_t
(2)

Here AA is d×dd \times d (the dynamics), CC is N×dN \times d (the observation map from latent state to neural activity), and wtw_t is observation noise. You observe yty_t, the full vector of 100 firing rates at each time step. You want AA and CC.

The naive approach is to treat the latent states xtx_t as unknowns and estimate everything jointly. This runs into trouble fast. With dd latent dimensions and NN neurons, you need a d×dd \times d dynamics matrix AA, an N×dN \times d observation matrix CC, and TT latent state vectors xtx_t. The number of unknowns grows with the length of the recording, and the problem is not identified without additional constraints. You could impose structure and iterate (EM does exactly this), but there is a more direct route.

The key idea: don't try to estimate the latent states xtx_t directly. Instead, exploit the temporal structure of the observations themselves. Consecutive observations are not independent. They are generated by the same latent state evolving through AA. That shared dynamical structure leaves a signature in the data: a particular pattern of correlations across time lags that can be read off with the SVD.

Time-lagged structure

To see the signature, stack consecutive observations into a single tall vector. Pick a window length pp and, starting at time tt, concatenate pp successive observation vectors:

zt=[ytyt+1yt+2yt+p1]z_t = \begin{bmatrix} y_t \\ y_{t+1} \\ y_{t+2} \\ \vdots \\ y_{t+p-1} \end{bmatrix}
(3)

Each ztz_t lives in RpN\mathbb{R}^{pN}, a pNpN-dimensional vector built from pp windows of NN neurons. Now substitute the state-space model. The observation at time tt is yt=Cxty_t = Cx_t (ignoring noise for the moment). The observation at t+1t+1 is yt+1=Cxt+1=CAxty_{t+1} = Cx_{t+1} = CAx_t. At t+2t+2, yt+2=CA2xty_{t+2} = CA^2 x_t. In general, the observation kk steps ahead is CAkxtCA^k x_t. So the stacked vector has a compact form:

zt=[CCACA2CAp1]Oxtz_t = \underbrace{\begin{bmatrix} C \\ CA \\ CA^2 \\ \vdots \\ CA^{p-1} \end{bmatrix}}_{\mathcal{O}} x_t
(4)

The matrix O\mathcal{O} is called the extended observability matrix.The observability matrix comes from control theory. A system (A,C)(A, C) is said to be observable if O\mathcal{O} has full column rank, meaning the latent state can, in principle, be reconstructed from the observations. For the subspace identification method to work, the system must be observable: every latent dimension must leave some trace in the recorded neurons. Each block row tells you how the latent state at time tt maps to the observation at a particular lag: CC maps it to the current observation, CACA maps it to the next time step, CA2CA^2 to two steps ahead, and so on. The dynamics matrix AA is baked into the structure of O\mathcal{O}.

Now collect ztz_t for all valid starting times t=1,,Tp+1t = 1, \ldots, T - p + 1 and arrange them as columns of a matrix:

H=[z1z2zTp+1]=O[x1x2xTp+1]H = \begin{bmatrix} z_1 & z_2 & \cdots & z_{T-p+1} \end{bmatrix} = \mathcal{O} \begin{bmatrix} x_1 & x_2 & \cdots & x_{T-p+1} \end{bmatrix}
(5)

This is the Hankel matrix. It has pNpN rows and Tp+1T - p + 1 columns, but its rank is at most dd, the latent dimensionality. It does not matter how many neurons you recorded or how many time lags you stacked. The rank of HH is bounded by the rank of O\mathcal{O}, which is dd (assuming observability). The observations are high-dimensional, but the time-lagged structure they contain is low-rank because it is generated by a low-dimensional latent process.

Sliding window → each position becomes a column of Ht=1…t+5y(t)y(t)y(t+1)y(t+2)y(t+3)y(t+4)y(t+5)Column 1 of 40 — same color = same time point (anti-diagonal)H
The Hankel matrix stacks time-lagged windows of observations. Each row is a history of p consecutive time steps. The rank of this matrix equals the latent dimensionality.

The Hankel SVD

The Hankel matrix HH is large, potentially thousands of rows and columns, but its rank is small. From Part 6, you know what to do with a matrix like that: take its SVD.

H=UΣVH = U \Sigma V^\top
(6)

The singular values σ1σ2\sigma_1 \geq \sigma_2 \geq \cdots drop off. In the noiseless case, exactly dd of them are nonzero and the rest are zero. With noise, you get a gap: the first dd singular values are large (they correspond to the latent dynamics) and the remaining ones are small (they correspond to noise). The number of significant singular values tells you the latent dimensionality, just as it did for PCA. The difference is that the matrix here is not a covariance matrix but a matrix of time-lagged observations.

The left singular vectors carry the observability subspace. The first dd columns of UU span the same column space as O\mathcal{O}, the subspace through which the latent state is observed by the neurons. Denote the truncated SVD as HUdΣdVdH \approx U_d \Sigma_d V_d^\top, keeping only the first dd components. Then UdU_d is a pN×dpN \times d matrix whose columns are an orthonormal basis for the observability subspace.

The right singular vectors encode the latent state sequence. The columns of VdV_d are the dd-dimensional latent trajectories, up to a change of basis. You do not recover the original xtx_t (as discussed in Part 2, the latent state is only defined up to an invertible linear transformation), but you recover a version of it that preserves all the structure that matters: the subspace it lives in, the dynamics it obeys, and the observations it generates.This is the core of all subspace identification methods: N4SID [1], MOESP [2], and CVA [3]. They differ in how they weight the Hankel matrix before taking the SVD. N4SID uses an oblique projection, MOESP uses an orthogonal one, CVA normalizes by the noise covariance. But the skeleton is the same everywhere. Build a Hankel matrix, take its SVD, read off the subspace. The differences in weighting affect statistical efficiency, not the fundamental structure.

The singular values themselves tell you how much variance each latent mode explains in the time-lagged observations. A mode with a large singular value contributes strongly to the temporal correlations in the data. A mode with a small singular value contributes little. If it is below the noise floor, it is probably not real. This gives you a principled way to choose the latent dimensionality, which we return to in a later section.

Block Hankel matrix (60 × 481)Time windows (N = 481)Block rows (10 × 6)Singular values12345620signal
The SVD of the Hankel matrix. The number of significant singular values reveals the latent dimensionality. The left singular vectors span the subspace through which the latent state is observed.

Recovering the system

You now have the observability subspace: the first dd columns of UU from the Hankel SVD. Call this pN×dpN \times d matrix O^\hat{\mathcal{O}}. It is an estimate of the extended observability matrix O\mathcal{O}, and it contains everything you need to extract AA and CC.

Start with CC. Look at the structure of O\mathcal{O}: its first block row is CC itself. So CC is just the first NN rows of O^\hat{\mathcal{O}}. That is the observation matrix, which is the linear map from the latent state to neural activity at a single time step. No optimization required, no iteration. You read it off.

Getting AA takes one more step. The key is a shift relation between consecutive block rows of O\mathcal{O}. The second block row is CACA, the third is CA2CA^2, and so on. If you define O\mathcal{O}_\uparrow as O\mathcal{O} with its last block row removed and O\mathcal{O}_\downarrow as O\mathcal{O} with its first block row removed, then:

O=OA\mathcal{O}_\downarrow = \mathcal{O}_\uparrow \, A
(7)

Every row of O\mathcal{O}_\downarrow is the corresponding row of O\mathcal{O}_\uparrow multiplied by AA. You know both sides of this equation (they come from O^\hat{\mathcal{O}}), so you solve for AA by least squares. In practice this is a single line of code: A=OOA = \mathcal{O}_\uparrow^{\dagger} \, \mathcal{O}_\downarrow, where the dagger denotes the pseudoinverse.

Finally, the latent states themselves. Given CC, you can recover the latent state at any time step by projecting the observation onto the identified subspace: x^t=Cyt\hat{x}_t = C^{\dagger} y_t. Or you can read the full trajectory from the right singular vectors: x^t\hat{x}_t is the tt-th row of ΣdVd\Sigma_d V_d^\top, rescaled appropriately. Either way, you get a dd-dimensional trajectory that captures the latent dynamics underlying the recorded neural activity.

One important caveat. The recovered coordinates are not unique. If TT is any d×dd \times d invertible matrix, you can define a new latent state x=Txx' = Tx, a new dynamics matrix A=TAT1A' = TAT^{-1}, and a new observation matrix C=CT1C' = CT^{-1}. This transformed system produces exactly the same observations as the original. The subspace is unique: the column space of O\mathcal{O} does not depend on TT. But the coordinates within it are not. This is the same basis ambiguity you encountered in Part 2: any invertible change of basis gives an equally valid representation.The coordinate ambiguity is why you cannot directly compare latent states across two separately identified systems. If you fit a model to monkey A's motor cortex and another to monkey B's, the two latent spaces may capture similar dynamics but in different coordinate systems. Aligning them requires Procrustes rotation ( Part 10) or CCA (Part 9).

ch 1ch 3ch 5signal (test)ch 1ch 3ch 5reconstructed (d = 2)denoising R² = 0.67
2
True latent trajectory (left) vs recovered (right). At d = 2 the recovered spiral matches the true one. At d = 1 it collapses. At d > 2 extra noisy dimensions appear. Adjust d with the slider.

How many latent dimensions

Everything so far has assumed you know dd, the number of latent dimensions. In practice, you have to choose it. The singular values of the Hankel matrix tell you how.

In the noiseless case, the answer is exact. The Hankel matrix has rank dd, so exactly dd singular values are nonzero and the rest are zero. With noise, this clean picture blurs. The first few singular values are large (they correspond to the true latent modes) and the remaining ones are small but not zero. You are looking for a gap: a point where the singular values drop from "large" to "small." The number of singular values above the gap is your estimate of dd.

This is the same logic as PCA's scree plot from Part 7, applied to the Hankel matrix instead of the data covariance. Large singular values correspond to directions that carry structured temporal correlations. Small ones carry noise. The gap separates the two.

In practice, the gap is rarely clean. Finite data smooths the transition from signal to noise, and the drop-off is gradual rather than sharp. Choosing dd requires judgment. Cross-validation is one option: fit models with different values of dd, hold out a portion of the data, and pick the dd that predicts held-out observations best. Information criteria like AIC or BIC penalize model complexity and provide an alternative. The parallel analysis approach from PCA also works here: compare the singular values to those obtained from shuffled data and keep the components that exceed the shuffled baseline. None of these methods is perfect, but they all point in the same direction: pick the smallest dd that captures the temporal structure in the data without fitting the noise.For neural data, the singular value gap is often ambiguous because neural noise is correlated, not white. Correlated noise inflates the small singular values, pushing them closer to the signal range and obscuring the boundary. Methods like GPFA [4] handle this by modeling the noise structure explicitly by fitting a Gaussian process prior over the latent states and a separate noise covariance for the observations.

8.56.20.90.70.50.40.350.30.250.2σ₁σ₂σ₃σ₄σ₅σ₆σ₇σ₈σ₉σ₁₀signalnoise floord = 2gaptest reconstructionR² = 0.99
Singular values of the Hankel matrix. The first two are large (signal); the rest are small (noise). Drag the threshold to choose how many dimensions to keep.

What comes next

Standard subspace identification recovers the latent dynamics that best explain the neural observations. "Best" here means the dynamics that account for the most variance in the time-lagged observation matrix. This is the right criterion if you want a compact description of what the neural population is doing. But it has a blind spot: it privileges directions of high variance in the neural population, the same bias PCA has. If the behaviorally relevant dynamics happen to lie along low-variance directions (directions that move the arm but do not dominate the population firing rates), standard subspace identification will miss them or bury them in the noise.

Preferential subspace identification (PSID) fixes this [5]. Instead of asking "what dynamics explain the most neural variance?", PSID asks "what dynamics are most relevant to behavior?" Its first stage uses CCA between time-lagged neural activity and behavioral variables to find the behaviorally relevant subspace: the latent directions that predict movements, forces, or whatever behavioral signal you recorded. Its second stage recovers the remaining dynamics from the residual neural activity. The result is a latent system cleanly partitioned into two components: one that drives behavior and one that does not.

That partition is the subject of the next post.

References

  1. Van Overschee, P. and De Moor, B. Subspace Identification for Linear Systems. Kluwer Academic Publishers, 1996.
  2. Churchland, M. M., Cunningham, J. P., Kaufman, M. T., et al. "Neural population dynamics during reaching," Nature, vol. 487, pp. 51-56, 2012.
  3. Cunningham, J. P. and Yu, B. M. "Dimensionality reduction for large-scale neural recordings," Nature Neuroscience, vol. 17, pp. 1500-1509, 2014.
  4. Yu, B. M., Cunningham, J. P., Santhanam, G., et al. "Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity," Journal of Neurophysiology, vol. 102, no. 1, pp. 614-635, 2009.
  5. Sani, O. G., Abbaspourazad, H., Wong, Y. T., et al. "Modeling behaviorally relevant neural dynamics enabled by preferential subspace identification," Nature Neuroscience, vol. 24, pp. 140-149, 2021.
  6. Pandarinath, C., O'Shea, D. J., Collins, J., et al. "Inferring single-trial neural population dynamics using sequential auto-encoders," Nature Methods, vol. 15, no. 10, pp. 805-815, 2018.
← Back to home