Subspace identification
Recovering latent dynamics from recorded neural activity.
The inverse problem
You recorded 100 neurons in motor cortex during a reaching task. From Part 12, you know a useful model for what is going on: there is a low-dimensional latent state that evolves according to linear dynamics, and each neuron's firing rate is a linear readout of that state plus noise. Written out:
Here is (the dynamics), is (the observation map from latent state to neural activity), and is observation noise. You observe , the full vector of 100 firing rates at each time step. You want and .
The naive approach is to treat the latent states as unknowns and estimate everything jointly. This runs into trouble fast. With latent dimensions and neurons, you need a dynamics matrix , an observation matrix , and latent state vectors . The number of unknowns grows with the length of the recording, and the problem is not identified without additional constraints. You could impose structure and iterate (EM does exactly this), but there is a more direct route.
The key idea: don't try to estimate the latent states directly. Instead, exploit the temporal structure of the observations themselves. Consecutive observations are not independent. They are generated by the same latent state evolving through . That shared dynamical structure leaves a signature in the data: a particular pattern of correlations across time lags that can be read off with the SVD.
Time-lagged structure
To see the signature, stack consecutive observations into a single tall vector. Pick a window length and, starting at time , concatenate successive observation vectors:
Each lives in , a -dimensional vector built from windows of neurons. Now substitute the state-space model. The observation at time is (ignoring noise for the moment). The observation at is . At , . In general, the observation steps ahead is . So the stacked vector has a compact form:
The matrix is called the extended observability matrix.The observability matrix comes from control theory. A system is said to be observable if has full column rank, meaning the latent state can, in principle, be reconstructed from the observations. For the subspace identification method to work, the system must be observable: every latent dimension must leave some trace in the recorded neurons. Each block row tells you how the latent state at time maps to the observation at a particular lag: maps it to the current observation, maps it to the next time step, to two steps ahead, and so on. The dynamics matrix is baked into the structure of .
Now collect for all valid starting times and arrange them as columns of a matrix:
This is the Hankel matrix. It has rows and columns, but its rank is at most , the latent dimensionality. It does not matter how many neurons you recorded or how many time lags you stacked. The rank of is bounded by the rank of , which is (assuming observability). The observations are high-dimensional, but the time-lagged structure they contain is low-rank because it is generated by a low-dimensional latent process.
The Hankel SVD
The Hankel matrix is large, potentially thousands of rows and columns, but its rank is small. From Part 6, you know what to do with a matrix like that: take its SVD.
The singular values drop off. In the noiseless case, exactly of them are nonzero and the rest are zero. With noise, you get a gap: the first singular values are large (they correspond to the latent dynamics) and the remaining ones are small (they correspond to noise). The number of significant singular values tells you the latent dimensionality, just as it did for PCA. The difference is that the matrix here is not a covariance matrix but a matrix of time-lagged observations.
The left singular vectors carry the observability subspace. The first columns of span the same column space as , the subspace through which the latent state is observed by the neurons. Denote the truncated SVD as , keeping only the first components. Then is a matrix whose columns are an orthonormal basis for the observability subspace.
The right singular vectors encode the latent state sequence. The columns of are the -dimensional latent trajectories, up to a change of basis. You do not recover the original (as discussed in Part 2, the latent state is only defined up to an invertible linear transformation), but you recover a version of it that preserves all the structure that matters: the subspace it lives in, the dynamics it obeys, and the observations it generates.This is the core of all subspace identification methods: N4SID [1], MOESP [2], and CVA [3]. They differ in how they weight the Hankel matrix before taking the SVD. N4SID uses an oblique projection, MOESP uses an orthogonal one, CVA normalizes by the noise covariance. But the skeleton is the same everywhere. Build a Hankel matrix, take its SVD, read off the subspace. The differences in weighting affect statistical efficiency, not the fundamental structure.
The singular values themselves tell you how much variance each latent mode explains in the time-lagged observations. A mode with a large singular value contributes strongly to the temporal correlations in the data. A mode with a small singular value contributes little. If it is below the noise floor, it is probably not real. This gives you a principled way to choose the latent dimensionality, which we return to in a later section.
Recovering the system
You now have the observability subspace: the first columns of from the Hankel SVD. Call this matrix . It is an estimate of the extended observability matrix , and it contains everything you need to extract and .
Start with . Look at the structure of : its first block row is itself. So is just the first rows of . That is the observation matrix, which is the linear map from the latent state to neural activity at a single time step. No optimization required, no iteration. You read it off.
Getting takes one more step. The key is a shift relation between consecutive block rows of . The second block row is , the third is , and so on. If you define as with its last block row removed and as with its first block row removed, then:
Every row of is the corresponding row of multiplied by . You know both sides of this equation (they come from ), so you solve for by least squares. In practice this is a single line of code: , where the dagger denotes the pseudoinverse.
Finally, the latent states themselves. Given , you can recover the latent state at any time step by projecting the observation onto the identified subspace: . Or you can read the full trajectory from the right singular vectors: is the -th row of , rescaled appropriately. Either way, you get a -dimensional trajectory that captures the latent dynamics underlying the recorded neural activity.
One important caveat. The recovered coordinates are not unique. If is any invertible matrix, you can define a new latent state , a new dynamics matrix , and a new observation matrix . This transformed system produces exactly the same observations as the original. The subspace is unique: the column space of does not depend on . But the coordinates within it are not. This is the same basis ambiguity you encountered in Part 2: any invertible change of basis gives an equally valid representation.The coordinate ambiguity is why you cannot directly compare latent states across two separately identified systems. If you fit a model to monkey A's motor cortex and another to monkey B's, the two latent spaces may capture similar dynamics but in different coordinate systems. Aligning them requires Procrustes rotation ( Part 10) or CCA (Part 9).
How many latent dimensions
Everything so far has assumed you know , the number of latent dimensions. In practice, you have to choose it. The singular values of the Hankel matrix tell you how.
In the noiseless case, the answer is exact. The Hankel matrix has rank , so exactly singular values are nonzero and the rest are zero. With noise, this clean picture blurs. The first few singular values are large (they correspond to the true latent modes) and the remaining ones are small but not zero. You are looking for a gap: a point where the singular values drop from "large" to "small." The number of singular values above the gap is your estimate of .
This is the same logic as PCA's scree plot from Part 7, applied to the Hankel matrix instead of the data covariance. Large singular values correspond to directions that carry structured temporal correlations. Small ones carry noise. The gap separates the two.
In practice, the gap is rarely clean. Finite data smooths the transition from signal to noise, and the drop-off is gradual rather than sharp. Choosing requires judgment. Cross-validation is one option: fit models with different values of , hold out a portion of the data, and pick the that predicts held-out observations best. Information criteria like AIC or BIC penalize model complexity and provide an alternative. The parallel analysis approach from PCA also works here: compare the singular values to those obtained from shuffled data and keep the components that exceed the shuffled baseline. None of these methods is perfect, but they all point in the same direction: pick the smallest that captures the temporal structure in the data without fitting the noise.For neural data, the singular value gap is often ambiguous because neural noise is correlated, not white. Correlated noise inflates the small singular values, pushing them closer to the signal range and obscuring the boundary. Methods like GPFA [4] handle this by modeling the noise structure explicitly by fitting a Gaussian process prior over the latent states and a separate noise covariance for the observations.
What comes next
Standard subspace identification recovers the latent dynamics that best explain the neural observations. "Best" here means the dynamics that account for the most variance in the time-lagged observation matrix. This is the right criterion if you want a compact description of what the neural population is doing. But it has a blind spot: it privileges directions of high variance in the neural population, the same bias PCA has. If the behaviorally relevant dynamics happen to lie along low-variance directions (directions that move the arm but do not dominate the population firing rates), standard subspace identification will miss them or bury them in the noise.
Preferential subspace identification (PSID) fixes this [5]. Instead of asking "what dynamics explain the most neural variance?", PSID asks "what dynamics are most relevant to behavior?" Its first stage uses CCA between time-lagged neural activity and behavioral variables to find the behaviorally relevant subspace: the latent directions that predict movements, forces, or whatever behavioral signal you recorded. Its second stage recovers the remaining dynamics from the residual neural activity. The result is a latent system cleanly partitioned into two components: one that drives behavior and one that does not.
That partition is the subject of the next post.
References
- Van Overschee, P. and De Moor, B. Subspace Identification for Linear Systems. Kluwer Academic Publishers, 1996.
- Churchland, M. M., Cunningham, J. P., Kaufman, M. T., et al. "Neural population dynamics during reaching," Nature, vol. 487, pp. 51-56, 2012.
- Cunningham, J. P. and Yu, B. M. "Dimensionality reduction for large-scale neural recordings," Nature Neuroscience, vol. 17, pp. 1500-1509, 2014.
- Yu, B. M., Cunningham, J. P., Santhanam, G., et al. "Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity," Journal of Neurophysiology, vol. 102, no. 1, pp. 614-635, 2009.
- Sani, O. G., Abbaspourazad, H., Wong, Y. T., et al. "Modeling behaviorally relevant neural dynamics enabled by preferential subspace identification," Nature Neuroscience, vol. 24, pp. 140-149, 2021.
- Pandarinath, C., O'Shea, D. J., Collins, J., et al. "Inferring single-trial neural population dynamics using sequential auto-encoders," Nature Methods, vol. 15, no. 10, pp. 805-815, 2018.