Subspace identification

Recovering latent dynamics from recorded neural activity.

PublishedMarch 2026

Reading time

SeriesLinear Algebra for Neural Data, Part 13

The inverse problem

You recorded 100 neurons in motor cortex during a reaching task. From Part 12, you know a useful model for what is going on: there is a low-dimensional latent state $x_t \in \mathbb{R}^d$ that evolves according to linear dynamics, and each neuron's firing rate is a linear readout of that state plus noise. Written out:

x_{t+1} = A x_t

(1)

y_t = C x_t + w_t

(2)

Here $A$ is $d \times d$ (the dynamics), $C$ is $N \times d$ (the observation map from latent state to neural activity), and $w_t$ is observation noise. You observe $y_t$ , the full vector of 100 firing rates at each time step. You want $A$ and $C$ .

The naive approach is to treat the latent states $x_t$ as unknowns and estimate everything jointly. This runs into trouble fast. With $d$ latent dimensions and $N$ neurons, you need a $d \times d$ dynamics matrix $A$ , an $N \times d$ observation matrix $C$ , and $T$ latent state vectors $x_t$ . The number of unknowns grows with the length of the recording, and the problem is not identified without additional constraints. You could impose structure and iterate (EM does exactly this), but there is a more direct route.

The key idea: don't try to estimate the latent states $x_t$ directly. Instead, exploit the temporal structure of the observations themselves. Consecutive observations are not independent. They are generated by the same latent state evolving through $A$ . That shared dynamical structure leaves a signature in the data: a particular pattern of correlations across time lags that can be read off with the SVD.

Time-lagged structure

To see the signature, stack consecutive observations into a single tall vector. Pick a window length $p$ and, starting at time $t$ , concatenate $p$ successive observation vectors:

z_t = \begin{bmatrix} y_t \\ y_{t+1} \\ y_{t+2} \\ \vdots \\ y_{t+p-1} \end{bmatrix}

(3)

Each $z_t$ lives in $\mathbb{R}^{pN}$ , a $pN$ -dimensional vector built from $p$ windows of $N$ neurons. Now substitute the state-space model. The observation at time $t$ is $y_t = Cx_t$ (ignoring noise for the moment). The observation at $t+1$ is $y_{t+1} = Cx_{t+1} = CAx_t$ . At $t+2$ , $y_{t+2} = CA^2 x_t$ . In general, the observation $k$ steps ahead is $CA^k x_t$ . So the stacked vector has a compact form:

z_t = \underbrace{\begin{bmatrix} C \\ CA \\ CA^2 \\ \vdots \\ CA^{p-1} \end{bmatrix}}_{\mathcal{O}} x_t

(4)

The matrix $\mathcal{O}$ is called the extended observability matrix.*The observability matrix comes from control theory. A system $(A, C)$ is said to be observable if $\mathcal{O}$ has full column rank, meaning the latent state can, in principle, be reconstructed from the observations. For the subspace identification method to work, the system must be observable: every latent dimension must leave some trace in the recorded neurons. Each block row tells you how the latent state at time $t$ maps to the observation at a particular lag: $C$ maps it to the current observation, $CA$ maps it to the next time step, $CA^2$ to two steps ahead, and so on. The dynamics matrix $A$ is baked into the structure of $\mathcal{O}$ .

Now collect $z_t$ for all valid starting times $t = 1, \ldots, T - p + 1$ and arrange them as columns of a matrix:

H = \begin{bmatrix} z_1 & z_2 & \cdots & z_{T-p+1} \end{bmatrix} = \mathcal{O} \begin{bmatrix} x_1 & x_2 & \cdots & x_{T-p+1} \end{bmatrix}

(5)

This is the Hankel matrix. It has $pN$ rows and $T - p + 1$ columns, but its rank is at most $d$ , the latent dimensionality. It does not matter how many neurons you recorded or how many time lags you stacked. The rank of $H$ is bounded by the rank of $\mathcal{O}$ , which is $d$ (assuming observability). The observations are high-dimensional, but the time-lagged structure they contain is low-rank because it is generated by a low-dimensional latent process.

Step: 1

The Hankel matrix stacks time-lagged windows of observations. Each row is a history of p consecutive time steps. The rank of this matrix equals the latent dimensionality.

The Hankel SVD

The Hankel matrix $H$ is large, potentially thousands of rows and columns, but its rank is small. From Part 6, you know what to do with a matrix like that: take its SVD.

H = U \Sigma V^\top

(6)

The singular values $\sigma_1 \geq \sigma_2 \geq \cdots$ drop off. In the noiseless case, exactly $d$ of them are nonzero and the rest are zero. With noise, you get a gap: the first $d$ singular values are large (they correspond to the latent dynamics) and the remaining ones are small (they correspond to noise). The number of significant singular values tells you the latent dimensionality, just as it did for PCA. The difference is that the matrix here is not a covariance matrix but a matrix of time-lagged observations.

The left singular vectors carry the observability subspace. The first $d$ columns of $U$ span the same column space as $\mathcal{O}$ , the subspace through which the latent state is observed by the neurons. Denote the truncated SVD as $H \approx U_d \Sigma_d V_d^\top$ , keeping only the first $d$ components. Then $U_d$ is a $pN \times d$ matrix whose columns are an orthonormal basis for the observability subspace.

The right singular vectors encode the latent state sequence. The columns of $V_d$ are the $d$ -dimensional latent trajectories, up to a change of basis. You do not recover the original $x_t$ (as discussed in Part 2, the latent state is only defined up to an invertible linear transformation), but you recover a version of it that preserves all the structure that matters: the subspace it lives in, the dynamics it obeys, and the observations it generates.*This is the core of all subspace identification methods: N4SID [1], MOESP [2], and CVA [3]. They differ in how they weight the Hankel matrix before taking the SVD. N4SID uses an oblique projection, MOESP uses an orthogonal one, CVA normalizes by the noise covariance. But the skeleton is the same everywhere. Build a Hankel matrix, take its SVD, read off the subspace. The differences in weighting affect statistical efficiency, not the fundamental structure.

The singular values themselves tell you how much variance each latent mode explains in the time-lagged observations. A mode with a large singular value contributes strongly to the temporal correlations in the data. A mode with a small singular value contributes little. If it is below the noise floor, it is probably not real. This gives you a principled way to choose the latent dimensionality, which we return to in a later section.

Time lags: 10

The SVD of the Hankel matrix. The number of significant singular values reveals the latent dimensionality. The left singular vectors span the subspace through which the latent state is observed.

Recovering the system

You now have the observability subspace: the first $d$ columns of $U$ from the Hankel SVD. Call this $pN \times d$ matrix $\hat{\mathcal{O}}$ . It is an estimate of the extended observability matrix $\mathcal{O}$ , and it contains everything you need to extract $A$ and $C$ .

Start with $C$ . Look at the structure of $\mathcal{O}$ : its first block row is $C$ itself. So $C$ is just the first $N$ rows of $\hat{\mathcal{O}}$ . That is the observation matrix, which is the linear map from the latent state to neural activity at a single time step. No optimization required, no iteration. You read it off.

Getting $A$ takes one more step. The key is a shift relation between consecutive block rows of $\mathcal{O}$ . The second block row is $CA$ , the third is $CA^2$ , and so on. If you define $\mathcal{O}_\uparrow$ as $\mathcal{O}$ with its last block row removed and $\mathcal{O}_\downarrow$ as $\mathcal{O}$ with its first block row removed, then:

\mathcal{O}_\downarrow = \mathcal{O}_\uparrow \, A

(7)

Every row of $\mathcal{O}_\downarrow$ is the corresponding row of $\mathcal{O}_\uparrow$ multiplied by $A$ . You know both sides of this equation (they come from $\hat{\mathcal{O}}$ ), so you solve for $A$ by least squares. In practice this is a single line of code: $A = \mathcal{O}_\uparrow^{\dagger} \, \mathcal{O}_\downarrow$ , where the dagger denotes the pseudoinverse.

Finally, the latent states themselves. Given $C$ , you can recover the latent state at any time step by projecting the observation onto the identified subspace: $\hat{x}_t = C^{\dagger} y_t$ . Or you can read the full trajectory from the right singular vectors: $\hat{x}_t$ is the $t$ -th row of $\Sigma_d V_d^\top$ , rescaled appropriately. Either way, you get a $d$ -dimensional trajectory that captures the latent dynamics underlying the recorded neural activity.

One important caveat. The recovered coordinates are not unique. If $T$ is any $d \times d$ invertible matrix, you can define a new latent state $x' = Tx$ , a new dynamics matrix $A' = TAT^{-1}$ , and a new observation matrix $C' = CT^{-1}$ . This transformed system produces exactly the same observations as the original. The subspace is unique: the column space of $\mathcal{O}$ does not depend on $T$ . But the coordinates within it are not. This is the same basis ambiguity you encountered in Part 2: any invertible change of basis gives an equally valid representation.*The coordinate ambiguity is why you cannot directly compare latent states across two separately identified systems. If you fit a model to monkey A's motor cortex and another to monkey B's, the two latent spaces may capture similar dynamics but in different coordinate systems. Aligning them requires Procrustes rotation ( Part 10) or CCA (Part 9).

latent dimensions d2

True latent trajectory (left) vs recovered (right). At d = 2 the recovered spiral matches the true one. At d = 1 it collapses. At d > 2 extra noisy dimensions appear. Adjust d with the slider.

How many latent dimensions

Everything so far has assumed you know $d$ , the number of latent dimensions. In practice, you have to choose it. The singular values of the Hankel matrix tell you how.

In the noiseless case, the answer is exact. The Hankel matrix has rank $d$ , so exactly $d$ singular values are nonzero and the rest are zero. With noise, this clean picture blurs. The first few singular values are large (they correspond to the true latent modes) and the remaining ones are small but not zero. You are looking for a gap: a point where the singular values drop from "large" to "small." The number of singular values above the gap is your estimate of $d$ .

This is the same logic as PCA's scree plot from Part 7, applied to the Hankel matrix instead of the data covariance. Large singular values correspond to directions that carry structured temporal correlations. Small ones carry noise. The gap separates the two.

In practice, the gap is rarely clean. Finite data smooths the transition from signal to noise, and the drop-off is gradual rather than sharp. Choosing $d$ requires judgment. Cross-validation is one option: fit models with different values of $d$ , hold out a portion of the data, and pick the $d$ that predicts held-out observations best. Information criteria like AIC or BIC penalize model complexity and provide an alternative. The parallel analysis approach from PCA also works here: compare the singular values to those obtained from shuffled data and keep the components that exceed the shuffled baseline. None of these methods is perfect, but they all point in the same direction: pick the smallest $d$ that captures the temporal structure in the data without fitting the noise.*For neural data, the singular value gap is often ambiguous because neural noise is correlated, not white. Correlated noise inflates the small singular values, pushing them closer to the signal range and obscuring the boundary. Methods like GPFA [4] handle this by modeling the noise structure explicitly by fitting a Gaussian process prior over the latent states and a separate noise covariance for the observations.

Singular values of the Hankel matrix. The first two are large (signal); the rest are small (noise). Drag the threshold to choose how many dimensions to keep.

What comes next

Standard subspace identification recovers the latent dynamics that best explain the neural observations. "Best" here means the dynamics that account for the most variance in the time-lagged observation matrix. This is the right criterion if you want a compact description of what the neural population is doing. But it has a blind spot: it privileges directions of high variance in the neural population, the same bias PCA has. If the behaviorally relevant dynamics happen to lie along low-variance directions (directions that move the arm but do not dominate the population firing rates), standard subspace identification will miss them or bury them in the noise.

Preferential subspace identification (PSID) fixes this [5]. Instead of asking "what dynamics explain the most neural variance?", PSID asks "what dynamics are most relevant to behavior?" Its first stage uses CCA between time-lagged neural activity and behavioral variables to find the behaviorally relevant subspace: the latent directions that predict movements, forces, or whatever behavioral signal you recorded. Its second stage recovers the remaining dynamics from the residual neural activity. The result is a latent system cleanly partitioned into two components: one that drives behavior and one that does not.

That partition is the subject of the next post.

References

Van Overschee, P. and De Moor, B. Subspace Identification for Linear Systems. Kluwer Academic Publishers, 1996.
Churchland, M. M., Cunningham, J. P., Kaufman, M. T., et al. "Neural population dynamics during reaching," Nature, vol. 487, pp. 51-56, 2012.
Cunningham, J. P. and Yu, B. M. "Dimensionality reduction for large-scale neural recordings," Nature Neuroscience, vol. 17, pp. 1500-1509, 2014.
Yu, B. M., Cunningham, J. P., Santhanam, G., et al. "Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity," Journal of Neurophysiology, vol. 102, no. 1, pp. 614-635, 2009.
Sani, O. G., Abbaspourazad, H., Wong, Y. T., et al. "Modeling behaviorally relevant neural dynamics enabled by preferential subspace identification," Nature Neuroscience, vol. 24, pp. 140-149, 2021.
Pandarinath, C., O'Shea, D. J., Collins, J., et al. "Inferring single-trial neural population dynamics using sequential auto-encoders," Nature Methods, vol. 15, no. 10, pp. 805-815, 2018.

← Back to home