Preferential Subspace Identification
Recovering latent dynamics from neural recordings and separating behaviorally relevant from irrelevant structure, derived from scratch with interactive figures.
Introduction
In the previous post, CCA gave us a way to find shared linear structure between two datasets. We found weight vectors that maximized the correlation between projections of paired observations, and used these to align neural recordings across subjects. But CCA treats each row of data as an independent observation. Neural data has temporal structure: the activity at time depends on the activity at time , and this dependence is what we mean by dynamics. CCA ignores it entirely.
This post picks up where CCA left off. We will build up the mathematics of linear dynamical systems, derive a classical method for recovering latent dynamics from observations (subspace identification), and then extend it to preferentially extract the dynamics that relate to behavior. The result is Preferential Subspace Identification, or PSID [1], a method that cleanly partitions latent neural dynamics into a behaviorally relevant subspace and an irrelevant one.PSID was introduced by Sani et al. [1] as a way to dissociate behaviorally relevant neural dynamics from the full latent state. The key idea: use CCA in a subspace identification framework to preferentially extract the dynamics that predict behavior.
Everything is derived from scratch. I assume you are comfortable with linear algebra basics (matrix multiplication, eigenvalues, SVD) but nothing about state-space models, system identification, or dynamics. If you have read the CCA post, you will recognize some of the machinery (particularly the whitening-then-SVD step), but it is not required.
State-space models
A linear state-space model describes a system evolving through time in two equations. The state equation says that the latent state evolves linearly:
where is the state transition matrix and is process noise. The matrix encodes the dynamics: its eigenvalues determine whether the system oscillates, decays, or drifts, and how fast.Linear state-space models date to Kalman's foundational work [2] on optimal filtering. The framework is general enough to describe everything from electrical circuits to population dynamics, and it remains the default starting point for modeling temporal neural data.
The observation equation says that we never see the latent state directly. Instead, we observe a noisy linear function of it:
where maps the -dimensional latent state to observed variables (e.g., neurons), and is observation noise. In a neural recording, might be the firing rates of 100 neurons at time , each a noisy weighted sum of a handful of latent dimensions.
Together, the two equations define the generative model: a low-dimensional state evolves according to , and we observe it through the lens of . The inverse problem is: given only the observed , recover , , and the latent trajectory .
The figure above shows a synthetic system with three latent dimensions and six observed neurons. Two latent dimensions form a slow rotation (the circular trajectory in the x₁–x₂ projection), while the third is an independent, faster oscillation. All six neurons are noisy linear combinations of these three latent variables. As you increase the observation noise, the neural traces become harder to distinguish from each other, but the underlying latent structure is still there, waiting to be extracted.
Subspace identification
The key insight behind subspace identification is that future observations carry information about the current latent state, and this information has a specific structure that we can exploit. Unrolling the state equation forward from time gives:
Substituting into the observation equation, the observation at time is:
If we stack several future observations into a vector , the signal part factors as:
where is the extended observability matrix:
This matrix has rows and columns. Its column space is the observability subspace: the set of directions in observation space that the latent state can influence. The rank of equals the latent dimensionality , as long as the system is observable and is large enough.A system is observable if different initial states produce different output sequences. For linear systems, this is equivalent to having full column rank. When this fails (say two latent dimensions project identically onto all neurons), those dimensions become indistinguishable from observations alone.
Now stack these future observation vectors for every time point into columns of a matrix. The result is the block Hankel matrix. Each column contains consecutive observation vectors starting from a different time step. Columns shift forward by one time step, giving the matrix its characteristic diagonal-constant structure:
Each entry is itself an -dimensional vector (one value per neuron), so each column of is an -dimensional vector. The word "block" refers to this: is a matrix of vectors, not scalars. Notice the structure: moving one step down a column shifts the time index by one, and moving one step right across a row also shifts by one. This means that every anti-diagonal contains the same observation vector, the defining property of a Hankel matrix.
Using our factorization from Equation (5), the signal part of each column is for some . This means the signal part of has rank at most , so all columns live in the -dimensional column space of , regardless of how many time points we have.
Taking the SVD of the block Hankel matrix reveals the latent dimensionality as a gap in the singular value spectrum. The first singular values correspond to the signal; the rest correspond to noise. The left singular vectors span the observability subspace, giving us (from the first block of rows) and (from the shift structure). The right singular vectors, scaled by the singular values, give the latent state sequence.This procedure is the essence of N4SID and related algorithms [3]. The name "subspace identification" comes from the fact that we identify the system by finding the column subspace of the observability matrix. In practice, an oblique projection step removes the effect of past inputs before taking the SVD; the algebra is similar to the whitening step in CCA.
With the observability subspace in hand, extracting the system matrices is straightforward. Write the SVD as , keeping only the first components (the ones above the gap). The left singular vectors estimate the column space of the observability matrix . The observation matrix is the first rows of . The latent state sequence is recovered by multiplying the pseudoinverse of this observability estimate by the original Hankel columns: . Finally, the state transition matrix is estimated by least-squares regression: find the matrix that best maps each recovered state to the next state .
The figure above shows the punchline of standard subspace identification: the latent states are recovered accurately (up to an invertible linear transformation), but the individual recovered dimensions have no particular meaning. The first recovered dimension might be a mixture of the behaviorally relevant rotation and the irrelevant oscillation. There is nothing in the procedure that distinguishes them.
Adding behavior
So far we have been recovering the full latent state without asking what any of it means. But in practice, we care about specific aspects of the latent dynamics, specifically the part that drives behavior. The figure below illustrates the problem. The brain's full latent state space contains signals that are encoded in our neural recordings and signals that are not. Of the encoded signals, only a subset relates to the behavior we measure. Different methods see different parts of this picture.
Now suppose we also have a behavioral measurement that is a linear function of the latent state:
where picks out the behaviorally relevant latent dimensions and is noise. In our synthetic system, , giving a scalar behavior () that depends on the first two latent dimensions (the slow rotation) and ignores the third (the fast oscillation).
Standard subspace identification does not know about . It recovers the full latent state, but in an arbitrary coordinate system. The behavioral information ends up smeared across all recovered dimensions.
This is the central limitation: if you want to know which part of the neural dynamics drives behavior, standard subspace identification cannot tell you. The behavioral signal is there, but it is mixed with irrelevant dynamics. Can we modify the procedure to extract the behaviorally relevant dynamics first?
Preferential subspace identification
PSID answers this question with a two-stage procedure [1]. The idea is to use CCA (the same whitening-and-SVD machinery from the previous post) to find the part of the neural state space that predicts future behavior, and then recover the remaining dynamics from the residual.
Stage 1: behaviorally relevant subspace. Build two block Hankel matrices: one from past neural observations and one from future behavioral measurements . Run a procedure analogous to CCA: an oblique projection of future behavior onto past neural activity. The top directions from the neural side span the subspace of past neural activity that is maximally predictive of future behavior. These are the behaviorally relevant directions.Unlike symmetric CCA (which finds mutual directions in both spaces), this projection is asymmetric: we care about directions in neural space that predict behavior, not the reverse. The whitening-then-SVD machinery is the same as in the CCA post, but the past-neural-to-future-behavior cross-covariance replaces the static .
Project the neural data onto these canonical directions to get a low-dimensional state estimate for the relevant subspace. Extract the transition matrix and observation matrix by the same least-squares procedure as before.
Stage 2: remaining dynamics. Project out the behaviorally relevant component from the observations: . Run standard subspace identification on the residual to capture the remaining latent dynamics. This gives and .
The combined model has a block lower-triangular transition matrix and a concatenated observation matrix:
The zero in the upper-right block is the key constraint: the relevant subspace evolves independently of the irrelevant one (governed by alone), but the irrelevant subspace can be driven by the relevant state through . The parts talk to behavior; the parts don't. The latent state is cleanly partitioned.
The difference is immediate. Standard subspace identification gives three dimensions with behavioral information mixed across all of them. PSID gives two dimensions that are strongly correlated with behavior and one that is not. The partition matches the ground truth of the synthetic system: two relevant dimensions (the slow rotation) and one irrelevant dimension (the fast oscillation).
When PSID misleads
PSID requires you to specify the relevant dimensionality: how many latent dimensions should be assigned to the behaviorally relevant subspace. If you get this wrong, the results can be misleading.
Over-specifying the relevant dimensionality. If you tell PSID to extract more relevant dimensions than truly exist, the extra dimensions will fit noise or capture dynamics that are actually irrelevant to behavior. In the figure below, the true relevant dimensionality is 2. Setting it to 3 or higher forces PSID to designate additional dimensions as "relevant," but their correlation with the behavioral variable is low.
No real behavioral relevance. If the behavioral variable is actually independent of the neural activity, PSID will still extract "relevant" dimensions. It has no way to know that the CCA directions are fitting noise rather than signal. Cross-validation is the standard defense: fit PSID on a training set, predict behavior on a held-out test set, and check whether the prediction exceeds what you would expect by chance.
In practice, the right approach is to sweep the relevant dimensionality from 0 upward and monitor the cross-validated prediction accuracy. The point where the curve plateaus tells you how many latent dimensions truly contribute to behavior.
Extensions
PSID assumes a linear autonomous system: no external inputs, linear dynamics, linear observations. Three subsequent papers relax these assumptions while preserving the dissociative structure that separates behaviorally relevant from irrelevant dynamics.
IPSID: measured inputs. In many experiments, subjects receive sensory stimuli or perturbations under the experimenter's control. Standard PSID folds these input-driven responses into the latent state, conflating intrinsic dynamics with stimulus-evoked activity. IPSID [6] adds measured inputs to the state equation:
This dissociates intrinsic dynamics (governed by ) from input-driven responses (governed by ), letting you ask which part of neural activity reflects the brain's own dynamics versus what is driven by external events.
DPAD: nonlinear dynamics. PSID's linear machinery cannot capture thresholding, gating, or other nonlinear interactions in neural circuits. DPAD [8] replaces the linear state-space model with a multisection recurrent neural network, trained to prioritize behaviorally relevant dynamics in the first sections and capture the remainder in the rest. This lets the model learn nonlinear latent dynamics without sacrificing the relevant/irrelevant partition.
Optimal smoothing. PSID uses past neural activity to predict future behavior, a purely causal (filtering) direction. Sani & Shanechi [9] extend this by incorporating concurrent and future neural observations through optimal smoothing. The hierarchy is clean: prediction uses only past neural data, filtering adds concurrent data, and smoothing adds future data as well. Each step reduces estimation error in the latent states, which matters for offline analyses where the full recording is available.The progression from prediction to filtering to smoothing mirrors the classical Kalman filter literature [2]: the Kalman filter is a causal estimator, while the Rauch–Tung–Striebel smoother refines those estimates by running backward through the data.
Assumptions and limitations
What PSID gives you. A latent dynamical system with a clean partition into behaviorally relevant and irrelevant subspaces, identified directly from data. The system matrices (, ) are estimated without iterative optimization, and the partition comes from the structure of the CCA solution.
Linearity. Both the dynamics and the observation model are assumed linear. If the true dynamics involve nonlinear interactions (thresholding, saturation, gating), the linear model will capture only the best linear approximation, which may miss important structure. This is the same limitation as linear CCA, just extended to the temporal domain.
Stationarity. The system matrices and are assumed constant over time. If the neural code changes over the course of a session (as it does during learning), PSID will average over these changes rather than tracking them.
Linear behavioral readout. The behavioral variable is assumed to be a linear function of the latent state. If the relationship is nonlinear (for example, behavior depends on the phase of a neural oscillation rather than its amplitude), PSID's "relevant" subspace will not capture the full picture.
Two-stage is greedy. PSID extracts the relevant subspace first, then the irrelevant subspace from the residual. This greedy approach is not jointly optimal: errors in the first stage propagate to the second.
Where this leads. The extensions above address some of these limitations directly: DPAD relaxes linearity, IPSID handles exogenous inputs, and smoothing improves offline estimates. Beyond the PSID family, LFADS [5] learns nonlinear latent dynamics through a sequential autoencoder, and switching linear dynamical systems can relax stationarity by allowing the matrices to change over time.
PSID and its neighbors
| Method | Key matrix | Dynamics? | Behavior-aware? | Relationship to PSID |
|---|---|---|---|---|
| PCA | No | No | SVD of one covariance instead of a cross-covariance | |
| CCA | No | Yes | Static version of PSID's stage 1 | |
| N4SID | Yes | No | PSID without the preferential step | |
| PSID | Yes | Yes | (this post) | |
| IPSID | Yes | Yes | PSID + measured inputs; dissociates intrinsic from input-driven | |
| DPAD | (RNN) | Yes (nonlinear) | Yes | Nonlinear PSID; RNNs replace linear state-space |
| Shared-AE | (autoencoder) | No | Yes | Nonlinear shared/private split via CS-divergence; extends to 3+ modalities |
| LFADS | (seq. autoencoder) | Yes | Optional | Nonlinear, requires training a neural network |
CCA [4] gave us static shared directions between paired datasets. PSID adds two things: temporal structure (dynamics) and a principled partition between behaviorally relevant and irrelevant latent dimensions. The connection between the two methods is direct: PSID's first stage is CCA applied to time-lagged neural and behavioral data, using the same whitening-and-SVD pipeline.In the CCA post, both datasets were neural recordings from different subjects, and the goal was to find shared directions between two populations. Here, one dataset is neural and the other is behavioral. The math is identical (whiten, then SVD the cross-covariance), but the interpretation changes: instead of "shared across subjects," we get "predictive of behavior."
The original PSID paper [1] demonstrated the method on intracortical recordings from monkeys performing reaching tasks, showing that behaviorally relevant dynamics occupied a low-dimensional subspace that standard methods like N4SID missed. Subsequent work extended the framework in three directions: IPSID [6] adds measured inputs to dissociate intrinsic from stimulus-driven dynamics, DPAD [8] replaces the linear model with recurrent neural networks, and optimal smoothing [9] leverages future neural data for offline estimation. All methods come with a publicly available MATLAB/Python toolbox [7].
A different line of work drops the dynamical model entirely. Shared-AE [10] uses dual autoencoders with Cauchy-Schwarz divergence regularization to split each modality's latent space into shared and private components. Because the encoders are nonlinear, Shared-AE can capture richer neural-behavioral relationships than PSID's linear model, and it generalizes naturally to more than two modalities. The trade-off is that it gives up the temporal structure and closed-form identifiability that make PSID's latent dynamics directly interpretable.
The next post in this series will move beyond pairwise comparisons to aligning multiple datasets simultaneously, the multi-population setting that motivates most of my PhD work.
Implementation
The full PSID algorithm is compact enough to fit in about 50 lines of NumPy. Here is a self-contained implementation that takes observations , behavioral measurements , the relevant dimensionality, and the total dimensionality, and returns the system matrices , , the behavioral readout , and the recovered latent states .
import numpy as np
from scipy.linalg import svd, sqrtm, pinv
def build_hankel(data, num_lags):
"""Stack time-lagged snapshots into a block Hankel matrix."""
T, d = data.shape
N = T - 2 * num_lags + 1
past = np.zeros((num_lags * d, N))
future = np.zeros((num_lags * d, N))
for lag in range(num_lags):
past[lag*d:(lag+1)*d] = data[lag:lag+N].T
future[lag*d:(lag+1)*d] = data[num_lags+lag:num_lags+lag+N].T
return past, future
def psid(Y, Z, d_rel, d_total, num_lags):
"""
Preferential Subspace Identification.
Parameters
----------
Y : (T, m) neural observations
Z : (T, p) behavioral measurements
d_rel : number of behaviorally relevant latent dimensions
d_total : total latent dimensionality
num_lags : number of time lags for Hankel matrices
Returns
-------
A, C, Cz, Xhat : system matrices, behavioral readout, and states
"""
# Center
Y = Y - Y.mean(0)
Z = Z - Z.mean(0)
m = Y.shape[1]
# Stage 1: CCA between past-neural and future-behavior
Y_past, Y_future = build_hankel(Y, num_lags)
Z_past, Z_future = build_hankel(Z, num_lags)
Cyy = Y_past @ Y_past.T / Y_past.shape[1]
Czz = Z_future @ Z_future.T / Z_future.shape[1]
Cyz = Y_past @ Z_future.T / Y_past.shape[1]
# Whiten, then SVD of cross-covariance
Cyy_inv_sqrt = np.real(pinv(sqrtm(Cyy)))
Czz_inv_sqrt = np.real(pinv(sqrtm(Czz)))
M = Cyy_inv_sqrt @ Cyz @ Czz_inv_sqrt
U, s, Vt = svd(M, full_matrices=False)
# Top d_rel canonical directions (in neural Hankel space)
W_rel = Cyy_inv_sqrt @ U[:, :d_rel]
# Recover relevant states from past observations
Xhat_rel = W_rel.T @ Y_past # (d_rel, N)
# Observability matrix estimate and single-lag C1
O_rel = W_rel[:m, :] # first block = C1
C1 = O_rel
# Fit A1 by least squares: x_{t+1} = A1 @ x_t
A1 = Xhat_rel[:, 1:] @ pinv(Xhat_rel[:, :-1])
# Behavioral readout: regress Z onto relevant states
N = Y_past.shape[1]
Z_aligned = Z[num_lags:num_lags+N] # (N, p)
Cz = Z_aligned.T @ pinv(Xhat_rel).T # (p, d_rel)
# Stage 2: residual subspace identification
d_irr = d_total - d_rel
if d_irr > 0:
# Project out relevant component, then re-embed as Hankel
Y_single = Y[num_lags:num_lags+N].T # (m, N)
Y_resid_ts = (Y_single - C1 @ Xhat_rel).T # (N, m)
Yr_past, _ = build_hankel(Y_resid_ts, num_lags)
U2, s2, V2t = svd(Yr_past, full_matrices=False)
Xhat_irr = np.diag(s2[:d_irr]) @ V2t[:d_irr] # (d_irr, N2)
C2 = U2[:m, :d_irr]
A2 = Xhat_irr[:, 1:] @ pinv(Xhat_irr[:, :-1])
# Trim relevant states to match residual length
N2 = Xhat_irr.shape[1]
Xhat_rel = Xhat_rel[:, :N2]
else:
N2 = Xhat_rel.shape[1]
Xhat_irr = np.zeros((0, N2))
C2 = np.zeros((m, 0))
A2 = np.zeros((0, 0))
# Combine: block lower-triangular A
d1, d2 = d_rel, d_irr
A = np.zeros((d1 + d2, d1 + d2))
A[:d1, :d1] = A1
if d2 > 0:
A[d1:, d1:] = A2
# A21: regression of irrelevant next-state on relevant state
A[d1:, :d1] = Xhat_irr[:, 1:] @ pinv(Xhat_rel[:, :-1])
C = np.hstack([C1, C2])
Xhat = np.vstack([Xhat_rel, Xhat_irr]).T # (N2, d_total)
return A, C, Cz, XhatThe structure mirrors the two stages. Lines 1–20 build block Hankel matrices. Lines 21–40 run the oblique projection between past neural and future behavioral Hankel matrices to get the relevant subspace, then estimate the behavioral readout by regressing behavior onto the relevant states. Lines 41–65 project out the relevant component, re-embed the residual as a new Hankel matrix, and run standard subspace ID on it. The final assembly builds the block lower-triangular (including the coupling term), concatenated , and the full latent state sequence.
References
- O. G. Sani, H. Abbaspourazad, Y. T. Wong, B. Pesaran, and M. M. Shanechi, "Modeling behaviorally relevant neural dynamics enabled by preferential subspace identification," Nature Neuroscience, vol. 24, pp. 140–149, 2021.
- R. E. Kalman, "A new approach to linear filtering and prediction problems," Journal of Basic Engineering, vol. 82, no. 1, pp. 35–45, 1960.
- P. Van Overschee and B. De Moor, Subspace Identification for Linear Systems: Theory, Implementation, Applications. Springer, 1996.
- H. Hotelling, "Relations between two sets of variates," Biometrika, vol. 28, no. 3/4, pp. 321–377, 1936.
- C. Pandarinath, D. J. O'Shea, J. Collins, et al., "Inferring single-trial neural population dynamics using sequential auto-encoders," Nature Methods, vol. 15, pp. 805–815, 2018.
- P. Vahidi*, O. G. Sani*, and M. M. Shanechi, "Modeling and dissociation of intrinsic and input-driven neural population dynamics underlying behavior," PNAS, vol. 121, no. 7, e2212887121, 2024.
- O. G. Sani and M. M. Shanechi, "PSID: Preferential Subspace Identification" (MATLAB/Python toolbox), GitHub.
- O. G. Sani, B. Pesaran, and M. M. Shanechi, "Dissociative and prioritized modeling of behaviorally relevant neural dynamics using recurrent neural networks," Nature Neuroscience, vol. 27, pp. 2033–2045, 2024.
- O. G. Sani and M. M. Shanechi, "Preferential subspace identification (PSID) with forward-backward smoothing," arXiv:2507.15288, 2025.
- D. Yi, H. Dong, M. J. Higley, A. Churchland, and S. Saxena, "Shared-AE: automatic identification of shared subspaces in high-dimensional neural and behavioral activity," ICLR, 2025.