Preferential Subspace Identification

Recovering latent dynamics from neural recordings and separating behaviorally relevant from irrelevant structure, derived from scratch with interactive figures.

PublishedMarch 2026

Reading time

PSID separates neural dynamics into behaviorally relevant (orange) and irrelevant (gray) subspaces. Top: raw neural activity (teal). Middle: behavioral readout (orange). Bottom: recovered latent dimensions.

Introduction

In the previous post, CCA gave us a way to find shared linear structure between two datasets. We found weight vectors that maximized the correlation between projections of paired observations, and used these to align neural recordings across subjects. But CCA treats each row of data as an independent observation. Neural data has temporal structure: the activity at time $t$ depends on the activity at time $t-1$ , and this dependence is what we mean by dynamics. CCA ignores it entirely.

This post picks up where CCA left off. We will build up the mathematics of linear dynamical systems, derive a classical method for recovering latent dynamics from observations (subspace identification), and then extend it to preferentially extract the dynamics that relate to behavior. The result is Preferential Subspace Identification, or PSID [1], a method that cleanly partitions latent neural dynamics into a behaviorally relevant subspace and an irrelevant one.*PSID was introduced by Sani et al. [1] as a way to dissociate behaviorally relevant neural dynamics from the full latent state. The key idea: use CCA in a subspace identification framework to preferentially extract the dynamics that predict behavior.

Everything is derived from scratch. I assume you are comfortable with linear algebra basics (matrix multiplication, eigenvalues, SVD) but nothing about state-space models, system identification, or dynamics. If you have read the CCA post, you will recognize some of the machinery (particularly the whitening-then-SVD step), but it is not required.

State-space models

A linear state-space model describes a system evolving through time in two equations. The state equation says that the latent state ${\color{#4A90D9}x_t} \in \mathbb{R}^{\color{#4A90D9}d}$ evolves linearly:

{\color{#4A90D9}x_{t+1}} = {\color{#4A90D9}A}\, {\color{#4A90D9}x_t} + {\color{#999999}w_t}

(1)

where ${\color{#4A90D9}A} \in \mathbb{R}^{{\color{#4A90D9}d} \times {\color{#4A90D9}d}}$ is the state transition matrix and ${\color{#999999}w_t} \sim \mathcal{N}(0, \sigma_w^2 I)$ is process noise. The matrix ${\color{#4A90D9}A}$ encodes the dynamics: its eigenvalues determine whether the system oscillates, decays, or drifts, and how fast.*Linear state-space models date to Kalman's foundational work [2] on optimal filtering. The framework is general enough to describe everything from electrical circuits to population dynamics, and it remains the default starting point for modeling temporal neural data.

Drag the eigenvalue in the complex plane. Inside the unit circle: decaying. Outside: growing. Off the real axis: oscillating. The trajectory on the right shows what these dynamics look like in practice.

The observation equation says that we never see the latent state directly. Instead, we observe a noisy linear function of it:

{\color{#4A7C6F}y_t} = {\color{#4A7C6F}C}\, {\color{#4A90D9}x_t} + {\color{#999999}v_t}

(2)

where ${\color{#4A7C6F}C} \in \mathbb{R}^{{\color{#4A7C6F}m} \times {\color{#4A90D9}d}}$ maps the ${\color{#4A90D9}d}$ -dimensional latent state to ${\color{#4A7C6F}m}$ observed variables (e.g., neurons), and ${\color{#999999}v_t} \sim \mathcal{N}(0, \sigma_v^2 I)$ is observation noise. In a neural recording, ${\color{#4A7C6F}y_t}$ might be the firing rates of 100 neurons at time $t$ , each a noisy weighted sum of a handful of latent dimensions.

Together, the two equations define the generative model: a low-dimensional state evolves according to ${\color{#4A90D9}A}$ , and we observe it through the lens of ${\color{#4A7C6F}C}$ . The inverse problem is: given only the observed ${\color{#4A7C6F}y_1}, \ldots, {\color{#4A7C6F}y_T}$ , recover ${\color{#4A90D9}A}$ , ${\color{#4A7C6F}C}$ , and the latent trajectory ${\color{#4A90D9}x_1}, \ldots, {\color{#4A90D9}x_T}$ .

Projection:σ_w: 0.10σ_v: 0.30

Figure 1: Explore the relationship between latent dynamics and neural observations. Left: the 3D latent state trajectory. Right: the resulting observed neural firing rates. Adjust noise levels to see how observability degrades.

The figure above shows a synthetic system with three latent dimensions and six observed neurons. Two latent dimensions form a slow rotation (the circular trajectory in the x₁–x₂ projection), while the third is an independent, faster oscillation. All six neurons are noisy linear combinations of these three latent variables. As you increase the observation noise, the neural traces become harder to distinguish from each other, but the underlying latent structure is still there, waiting to be extracted.

Subspace identification

The key insight behind subspace identification is that future observations carry information about the current latent state, and this information has a specific structure that we can exploit. Unrolling the state equation forward from time $t$ gives:

{\color{#4A90D9}x_{t+k}} = {\color{#4A90D9}A}^k {\color{#4A90D9}x_t} + \sum_{j=0}^{k-1} {\color{#4A90D9}A}^j {\color{#999999}w_{t+k-1-j}}

(3)

Substituting into the observation equation, the observation at time $t+k$ is:

{\color{#4A7C6F}y_{t+k}} = {\color{#4A7C6F}C} {\color{#4A90D9}A}^k {\color{#4A90D9}x_t} + {\color{#4A7C6F}C} \sum_{j=0}^{k-1} {\color{#4A90D9}A}^j {\color{#999999}w_{t+k-1-j}} + {\color{#999999}v_{t+k}}

(4)

If we stack several future observations into a vector ${\color{#4A7C6F}Y_f} = [{\color{#4A7C6F}y_t}^\top, {\color{#4A7C6F}y_{t+1}}^\top, \ldots, {\color{#4A7C6F}y_{t+i-1}}^\top]^\top$ , the signal part factors as:

{\color{#4A7C6F}Y_f} = {\color{#4A7C6F}\mathcal{O}_i} \, {\color{#4A90D9}x_t} + {\color{#999999}\text{noise}}

(5)

where ${\color{#4A7C6F}\mathcal{O}_i}$ is the extended observability matrix:

{\color{#4A7C6F}\mathcal{O}_i} = \begin{bmatrix} {\color{#4A7C6F}C} \\ {\color{#4A7C6F}C}{\color{#4A90D9}A} \\ {\color{#4A7C6F}C}{\color{#4A90D9}A}^2 \\ \vdots \\ {\color{#4A7C6F}C}{\color{#4A90D9}A}^{i-1} \end{bmatrix}

(6)

This matrix has $i{\color{#4A7C6F}m}$ rows and ${\color{#4A90D9}d}$ columns. Its column space is the observability subspace: the set of directions in observation space that the latent state can influence. The rank of ${\color{#4A7C6F}\mathcal{O}_i}$ equals the latent dimensionality ${\color{#4A90D9}d}$ , as long as the system is observable and $i$ is large enough.*A system is observable if different initial states produce different output sequences. For linear systems, this is equivalent to ${\color{#4A7C6F}\mathcal{O}_i}$ having full column rank. When this fails (say two latent dimensions project identically onto all neurons), those dimensions become indistinguishable from observations alone.

x₃ component: 0.10

Figure 2: Drag the arrow to change the projection direction and see how well the latent state can be reconstructed from observations. Some directions are easily recovered (high R²), others are not.

Now stack these future observation vectors for every time point into columns of a matrix. The result is the block Hankel matrix. Each column contains $i$ consecutive observation vectors starting from a different time step. Columns shift forward by one time step, giving the matrix its characteristic diagonal-constant structure:

H = \begin{bmatrix} {\color{#4A7C6F}y_1} & {\color{#4A7C6F}y_2} & {\color{#4A7C6F}y_3} & \cdots & {\color{#4A7C6F}y_N} \\ {\color{#4A7C6F}y_2} & {\color{#4A7C6F}y_3} & {\color{#4A7C6F}y_4} & \cdots & {\color{#4A7C6F}y_{N+1}} \\ {\color{#4A7C6F}y_3} & {\color{#4A7C6F}y_4} & {\color{#4A7C6F}y_5} & \cdots & {\color{#4A7C6F}y_{N+2}} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ {\color{#4A7C6F}y_i} & {\color{#4A7C6F}y_{i+1}} & {\color{#4A7C6F}y_{i+2}} & \cdots & {\color{#4A7C6F}y_{N+i-1}} \end{bmatrix}

(7)

Each entry ${\color{#4A7C6F}y_t}$ is itself an ${\color{#4A7C6F}m}$ -dimensional vector (one value per neuron), so each column of $H$ is an $i{\color{#4A7C6F}m}$ -dimensional vector. The word "block" refers to this: $H$ is a matrix of vectors, not scalars. Notice the structure: moving one step down a column shifts the time index by one, and moving one step right across a row also shifts by one. This means that every anti-diagonal contains the same observation vector, the defining property of a Hankel matrix.

Step: 1

Watch the block Hankel matrix being built column by column. Each column stacks a window of consecutive observations. As the window slides right, the next column fills in.

Using our factorization from Equation (5), the signal part of each column is ${\color{#4A7C6F}\mathcal{O}_i}\, {\color{#4A90D9}x_t}$ for some ${\color{#4A90D9}x_t}$ . This means the signal part of $H$ has rank at most ${\color{#4A90D9}d}$ , so all columns live in the ${\color{#4A90D9}d}$ -dimensional column space of ${\color{#4A7C6F}\mathcal{O}_i}$ , regardless of how many time points $N$ we have.

Taking the SVD of the block Hankel matrix reveals the latent dimensionality as a gap in the singular value spectrum. The first ${\color{#4A90D9}d}$ singular values correspond to the signal; the rest correspond to noise. The left singular vectors span the observability subspace, giving us ${\color{#4A7C6F}C}$ (from the first block of rows) and ${\color{#4A90D9}A}$ (from the shift structure). The right singular vectors, scaled by the singular values, give the latent state sequence.*This procedure is the essence of N4SID and related algorithms [3]. The name "subspace identification" comes from the fact that we identify the system by finding the column subspace of the observability matrix. In practice, an oblique projection step removes the effect of past inputs before taking the SVD; the algebra is similar to the whitening step in CCA.

Time lags: 10

Figure 3: The block Hankel matrix and its singular value spectrum. The gap after the first 3 singular values reveals the true latent dimensionality. Adjust the number of time lags to see how the matrix structure changes.

With the observability subspace in hand, extracting the system matrices is straightforward. Write the SVD as $H \approx U_{\color{#4A90D9}d}\, S_{\color{#4A90D9}d}\, V_{\color{#4A90D9}d}^\top$ , keeping only the first ${\color{#4A90D9}d}$ components (the ones above the gap). The left singular vectors $U_{\color{#4A90D9}d}$ estimate the column space of the observability matrix ${\color{#4A7C6F}\mathcal{O}_i}$ . The observation matrix ${\color{#4A7C6F}C}$ is the first ${\color{#4A7C6F}m}$ rows of $U_{\color{#4A90D9}d}$ . The latent state sequence is recovered by multiplying the pseudoinverse of this observability estimate by the original Hankel columns: ${\color{#4A90D9}\hat{x}_t} = U_{\color{#4A90D9}d}^+ \, H_{:,t}$ . Finally, the state transition matrix ${\color{#4A90D9}A}$ is estimated by least-squares regression: find the matrix that best maps each recovered state ${\color{#4A90D9}\hat{x}_t}$ to the next state ${\color{#4A90D9}\hat{x}_{t+1}}$ .

Figure 4: Recovered latent states (dark) overlaid on the true states (blue, dashed). The match is good, but the recovered dimensions are an arbitrary rotation of the true ones. There is no way to tell which dimensions relate to behavior.

The figure above shows the punchline of standard subspace identification: the latent states are recovered accurately (up to an invertible linear transformation), but the individual recovered dimensions have no particular meaning. The first recovered dimension might be a mixture of the behaviorally relevant rotation and the irrelevant oscillation. There is nothing in the procedure that distinguishes them.

Adding behavior

So far we have been recovering the full latent state without asking what any of it means. But in practice, we care about specific aspects of the latent dynamics, specifically the part that drives behavior. The figure below illustrates the problem. The brain's full latent state space contains signals that are encoded in our neural recordings and signals that are not. Of the encoded signals, only a subset relates to the behavior we measure. Different methods see different parts of this picture.

Uses both Y and behavior Z

The subspace partition. The full latent space (dashed) contains signals encoded in neural recordings (middle ellipse) and signals we cannot observe. Of the encoded signals, only a subset is behaviorally relevant (inner ellipse). Toggle between methods to see what each one captures.

Now suppose we also have a behavioral measurement ${\color{#D4783C}z_t} \in \mathbb{R}^{\color{#D4783C}p}$ that is a linear function of the latent state:

{\color{#D4783C}z_t} = {\color{#D4783C}L}\, {\color{#4A90D9}x_t} + {\color{#999999}e_t}

(8)

where ${\color{#D4783C}L} \in \mathbb{R}^{{\color{#D4783C}p} \times {\color{#4A90D9}d}}$ picks out the behaviorally relevant latent dimensions and ${\color{#999999}e_t}$ is noise. In our synthetic system, ${\color{#D4783C}L} = [1, 1, 0]$ , giving a scalar behavior ( ${\color{#D4783C}p} = 1$ ) that depends on the first two latent dimensions (the slow rotation) and ignores the third (the fast oscillation).

Standard subspace identification does not know about ${\color{#D4783C}z_t}$ . It recovers the full latent state, but in an arbitrary coordinate system. The behavioral information ends up smeared across all recovered dimensions.

Figure 5: The mixing problem. Left: recovered latent dimensions from standard subspace identification. Right: each dimension's correlation with the behavioral variable. Information about behavior is distributed across all three dimensions.

This is the central limitation: if you want to know which part of the neural dynamics drives behavior, standard subspace identification cannot tell you. The behavioral signal is there, but it is mixed with irrelevant dynamics. Can we modify the procedure to extract the behaviorally relevant dynamics first?

Preferential subspace identification

PSID answers this question with a two-stage procedure [1]. The idea is to use CCA (the same whitening-and-SVD machinery from the previous post) to find the part of the neural state space that predicts future behavior, and then recover the remaining dynamics from the residual.

Stage 1: behaviorally relevant subspace. Build two block Hankel matrices: one from past neural observations ${\color{#4A7C6F}Y_\text{past}}$ and one from future behavioral measurements ${\color{#D4783C}Z_\text{future}}$ . Run a procedure analogous to CCA: an oblique projection of future behavior onto past neural activity. The top directions from the neural side span the subspace of past neural activity that is maximally predictive of future behavior. These are the behaviorally relevant directions.*Unlike symmetric CCA (which finds mutual directions in both spaces), this projection is asymmetric: we care about directions in neural space that predict behavior, not the reverse. The whitening-then-SVD machinery is the same as in the CCA post, but the past-neural-to-future-behavior cross-covariance replaces the static $\Sigma_{ab}$ .

Relevant dims: 3

CCA between time-lagged matrices. Left: past neural Hankel matrix. Middle: future behavioral Hankel matrix. Right: the canonical correlations. The first few are strong, indicating shared structure between past neural activity and future behavior.

Project the neural data onto these canonical directions to get a low-dimensional state estimate for the relevant subspace. Extract the transition matrix ${\color{#D4783C}A_1}$ and observation matrix ${\color{#D4783C}C_1}$ by the same least-squares procedure as before.

Stage 2: remaining dynamics. Project out the behaviorally relevant component from the observations: ${\color{#4A7C6F}Y_\text{resid}} = {\color{#4A7C6F}Y} - {\color{#D4783C}\hat{X}_\text{rel}}\, {\color{#D4783C}C_1}^\top$ . Run standard subspace identification on the residual to capture the remaining latent dynamics. This gives ${\color{#999999}A_2}$ and ${\color{#999999}C_2}$ .

The combined model has a block lower-triangular transition matrix and a concatenated observation matrix:

{\color{#4A90D9}A} = \begin{bmatrix} {\color{#D4783C}A_{11}} & 0 \\ {\color{#999999}A_{21}} & {\color{#999999}A_{22}} \end{bmatrix} \qquad {\color{#4A7C6F}C} = [{\color{#D4783C}C_1},\; {\color{#999999}C_2}]

(9)

The zero in the upper-right block is the key constraint: the relevant subspace evolves independently of the irrelevant one (governed by ${\color{#D4783C}A_{11}}$ alone), but the irrelevant subspace can be driven by the relevant state through ${\color{#999999}A_{21}}$ . The ${\color{#D4783C}\text{orange}}$ parts talk to behavior; the ${\color{#999999}\text{gray}}$ parts don't. The latent state is cleanly partitioned.

Figure 6: Toggle between standard subspace identification and PSID. In PSID mode, the first two dimensions (orange) cleanly track the behavioral variable (faint orange overlay), while the third (gray) captures the independent oscillation.

The difference is immediate. Standard subspace identification gives three dimensions with behavioral information mixed across all of them. PSID gives two dimensions that are strongly correlated with behavior and one that is not. The partition matches the ground truth of the synthetic system: two relevant dimensions (the slow rotation) and one irrelevant dimension (the fast oscillation).

When PSID misleads

PSID requires you to specify the relevant dimensionality: how many latent dimensions should be assigned to the behaviorally relevant subspace. If you get this wrong, the results can be misleading.

Over-specifying the relevant dimensionality. If you tell PSID to extract more relevant dimensions than truly exist, the extra dimensions will fit noise or capture dynamics that are actually irrelevant to behavior. In the figure below, the true relevant dimensionality is 2. Setting it to 3 or higher forces PSID to designate additional dimensions as "relevant," but their correlation with the behavioral variable is low.

Specified relevant dims: 2✓ correct

Figure 7a: Adjust the specified relevant dimensionality. At the correct value (2), PSID cleanly separates relevant from irrelevant. Over-specifying forces extra dimensions into the relevant subspace, where they capture noise.

No real behavioral relevance. If the behavioral variable is actually independent of the neural activity, PSID will still extract "relevant" dimensions. It has no way to know that the CCA directions are fitting noise rather than signal. Cross-validation is the standard defense: fit PSID on a training set, predict behavior on a held-out test set, and check whether the prediction $R^2$ exceeds what you would expect by chance.

Figure 7b: Cross-validated prediction accuracy as a function of specified relevant dimensionality. With real behavior (orange), R² peaks at the true dimensionality and plateaus. With independent behavior (gray, dashed), R² stays near zero.

In practice, the right approach is to sweep the relevant dimensionality from 0 upward and monitor the cross-validated prediction accuracy. The point where the curve plateaus tells you how many latent dimensions truly contribute to behavior.

Extensions

PSID assumes a linear autonomous system: no external inputs, linear dynamics, linear observations. Three subsequent papers relax these assumptions while preserving the dissociative structure that separates behaviorally relevant from irrelevant dynamics.

IPSID: measured inputs. In many experiments, subjects receive sensory stimuli or perturbations under the experimenter's control. Standard PSID folds these input-driven responses into the latent state, conflating intrinsic dynamics with stimulus-evoked activity. IPSID [6] adds measured inputs ${\color{#D4783C}u_t}$ to the state equation:

{\color{#4A90D9}x_{t+1}} = {\color{#4A90D9}A}\,{\color{#4A90D9}x_t} + {\color{#D4783C}B}\,{\color{#D4783C}u_t} + {\color{#999999}w_t}

This dissociates intrinsic dynamics (governed by ${\color{#4A90D9}A}$ ) from input-driven responses (governed by ${\color{#D4783C}B}$ ), letting you ask which part of neural activity reflects the brain's own dynamics versus what is driven by external events.

DPAD: nonlinear dynamics. PSID's linear machinery cannot capture thresholding, gating, or other nonlinear interactions in neural circuits. DPAD [8] replaces the linear state-space model with a multisection recurrent neural network, trained to prioritize behaviorally relevant dynamics in the first sections and capture the remainder in the rest. This lets the model learn nonlinear latent dynamics without sacrificing the relevant/irrelevant partition.

Optimal smoothing. PSID uses past neural activity to predict future behavior, a purely causal (filtering) direction. Sani & Shanechi [9] extend this by incorporating concurrent and future neural observations through optimal smoothing. The hierarchy is clean: prediction uses only past neural data, filtering adds concurrent data, and smoothing adds future data as well. Each step reduces estimation error in the latent states, which matters for offline analyses where the full recording is available.*The progression from prediction to filtering to smoothing mirrors the classical Kalman filter literature [2]: the Kalman filter is a causal estimator, while the Rauch–Tung–Striebel smoother refines those estimates by running backward through the data.

Assumptions and limitations

What PSID gives you. A latent dynamical system with a clean partition into behaviorally relevant and irrelevant subspaces, identified directly from data. The system matrices ( ${\color{#4A90D9}A}$ , ${\color{#4A7C6F}C}$ ) are estimated without iterative optimization, and the partition comes from the structure of the CCA solution.

Linearity. Both the dynamics and the observation model are assumed linear. If the true dynamics involve nonlinear interactions (thresholding, saturation, gating), the linear model will capture only the best linear approximation, which may miss important structure. This is the same limitation as linear CCA, just extended to the temporal domain.

Stationarity. The system matrices ${\color{#4A90D9}A}$ and ${\color{#4A7C6F}C}$ are assumed constant over time. If the neural code changes over the course of a session (as it does during learning), PSID will average over these changes rather than tracking them.

Linear behavioral readout. The behavioral variable ${\color{#D4783C}z_t}$ is assumed to be a linear function of the latent state. If the relationship is nonlinear (for example, behavior depends on the phase of a neural oscillation rather than its amplitude), PSID's "relevant" subspace will not capture the full picture.

Two-stage is greedy. PSID extracts the relevant subspace first, then the irrelevant subspace from the residual. This greedy approach is not jointly optimal: errors in the first stage propagate to the second.

Where this leads. The extensions above address some of these limitations directly: DPAD relaxes linearity, IPSID handles exogenous inputs, and smoothing improves offline estimates. Beyond the PSID family, LFADS [5] learns nonlinear latent dynamics through a sequential autoencoder, and switching linear dynamical systems can relax stationarity by allowing the matrices to change over time.

PSID and its neighbors

Method	Key matrix	Dynamics?	Behavior-aware?	Relationship to PSID
PCA	${\color{#4A7C6F}\Sigma_{yy}}$	No	No	SVD of one covariance instead of a cross-covariance
CCA	${\color{#4A7C6F}\Sigma_{yy}}^{-1/2} {\color{#D4783C}\Sigma_{yz}} {\color{#D4783C}\Sigma_{zz}}^{-1/2}$	No	Yes	Static version of PSID's stage 1
N4SID	$H({\color{#4A7C6F}Y})$	Yes	No	PSID without the preferential step
PSID	$H({\color{#4A7C6F}Y_\text{past}})\ \text{vs}\ H({\color{#D4783C}Z_\text{fut}})$	Yes	Yes	(this post)
IPSID	$H({\color{#4A7C6F}Y_\text{past}}),\, {\color{#D4783C}U}$	Yes	Yes	PSID + measured inputs; dissociates intrinsic from input-driven
DPAD	${\color{#4A7C6F}Y}$ (RNN)	Yes (nonlinear)	Yes	Nonlinear PSID; RNNs replace linear state-space
Shared-AE	${\color{#4A7C6F}Y}$ (autoencoder)	No	Yes	Nonlinear shared/private split via CS-divergence; extends to 3+ modalities
LFADS	${\color{#4A7C6F}Y}$ (seq. autoencoder)	Yes	Optional	Nonlinear, requires training a neural network

CCA [4] gave us static shared directions between paired datasets. PSID adds two things: temporal structure (dynamics) and a principled partition between behaviorally relevant and irrelevant latent dimensions. The connection between the two methods is direct: PSID's first stage is CCA applied to time-lagged neural and behavioral data, using the same whitening-and-SVD pipeline.*In the CCA post, both datasets were neural recordings from different subjects, and the goal was to find shared directions between two populations. Here, one dataset is neural and the other is behavioral. The math is identical (whiten, then SVD the cross-covariance), but the interpretation changes: instead of "shared across subjects," we get "predictive of behavior."

The original PSID paper [1] demonstrated the method on intracortical recordings from monkeys performing reaching tasks, showing that behaviorally relevant dynamics occupied a low-dimensional subspace that standard methods like N4SID missed. Subsequent work extended the framework in three directions: IPSID [6] adds measured inputs to dissociate intrinsic from stimulus-driven dynamics, DPAD [8] replaces the linear model with recurrent neural networks, and optimal smoothing [9] leverages future neural data for offline estimation. All methods come with a publicly available MATLAB/Python toolbox [7].

A different line of work drops the dynamical model entirely. Shared-AE [10] uses dual autoencoders with Cauchy-Schwarz divergence regularization to split each modality's latent space into shared and private components. Because the encoders are nonlinear, Shared-AE can capture richer neural-behavioral relationships than PSID's linear model, and it generalizes naturally to more than two modalities. The trade-off is that it gives up the temporal structure and closed-form identifiability that make PSID's latent dynamics directly interpretable.

The next post in this series will move beyond pairwise comparisons to aligning multiple datasets simultaneously, the multi-population setting that motivates most of my PhD work.

Implementation

The full PSID algorithm is compact enough to fit in about 50 lines of NumPy. Here is a self-contained implementation that takes observations ${\color{#4A7C6F}Y}$ , behavioral measurements ${\color{#D4783C}Z}$ , the relevant dimensionality, and the total dimensionality, and returns the system matrices ${\color{#4A90D9}A}$ , ${\color{#4A7C6F}C}$ , the behavioral readout ${\color{#D4783C}C_z}$ , and the recovered latent states ${\color{#4A90D9}\hat{X}}$ .

import numpy as np
from scipy.linalg import svd, sqrtm, pinv

def build_hankel(data, num_lags):
    """Stack time-lagged snapshots into a block Hankel matrix."""
    T, d = data.shape
    N = T - 2 * num_lags + 1
    past = np.zeros((num_lags * d, N))
    future = np.zeros((num_lags * d, N))
    for lag in range(num_lags):
        past[lag*d:(lag+1)*d] = data[lag:lag+N].T
        future[lag*d:(lag+1)*d] = data[num_lags+lag:num_lags+lag+N].T
    return past, future

def psid(Y, Z, d_rel, d_total, num_lags):
    """
    Preferential Subspace Identification.

    Parameters
    ----------
    Y : (T, m) neural observations
    Z : (T, p) behavioral measurements
    d_rel : number of behaviorally relevant latent dimensions
    d_total : total latent dimensionality
    num_lags : number of time lags for Hankel matrices

    Returns
    -------
    A, C, Cz, Xhat : system matrices, behavioral readout, and states
    """
    # Center
    Y = Y - Y.mean(0)
    Z = Z - Z.mean(0)
    m = Y.shape[1]

    # Stage 1: CCA between past-neural and future-behavior
    Y_past, Y_future = build_hankel(Y, num_lags)
    Z_past, Z_future = build_hankel(Z, num_lags)

    Cyy = Y_past @ Y_past.T / Y_past.shape[1]
    Czz = Z_future @ Z_future.T / Z_future.shape[1]
    Cyz = Y_past @ Z_future.T / Y_past.shape[1]

    # Whiten, then SVD of cross-covariance
    Cyy_inv_sqrt = np.real(pinv(sqrtm(Cyy)))
    Czz_inv_sqrt = np.real(pinv(sqrtm(Czz)))
    M = Cyy_inv_sqrt @ Cyz @ Czz_inv_sqrt

    U, s, Vt = svd(M, full_matrices=False)

    # Top d_rel canonical directions (in neural Hankel space)
    W_rel = Cyy_inv_sqrt @ U[:, :d_rel]

    # Recover relevant states from past observations
    Xhat_rel = W_rel.T @ Y_past  # (d_rel, N)

    # Observability matrix estimate and single-lag C1
    O_rel = W_rel[:m, :]  # first block = C1
    C1 = O_rel

    # Fit A1 by least squares: x_{t+1} = A1 @ x_t
    A1 = Xhat_rel[:, 1:] @ pinv(Xhat_rel[:, :-1])

    # Behavioral readout: regress Z onto relevant states
    N = Y_past.shape[1]
    Z_aligned = Z[num_lags:num_lags+N]  # (N, p)
    Cz = Z_aligned.T @ pinv(Xhat_rel).T  # (p, d_rel)

    # Stage 2: residual subspace identification
    d_irr = d_total - d_rel
    if d_irr > 0:
        # Project out relevant component, then re-embed as Hankel
        Y_single = Y[num_lags:num_lags+N].T  # (m, N)
        Y_resid_ts = (Y_single - C1 @ Xhat_rel).T  # (N, m)
        Yr_past, _ = build_hankel(Y_resid_ts, num_lags)

        U2, s2, V2t = svd(Yr_past, full_matrices=False)
        Xhat_irr = np.diag(s2[:d_irr]) @ V2t[:d_irr]  # (d_irr, N2)
        C2 = U2[:m, :d_irr]
        A2 = Xhat_irr[:, 1:] @ pinv(Xhat_irr[:, :-1])

        # Trim relevant states to match residual length
        N2 = Xhat_irr.shape[1]
        Xhat_rel = Xhat_rel[:, :N2]
    else:
        N2 = Xhat_rel.shape[1]
        Xhat_irr = np.zeros((0, N2))
        C2 = np.zeros((m, 0))
        A2 = np.zeros((0, 0))

    # Combine: block lower-triangular A
    d1, d2 = d_rel, d_irr
    A = np.zeros((d1 + d2, d1 + d2))
    A[:d1, :d1] = A1
    if d2 > 0:
        A[d1:, d1:] = A2
        # A21: regression of irrelevant next-state on relevant state
        A[d1:, :d1] = Xhat_irr[:, 1:] @ pinv(Xhat_rel[:, :-1])

    C = np.hstack([C1, C2])
    Xhat = np.vstack([Xhat_rel, Xhat_irr]).T  # (N2, d_total)

    return A, C, Cz, Xhat

The structure mirrors the two stages. Lines 1–20 build block Hankel matrices. Lines 21–40 run the oblique projection between past neural and future behavioral Hankel matrices to get the relevant subspace, then estimate the behavioral readout ${\color{#D4783C}C_z}$ by regressing behavior onto the relevant states. Lines 41–65 project out the relevant component, re-embed the residual as a new Hankel matrix, and run standard subspace ID on it. The final assembly builds the block lower-triangular ${\color{#4A90D9}A}$ (including the ${\color{#999999}A_{21}}$ coupling term), concatenated ${\color{#4A7C6F}C}$ , and the full latent state sequence.

References

O. G. Sani, H. Abbaspourazad, Y. T. Wong, B. Pesaran, and M. M. Shanechi, "Modeling behaviorally relevant neural dynamics enabled by preferential subspace identification," Nature Neuroscience, vol. 24, pp. 140–149, 2021.
R. E. Kalman, "A new approach to linear filtering and prediction problems," Journal of Basic Engineering, vol. 82, no. 1, pp. 35–45, 1960.
P. Van Overschee and B. De Moor, Subspace Identification for Linear Systems: Theory, Implementation, Applications. Springer, 1996.
H. Hotelling, "Relations between two sets of variates," Biometrika, vol. 28, no. 3/4, pp. 321–377, 1936.
C. Pandarinath, D. J. O'Shea, J. Collins, et al., "Inferring single-trial neural population dynamics using sequential auto-encoders," Nature Methods, vol. 15, pp. 805–815, 2018.
P. Vahidi*, O. G. Sani*, and M. M. Shanechi, "Modeling and dissociation of intrinsic and input-driven neural population dynamics underlying behavior," PNAS, vol. 121, no. 7, e2212887121, 2024.
O. G. Sani and M. M. Shanechi, "PSID: Preferential Subspace Identification" (MATLAB/Python toolbox), GitHub.
O. G. Sani, B. Pesaran, and M. M. Shanechi, "Dissociative and prioritized modeling of behaviorally relevant neural dynamics using recurrent neural networks," Nature Neuroscience, vol. 27, pp. 2033–2045, 2024.
O. G. Sani and M. M. Shanechi, "Preferential subspace identification (PSID) with forward-backward smoothing," arXiv:2507.15288, 2025.
D. Yi, H. Dong, M. J. Higley, A. Churchland, and S. Saxena, "Shared-AE: automatic identification of shared subspaces in high-dimensional neural and behavioral activity," ICLR, 2025.

← Back to home