Procrustes and hyperalignment

Aligning neural representations across subjects.

The alignment problem

You record from motor cortex in two monkeys performing the same reaching task. Each monkey has its own electrode array, its own set of neurons, its own coordinate system. Monkey A gives you a 500-by-96 data matrix. Monkey B gives you a 500-by-112 data matrix. The 500 rows are matched: row tt in both matrices corresponds to the same time point during the same reach condition.

You run PCA on each monkey separately and keep 10 components. Now you have two 500-by-10 matrices of PC scores: XX for monkey A and YY for monkey B. Each matrix describes the same set of reaches, but in a different coordinate system. If the two populations encode movement similarly, there should be a way to rotate one coordinate system to align with the other.

But which rotation? The PC axes for monkey A were chosen to capture variance in monkey A's data. Monkey B's axes capture variance in monkey B's data. There is no reason the first PC of one monkey should correspond to the first PC of another. The axes might be permuted, flipped, or rotated by some arbitrary angle.

The question is: find the rotation that makes the two descriptions match as closely as possible. This is the Procrustes problem [1].The name comes from Greek mythology. Procrustes was an innkeeper who adjusted his guests to fit his bed — by stretching them or cutting off their legs. The mathematical Procrustes problem is more civilized: it adjusts one dataset to fit another, but only by rotations and reflections, not by stretching or cutting. Schönemann [1] gave the SVD-based solution in 1966.

Measuring misalignment

Before we can find the best rotation, we need a way to measure how far apart two matrices are. The standard choice is the Frobenius norm: the square root of the sum of all squared entries.

MF=i,jMij2\|M\|_F = \sqrt{\sum_{i,j} M_{ij}^2}
(1)

Think of it as flattening the matrix into one long vector and computing its L2L^2 length. The distance between matrices XX and YY is XYF\|X - Y\|_F. If the two matrices are identical, the distance is zero. If they differ, each disagreeing entry contributes to the total.

The Frobenius norm connects to the trace in a useful way: MF2=tr(MM)\|M\|_F^2 = \text{tr}(M^\top M). This lets us expand the squared distance between XX and YRYR (monkey B's data, rotated by RR):

XYRF2=tr(XX)2tr(RYX)+tr(YY)\|X - YR\|_F^2 = \text{tr}(X^\top X) - 2\,\text{tr}(R^\top Y^\top X) + \text{tr}(Y^\top Y)
(2)

The first and third terms do not depend on RR. Only the middle term does. So minimizing the distance is the same as maximizing tr(RYX)\text{tr}(R^\top Y^\top X). That is the problem we need to solve.Verify the expansion by writing XYRF2=tr((XYR)(XYR))\|X - YR\|_F^2 = \text{tr}((X - YR)^\top(X - YR)) and distributing. The cross term 2tr(RYX)-2\,\text{tr}(R^\top Y^\top X) uses the cyclic property of the trace: tr(ABC)=tr(CAB)\text{tr}(ABC) = \text{tr}(CAB). This identity shows up throughout the series — in CCA, in the SVD proofs, and in the SRM derivation below.

XYR(θ)θ = 0°X YR = 2.825.00°90°180°270°360°rotation angle θX YR(θ)optimalSVD solutionFrobenius norm loss
Drag the rotation angle to see the Frobenius distance change. The minimum of the loss curve coincides with the SVD solution.

The orthogonal Procrustes problem

Find the orthogonal matrix RR (rotation or reflection) that minimizes the distance between XX and YRYR:

minRR=I  XYRF2\min_{R^\top R = I}\; \|X - YR\|_F^2
(3)

From equation (2), this is equivalent to:

maxRR=I  tr(RYX)\max_{R^\top R = I}\; \text{tr}(R^\top Y^\top X)
(4)

The solution comes from the SVD. Compute the cross-product M=YXM = Y^\top X, an n×nn \times n matrix where nn is the number of retained components (10 in our example). Take its SVD: M=UΣVM = U\Sigma V^\top. The optimal rotation is:

R=UVR = U V^\top
(5)

That is the entire solution. Let's verify that it is orthogonal: RR=VUUV=VV=IR^\top R = V U^\top U V^\top = V V^\top = I. Yes.

Why does this work? The trace tr(RM)=tr(RUΣV)\text{tr}(R^\top M) = \text{tr}(R^\top U \Sigma V^\top). Substituting R=UVR = UV^\top: tr(VUUΣV)=tr(VΣV)=tr(Σ)\text{tr}(VU^\top U \Sigma V^\top) = \text{tr}(V \Sigma V^\top) = \text{tr}(\Sigma). The trace of Σ\Sigma is the sum of the singular values. Von Neumann's trace inequality guarantees that this is the maximum possible value of tr(RM)\text{tr}(R^\top M) over all orthogonal RR.The intuition: the SVD decomposes the cross-product into "rotate, scale, rotate." The optimal RR undoes both rotations, leaving only the scaling. The remaining trace is the sum of singular values, which is the largest possible alignment score. Compare this to the CCA post, where the SVD of the whitened cross-covariance gives the canonical correlations. The algebraic structure is identical: form a cross-product matrix, take the SVD, read off the answer.

Let's make this concrete with small numbers. Suppose after PCA we have 2D scores. Monkey A: X=(100110)X = \bigl(\begin{smallmatrix} 1 & 0 \\ 0 & 1 \\ -1 & 0 \end{smallmatrix}\bigr). Monkey B: Y=(011001)Y = \bigl(\begin{smallmatrix} 0 & 1 \\ -1 & 0 \\ 0 & -1 \end{smallmatrix}\bigr). Monkey B's data looks like monkey A's, rotated 90 degrees. Compute M=YX=(0110)M = Y^\top X = \bigl(\begin{smallmatrix} 0 & -1 \\ 1 & 0 \end{smallmatrix}\bigr). This is itself a rotation matrix, so its SVD gives singular values both equal to 1, and R=UVR = UV^\top recovers the inverse rotation. After alignment, YR=XYR = X exactly.

1234567812345678Frobenius residualR Y - XF = 5.256R: θ = -40.3°X (reference)Y (to align)drag Y points to perturb the alignment problem
Two point clouds, initially misaligned. Click Align to compute the SVD-based rotation. Drag points in Y to perturb the alignment and see the residual change.

Why restrict to rotations?

The Procrustes problem restricts RR to be orthogonal. Why not allow any matrix? If you drop the constraint, the minimizer of XYRF2\|X - YR\|_F^2 is the least-squares solution R=(YY)1YXR = (Y^\top Y)^{-1} Y^\top X, which is just regression. That would work, but it allows stretching.

Stretching is dangerous for alignment. If monkey B's population happens to have more variance along one axis, an unconstrained transformation will stretch that axis to match monkey A's variance. Now you cannot tell whether the two populations truly share structure or whether the alignment just inflated the right dimensions to create an artificial match.

Orthogonal transformations preserve distances. Every pairwise distance between time points stays the same after rotation. That is a strong guarantee: the internal geometry of monkey B's data is unchanged. Only its orientation relative to monkey A's is adjusted. If the two populations genuinely share a geometric structure, a rotation is enough to reveal it. If they do not, no amount of rotation will create a good match, and the residual distance will be large.Some methods allow scaling as well as rotation. The generalized Procrustes problem finds the best similarity transformation (rotation + uniform scaling). For neural data, the pure orthogonal constraint is usually preferred because it makes the alignment result harder to overfit. Williams et al. [8] develop a framework of "shape metrics" for comparing neural representations that generalizes beyond rotations to affine and more flexible classes of transformations, with statistical tests for whether the added flexibility is justified by the data.

From pairwise to group alignment

With two subjects, you align B to A and you are done. With ten subjects, pairwise alignment gets complicated. You could align every pair, but that gives you (102)=45\binom{10}{2} = 45 different alignments, and they will not be consistent: aligning B to A and C to A does not guarantee that B and C are well aligned with each other.

A better approach: align all subjects simultaneously to a shared template. Define a template SS (a matrix the same size as each subject's PCA scores) and find rotations R1,R2,,RKR_1, R_2, \ldots, R_K that minimize the total distance:

minS,R1,,RK  k=1KSXkRkF2s.t. RkRk=I\min_{S,\, R_1, \ldots, R_K}\; \sum_{k=1}^{K} \|S - X_k R_k\|_F^2 \qquad \text{s.t. } R_k^\top R_k = I
(6)

This alternates between two steps. Fix the template SS and solve for each RkR_k by Procrustes (equation 5). Fix the rotations and update the template as S=1KkXkRkS = \tfrac{1}{K}\sum_k X_k R_k, the average of the aligned datasets. Repeat until convergence. This is generalized Procrustes analysis: alternating Procrustes alignment with template estimation.

The result is a shared coordinate system in which all subjects' data are expressed. The template represents what is common. The residuals represent what is individual.Convergence is guaranteed because each step (fixing the template and solving for rotations, or fixing rotations and averaging) decreases the total objective. The objective is bounded below by zero, so the alternation converges. The solution is not unique — rotating all subjects and the template by the same orthogonal matrix gives the same total distance — but the relative alignment between subjects is unique up to this global rotation.

neural state spaceS1S2S3S4alignment error9.8001234iterationiter 0err 9.84
Four subjects, initially misaligned. Click Align all to run iterative generalized Procrustes. Watch each subject's data rotate toward the group template as the total alignment error decreases.

Hyperalignment

Haxby et al. [2] introduced hyperalignment for fMRI data, where the goal is to align neural representations across subjects watching the same movie. The data are voxel-by-time matrices, and each subject's voxels are in a different anatomical coordinate system.

Hyperalignment is generalized Procrustes analysis applied to neural data. Run PCA on each subject to reduce dimensionality, then iteratively align all subjects to a shared template using orthogonal rotations. The key insight is that the shared template captures the representational structure common to all subjects: the stimulus-driven geometry that is preserved across individual anatomical differences.

In practice, the alignment is done in a high-dimensional PCA space (often 100+ components) rather than in full voxel space. The orthogonal constraint means each subject's internal geometry is preserved. What changes is how the dimensions relate to the shared template.The original hyperalignment paper [2] used a Procrustean rotation between each subject and a reference subject, iterated with template updates. Later work reformulated this as an optimization over the Stiefel manifold (the space of orthogonal matrices), which guarantees convergence and gives the same result more efficiently.

Safaie et al. [5] applied a similar strategy to align neural dynamics across monkeys and mice performing reaching movements. They used CCA rather than Procrustes for the alignment step, but the underlying logic is the same: find a shared coordinate system in which the population trajectories of different individuals overlap.

The shared response model

Chen et al. [3] observed that hyperalignment and PCA can be combined more tightly. Instead of running PCA on each subject separately and then aligning, you can solve for the shared low-dimensional representation and the per-subject rotations simultaneously.

The model: each subject's data matrix XkX_k (time by voxels/neurons) is approximated as a shared response matrix SS (time by dd latent components) transformed by a subject-specific orthogonal mixing matrix WkW_k (dd by neurons):

XkSWkwith WkWk=IX_k \approx S\, W_k^\top \qquad \text{with } W_k^\top W_k = I
(7)

The optimization minimizes the total reconstruction error across subjects:

minS,W1,,WK  k=1KXkSWkF2\min_{S,\, W_1, \ldots, W_K}\; \sum_{k=1}^{K} \|X_k - S\, W_k^\top\|_F^2
(8)

This is a joint matrix factorization: each subject's data is a shared signal viewed through a subject-specific orthogonal lens. The shared signal SS captures stimulus-driven structure. The per-subject matrices WkW_k capture how each subject's neurons encode that shared structure.

The solution alternates, like generalized Procrustes. Fix SS and solve for each WkW_k by Procrustes (SVD of XkSX_k^\top S). Fix the WkW_k's and update SS by averaging the aligned data. The difference from plain hyperalignment is that the shared component SS is explicitly low-dimensional (rank dd), not full rank.SRM can also be derived as a probabilistic model. The shared response is a latent variable, each subject's observation is a noisy rotation of it, and the EM algorithm gives the same alternating updates. This probabilistic view connects SRM to factor analysis and probabilistic CCA [4].

After fitting, the WkW_k's can align a new subject's data into the shared space. This enables cross-subject decoding (train a decoder on the shared representation, apply it to any subject) and cross-subject comparison (are two subjects' representations similar? Compare them in the shared space).The distinction between SRM and CCA is subtle but important. Both learn a shared latent space from multiple datasets. CCA maximizes correlation between pairs of projections. SRM minimizes reconstruction error with orthogonal subject-specific maps. CCA does not require the projections to preserve geometry; SRM does. For neuroscience, SRM's constraint is often preferred because it guarantees that within-subject distances are preserved after alignment.

Subject 1X1neuronstimeSubject 2X2neuronstimeSubject 3X3neuronstimeSshared responselatenttimeW1W2W3XkS Wk
The SRM factorization. Each subject's data is a shared response S viewed through a subject-specific orthogonal mixing matrix. The shared structure is the same; only the embedding differs.

What comes next

Procrustes and CCA both align two datasets, but they solve different problems. CCA finds the most correlated projections of two datasets, regardless of the original coordinate systems. Procrustes finds the rotation that best matches one dataset to another in a fixed coordinate system.

CCA allows each dataset to be projected to a different low-dimensional space. Procrustes requires both datasets to be in the same dimensionality and finds a rigid alignment. CCA is more flexible (it can discover shared structure even when the original dimensions do not correspond). Procrustes is more constrained (it preserves the internal geometry of each dataset).

For cross-subject neural alignment, the typical pipeline is: PCA to reduce dimensionality, then Procrustes (or hyperalignment, or SRM) to align. Cunningham and Yu [7] review this pipeline and its variants. CCA provides an alternative that does not require the PCA step, but it is also more prone to overfitting in high dimensions.Methods like ShaReD combine elements of both: they learn shared and private latent spaces with behavioral supervision, using the structure of dynamics-based methods (like PSID) rather than static alignment. The Procrustes framework is the conceptual ancestor, but the optimization objective changes to incorporate temporal structure and behavioral relevance.

Both Procrustes and CCA are symmetric: they treat the two datasets equally. But many problems in neuroscience are asymmetric. You want to predict behavior from neural activity, not the reverse. You want to find the subspace of neural activity that predicts a target, not the most correlated subspace. That asymmetry leads to reduced-rank regression, targeted dimensionality reduction, and communication subspaces, which are the next post.

Implementation

The Procrustes solution is four lines of NumPy. Below is a complete implementation covering pairwise alignment, generalized Procrustes, and SRM:

import numpy as np

def procrustes(X, Y):
    """
    Find the orthogonal R minimizing ||X - Y R||_F.

    Parameters
    ----------
    X : array, shape (n, d)
        Target data (n observations, d dimensions).
    Y : array, shape (n, d)
        Source data to be aligned to X.

    Returns
    -------
    R : array, shape (d, d)
        Orthogonal alignment matrix (Y @ R ≈ X).
    distance : float
        Frobenius norm of the residual after alignment.
    """
    M = Y.T @ X            # cross-product
    U, s, Vt = np.linalg.svd(M)
    R = U @ Vt              # optimal rotation
    distance = np.linalg.norm(X - Y @ R, 'fro')
    return R, distance


def generalized_procrustes(datasets, n_iter=100, tol=1e-8):
    """
    Align K datasets to a shared template by alternating
    Procrustes alignment with template estimation.

    Parameters
    ----------
    datasets : list of arrays, each shape (n, d)
        K datasets with matched rows.

    Returns
    -------
    aligned : list of arrays, each shape (n, d)
        Aligned datasets.
    template : array, shape (n, d)
        Group-average template.
    rotations : list of arrays, each shape (d, d)
        Per-subject orthogonal alignment matrices.
    """
    K = len(datasets)
    template = datasets[0].copy()
    rotations = [np.eye(datasets[0].shape[1])] * K

    for iteration in range(n_iter):
        old_template = template.copy()

        # Align each dataset to the current template
        for k in range(K):
            R, _ = procrustes(template, datasets[k])
            rotations[k] = R

        # Update template as mean of aligned datasets
        template = np.mean(
            [datasets[k] @ rotations[k] for k in range(K)],
            axis=0,
        )

        # Check convergence
        if np.linalg.norm(template - old_template) < tol:
            break

    aligned = [datasets[k] @ rotations[k] for k in range(K)]
    return aligned, template, rotations


def srm(datasets, d, n_iter=100, tol=1e-8):
    """
    Shared Response Model: find a d-dimensional shared
    representation and per-subject orthogonal mappings.

    Parameters
    ----------
    datasets : list of arrays, each shape (n, p_k)
        K datasets. Each can have a different number of
        columns (neurons/voxels), but all have n rows.
    d : int
        Dimensionality of the shared response.

    Returns
    -------
    S : array, shape (n, d)
        Shared response (low-dimensional template).
    W_list : list of arrays, each shape (p_k, d)
        Per-subject orthogonal mixing matrices.
    """
    K = len(datasets)

    # Initialize: PCA on concatenated data
    concat = np.hstack([ds for ds in datasets])
    U, s, Vt = np.linalg.svd(concat, full_matrices=False)
    S = U[:, :d] * s[np.newaxis, :d]

    W_list = [None] * K
    for iteration in range(n_iter):
        old_S = S.copy()

        # Fix S, solve for each W_k by Procrustes
        for k in range(K):
            M = datasets[k].T @ S
            U_k, _, Vt_k = np.linalg.svd(M)
            W_list[k] = U_k[:, :d] @ Vt_k[:d, :]

        # Fix W_k's, update S
        S = np.mean(
            [datasets[k] @ W_list[k] for k in range(K)],
            axis=0,
        )

        if np.linalg.norm(S - old_S) < tol:
            break

    return S, W_list


# ── Example: align two simulated monkeys ──
rng = np.random.default_rng(42)

n_time, d_latent = 500, 10

# Shared latent trajectory
Z = rng.standard_normal((n_time, d_latent))

# Subject-specific orthogonal mixings
def random_orthogonal(d, rng):
    H = rng.standard_normal((d, d))
    Q, _ = np.linalg.qr(H)
    return Q

Q_a = random_orthogonal(d_latent, rng)
Q_b = random_orthogonal(d_latent, rng)

X = Z @ Q_a.T + 0.3 * rng.standard_normal((n_time, d_latent))
Y = Z @ Q_b.T + 0.3 * rng.standard_normal((n_time, d_latent))

# Pairwise Procrustes
R, dist = procrustes(X, Y)
print(f"Residual after alignment: {dist:.2f}")

# Correlation between aligned Y and X
corrs = [np.corrcoef(X[:, j], (Y @ R)[:, j])[0, 1]
         for j in range(d_latent)]
print(f"Mean correlation per dim: {np.mean(corrs):.3f}")

A practical note: always center each dataset (subtract the column mean) before running Procrustes. The alignment optimizes rotation, not translation. If the datasets have different means, the mean offset will be treated as misalignment and absorbed into the rotation, which is not what you want.

References

  1. Schönemann, P. H. "A generalized solution of the orthogonal Procrustes problem," Psychometrika, vol. 31, no. 1, pp. 1-10, 1966.
  2. Haxby, J. V., Guntupalli, J. S., Connolly, A. C., et al. "A common, high-dimensional model of the representational space in human ventral temporal cortex," Neuron, vol. 72, no. 2, pp. 404-416, 2011.
  3. Chen, P.-H., Chen, J., Yeshurun, Y., et al. "A reduced-dimension fMRI shared response model," Advances in Neural Information Processing Systems, vol. 28, 2015.
  4. Bach, F. R. and Jordan, M. I. "A probabilistic interpretation of canonical correlation analysis," Technical Report 688, UC Berkeley, 2005.
  5. Safaie, M., Chang, J. C., Park, J., et al. "Preserved neural dynamics across animals performing similar behaviour," Nature, vol. 623, pp. 765-771, 2023.
  6. Churchland, M. M., Cunningham, J. P., Kaufman, M. T., et al. "Neural population dynamics during reaching," Nature, vol. 487, pp. 51-56, 2012.
  7. Cunningham, J. P. and Yu, B. M. "Dimensionality reduction for large-scale neural recordings," Nature Neuroscience, vol. 17, pp. 1500-1509, 2014.
  8. Williams, A. H., Kunz, E., Kornblith, S., and Linderman, S. W. "Generalized shape metrics on neural representations," Advances in Neural Information Processing Systems, vol. 34, 2021.
← Back to home