Canonical Correlation Analysis

Finding shared structure between two high-dimensional datasets, derived from scratch with interactive figures.

ρ1 = 0.86Regenerate
CCA finds directions in two datasets that are maximally correlated. Click Regenerate to draw new data.

Introduction

Throughout my PhD I have been developing methods for aligning neural population activity across subjects. Different animals performing the same task give you different neurons, different numbers of neurons, and different relationships between those neurons. But if the task is the same, there should be some shared structure in the population dynamics, and finding it requires a way to compare high-dimensional recordings that don't share a common coordinate system. Canonical correlation analysis, or CCA [1], is one of the earliest and cleanest solutions to this problem, and the ideas show up in nearly everything that came after.CCA was introduced by Harold Hotelling in 1936 [1]. It generalizes the familiar Pearson correlation between two scalar variables to the setting where each "variable" is an entire multidimensional dataset. For a neuroscience-oriented review, see Zhuang et al. [2].

Here is a concrete example. You've recorded neural populations from two monkeys performing the same reaching task. Each gives you a matrix of activity, dozens of neurons over thousands of timepoints. You could correlate individual neurons across recordings, one pair at a time, but neural populations encode information in patterns of activity, and no single neuron tells the full story. What you really want is to find weighted combinations of neurons in each recording that move together. CCA does this: it finds the linear projections of each dataset that are maximally correlated with each other. Safaie, Chang et al. [7] used CCA to answer this question, aligning low-dimensional neural trajectories across monkeys and across mice to show that the underlying dynamics are preserved.

The idea generalizes well beyond neuroscience (gene expression paired with protein levels, audio paired with video features), but the cross-animal alignment setting is where CCA becomes most intuitive to me, and it is the lens I'll use throughout. This is the first in a series of posts working through my PhD notes. CCA felt like the right place to start: it is simple enough to derive from scratch in a single sitting, but the ideas connect directly to the harder problems I'll cover later.

Setup and notation

Stack your observations into two matrices: XaRn×pX_a \in \mathbb{R}^{n \times p} holds nn observations of pp variables from the first dataset, and XbRn×qX_b \in \mathbb{R}^{n \times q} holds the same nn observations of qq variables from the second. The rows are paired: row ii in both matrices corresponds to the same trial, the same timepoint, or whatever links your two measurements together.We assume the data are centered (column means subtracted). This is standard practice and ensures the covariance matrices are computed correctly. Most implementations handle this for you.

CCA looks for weight vectors waRpw_a \in \mathbb{R}^p and wbRqw_b \in \mathbb{R}^q that define one-dimensional projections of each dataset, XawaX_a w_a and XbwbX_b w_b, such that these projections are as correlated as possible. Think of each weight vector as a "recipe" for combining variables: how much of neuron 1, how much of neuron 2, and so on.

Three matrices govern the problem. The within-dataset covariance matrices Σaa=1nXaXa\Sigma_{aa} = \tfrac{1}{n} X_a^\top X_a and Σbb=1nXbXb\Sigma_{bb} = \tfrac{1}{n} X_b^\top X_b describe the variance structure inside each dataset, that is, how the variables in each group relate to each other. The cross-covariance matrix Σab=1nXaXb\Sigma_{ab} = \tfrac{1}{n} X_a^\top X_b captures how variables in the first dataset co-vary with variables in the second, and this is the matrix that CCA ultimately exploits. Explore all three below.

Σaaa₁a₂a₁a₂0.49-0.23-0.231.86Σaba₁a₂b₁b₂-0.08-0.251.390.96Σbbb₁b₂b₁b₂1.350.670.670.83Hover over a cell
The within-dataset covariance matrices (Σ_aa, Σ_bb) and the cross-covariance matrix (Σ_ab). Hover a cell to see the corresponding variable pair in the scatter plot.

The objective

We want our projections to be maximally correlated. Writing this out, the correlation between XawaX_a w_a and XbwbX_b w_b is:

ρ=corr(Xawa,  Xbwb)=waΣabwbwaΣaawa  wbΣbbwb\rho = \operatorname{corr}(X_a w_a,\; X_b w_b) = \frac{w_a^\top \Sigma_{ab}\, w_b}{\sqrt{w_a^\top \Sigma_{aa}\, w_a}\;\sqrt{w_b^\top \Sigma_{bb}\, w_b}}
(1)

Notice something about this formula: if you double waw_a, the numerator doubles, but so does the denominator. Correlation is insensitive to scale; it only depends on direction. This means the optimization problem as stated has infinitely many equivalent solutions (any scalar multiple of an optimum is also an optimum). To pin things down, we fix the variance of each projection to one:This is the same trick used in PCA, but with a twist. In PCA you constrain the weight vector to unit norm. Here we constrain the projected variance, which accounts for the covariance structure of the data. When the data are correlated, a unit-norm weight vector can still produce a projection with wildly different variance depending on its direction, so the PCA-style constraint would not standardize the denominator.

waΣaawa=1,wbΣbbwb=1w_a^\top \Sigma_{aa}\, w_a = 1, \qquad w_b^\top \Sigma_{bb}\, w_b = 1
(2)

With unit-variance projections, the denominator in equation (1) becomes one, and maximizing the correlation simplifies to maximizing just the numerator:

maxwa,wb  waΣabwbs.t.waΣaawa=1,    wbΣbbwb=1\max_{w_a,\, w_b}\; w_a^\top \Sigma_{ab}\, w_b \qquad \text{s.t.}\quad w_a^\top \Sigma_{aa}\, w_a = 1,\;\; w_b^\top \Sigma_{bb}\, w_b = 1
(3)

Solving the objective

This is a constrained optimization problem: maximize an objective subject to two equality constraints. The standard approach is to introduce Lagrange multipliers, one for each constraint, and look for stationary points of the combined expression. The Lagrangian is:The factor of 12\tfrac{1}{2} in front of each multiplier is a convenience that simplifies the derivatives without changing the solution. This derivation follows Gundersen [4] and Borga [6].

L=waΣabwb    λ12 ⁣(waΣaawa1)    λ22 ⁣(wbΣbbwb1)\mathcal{L} = w_a^\top \Sigma_{ab}\, w_b \;-\; \frac{\lambda_1}{2}\!\left(w_a^\top \Sigma_{aa}\, w_a - 1\right) \;-\; \frac{\lambda_2}{2}\!\left(w_b^\top \Sigma_{bb}\, w_b - 1\right)
(4)

At a maximum, the gradient of the Lagrangian with respect to each weight vector must vanish. Differentiating and setting the results to zero gives us two conditions, one for each dataset:

Lwa=Σabwbλ1Σaawa=0\frac{\partial \mathcal{L}}{\partial w_a} = \Sigma_{ab}\, w_b - \lambda_1\, \Sigma_{aa}\, w_a = 0
(5)
Lwb=Σbawaλ2Σbbwb=0\frac{\partial \mathcal{L}}{\partial w_b} = \Sigma_{ba}\, w_a - \lambda_2\, \Sigma_{bb}\, w_b = 0
(6)

where Σba=Σab\Sigma_{ba} = \Sigma_{ab}^\top. Each equation says the same thing: the gradient from the cross-covariance (pulling toward the other dataset's structure) is balanced by the gradient from the within-dataset covariance (enforcing the constraint). To see what the multipliers are, left-multiply equation (5) by waw_a^\top and equation (6) by wbw_b^\top:

waΣabwb=λ1waΣaawa=λ1w_a^\top \Sigma_{ab}\, w_b = \lambda_1\, w_a^\top \Sigma_{aa}\, w_a = \lambda_1
wbΣbawa=λ2wbΣbbwb=λ2w_b^\top \Sigma_{ba}\, w_a = \lambda_2\, w_b^\top \Sigma_{bb}\, w_b = \lambda_2

The left-hand sides of these two expressions are scalars, and they are transposes of each other, so they must be equal. This forces λ1=λ2\lambda_1 = \lambda_2. And since the constraints set each projected variance to one, both multipliers equal the objective value itself: λ1=λ2=ρ\lambda_1 = \lambda_2 = \rho, the canonical correlation.The Lagrange multipliers, introduced as bookkeeping devices to enforce the constraints, have a direct interpretation: they equal the canonical correlation, the quantity we are maximizing.

The eigenvalue problem

Equations (5) and (6) couple waw_a and wbw_b through the covariance matrices. We can eliminate one unknown by solving equation (6) for wbw_b:

wb=1ρΣbb1Σbawaw_b = \frac{1}{\rho}\, \Sigma_{bb}^{-1}\, \Sigma_{ba}\, w_a
(7)

Substituting this into equation (5) eliminates wbw_b entirely:

Σab ⁣(1ρΣbb1Σbawa)=ρΣaawa\Sigma_{ab}\!\left(\frac{1}{\rho}\, \Sigma_{bb}^{-1}\, \Sigma_{ba}\, w_a\right) = \rho\, \Sigma_{aa}\, w_a

Rearranging and left-multiplying by Σaa1\Sigma_{aa}^{-1} gives an eigenvalue equation in waw_a alone:

Σaa1ΣabΣbb1Σbawa=ρ2wa\Sigma_{aa}^{-1}\, \Sigma_{ab}\, \Sigma_{bb}^{-1}\, \Sigma_{ba}\, w_a = \rho^2\, w_a
(8)

This is a standard eigenvalue problem. The matrix Σaa1ΣabΣbb1Σba\Sigma_{aa}^{-1} \Sigma_{ab} \Sigma_{bb}^{-1} \Sigma_{ba} encodes the entire relationship between the two datasets, filtered through their individual covariance structures. Its eigenvalues are ρ2\rho^2, so the canonical correlations are the square roots: ρ1ρ2\rho_1 \geq \rho_2 \geq \cdots. The corresponding eigenvectors are the canonical weight vectors, the weight "recipes" from the setup.The number of nonzero canonical correlations is at most min(p,q)\min(p, q), the smaller of the two datasets' dimensionalities. You can't find more shared directions than the lower-dimensional space has room for.

Geometric interpretation

The eigenvalue equation solves the problem, but it does not explain much about what CCA is doing geometrically. There is another way to derive the same result that makes the geometry explicit and also leads to a better algorithm.

Start with a question: why is CCA harder than ordinary correlation? Because correlation between scalar variables is just the cosine of an angle. But our data live in spaces with their own internal structure. The variables within each dataset may be correlated with each other, stretched along some directions and compressed along others. This internal structure distorts the geometry, making it hard to compare directions across the two spaces.This geometric view connects CCA to the Procrustes problem and to methods for aligning representational spaces in neuroscience and machine learning.

The fix is to remove that internal structure first. If we whiten each dataset (transform it so that its covariance becomes the identity matrix), then all directions within each space become equivalent. Variance is the same in every direction, and within-dataset correlations vanish. In these whitened coordinates, finding the most correlated directions between the two datasets reduces to the singular value decomposition (SVD) of the whitened cross-covariance matrix Σaa1/2ΣabΣbb1/2\Sigma_{aa}^{-1/2}\, \Sigma_{ab}\, \Sigma_{bb}^{-1/2}. The singular values of this matrix are the canonical correlations, and the left and right singular vectors give the canonical directions in whitened space.The SVD approach is also preferred in practice because the eigenvalue formulation requires inverting covariance matrices and multiplying them together, which amplifies numerical errors when the matrices are ill-conditioned [3].

The figure below shows this three-step pipeline. Start with the raw data and its covariance ellipse. Whiten each dataset to make the ellipse circular. Then read off the canonical directions from the SVD of the whitened cross-covariance. Each subsequent canonical pair captures the next strongest relationship, orthogonal to all the ones before it.

Step 1: Raw data — each dataset has its own covariance structure

Xa (centered)Xb (centered)
The whitening-then-SVD pipeline: (1) raw data with its covariance ellipse, (2) whitened data with unit covariance, (3) canonical directions from the SVD of the whitened cross-covariance.

Try it yourself

The figure below lets you play the role of the optimizer. Drag the direction arrows on each scatter plot to choose your own projection directions, and watch how the correlation between the projected values changes in real time. The CCA solution is shown as dashed lines. Try aligning with the principal axis of one dataset and see what happens. The CCA solution often points in a direction that looks wrong for one dataset considered alone, because it is jointly optimizing over both.This is harder than it looks. You're optimizing over two directions simultaneously: a good direction for one dataset might be terrible once you pair it with the wrong direction in the other. This is why we need the machinery above rather than guessing.

Dataset ADataset BProjectedChoose projectionsX_a·w_a vs X_b·w_bx_a·w_ax_b·w_bw_aw_bρ = -0.784(CCA optimal: ρ = 0.919)drag arrow tips to change projection
Drag the direction arrows on each scatter plot to change the projection direction. The right panel shows the projected values and their correlation. The dashed lines show the CCA-optimal directions.

When CCA misleads

CCA always finds something. It optimizes over all possible projection directions in both datasets, and in a high-dimensional space, there will always be some pair of directions with nonzero correlation, even when the two datasets are completely independent. When the number of variables approaches the number of observations, this problem becomes severe: the canonical correlations can be close to one even in pure noise.The distribution of canonical correlations between independent Gaussian matrices is well-studied. As min(p,q)/n1\min(p,q)/n \to 1, the largest canonical correlation converges to one [8].

The figure below generates independent data (no shared structure) and runs CCA. Start by keeping pp and qq small relative to nn. Then drag them up. Watch how the canonical correlations inflate even though the data have nothing in common.

0.000.250.500.751.000.100.02canonical pairscanonical correlation
CCA on independent Gaussian data. Drag p and q toward n to see spurious canonical correlations appear.

This is not a theoretical curiosity. In neural data, 200 neurons recorded over 50 trials puts you deep in the danger zone. How do you know which canonical correlations are real?

One approach is a permutation test. Shuffle the row pairing between the two datasets to destroy any true relationship, run CCA on the shuffled data, and record the first canonical correlation. Repeat many times to build a null distribution. If a real canonical correlation falls above the 95th percentile of this null, it is unlikely to have arisen from noise alone.Parametric alternatives exist (Bartlett's test uses a chi-squared approximation), but the permutation approach is more general: it makes no distributional assumptions and works for any sample size.

n=100, p=5, q=5, true correlations: [0.85, 0.6]
0.000.250.500.751.00ρ1 = 0.75ρ2 = 0.46Press "Run permutation test" to build the null distributionfirst canonical correlation
Permutation test for CCA significance. The histogram shows the null distribution of the first canonical correlation from shuffled data. Vertical lines mark the real canonical correlations.

Beyond permutation testing, you can address the problem structurally: regularization (shrinkage covariance estimates that pull the eigenvalues away from zero) or reduced-rank approaches that only estimate the top few canonical directions.

Implementation

The geometric perspective translates directly into code: center, whiten, SVD. Here is an implementation:This implementation assumes n>max(p,q)n > \max(p, q) so that the covariance matrices are full rank. When you have more variables than observations, use regularized covariance estimates (e.g., shrinkage) [3]. For a practical overview of dimensionality reduction methods for neural data, see Mineault [5].

import numpy as np

def cca(X, Y):
    """Canonical Correlation Analysis via whitened SVD.

    Parameters
    ----------
    X : array, shape (n, p)
    Y : array, shape (n, q)

    Returns
    -------
    corrs : canonical correlations
    W_x : canonical weights for X
    W_y : canonical weights for Y
    """
    n = X.shape[0]

    # Center
    X = X - X.mean(axis=0)
    Y = Y - Y.mean(axis=0)

    # Covariance matrices (1/n, not 1/(n-1); either works, the choice doesn't affect the result)
    Sxx = X.T @ X / n
    Syy = Y.T @ Y / n
    Sxy = X.T @ Y / n

    # Whiten
    Ux, sx, _ = np.linalg.svd(Sxx)
    Uy, sy, _ = np.linalg.svd(Syy)
    Sxx_inv_half = Ux @ np.diag(1 / np.sqrt(sx)) @ Ux.T
    Syy_inv_half = Uy @ np.diag(1 / np.sqrt(sy)) @ Uy.T

    # SVD of whitened cross-covariance
    M = Sxx_inv_half @ Sxy @ Syy_inv_half
    U, s, Vt = np.linalg.svd(M, full_matrices=False)

    W_x = Sxx_inv_half @ U
    W_y = Syy_inv_half @ Vt.T

    return s, W_x, W_y

The singular values of MM are the canonical correlations, and the left and right singular vectors (after un-whitening) give the canonical weight vectors for each dataset.

Assumptions and limitations

What CCA gives you

For each canonical pair, CCA returns a weight vector in each dataset and a correlation measuring how well the two projections track each other. These come in a ranked sequence, ρ1ρ2\rho_1 \geq \rho_2 \geq \cdots, measuring how strongly the two datasets co-vary along each successive direction. The whitening-then-SVD pipeline separates the within-dataset structure from the between-dataset relationship cleanly: all the shared linear information lives in the SVD of the whitened cross-covariance.

Assumptions

Paired observations. Both datasets must have the same number of rows, and row ii in one must correspond to row ii in the other. CCA cannot handle unpaired or differently-sized datasets.

Linearity. CCA finds linear projections. If the shared structure between your datasets is nonlinear, CCA will miss it entirely or capture only its linear component.

More observations than variables. The basic method requires n>max(p,q)n > \max(p, q) so the covariance matrices are invertible. When this fails, you need regularization or dimensionality reduction as a preprocessing step.

Sensitivity to outliers. Both the covariance estimates and the linear projections are affected by outliers, as with any method built on second-order statistics.

Limitations

CCA captures only linear shared structure. Kernel CCA and deep CCA [10] extend the idea to nonlinear mappings, at the cost of interpretability and additional hyperparameters.

CCA finds the shared part but does not model what is unique to each dataset. Probabilistic CCA [9] gives a generative model where both datasets are generated from a shared latent variable plus independent noise, but the "private" component is just isotropic Gaussian noise, not a structured private latent space.

Canonical directions can be hard to interpret in high dimensions. A weight vector with 200 entries does not point to a single neuron or a single feature.

These limitations point directly to the methods I will cover in future posts: nonlinear extensions, probabilistic formulations, and approaches that relax the pairing requirement. CCA is the foundation, and knowing where it breaks helps you know when to reach for something stronger.

CCA and its neighbors

The same ingredients (whitening, SVD, cross-covariance) recombine to give different methods depending on what you optimize and how you constrain the solution. The table below shows how CCA relates to three methods you are likely to encounter alongside it.

MethodObjectiveConstraintRelationship to CCA
PCAVarianceUnit norm weightsSingle-dataset analogue: SVD of one covariance instead of the cross-covariance between two
CCACorrelationUnit projected variance(this post)
PLSCovarianceUnit norm weightsNo whitening, so high-variance directions dominate
RRRPrediction errorRank constraintAsymmetric: predicts Y from X

The choice between correlation (CCA), covariance (PLS), and prediction (RRR) depends on what you care about. CCA gives you paired linear directions with temporal correspondence, which is exactly what cross-animal alignment needs. The next post will address what happens when you want to relax one or more of those constraints: nonlinear relationships, multiple datasets simultaneously, or the absence of trial-matched observations.

References

  1. H. Hotelling, "Relations between two sets of variates," Biometrika, vol. 28, no. 3/4, pp. 321–377, 1936.
  2. X. Zhuang, Z. Yang, and D. Cordes, "A technical review of canonical correlation analysis for neuroscience applications," Human Brain Mapping, vol. 41, no. 13, pp. 3807–3833, 2020.
  3. D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, "Canonical correlation analysis: An overview with application to learning methods," Neural Computation, vol. 16, no. 12, pp. 2639–2664, 2004.
  4. G. Gundersen, "Canonical Correlation Analysis", 2018.
  5. P. Mineault, "Dimensionality reduction in neural data analysis", 2021.
  6. M. Borga, "Canonical correlation: a tutorial", 2001.
  7. M. Safaie, J. C. Chang, et al., "Preserved neural dynamics across animals performing similar behaviour," Nature, vol. 623, pp. 765–771, 2023.
  8. K. W. Wachter, "The limiting empirical measure of multiple discriminant ratios," Annals of Statistics, vol. 8, no. 5, pp. 937–957, 1980.
  9. F. R. Bach and M. I. Jordan, "A probabilistic interpretation of canonical correlation analysis," Technical Report 688, Dept. of Statistics, UC Berkeley, 2005.
  10. G. Andrew, R. Arora, J. Bilmes, and K. Livescu, "Deep canonical correlation analysis," Proceedings of the 30th International Conference on Machine Learning (ICML), 2013.
← Back to home