Canonical Correlation Analysis
Finding shared structure between two high-dimensional datasets, derived from scratch with interactive figures.
Introduction
Throughout my PhD I have been developing methods for aligning neural population activity across subjects. Different animals performing the same task give you different neurons, different numbers of neurons, and different relationships between those neurons. But if the task is the same, there should be some shared structure in the population dynamics, and finding it requires a way to compare high-dimensional recordings that don't share a common coordinate system. Canonical correlation analysis, or CCA [1], is one of the earliest and cleanest solutions to this problem, and the ideas show up in nearly everything that came after.CCA was introduced by Harold Hotelling in 1936 [1]. It generalizes the familiar Pearson correlation between two scalar variables to the setting where each "variable" is an entire multidimensional dataset. For a neuroscience-oriented review, see Zhuang et al. [2].
Here is a concrete example. You've recorded neural populations from two monkeys performing the same reaching task. Each gives you a matrix of activity, dozens of neurons over thousands of timepoints. You could correlate individual neurons across recordings, one pair at a time, but neural populations encode information in patterns of activity, and no single neuron tells the full story. What you really want is to find weighted combinations of neurons in each recording that move together. CCA does this: it finds the linear projections of each dataset that are maximally correlated with each other. Safaie, Chang et al. [7] used CCA to answer this question, aligning low-dimensional neural trajectories across monkeys and across mice to show that the underlying dynamics are preserved.
The idea generalizes well beyond neuroscience (gene expression paired with protein levels, audio paired with video features), but the cross-animal alignment setting is where CCA becomes most intuitive to me, and it is the lens I'll use throughout. This is the first in a series of posts working through my PhD notes. CCA felt like the right place to start: it is simple enough to derive from scratch in a single sitting, but the ideas connect directly to the harder problems I'll cover later.
Setup and notation
Stack your observations into two matrices: holds observations of variables from the first dataset, and holds the same observations of variables from the second. The rows are paired: row in both matrices corresponds to the same trial, the same timepoint, or whatever links your two measurements together.We assume the data are centered (column means subtracted). This is standard practice and ensures the covariance matrices are computed correctly. Most implementations handle this for you.
CCA looks for weight vectors and that define one-dimensional projections of each dataset, and , such that these projections are as correlated as possible. Think of each weight vector as a "recipe" for combining variables: how much of neuron 1, how much of neuron 2, and so on.
Three matrices govern the problem. The within-dataset covariance matrices and describe the variance structure inside each dataset, that is, how the variables in each group relate to each other. The cross-covariance matrix captures how variables in the first dataset co-vary with variables in the second, and this is the matrix that CCA ultimately exploits. Explore all three below.
The objective
We want our projections to be maximally correlated. Writing this out, the correlation between and is:
Notice something about this formula: if you double , the numerator doubles, but so does the denominator. Correlation is insensitive to scale; it only depends on direction. This means the optimization problem as stated has infinitely many equivalent solutions (any scalar multiple of an optimum is also an optimum). To pin things down, we fix the variance of each projection to one:This is the same trick used in PCA, but with a twist. In PCA you constrain the weight vector to unit norm. Here we constrain the projected variance, which accounts for the covariance structure of the data. When the data are correlated, a unit-norm weight vector can still produce a projection with wildly different variance depending on its direction, so the PCA-style constraint would not standardize the denominator.
With unit-variance projections, the denominator in equation (1) becomes one, and maximizing the correlation simplifies to maximizing just the numerator:
Solving the objective
This is a constrained optimization problem: maximize an objective subject to two equality constraints. The standard approach is to introduce Lagrange multipliers, one for each constraint, and look for stationary points of the combined expression. The Lagrangian is:The factor of in front of each multiplier is a convenience that simplifies the derivatives without changing the solution. This derivation follows Gundersen [4] and Borga [6].
At a maximum, the gradient of the Lagrangian with respect to each weight vector must vanish. Differentiating and setting the results to zero gives us two conditions, one for each dataset:
where . Each equation says the same thing: the gradient from the cross-covariance (pulling toward the other dataset's structure) is balanced by the gradient from the within-dataset covariance (enforcing the constraint). To see what the multipliers are, left-multiply equation (5) by and equation (6) by :
The left-hand sides of these two expressions are scalars, and they are transposes of each other, so they must be equal. This forces . And since the constraints set each projected variance to one, both multipliers equal the objective value itself: , the canonical correlation.The Lagrange multipliers, introduced as bookkeeping devices to enforce the constraints, have a direct interpretation: they equal the canonical correlation, the quantity we are maximizing.
The eigenvalue problem
Equations (5) and (6) couple and through the covariance matrices. We can eliminate one unknown by solving equation (6) for :
Substituting this into equation (5) eliminates entirely:
Rearranging and left-multiplying by gives an eigenvalue equation in alone:
This is a standard eigenvalue problem. The matrix encodes the entire relationship between the two datasets, filtered through their individual covariance structures. Its eigenvalues are , so the canonical correlations are the square roots: . The corresponding eigenvectors are the canonical weight vectors, the weight "recipes" from the setup.The number of nonzero canonical correlations is at most , the smaller of the two datasets' dimensionalities. You can't find more shared directions than the lower-dimensional space has room for.
Geometric interpretation
The eigenvalue equation solves the problem, but it does not explain much about what CCA is doing geometrically. There is another way to derive the same result that makes the geometry explicit and also leads to a better algorithm.
Start with a question: why is CCA harder than ordinary correlation? Because correlation between scalar variables is just the cosine of an angle. But our data live in spaces with their own internal structure. The variables within each dataset may be correlated with each other, stretched along some directions and compressed along others. This internal structure distorts the geometry, making it hard to compare directions across the two spaces.This geometric view connects CCA to the Procrustes problem and to methods for aligning representational spaces in neuroscience and machine learning.
The fix is to remove that internal structure first. If we whiten each dataset (transform it so that its covariance becomes the identity matrix), then all directions within each space become equivalent. Variance is the same in every direction, and within-dataset correlations vanish. In these whitened coordinates, finding the most correlated directions between the two datasets reduces to the singular value decomposition (SVD) of the whitened cross-covariance matrix . The singular values of this matrix are the canonical correlations, and the left and right singular vectors give the canonical directions in whitened space.The SVD approach is also preferred in practice because the eigenvalue formulation requires inverting covariance matrices and multiplying them together, which amplifies numerical errors when the matrices are ill-conditioned [3].
The figure below shows this three-step pipeline. Start with the raw data and its covariance ellipse. Whiten each dataset to make the ellipse circular. Then read off the canonical directions from the SVD of the whitened cross-covariance. Each subsequent canonical pair captures the next strongest relationship, orthogonal to all the ones before it.
Step 1: Raw data — each dataset has its own covariance structure
Try it yourself
The figure below lets you play the role of the optimizer. Drag the direction arrows on each scatter plot to choose your own projection directions, and watch how the correlation between the projected values changes in real time. The CCA solution is shown as dashed lines. Try aligning with the principal axis of one dataset and see what happens. The CCA solution often points in a direction that looks wrong for one dataset considered alone, because it is jointly optimizing over both.This is harder than it looks. You're optimizing over two directions simultaneously: a good direction for one dataset might be terrible once you pair it with the wrong direction in the other. This is why we need the machinery above rather than guessing.
When CCA misleads
CCA always finds something. It optimizes over all possible projection directions in both datasets, and in a high-dimensional space, there will always be some pair of directions with nonzero correlation, even when the two datasets are completely independent. When the number of variables approaches the number of observations, this problem becomes severe: the canonical correlations can be close to one even in pure noise.The distribution of canonical correlations between independent Gaussian matrices is well-studied. As , the largest canonical correlation converges to one [8].
The figure below generates independent data (no shared structure) and runs CCA. Start by keeping and small relative to . Then drag them up. Watch how the canonical correlations inflate even though the data have nothing in common.
This is not a theoretical curiosity. In neural data, 200 neurons recorded over 50 trials puts you deep in the danger zone. How do you know which canonical correlations are real?
One approach is a permutation test. Shuffle the row pairing between the two datasets to destroy any true relationship, run CCA on the shuffled data, and record the first canonical correlation. Repeat many times to build a null distribution. If a real canonical correlation falls above the 95th percentile of this null, it is unlikely to have arisen from noise alone.Parametric alternatives exist (Bartlett's test uses a chi-squared approximation), but the permutation approach is more general: it makes no distributional assumptions and works for any sample size.
Beyond permutation testing, you can address the problem structurally: regularization (shrinkage covariance estimates that pull the eigenvalues away from zero) or reduced-rank approaches that only estimate the top few canonical directions.
Implementation
The geometric perspective translates directly into code: center, whiten, SVD. Here is an implementation:This implementation assumes so that the covariance matrices are full rank. When you have more variables than observations, use regularized covariance estimates (e.g., shrinkage) [3]. For a practical overview of dimensionality reduction methods for neural data, see Mineault [5].
import numpy as np
def cca(X, Y):
"""Canonical Correlation Analysis via whitened SVD.
Parameters
----------
X : array, shape (n, p)
Y : array, shape (n, q)
Returns
-------
corrs : canonical correlations
W_x : canonical weights for X
W_y : canonical weights for Y
"""
n = X.shape[0]
# Center
X = X - X.mean(axis=0)
Y = Y - Y.mean(axis=0)
# Covariance matrices (1/n, not 1/(n-1); either works, the choice doesn't affect the result)
Sxx = X.T @ X / n
Syy = Y.T @ Y / n
Sxy = X.T @ Y / n
# Whiten
Ux, sx, _ = np.linalg.svd(Sxx)
Uy, sy, _ = np.linalg.svd(Syy)
Sxx_inv_half = Ux @ np.diag(1 / np.sqrt(sx)) @ Ux.T
Syy_inv_half = Uy @ np.diag(1 / np.sqrt(sy)) @ Uy.T
# SVD of whitened cross-covariance
M = Sxx_inv_half @ Sxy @ Syy_inv_half
U, s, Vt = np.linalg.svd(M, full_matrices=False)
W_x = Sxx_inv_half @ U
W_y = Syy_inv_half @ Vt.T
return s, W_x, W_yThe singular values of are the canonical correlations, and the left and right singular vectors (after un-whitening) give the canonical weight vectors for each dataset.
Assumptions and limitations
What CCA gives you
For each canonical pair, CCA returns a weight vector in each dataset and a correlation measuring how well the two projections track each other. These come in a ranked sequence, , measuring how strongly the two datasets co-vary along each successive direction. The whitening-then-SVD pipeline separates the within-dataset structure from the between-dataset relationship cleanly: all the shared linear information lives in the SVD of the whitened cross-covariance.
Assumptions
Paired observations. Both datasets must have the same number of rows, and row in one must correspond to row in the other. CCA cannot handle unpaired or differently-sized datasets.
Linearity. CCA finds linear projections. If the shared structure between your datasets is nonlinear, CCA will miss it entirely or capture only its linear component.
More observations than variables. The basic method requires so the covariance matrices are invertible. When this fails, you need regularization or dimensionality reduction as a preprocessing step.
Sensitivity to outliers. Both the covariance estimates and the linear projections are affected by outliers, as with any method built on second-order statistics.
Limitations
CCA captures only linear shared structure. Kernel CCA and deep CCA [10] extend the idea to nonlinear mappings, at the cost of interpretability and additional hyperparameters.
CCA finds the shared part but does not model what is unique to each dataset. Probabilistic CCA [9] gives a generative model where both datasets are generated from a shared latent variable plus independent noise, but the "private" component is just isotropic Gaussian noise, not a structured private latent space.
Canonical directions can be hard to interpret in high dimensions. A weight vector with 200 entries does not point to a single neuron or a single feature.
These limitations point directly to the methods I will cover in future posts: nonlinear extensions, probabilistic formulations, and approaches that relax the pairing requirement. CCA is the foundation, and knowing where it breaks helps you know when to reach for something stronger.
CCA and its neighbors
The same ingredients (whitening, SVD, cross-covariance) recombine to give different methods depending on what you optimize and how you constrain the solution. The table below shows how CCA relates to three methods you are likely to encounter alongside it.
| Method | Objective | Constraint | Relationship to CCA |
|---|---|---|---|
| PCA | Variance | Unit norm weights | Single-dataset analogue: SVD of one covariance instead of the cross-covariance between two |
| CCA | Correlation | Unit projected variance | (this post) |
| PLS | Covariance | Unit norm weights | No whitening, so high-variance directions dominate |
| RRR | Prediction error | Rank constraint | Asymmetric: predicts Y from X |
The choice between correlation (CCA), covariance (PLS), and prediction (RRR) depends on what you care about. CCA gives you paired linear directions with temporal correspondence, which is exactly what cross-animal alignment needs. The next post will address what happens when you want to relax one or more of those constraints: nonlinear relationships, multiple datasets simultaneously, or the absence of trial-matched observations.
References
- H. Hotelling, "Relations between two sets of variates," Biometrika, vol. 28, no. 3/4, pp. 321–377, 1936.
- X. Zhuang, Z. Yang, and D. Cordes, "A technical review of canonical correlation analysis for neuroscience applications," Human Brain Mapping, vol. 41, no. 13, pp. 3807–3833, 2020.
- D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, "Canonical correlation analysis: An overview with application to learning methods," Neural Computation, vol. 16, no. 12, pp. 2639–2664, 2004.
- G. Gundersen, "Canonical Correlation Analysis", 2018.
- P. Mineault, "Dimensionality reduction in neural data analysis", 2021.
- M. Borga, "Canonical correlation: a tutorial", 2001.
- M. Safaie, J. C. Chang, et al., "Preserved neural dynamics across animals performing similar behaviour," Nature, vol. 623, pp. 765–771, 2023.
- K. W. Wachter, "The limiting empirical measure of multiple discriminant ratios," Annals of Statistics, vol. 8, no. 5, pp. 937–957, 1980.
- F. R. Bach and M. I. Jordan, "A probabilistic interpretation of canonical correlation analysis," Technical Report 688, Dept. of Statistics, UC Berkeley, 2005.
- G. Andrew, R. Arora, J. Bilmes, and K. Livescu, "Deep canonical correlation analysis," Proceedings of the 30th International Conference on Machine Learning (ICML), 2013.