Principal component analysis

Variance maximization and low-rank approximation turn out to be the same problem.

PublishedMarch 2026

Reading time

SeriesLinear Algebra for Neural Data, Part 7

The question PCA answers

You record from 100 neurons over 500 time bins. The data is a 500-by-100 matrix $X$ (centered: each neuron's mean has been subtracted). You believe the population's activity is approximately low-dimensional, say confined near a 5-dimensional subspace of the 100-dimensional neuron space.

You want to find that subspace. But which 5-dimensional subspace? There are infinitely many 5-dimensional subspaces of $\mathbb{R}^{100}$ . You need a criterion, a way to rank one subspace as better than another.

There are two natural criteria, and they look quite different.

Criterion 1: variance. Find the 5-dimensional subspace along which the data spread out the most. The data are informative in directions where they vary, and uninformative in directions where they sit still. Capture the most variance with the fewest axes.

Criterion 2: reconstruction. Find the 5-dimensional subspace such that projecting the data onto it and then lifting back to 100 dimensions loses as little as possible. Minimize the reconstruction error.

These seem like different problems. The first is about keeping what moves. The second is about approximating the original data. But they turn out to have the same answer. That answer is PCA.

Maximize variance

Start with one direction. Find the unit vector $w$ in $\mathbb{R}^{100}$ such that the variance of the data projected onto $w$ is as large as possible.

From Post 5, the variance of the data along a unit direction $w$ is the quadratic form $w^\top C w$ , where $C = \tfrac{1}{T} X^\top X$ is the covariance matrix. So we want:

\max_{w}\; w^\top C\, w \qquad \text{subject to}\quad \|w\| = 1

(1)

This is a constrained optimization. Use a Lagrange multiplier $\lambda$ for the constraint $w^\top w = 1$ . The Lagrangian is $L = w^\top C w - \lambda(w^\top w - 1)$ . Set the gradient to zero:

\frac{\partial L}{\partial w} = 2Cw - 2\lambda w = 0 \qquad \Longrightarrow \qquad Cw = \lambda w

(2)

That is an eigenvalue equation. The direction that maximizes variance is an eigenvector of the covariance matrix. Which one? Multiply both sides by $w^\top$ : $w^\top C w = \lambda w^\top w = \lambda$ . The variance along $w$ equals the eigenvalue. To maximize variance, pick the eigenvector with the largest eigenvalue.

That eigenvector is the first principal component.

Drag the direction vector around the unit circle. The right panel shows variance as a function of angle. The maximum coincides with the first eigenvector of the covariance matrix.

Now find the second. We want the unit direction of maximum variance that is orthogonal to the first. Add the constraint $w^\top w_1 = 0$ and repeat. The same argument gives another eigenvalue equation, and the maximum is the second-largest eigenvalue. Its eigenvector is the second principal component. Continue: the $k$ -th principal component is the eigenvector with the $k$ -th largest eigenvalue.*Each successive component is orthogonal to all previous ones, because the eigenvectors of a symmetric matrix are orthogonal (the spectral theorem from Post 5). The orthogonality constraint does not need to be imposed separately. It follows from the spectral theorem.

Minimize reconstruction error

Now the second criterion. We want to approximate each 100-dimensional data vector by its projection onto a $k$ -dimensional subspace, and choose the subspace that makes the approximation as good as possible.

From Post 2, if the subspace is spanned by orthonormal vectors $w_1, \ldots, w_k$ , the projection of a data point $x_t$ is:

\hat{x}_t = \sum_{j=1}^{k} (x_t \cdot w_j)\, w_j

(3)

The reconstruction error for one data point is $\|x_t - \hat{x}_t\|^2$ : the squared distance between the original and its projection. Average over all time bins:

\text{error} = \frac{1}{T} \sum_{t=1}^{T} \|x_t - \hat{x}_t\|^2

(4)

We want to minimize this. Expand the squared norm. Since $\hat{x}_t$ is the projection onto the subspace and $x_t - \hat{x}_t$ is perpendicular to it (that was the point of projection, from Post 4), Pythagoras gives:

\|x_t\|^2 = \|\hat{x}_t\|^2 + \|x_t - \hat{x}_t\|^2

(5)

The total length of the data point splits into the part captured by the projection and the part lost. Average over all $t$ :

\underbrace{\frac{1}{T}\sum_t \|x_t\|^2}_{\text{total variance}} = \underbrace{\frac{1}{T}\sum_t \|\hat{x}_t\|^2}_{\text{captured variance}} + \underbrace{\frac{1}{T}\sum_t \|x_t - \hat{x}_t\|^2}_{\text{error}}

(6)

The total variance (left side) is fixed. It does not depend on the choice of subspace. So minimizing the error (third term) is the same as maximizing the captured variance (second term). The two criteria are not just similar. They are exactly equivalent, connected by Pythagoras.

Why both give the same answer

Let's verify this. The captured variance is $\tfrac{1}{T}\sum_t \|\hat{x}_t\|^2$ . Expanding $\hat{x}_t$ using equation (3):

\frac{1}{T}\sum_t \|\hat{x}_t\|^2 = \frac{1}{T}\sum_t \sum_{j=1}^{k} (x_t \cdot w_j)^2 = \sum_{j=1}^{k} \underbrace{\frac{1}{T}\sum_t (x_t \cdot w_j)^2}_{w_j^\top C w_j}

(7)

Each term $w_j^\top C w_j$ is the variance along direction $w_j$ . So the total captured variance is the sum of variances along the $k$ subspace directions. To maximize this sum, you want each $w_j$ to point along a direction of high variance, subject to mutual orthogonality. The solution: the top $k$ eigenvectors of $C$ . The maximum captured variance is $\lambda_1 + \lambda_2 + \cdots + \lambda_k$ .

The reconstruction error is the leftover: $\lambda_{k+1} + \lambda_{k+2} + \cdots + \lambda_n$ . Maximize the first sum or minimize the second — same answer either way.

Drag the green line to rotate the projection subspace. The captured variance and reconstruction error update in real time. The dashed line shows the optimal direction (the first principal component).

PCA via the SVD

The derivation above says: eigendecompose the covariance matrix, take the top eigenvectors. That is mathematically correct, but it is not how you should compute PCA.

From the SVD post, the centered data matrix has SVD $X = U\Sigma V^\top$ , and:

C = \frac{1}{T}\,X^\top X = V\,\frac{\Sigma^2}{T}\,V^\top

(8)

The columns of $V$ are the eigenvectors of $C$ , i.e., the principal components. The eigenvalues are $\sigma_i^2 / T$ . You never need to form $C$ at all. In practice, PCA is always computed via the SVD of the data matrix, because it is faster and more numerically stable.*Forming $X^\top X$ squares the condition number of the problem. If the ratio of the largest to smallest singular value is $\kappa$ , the ratio of the largest to smallest eigenvalue of $X^\top X$ is $\kappa^2$ . The SVD avoids this squaring.

The recipe: center the data, compute the (thin) SVD, and read off the principal components from the columns of $V$ . That is PCA in three steps.

Scores, loadings, and reconstruction

PCA produces two objects, and the terminology is often confused.

The loadings are the principal component directions: the columns of $V$ . Each loading is a unit vector in $\mathbb{R}^{100}$ (neuron space). It tells you the "recipe" for one component: how much each neuron contributes to that direction.

The scores are the coordinates of the data in the principal component basis. For data point $x_t$ , the score on component $j$ is the dot product $x_t \cdot v_j$ . The full score matrix is $Z = XV$ , a 500-by-100 matrix (or 500-by- $k$ if you keep only $k$ components). Each row is a time bin. Each column is a component.

From the SVD, the scores have a clean form: $Z = XV = U\Sigma$ . The left singular vectors $U$ , scaled by the singular values, are the PC scores. The structure is visible: $U$ gives the temporal patterns, $\Sigma$ tells you how important each one is, $V$ tells you how each neuron contributes.

Reconstruction from $k$ components:

\hat{X} = Z_k V_k^\top = U_k \Sigma_k V_k^\top

(9)

This is the rank- $k$ truncated SVD from the previous post. The best rank- $k$ approximation to the data matrix. PCA and the truncated SVD are the same thing, seen from different angles.

The same neural trajectory in two bases. Left: neuron-space axes. Right: principal component axes. The trajectory has not changed; only the coordinates have.

How many components to keep

The eigenvalues tell you the variance captured by each component. The fraction of total variance explained by the first $k$ components is:

\text{explained variance} = \frac{\lambda_1 + \cdots + \lambda_k}{\lambda_1 + \cdots + \lambda_n}

(10)

Plot this as a function of $k$ and you get the cumulative explained variance curve. Plot the individual eigenvalues and you get the scree plot. Both help you choose $k$ .

Common heuristics: keep enough components to explain 90% or 95% of the variance. Or look for an "elbow" in the scree plot where the eigenvalues drop off sharply. Or use cross-validation: hold out some data, project using $k$ components, measure reconstruction error on the held-out data, and pick the $k$ that minimizes it.*The 90% or 95% threshold is a convention, not a theorem. For some datasets, 5 components explain 95% of variance and the remaining dimensions are noise. For others, the variance is spread more evenly and 95% requires 50 components, which may not feel like dimensionality reduction. The right $k$ depends on the question, not on a fixed threshold.

For neural data, the scree plot often shows a smooth decay without an obvious elbow. This happens because neural noise is correlated (nearby neurons share noise sources), which inflates the variance along many dimensions. Methods like factor analysis and GPFA handle this by explicitly modeling shared signal variance separately from private noise [4].

When PCA misleads

PCA finds directions of maximum variance. That is useful when variance corresponds to the structure you care about. But it can mislead in several ways.

Forgetting to center. If you run PCA on uncentered data, the first component may simply point toward the mean firing rate rather than capturing the most variable direction. This is the most common PCA mistake in practice. Always subtract the mean of each neuron first.

High variance is not the same as importance. The direction with the most variance might be a nuisance: a global gain fluctuation, a slow drift across the session, or a motion artifact. PCA does not know what is scientifically relevant. It finds what moves the most, and what moves the most might not be what you care about.*This is exactly the limitation that motivates PSID. PCA captures variance regardless of behavioral relevance. PSID finds the subspace that is maximally predictive of behavior, which may have less variance than the top PCA components but more scientific relevance.

Nonlinear structure. PCA finds linear subspaces. If the data lie on a curved surface (a nonlinear manifold), PCA will approximate it with a flat sheet. The approximation may be poor, and the number of components needed may be much larger than the intrinsic dimensionality of the manifold. Nonlinear methods like UMAP, t-SNE, and autoencoders address this, at the cost of losing PCA's clean guarantees.

Sensitivity to scaling. If one neuron fires at rates in the hundreds and another fires at rates in the single digits, PCA will be dominated by the high-rate neuron. This is a consequence of the $L^2$ norm from Post 1. If the variables have different units or very different scales, you may need to standardize (divide each neuron's activity by its standard deviation) before running PCA. This changes the geometry, and the PCA results will differ.

Rotational indeterminacy within a subspace. PCA gives you a specific set of orthogonal axes within the best subspace. But any rotation of those axes within the subspace captures the same total variance. The individual components are not unique. They depend on the ordering (by decreasing variance) and the orthogonality constraint. Methods like varimax rotation and ICA address this by imposing additional criteria.

Toggle between failure modes: uncentered data, high-variance noise, and slow drift. Watch PC1 shift to follow the nuisance instead of the structure.

Click Change Basis to rotate from neuron-space to PC-space. The low-dimensional structure becomes visible.

What comes next

PCA finds the best subspace for representing a single dataset. But PCA's decoder — the least-squares readout from PC scores back to neural activity — has a problem: the best-fitting decoder is usually the worst one to use on new data. It overfits. The coefficients blow up whenever two dimensions are correlated, which in neural data they almost always are.

The fix is regularization. Ridge regression, the lasso, and their relatives add a penalty on the size of the coefficients. The result is a decoder that fits slightly worse in-sample but generalizes far better out-of-sample. The geometry of regularization — why it shrinks some directions and not others, how the penalty interacts with the covariance structure of the data — is the subject of the next post.

Implementation

PCA via the SVD is a few lines of NumPy. Center the data, compute the thin SVD, read off the principal components from the columns of $V$ , and the scores from $U\Sigma$ .

import numpy as np

def pca(X, k=None):
    """
    PCA via the thin SVD of the centered data matrix.

    Parameters
    ----------
    X : array, shape (n_time, n_neurons)
        Raw data matrix (rows = observations).
    k : int or None
        Number of components. If None, keep all.

    Returns
    -------
    scores : array, shape (n_time, k)
        PC scores (projections onto principal components).
    loadings : array, shape (n_neurons, k)
        PC loadings (principal component directions).
    explained : array, shape (k,)
        Fraction of variance explained by each component.
    """
    # Center
    X_centered = X - X.mean(axis=0)

    # Thin SVD
    U, s, Vt = np.linalg.svd(X_centered, full_matrices=False)

    # Explained variance
    var_explained = s**2 / np.sum(s**2)

    # Truncate
    if k is None:
        k = len(s)
    U_k = U[:, :k]
    s_k = s[:k]
    Vt_k = Vt[:k, :]

    scores = U_k * s_k[np.newaxis, :]   # n_time x k
    loadings = Vt_k.T                    # n_neurons x k

    return scores, loadings, var_explained[:k]


# ── Example: PCA on simulated neural data ──
rng = np.random.default_rng(42)

# 500 time bins, 100 neurons, ~5 latent dimensions
n_time, n_neurons, true_rank = 500, 100, 5
latent = rng.standard_normal((n_time, true_rank))
weights = rng.standard_normal((true_rank, n_neurons))
noise = 0.3 * rng.standard_normal((n_time, n_neurons))
X = latent @ weights + noise

scores, loadings, explained = pca(X, k=10)

print("Explained variance (first 10 PCs):")
print(explained.round(3))
print(f"Cumulative: {explained.cumsum().round(3)}")
# First 5 PCs capture ~95% — matches the true rank

# Reconstruct from k components
X_approx = scores @ loadings.T + X.mean(axis=0)
error = np.linalg.norm(X - X_approx) / np.linalg.norm(X - X.mean(axis=0))
print(f"Relative reconstruction error (k=10): {error:.3f}")

References

Strang, G. Introduction to Linear Algebra, 6th ed. Wellesley-Cambridge Press, 2023.
3Blue1Brown. "Essence of Linear Algebra" video series, 2016.
Churchland, M. M., Cunningham, J. P., Kaufman, M. T., et al. "Neural population dynamics during reaching," Nature, vol. 487, pp. 51-56, 2012.
Cunningham, J. P. and Yu, B. M. "Dimensionality reduction for large-scale neural recordings," Nature Neuroscience, vol. 17, pp. 1500-1509, 2014.
Axler, S. Linear Algebra Done Right, 4th ed. Springer, 2024.
Strang, G. "The fundamental theorem of linear algebra," The American Mathematical Monthly, vol. 100, no. 9, pp. 848-855, 1993.
Safaie, M., Chang, J. C., Park, J., et al. "Preserved neural dynamics across animals performing similar behaviour," Nature, vol. 623, pp. 765-771, 2023.
Jolliffe, I. T. Principal Component Analysis, 2nd ed. Springer, 2002.

← Back to home