The singular value decomposition
One factorization behind most of neural data analysis.
What a matrix does to a circle
Every matrix you will meet in this series does something geometric: it takes a set of vectors and rearranges them. To understand what any particular matrix does, a good starting question is: what happens to the unit circle (in 2D) or the unit sphere (in higher dimensions) when the matrix acts on it?
Take a concrete example. In Post 3, we built a 2-by-3 neural decoder that maps 3-dimensional neural activity to 2-dimensional hand velocity. But to see the geometry clearly, start with a square matrix:
Apply this matrix to every point on the unit circle. The result is an ellipse. We saw in Post 5 that a symmetric matrix maps a circle to an ellipse whose axes align with the eigenvectors. The eigenvalues are 4 and 2 (check: , , so gives ). The eigenvectors and point along the ellipse axes, and the eigenvalues are the semi-axis lengths. So far, the eigendecomposition from Post 5 handles everything.
Now consider a non-symmetric matrix:
This still maps the circle to an ellipse, but the eigenvectors of are not perpendicular. The eigenvalues are 2 and 1, with eigenvectors and , which meet at an oblique angle. Yet the ellipse still has a longest axis and a shortest axis, and those axes are still perpendicular to each other. The eigenvectors simply do not point along them.Why are the ellipse axes always perpendicular? Because an ellipse is a level set of a quadratic form, and any quadratic form can be diagonalized by an orthogonal change of coordinates. The SVD is the machinery that finds those coordinates for arbitrary (including non-symmetric and rectangular) matrices.
This is the gap the SVD fills. The eigendecomposition finds directions that scale without rotating, which requires a square matrix and yields perpendicular axes only for symmetric matrices. The SVD instead asks: what are the perpendicular axes of the output ellipse, and which perpendicular axes on the input circle map to them? It answers this question for any matrix, symmetric or not, square or not.
There exists a pair of orthonormal bases, one for the input space and one for the output space, such that the matrix acts as pure scaling between them. The input basis vectors are the right singular vectors. The output basis vectors are the left singular vectors. The scaling factors are the singular values. That is the entire content of the SVD.
Rotate, scale, rotate
Every matrix, no matter how complicated, decomposes into three steps. The SVD of any matrix is:
Read this right to left, as three operations applied to a vector :
Step 1. : rotate the input vector so that the right singular vectors align with the coordinate axes. This is a change of basis in the input space. Because is orthogonal, is a rotation (or rotation plus reflection). No stretching happens.
Step 2. : scale each coordinate axis independently. The stretch factors are the singular values , the diagonal entries of . This is where the geometry actually changes: directions with large singular values get amplified, directions with small singular values get crushed.
Step 3. : rotate the scaled result into the output space, aligning the coordinate axes with the left singular vectors. Again, no stretching.
The two rotations can be completely different. orients the input space and orients the output space. Between them, does the only non-rigid work: independent scaling along each axis. This factorization is why the SVD reveals so much about a matrix. The singular values tell you how much each direction is amplified. The singular vectors tell you which directions.The "rotate, scale, rotate" picture is the SVD's fundamental insight, and Strang [1] and Trefethen and Bau [8] both emphasize it as the single most important factorization in numerical linear algebra. Trefethen in particular argues that the SVD should be taught before the eigendecomposition, because it works for all matrices and the geometric meaning is immediate.
Let's verify with a concrete example. Take the shear matrix:
Compute . The characteristic polynomial is , giving eigenvalues and . The singular values are the square roots: and .
Two checks. First: . The product of singular values equals the absolute determinant, because the determinant measures total area (or volume) scaling, and the singular values measure how much each axis contributes. Second: , the squared Frobenius norm (sum of all squared entries of : ).Both of these identities hold in general. For any matrix: and . The first says singular values decompose volume scaling. The second says they decompose total "energy." These identities are why singular values appear in matrix norms, condition numbers, and information-theoretic quantities throughout the series.
Where the SVD comes from
The SVD is not a new piece of machinery. It falls directly out of the eigendecomposition from Post 5, applied to a carefully chosen matrix.
We want to find the directions in the input space that get stretched the most by . The length of is . Maximizing this over unit vectors is an eigenvalue problem for the matrix , exactly as in Post 5. The maximum of subject to is the largest eigenvalue of , achieved at the corresponding eigenvector.
Why and not just ? Because might not be square, and even if it is, it might not be symmetric. But is always square, always symmetric, and always positive semidefinite:
By the spectral theorem from Post 5, it has a full orthogonal eigendecomposition:
where the columns of are orthonormal eigenvectors and has nonnegative eigenvalues. These eigenvectors are the right singular vectors of . They are the input directions that align with the ellipse axes.
The eigenvalues of are the squared singular values: . This is the connection: the singular values of are the square roots of the eigenvalues of .
Now we need the output directions. For each nonzero singular value, define:
This is applied to the right singular vector, normalized by the singular value. In other words: take the input direction, push it through the matrix, and normalize the result. The claim is that these 's are orthonormal. Verify:
where we used (the eigenvalue equation) and the orthonormality of the 's. So the 's are indeed orthonormal. They are the left singular vectors.The orthonormality proof is worth internalizing. It is the same structure that appears in the CCA derivation (where whitening plays the role of normalizing) and in subspace identification (where the left singular vectors of the Hankel matrix span the observability subspace). Every time you see an SVD in this series, the left and right singular vectors have a similar "push through the matrix, normalize" relationship.
Assemble everything. From Equation 6, we have for each singular value. Writing all of these equations as a single matrix equation:
where collects the right singular vectors, collects the left singular vectors, and is diagonal with the singular values. Right-multiplying by (which equals since it is orthogonal):
That is the SVD. We started from the spectral theorem for the symmetric matrix , defined the left singular vectors via the matrix itself, proved they are orthonormal, and assembled the factorization. No new axioms or theorems were needed beyond what Post 5 provides.
You could equally well start from instead of . Its eigenvectors are the left singular vectors, and you would construct the right singular vectors from them. Both and share the same nonzero eigenvalues (the squared singular values). The two paths converge to the same factorization.For a symmetric matrix, the SVD and the eigendecomposition coincide: left and right singular vectors are both eigenvectors, and singular values are the absolute values of eigenvalues. The SVD is strictly more general. It works for any matrix, including rectangular ones, and the singular values are always nonnegative (unlike eigenvalues, which can be negative or complex). Stewart [9] gives a thorough treatment of the SVD's history and properties.
Rectangular matrices and singular values
The eigendecomposition only works for square matrices. The SVD works for any matrix, including the rectangular ones that appear constantly in neural data.
Take a data matrix with 500 rows (time bins) and 100 columns (neurons). Its SVD is , where is 500-by-500, is 500-by-100, and is 100-by-100. But most of those entries are not doing useful work. At most 100 of the singular values can be nonzero (since the matrix has only 100 columns), and the corresponding 400 extra columns of span the left null space, which carries no information about the data.
In practice, you use the thin (or economy) SVD, which keeps only the parts that matter: , where is the rank. is 500-by-, is -by-, and is 100-by-. This is smaller and faster, with no loss of information.
The singular values are conventionally sorted in decreasing order: . Their magnitudes tell you how important each component is. A large gap between and means the matrix is well approximated by its first components.
This is exactly the singular value spectrum that the PSID post uses to detect latent dimensionality. There, the matrix is the block Hankel matrix of time-lagged observations, and a gap in its singular value spectrum reveals how many latent dimensions the dynamical system has. The logic is the same: the number of large singular values is the effective rank of the matrix, and the effective rank is the number of independent components that matter.The "numerical rank" (the number of singular values above some threshold) is a more useful measure than the exact rank for noisy data. Exact rank will always be full if there is any noise at all, since noise fills every dimension. Gavish and Donoho [10] derive an optimal singular value threshold for denoising matrices with known noise levels, giving a principled alternative to eyeballing the elbow.
The four fundamental subspaces from Post 4 appear directly in the SVD. The columns of are an orthonormal basis for the row space. The remaining columns of (the ones dropped in the thin SVD) span the null space. The columns of span the column space. The remaining columns of span the left null space. This is the four-subspace picture from Post 4 made computational. The SVD hands you orthonormal bases for all four subspaces in one shot. Strang [1, 6] builds his entire "fundamental theorem of linear algebra" around this observation.
Low-rank approximation
You have a 500-by-100 data matrix and you suspect it is approximately low-rank: a hundred neurons, but maybe only ten or fifteen latent dimensions driving them. You want to find the rank- matrix closest to your data. Which one is it?
Write the SVD as a sum of rank-1 terms. Each term is the outer product of a left and right singular vector, scaled by the singular value:
The truncated SVD keeps only the first terms:
The Eckart-Young-Mirsky theorem [11] says: among all matrices with rank at most , the truncated SVD is the one closest to . This holds in both the Frobenius norm (sum of squared entries) and the operator norm (largest singular value of the difference). No other rank- matrix does better.The Eckart-Young theorem dates to 1936 [11], the same year as Hotelling's CCA paper [12]. The proof is clean: any rank- matrix has at most nonzero singular values, and the Frobenius norm is the sum of squared singular values of the difference, which is minimized by zeroing out the smallest. Mirsky later extended the result to unitarily invariant norms.
The approximation error has a simple expression:
The error is entirely determined by the discarded singular values. If the singular values drop off sharply after the first , the approximation is excellent. The fraction of total "energy" (variance) captured is:
If this fraction is 0.95 with out of 100 neurons, then ten components capture 95% of the data's content. This is exactly how PCA's explained variance works. It is not a coincidence.
Let's make the PCA connection precise. If is the centered data matrix (rows are observations, columns are neurons) with SVD , then the covariance matrix is:
since . This is an eigendecomposition: the right singular vectors of the data matrix are the eigenvectors of the covariance matrix, and the squared singular values divided by are the eigenvalues. The PCA scores are , which are the projections of the data onto the first principal components.In practice, PCA is always computed via the SVD of the data matrix, never by forming the covariance matrix and eigendecomposing it. Forming squares the condition number: if has condition number , then has condition number , making small eigenvalues much harder to resolve. The SVD avoids this entirely. Golub and Van Loan [13] give a detailed analysis of the numerical stability differences.
The pseudoinverse
In Post 4, we introduced the least-squares formula and called the pseudoinverse. But that formula requires the columns of to be linearly independent (so is invertible). What happens when they are not?
A rank-deficient matrix sends some directions to zero. Those directions cannot be recovered. But the directions that survive can be inverted. The SVD separates these two cases.
Start from the SVD: . The "inverse" should undo each step. Reverse the output rotation (), invert the scaling (replace each with ), and reverse the input rotation (). For the zero singular values, there is nothing to invert: the matrix crushed those directions, and no inverse can reconstruct what was destroyed. So we leave them at zero.
where is formed by taking the reciprocal of each nonzero entry and transposing (to handle the rectangular case). Concretely: if , and zero otherwise.
This is the Moore-Penrose pseudoinverse [14]. It gives the unique matrix satisfying four conditions (the Penrose conditions), but you do not need to memorize them. The geometric picture is enough: invert what you can, leave the rest at zero.
The pseudoinverse does exactly the right thing in every case. For an overdetermined system (more equations than unknowns), gives the least-squares solution: the that minimizes . For an underdetermined system (more unknowns than equations), it gives the minimum-norm solution among all exact solutions. For a rank-deficient system, it does both simultaneously: minimum-norm, least-squares.The pseudoinverse appears throughout the rest of the series. In subspace identification, you need it to recover latent states from the observability estimate. In PSID, both stages use pseudoinverses to extract state sequences from projections. In CCA, the whitening step is essentially a partial pseudoinverse of the covariance. The SVD-based definition handles all cases cleanly.
There is a critical practical point. If some singular values are very small (not zero, but close), inverting them amplifies noise. A singular value of 0.001 becomes a reciprocal of 1000, and any noise in that direction gets multiplied by a factor of 1000. The standard fix is to treat singular values below a threshold as zero, setting their reciprocals to zero instead of computing them. This is truncated SVD regularization. It trades a small amount of bias (ignoring a real but tiny signal) for a large reduction in variance (not amplifying noise). The figure above lets you adjust the threshold and see this tradeoff directly.The choice of truncation threshold connects to Tikhonov regularization (ridge regression). Adding a penalty to the least-squares objective effectively replaces each singular value in the pseudoinverse with . Large singular values are barely affected; small ones are shrunk toward zero. This is softer than the hard truncation of the SVD pseudoinverse, but the underlying idea is the same: do not trust directions where the signal is small relative to the noise.
Where the SVD appears
Let's count the places where the SVD has already appeared in this series, or will appear in the posts ahead.
PCA (Post 7). The right singular vectors of the centered data matrix are the principal components. The squared singular values are proportional to the covariance eigenvalues. PCA is the truncated SVD of the data matrix. Cunningham and Yu [4] review PCA and related dimensionality reduction methods for neural data.
Least squares and the pseudoinverse. The pseudoinverse gives the least-squares solution for any matrix, regardless of rank or shape. Truncating small singular values regularizes the solution.
Low-rank approximation. The best rank- approximation to any matrix is the truncated SVD. This is the mathematical foundation for keeping only the first principal components.
CCA (Post 9). The canonical correlations are the singular values of the whitened cross-covariance matrix, and the canonical directions are its left and right singular vectors. CCA is an SVD after whitening.
Procrustes alignment (Post 10). The orthogonal matrix that best aligns one dataset to another comes from the SVD of their cross-product. If , the optimal rotation is . This is the core of hyperalignment and shared response models [7].
Reduced-rank regression (Post 11). The optimal low-rank predictor comes from the SVD of the regression coefficient matrix, constrained to a low-dimensional subspace.
Subspace identification. In the PSID post, the latent dimensionality of a dynamical system is the number of significant singular values of the block Hankel matrix. The left singular vectors span the observability subspace. The state sequence comes from scaling and projecting.
Seven methods, each reducible to "form a matrix and take its SVD." They differ in what matrix you construct and how many components you keep.This is not a coincidence. Many problems in applied linear algebra reduce to one of two questions: "find the best low-rank approximation to this matrix" or "find orthonormal bases for the row and column spaces of this matrix." The SVD answers both. Any method that can be phrased either way will use the SVD, whether or not it says so on the label.
We now have the complete linear algebra foundation. Vectors, bases, linear maps, subspaces, eigendecomposition, and the SVD. The next post puts them all together: PCA, derived from first principles, computed via the SVD, and connected to everything we have built.
Implementation
In NumPy, the SVD is a single function call. Here is a minimal implementation that computes the thin SVD of a data matrix, truncates to a chosen rank, and reconstructs the approximation:
import numpy as np
def truncated_svd(X, k):
"""
Compute the rank-k truncated SVD of X.
Parameters
----------
X : array, shape (n, p)
Data matrix (rows = observations, columns = variables).
k : int
Number of components to keep.
Returns
-------
U_k : array, shape (n, k)
Left singular vectors (score directions).
s_k : array, shape (k,)
Singular values.
Vt_k : array, shape (k, p)
Right singular vectors (loading directions), transposed.
"""
# Thin SVD — only compute the min(n, p) components
U, s, Vt = np.linalg.svd(X, full_matrices=False)
# Truncate to rank k
U_k = U[:, :k]
s_k = s[:k]
Vt_k = Vt[:k, :]
return U_k, s_k, Vt_k
def reconstruct(U_k, s_k, Vt_k):
"""Reconstruct the rank-k approximation."""
return U_k * s_k[np.newaxis, :] @ Vt_k
def pseudoinverse(X, threshold=1e-10):
"""
Compute the Moore-Penrose pseudoinverse via SVD.
Singular values below threshold are treated as zero.
"""
U, s, Vt = np.linalg.svd(X, full_matrices=False)
s_inv = np.where(s > threshold, 1.0 / s, 0.0)
return (Vt.T * s_inv[np.newaxis, :]) @ U.T
# ── Example: SVD of simulated neural data ──
rng = np.random.default_rng(42)
# 500 time bins, 100 neurons, true rank ~5
n_time, n_neurons, true_rank = 500, 100, 5
latent = rng.standard_normal((n_time, true_rank))
weights = rng.standard_normal((true_rank, n_neurons))
noise = 0.3 * rng.standard_normal((n_time, n_neurons))
X = latent @ weights + noise
# Center
X = X - X.mean(axis=0)
# Full thin SVD
U, s, Vt = np.linalg.svd(X, full_matrices=False)
print(f"Top 10 singular values: {s[:10].round(1)}")
# Notice the sharp drop after index 5
# Truncate to rank 5
U_k, s_k, Vt_k = truncated_svd(X, k=5)
X_approx = reconstruct(U_k, s_k, Vt_k)
frac = np.sum(s_k**2) / np.sum(s**2)
print(f"Rank-5 captures {frac:.1%} of variance")
# Pseudoinverse
X_pinv = pseudoinverse(X)
print(f"X shape: {X.shape}, X+ shape: {X_pinv.shape}")
# X+ @ X is approximately the identity on the row spaceThe key detail: always use np.linalg.svd with full_matrices=False for data matrices. The full SVD computes an unnecessarily large matrix (500-by-500 for our example), most of which is never used. The thin SVD gives as 500-by-100, which is all you need.For very large matrices (millions of rows), even the thin SVD is expensive. Randomized SVD algorithms [15] approximate the top singular values and vectors in time instead of , by first projecting onto a random low-dimensional subspace. The scikit-learn implementation (sklearn.utils.extmath.randomized_svd) is the standard choice for neural data matrices too large for a direct SVD.
References
- Strang, G. Introduction to Linear Algebra, 6th ed. Wellesley-Cambridge Press, 2023.
- 3Blue1Brown. "Essence of Linear Algebra" video series, 2016.
- Churchland, M. M., Cunningham, J. P., Kaufman, M. T., et al. "Neural population dynamics during reaching," Nature, vol. 487, pp. 51-56, 2012.
- Cunningham, J. P. and Yu, B. M. "Dimensionality reduction for large-scale neural recordings," Nature Neuroscience, vol. 17, pp. 1500-1509, 2014.
- Axler, S. Linear Algebra Done Right, 4th ed. Springer, 2024.
- Strang, G. "The fundamental theorem of linear algebra," The American Mathematical Monthly, vol. 100, no. 9, pp. 848-855, 1993.
- Safaie, M., Chang, J. C., Park, J., et al. "Preserved neural dynamics across animals performing similar behaviour," Nature, vol. 623, pp. 765-771, 2023.
- Trefethen, L. N. and Bau, D. Numerical Linear Algebra. SIAM, 1997.
- Stewart, G. W. "On the early history of the singular value decomposition," SIAM Review, vol. 35, no. 4, pp. 551-566, 1993.
- Gavish, M. and Donoho, D. L. "The optimal hard threshold for singular values is 4/√3," IEEE Transactions on Information Theory, vol. 60, no. 8, pp. 5040-5053, 2014.
- Eckart, C. and Young, G. "The approximation of one matrix by another of lower rank," Psychometrika, vol. 1, no. 3, pp. 211-218, 1936.
- Hotelling, H. "Relations between two sets of variates," Biometrika, vol. 28, no. 3-4, pp. 321-377, 1936.
- Golub, G. H. and Van Loan, C. F. Matrix Computations, 4th ed. Johns Hopkins University Press, 2013.
- Penrose, R. "A generalized inverse for matrices," Mathematical Proceedings of the Cambridge Philosophical Society, vol. 51, no. 3, pp. 406-413, 1955.
- Halko, N., Martinsson, P. G., and Tropp, J. A. "Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions," SIAM Review, vol. 53, no. 2, pp. 217-288, 2011.