The singular value decomposition

One factorization behind most of neural data analysis.

What a matrix does to a circle

Every matrix you will meet in this series does something geometric: it takes a set of vectors and rearranges them. To understand what any particular matrix does, a good starting question is: what happens to the unit circle (in 2D) or the unit sphere (in higher dimensions) when the matrix acts on it?

Take a concrete example. In Post 3, we built a 2-by-3 neural decoder that maps 3-dimensional neural activity to 2-dimensional hand velocity. But to see the geometry clearly, start with a square matrix:

A=[3113]A = \begin{bmatrix} 3 & 1 \\ 1 & 3 \end{bmatrix}
(1)

Apply this matrix to every point on the unit circle. The result is an ellipse. We saw in Post 5 that a symmetric matrix maps a circle to an ellipse whose axes align with the eigenvectors. The eigenvalues are 4 and 2 (check: trace=6\text{trace} = 6, det=8\det = 8, so λ26λ+8=0\lambda^2 - 6\lambda + 8 = 0 gives λ=4,2\lambda = 4, 2). The eigenvectors (1,1)/2(1, 1)/\sqrt{2} and (1,1)/2(1, -1)/\sqrt{2} point along the ellipse axes, and the eigenvalues are the semi-axis lengths. So far, the eigendecomposition from Post 5 handles everything.

Now consider a non-symmetric matrix:

B=[2011]B = \begin{bmatrix} 2 & 0 \\ 1 & 1 \end{bmatrix}
(2)

This still maps the circle to an ellipse, but the eigenvectors of BB are not perpendicular. The eigenvalues are 2 and 1, with eigenvectors (0,1)(0, 1) and (1,1)(-1, 1), which meet at an oblique angle. Yet the ellipse still has a longest axis and a shortest axis, and those axes are still perpendicular to each other. The eigenvectors simply do not point along them.Why are the ellipse axes always perpendicular? Because an ellipse is a level set of a quadratic form, and any quadratic form can be diagonalized by an orthogonal change of coordinates. The SVD is the machinery that finds those coordinates for arbitrary (including non-symmetric and rectangular) matrices.

This is the gap the SVD fills. The eigendecomposition finds directions that scale without rotating, which requires a square matrix and yields perpendicular axes only for symmetric matrices. The SVD instead asks: what are the perpendicular axes of the output ellipse, and which perpendicular axes on the input circle map to them? It answers this question for any matrix, symmetric or not, square or not.

vvσ₁uσ₂uσ₁ = 1.62σ₂ = 0.62A =[1.00 1.000.00 1.00]σ₁ = 1.62σ₂ = 0.62A maps the unit circle to an ellipse — SVD reveals the geometry
A unit circle maps to an ellipse. The right singular vectors (on the circle) map to the left singular vectors (on the ellipse), scaled by the singular values. Toggle between matrices to compare symmetric and non-symmetric cases.

There exists a pair of orthonormal bases, one for the input space and one for the output space, such that the matrix acts as pure scaling between them. The input basis vectors are the right singular vectors. The output basis vectors are the left singular vectors. The scaling factors are the singular values. That is the entire content of the SVD.

Rotate, scale, rotate

Every matrix, no matter how complicated, decomposes into three steps. The SVD of any matrix AA is:

A=UΣVA = U \Sigma V^\top
(3)

Read this right to left, as three operations applied to a vector xx:

Step 1. VxV^\top x: rotate the input vector so that the right singular vectors align with the coordinate axes. This is a change of basis in the input space. Because VV is orthogonal, VV^\top is a rotation (or rotation plus reflection). No stretching happens.

Step 2. Σ(Vx)\Sigma(V^\top x): scale each coordinate axis independently. The stretch factors are the singular values σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0, the diagonal entries of Σ\Sigma. This is where the geometry actually changes: directions with large singular values get amplified, directions with small singular values get crushed.

Step 3. U(ΣVx)U(\Sigma V^\top x): rotate the scaled result into the output space, aligning the coordinate axes with the left singular vectors. Again, no stretching.

A = U Σ V[2, 1][0.5, 1.5]unit circle — input space before any transformation
Step through the three stages of the SVD. Each button applies one operation. Between the two rotations, the only thing that happens is axis-aligned scaling. Every matrix, regardless of what it looks like, has this internal structure.

The two rotations can be completely different. VV orients the input space and UU orients the output space. Between them, Σ\Sigma does the only non-rigid work: independent scaling along each axis. This factorization is why the SVD reveals so much about a matrix. The singular values tell you how much each direction is amplified. The singular vectors tell you which directions.The "rotate, scale, rotate" picture is the SVD's fundamental insight, and Strang [1] and Trefethen and Bau [8] both emphasize it as the single most important factorization in numerical linear algebra. Trefethen in particular argues that the SVD should be taught before the eigendecomposition, because it works for all matrices and the geometric meaning is immediate.

Let's verify with a concrete example. Take the shear matrix:

A=[1101]A = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix}
(4)

Compute AA=(1112)A^\top A = \bigl(\begin{smallmatrix} 1 & 1 \\ 1 & 2 \end{smallmatrix}\bigr). The characteristic polynomial is λ23λ+1=0\lambda^2 - 3\lambda + 1 = 0, giving eigenvalues 3+522.618\tfrac{3 + \sqrt{5}}{2} \approx 2.618 and 3520.382\tfrac{3 - \sqrt{5}}{2} \approx 0.382. The singular values are the square roots: σ11.618\sigma_1 \approx 1.618 and σ20.618\sigma_2 \approx 0.618.

Two checks. First: σ1σ2=1.618×0.6181=det(A)\sigma_1 \cdot \sigma_2 = 1.618 \times 0.618 \approx 1 = |\det(A)|. The product of singular values equals the absolute determinant, because the determinant measures total area (or volume) scaling, and the singular values measure how much each axis contributes. Second: σ12+σ22=2.618+0.382=3=AF2\sigma_1^2 + \sigma_2^2 = 2.618 + 0.382 = 3 = \|A\|_F^2, the squared Frobenius norm (sum of all squared entries of AA: 1+1+0+1=31 + 1 + 0 + 1 = 3).Both of these identities hold in general. For any matrix: det(A)=iσi|\det(A)| = \prod_i \sigma_i and AF2=iσi2\|A\|_F^2 = \sum_i \sigma_i^2. The first says singular values decompose volume scaling. The second says they decompose total "energy." These identities are why singular values appear in matrix norms, condition numbers, and information-theoretic quantities throughout the series.

Where the SVD comes from

The SVD is not a new piece of machinery. It falls directly out of the eigendecomposition from Post 5, applied to a carefully chosen matrix.

We want to find the directions in the input space that get stretched the most by AA. The length of AxAx is Ax=xAAx\|Ax\| = \sqrt{x^\top A^\top A\, x}. Maximizing this over unit vectors x=1\|x\| = 1 is an eigenvalue problem for the matrix AAA^\top A, exactly as in Post 5. The maximum of Ax2\|Ax\|^2 subject to x=1\|x\| = 1 is the largest eigenvalue of AAA^\top A, achieved at the corresponding eigenvector.

Why AAA^\top A and not just AA? Because AA might not be square, and even if it is, it might not be symmetric. But AAA^\top A is always square, always symmetric, and always positive semidefinite:

(AA)=AAandxAAx=Ax20(A^\top A)^\top = A^\top A \qquad\text{and}\qquad x^\top A^\top A\, x = \|Ax\|^2 \geq 0

By the spectral theorem from Post 5, it has a full orthogonal eigendecomposition:

AA=VΛVA^\top A = V \Lambda V^\top
(5)

where the columns of VV are orthonormal eigenvectors and Λ=diag(λ1,,λn)\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n) has nonnegative eigenvalues. These eigenvectors are the right singular vectors of AA. They are the input directions that align with the ellipse axes.

The eigenvalues of AAA^\top A are the squared singular values: λi=σi2\lambda_i = \sigma_i^2. This is the connection: the singular values of AA are the square roots of the eigenvalues of AAA^\top A.

Now we need the output directions. For each nonzero singular value, define:

ui=1σiAviu_i = \frac{1}{\sigma_i}\, A\, v_i
(6)

This is AA applied to the right singular vector, normalized by the singular value. In other words: take the input direction, push it through the matrix, and normalize the result. The claim is that these uiu_i's are orthonormal. Verify:

uiuj=1σiσjviAAvj=1σiσjvi(σj2vj)=σjσivivj=δiju_i^\top u_j = \frac{1}{\sigma_i \sigma_j}\, v_i^\top A^\top A\, v_j = \frac{1}{\sigma_i \sigma_j}\, v_i^\top (\sigma_j^2\, v_j) = \frac{\sigma_j}{\sigma_i}\, v_i^\top v_j = \delta_{ij}

where we used AAvj=σj2vjA^\top A\, v_j = \sigma_j^2\, v_j (the eigenvalue equation) and the orthonormality of the viv_i's. So the uiu_i's are indeed orthonormal. They are the left singular vectors.The orthonormality proof is worth internalizing. It is the same structure that appears in the CCA derivation (where whitening plays the role of normalizing) and in subspace identification (where the left singular vectors of the Hankel matrix span the observability subspace). Every time you see an SVD in this series, the left and right singular vectors have a similar "push through the matrix, normalize" relationship.

Assemble everything. From Equation 6, we have Avi=σiuiA\, v_i = \sigma_i\, u_i for each singular value. Writing all rr of these equations as a single matrix equation:

AVr=UrΣrA\, V_r = U_r\, \Sigma_r

where VrV_r collects the right singular vectors, UrU_r collects the left singular vectors, and Σr\Sigma_r is diagonal with the singular values. Right-multiplying by VrV_r^\top (which equals Vr1V_r^{-1} since it is orthogonal):

A=UrΣrVrA = U_r\, \Sigma_r\, V_r^\top
(7)

That is the SVD. We started from the spectral theorem for the symmetric matrix AAA^\top A, defined the left singular vectors via the matrix itself, proved they are orthonormal, and assembled the factorization. No new axioms or theorems were needed beyond what Post 5 provides.

You could equally well start from AAAA^\top instead of AAA^\top A. Its eigenvectors are the left singular vectors, and you would construct the right singular vectors from them. Both AAA^\top A and AAAA^\top share the same nonzero eigenvalues (the squared singular values). The two paths converge to the same factorization.For a symmetric matrix, the SVD and the eigendecomposition coincide: left and right singular vectors are both eigenvectors, and singular values are the absolute values of eigenvalues. The SVD is strictly more general. It works for any matrix, including rectangular ones, and the singular values are always nonnegative (unlike eigenvalues, which can be negative or complex). Stewart [9] gives a thorough treatment of the SVD's history and properties.

Rectangular matrices and singular values

The eigendecomposition only works for square matrices. The SVD works for any matrix, including the rectangular ones that appear constantly in neural data.

Take a data matrix XX with 500 rows (time bins) and 100 columns (neurons). Its SVD is X=UΣVX = U \Sigma V^\top, where UU is 500-by-500, Σ\Sigma is 500-by-100, and VV is 100-by-100. But most of those entries are not doing useful work. At most 100 of the singular values can be nonzero (since the matrix has only 100 columns), and the corresponding 400 extra columns of UU span the left null space, which carries no information about the data.

In practice, you use the thin (or economy) SVD, which keeps only the parts that matter: X=UrΣrVrX = U_r \Sigma_r V_r^\top, where rr is the rank. UrU_r is 500-by-rr, Σr\Sigma_r is rr-by-rr, and VrV_r is 100-by-rr. This is smaller and faster, with no loss of information.

Input ℝⁿOutput ℝᵐRow space (Vᵣ)dim = rNull space (Vrest)dim = n − rCol space (Ur)dim = rLeft null (Urest)dim = m − rΣrσ₁ ≥ σ₂ ≥ …≥ σᵣ > 0vᵢσᵢ uᵢscale by σᵢ→ 00rank = rn dims totalm dims total
The SVD of a rectangular matrix. The four fundamental subspaces from Post 4 appear directly: columns of VrV_r span the row space, remaining columns of VV span the null space, columns of UrU_r span the column space, remaining columns of UU span the left null space. Adjust the rank to see how the subspaces partition each space.

The singular values are conventionally sorted in decreasing order: σ1σ2σr>0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0. Their magnitudes tell you how important each component is. A large gap between σk\sigma_k and σk+1\sigma_{k+1} means the matrix is well approximated by its first kk components.

This is exactly the singular value spectrum that the PSID post uses to detect latent dimensionality. There, the matrix is the block Hankel matrix of time-lagged observations, and a gap in its singular value spectrum reveals how many latent dimensions the dynamical system has. The logic is the same: the number of large singular values is the effective rank of the matrix, and the effective rank is the number of independent components that matter.The "numerical rank" (the number of singular values above some threshold) is a more useful measure than the exact rank for noisy data. Exact rank will always be full if there is any noise at all, since noise fills every dimension. Gavish and Donoho [10] derive an optimal singular value threshold for denoising matrices with known noise levels, giving a principled alternative to eyeballing the elbow.

The four fundamental subspaces from Post 4 appear directly in the SVD. The columns of VrV_r are an orthonormal basis for the row space. The remaining columns of VV (the ones dropped in the thin SVD) span the null space. The columns of UrU_r span the column space. The remaining columns of UU span the left null space. This is the four-subspace picture from Post 4 made computational. The SVD hands you orthonormal bases for all four subspaces in one shot. Strang [1, 6] builds his entire "fundamental theorem of linear algebra" around this observation.

Low-rank approximation

You have a 500-by-100 data matrix and you suspect it is approximately low-rank: a hundred neurons, but maybe only ten or fifteen latent dimensions driving them. You want to find the rank-kk matrix closest to your data. Which one is it?

Write the SVD as a sum of rank-1 terms. Each term σiuivi\sigma_i\, u_i\, v_i^\top is the outer product of a left and right singular vector, scaled by the singular value:

A=i=1rσiuiviA = \sum_{i=1}^{r} \sigma_i\, u_i\, v_i^\top
(8)

The truncated SVD keeps only the first kk terms:

Ak=i=1kσiuiviA_k = \sum_{i=1}^{k} \sigma_i\, u_i\, v_i^\top
(9)

The Eckart-Young-Mirsky theorem [11] says: among all matrices with rank at most kk, the truncated SVD AkA_k is the one closest to AA. This holds in both the Frobenius norm (sum of squared entries) and the operator norm (largest singular value of the difference). No other rank-kk matrix does better.The Eckart-Young theorem dates to 1936 [11], the same year as Hotelling's CCA paper [12]. The proof is clean: any rank-kk matrix has at most kk nonzero singular values, and the Frobenius norm is the sum of squared singular values of the difference, which is minimized by zeroing out the rkr - k smallest. Mirsky later extended the result to unitarily invariant norms.

The approximation error has a simple expression:

AAkF2=σk+12+σk+22++σr2\|A - A_k\|_F^2 = \sigma_{k+1}^2 + \sigma_{k+2}^2 + \cdots + \sigma_r^2
(10)

The error is entirely determined by the discarded singular values. If the singular values drop off sharply after the first kk, the approximation is excellent. The fraction of total "energy" (variance) captured is:

fraction explained=σ12++σk2σ12++σr2\text{fraction explained} = \frac{\sigma_1^2 + \cdots + \sigma_k^2}{\sigma_1^2 + \cdots + \sigma_r^2}
(11)

If this fraction is 0.95 with k=10k = 10 out of 100 neurons, then ten components capture 95% of the data's content. This is exactly how PCA's explained variance works. It is not a coincidence.

OriginalRank-3Residual=+singular values16.3σ8.2σ5.6σ0.8σ0.6σ0.4σ0.4σ0.2σ6.0% errorA A / A
Drag the truncation rank kk to see the reconstruction degrade. The heat map shows the original matrix, the rank-kk approximation, and the residual. Below, the singular value spectrum shows which components you are keeping and which you are discarding.

Let's make the PCA connection precise. If XX is the centered data matrix (rows are observations, columns are neurons) with SVD X=UΣVX = U\Sigma V^\top, then the covariance matrix is:

1TXX=1TVΣUUΣV=V ⁣(Σ2T) ⁣V\frac{1}{T}\, X^\top X = \frac{1}{T}\, V \Sigma^\top U^\top U \Sigma V^\top = V\!\left(\frac{\Sigma^2}{T}\right)\! V^\top
(12)

since UU=IU^\top U = I. This is an eigendecomposition: the right singular vectors VV of the data matrix are the eigenvectors of the covariance matrix, and the squared singular values divided by TT are the eigenvalues. The PCA scores are UkΣkU_k \Sigma_k, which are the projections of the data onto the first kk principal components.In practice, PCA is always computed via the SVD of the data matrix, never by forming the covariance matrix and eigendecomposing it. Forming XXX^\top X squares the condition number: if XX has condition number κ\kappa, then XXX^\top X has condition number κ2\kappa^2, making small eigenvalues much harder to resolve. The SVD avoids this entirely. Golub and Van Loan [13] give a detailed analysis of the numerical stability differences.

8.2σ5.1σ3.4σ1.8σ0.9σ0.5σ0.3σ0.15σkeepdiscard96.0% variance capturedrank-3 approximation captures 96.0% of total variance
The singular value spectrum of a simulated neural population. Click a bar to set the truncation rank kk. The cumulative explained variance curve shows how quickly the approximation improves. The "elbow" is where adding more components stops helping much.

The pseudoinverse

In Post 4, we introduced the least-squares formula x^=(AA)1Ab\hat{x} = (A^\top A)^{-1} A^\top b and called (AA)1A(A^\top A)^{-1} A^\top the pseudoinverse. But that formula requires the columns of AA to be linearly independent (so AAA^\top A is invertible). What happens when they are not?

A rank-deficient matrix sends some directions to zero. Those directions cannot be recovered. But the directions that survive can be inverted. The SVD separates these two cases.

Start from the SVD: A=UΣVA = U\Sigma V^\top. The "inverse" should undo each step. Reverse the output rotation (UU^\top), invert the scaling (replace each σi\sigma_i with 1/σi1/\sigma_i), and reverse the input rotation (VV). For the zero singular values, there is nothing to invert: the matrix crushed those directions, and no inverse can reconstruct what was destroyed. So we leave them at zero.

A+=VΣ+UA^+ = V \Sigma^+ U^\top
(13)

where Σ+\Sigma^+ is formed by taking the reciprocal of each nonzero entry and transposing (to handle the rectangular case). Concretely: Σii+=1/σi\Sigma^+_{ii} = 1/\sigma_i if σi>0\sigma_i > 0, and zero otherwise.

This is the Moore-Penrose pseudoinverse [14]. It gives the unique matrix satisfying four conditions (the Penrose conditions), but you do not need to memorize them. The geometric picture is enough: invert what you can, leave the rest at zero.

aabproj bx = Ab= [1.11, 0.67]b Ax = 0.000rank 2σ₁ = 2.21σ₂ = 1.30A = [1.8 0.3] [0.6 1.7]drag column vectors or target b
The pseudoinverse in action. Left: the system Ax=bAx = b may have no exact solution (when bb is outside the column space). The pseudoinverse finds the closest point in the column space, then maps back to the minimum-norm input. Right: adjust the truncation threshold to see how discarding small singular values regularizes the solution.

The pseudoinverse does exactly the right thing in every case. For an overdetermined system (more equations than unknowns), A+bA^+ b gives the least-squares solution: the xx that minimizes Axb2\|Ax - b\|^2. For an underdetermined system (more unknowns than equations), it gives the minimum-norm solution among all exact solutions. For a rank-deficient system, it does both simultaneously: minimum-norm, least-squares.The pseudoinverse appears throughout the rest of the series. In subspace identification, you need it to recover latent states from the observability estimate. In PSID, both stages use pseudoinverses to extract state sequences from projections. In CCA, the whitening step Σ1/2\Sigma^{-1/2} is essentially a partial pseudoinverse of the covariance. The SVD-based definition handles all cases cleanly.

There is a critical practical point. If some singular values are very small (not zero, but close), inverting them amplifies noise. A singular value of 0.001 becomes a reciprocal of 1000, and any noise in that direction gets multiplied by a factor of 1000. The standard fix is to treat singular values below a threshold as zero, setting their reciprocals to zero instead of computing them. This is truncated SVD regularization. It trades a small amount of bias (ignoring a real but tiny signal) for a large reduction in variance (not amplifying noise). The figure above lets you adjust the threshold and see this tradeoff directly.The choice of truncation threshold connects to Tikhonov regularization (ridge regression). Adding a penalty λx2\lambda \|x\|^2 to the least-squares objective effectively replaces each singular value σi\sigma_i in the pseudoinverse with σi/(σi2+λ)\sigma_i / (\sigma_i^2 + \lambda). Large singular values are barely affected; small ones are shrunk toward zero. This is softer than the hard truncation of the SVD pseudoinverse, but the underlying idea is the same: do not trust directions where the signal is small relative to the noise.

Where the SVD appears

Let's count the places where the SVD has already appeared in this series, or will appear in the posts ahead.

PCA (Post 7). The right singular vectors of the centered data matrix are the principal components. The squared singular values are proportional to the covariance eigenvalues. PCA is the truncated SVD of the data matrix. Cunningham and Yu [4] review PCA and related dimensionality reduction methods for neural data.

Least squares and the pseudoinverse. The pseudoinverse A+A^+ gives the least-squares solution for any matrix, regardless of rank or shape. Truncating small singular values regularizes the solution.

Low-rank approximation. The best rank-kk approximation to any matrix is the truncated SVD. This is the mathematical foundation for keeping only the first kk principal components.

CCA (Post 9). The canonical correlations are the singular values of the whitened cross-covariance matrix, and the canonical directions are its left and right singular vectors. CCA is an SVD after whitening.

Procrustes alignment (Post 10). The orthogonal matrix that best aligns one dataset to another comes from the SVD of their cross-product. If XY=UΣVX^\top Y = U\Sigma V^\top, the optimal rotation is R=VUR = VU^\top. This is the core of hyperalignment and shared response models [7].

Reduced-rank regression (Post 11). The optimal low-rank predictor comes from the SVD of the regression coefficient matrix, constrained to a low-dimensional subspace.

Subspace identification. In the PSID post, the latent dimensionality of a dynamical system is the number of significant singular values of the block Hankel matrix. The left singular vectors span the observability subspace. The state sequence comes from scaling and projecting.

SVDA = UΣVᵀPCAMatrix: Centered data XKeep: Top k right singular vectors (PCs)Least squaresMatrix: Design matrix AKeep: Pseudoinverse A⁺ = VΣ⁺UᵀLow-rank approxMatrix: Any matrix AKeep: Top k terms of UΣVᵀCCAMatrix: Whitened cross-covKeep: Sing. vectors = canonical dirsProcrustesMatrix: Cross-product YᵀXKeep: Rotation R = UVᵀRRRMatrix: Coefficient B̂Keep: Truncated SVD of B̂Subspace IDMatrix: Block HankelKeep: Left sing. vectors = obs. basis
Seven methods, one decomposition. Each method forms a different matrix and takes its SVD. The methods differ in what matrix you build and how many components you keep. The SVD is the shared computation.

Seven methods, each reducible to "form a matrix and take its SVD." They differ in what matrix you construct and how many components you keep.This is not a coincidence. Many problems in applied linear algebra reduce to one of two questions: "find the best low-rank approximation to this matrix" or "find orthonormal bases for the row and column spaces of this matrix." The SVD answers both. Any method that can be phrased either way will use the SVD, whether or not it says so on the label.

We now have the complete linear algebra foundation. Vectors, bases, linear maps, subspaces, eigendecomposition, and the SVD. The next post puts them all together: PCA, derived from first principles, computed via the SVD, and connected to everything we have built.

Implementation

In NumPy, the SVD is a single function call. Here is a minimal implementation that computes the thin SVD of a data matrix, truncates to a chosen rank, and reconstructs the approximation:

import numpy as np

def truncated_svd(X, k):
    """
    Compute the rank-k truncated SVD of X.

    Parameters
    ----------
    X : array, shape (n, p)
        Data matrix (rows = observations, columns = variables).
    k : int
        Number of components to keep.

    Returns
    -------
    U_k : array, shape (n, k)
        Left singular vectors (score directions).
    s_k : array, shape (k,)
        Singular values.
    Vt_k : array, shape (k, p)
        Right singular vectors (loading directions), transposed.
    """
    # Thin SVD — only compute the min(n, p) components
    U, s, Vt = np.linalg.svd(X, full_matrices=False)

    # Truncate to rank k
    U_k = U[:, :k]
    s_k = s[:k]
    Vt_k = Vt[:k, :]

    return U_k, s_k, Vt_k


def reconstruct(U_k, s_k, Vt_k):
    """Reconstruct the rank-k approximation."""
    return U_k * s_k[np.newaxis, :] @ Vt_k


def pseudoinverse(X, threshold=1e-10):
    """
    Compute the Moore-Penrose pseudoinverse via SVD.

    Singular values below threshold are treated as zero.
    """
    U, s, Vt = np.linalg.svd(X, full_matrices=False)
    s_inv = np.where(s > threshold, 1.0 / s, 0.0)
    return (Vt.T * s_inv[np.newaxis, :]) @ U.T


# ── Example: SVD of simulated neural data ──
rng = np.random.default_rng(42)

# 500 time bins, 100 neurons, true rank ~5
n_time, n_neurons, true_rank = 500, 100, 5
latent = rng.standard_normal((n_time, true_rank))
weights = rng.standard_normal((true_rank, n_neurons))
noise = 0.3 * rng.standard_normal((n_time, n_neurons))
X = latent @ weights + noise

# Center
X = X - X.mean(axis=0)

# Full thin SVD
U, s, Vt = np.linalg.svd(X, full_matrices=False)
print(f"Top 10 singular values: {s[:10].round(1)}")
# Notice the sharp drop after index 5

# Truncate to rank 5
U_k, s_k, Vt_k = truncated_svd(X, k=5)
X_approx = reconstruct(U_k, s_k, Vt_k)

frac = np.sum(s_k**2) / np.sum(s**2)
print(f"Rank-5 captures {frac:.1%} of variance")

# Pseudoinverse
X_pinv = pseudoinverse(X)
print(f"X shape: {X.shape}, X+ shape: {X_pinv.shape}")
# X+ @ X is approximately the identity on the row space

The key detail: always use np.linalg.svd with full_matrices=False for data matrices. The full SVD computes an unnecessarily large UU matrix (500-by-500 for our example), most of which is never used. The thin SVD gives UU as 500-by-100, which is all you need.For very large matrices (millions of rows), even the thin SVD is expensive. Randomized SVD algorithms [15] approximate the top kk singular values and vectors in O(mnk)O(mnk) time instead of O(mnmin(m,n))O(mn \min(m,n)), by first projecting onto a random low-dimensional subspace. The scikit-learn implementation (sklearn.utils.extmath.randomized_svd) is the standard choice for neural data matrices too large for a direct SVD.

References

  1. Strang, G. Introduction to Linear Algebra, 6th ed. Wellesley-Cambridge Press, 2023.
  2. 3Blue1Brown. "Essence of Linear Algebra" video series, 2016.
  3. Churchland, M. M., Cunningham, J. P., Kaufman, M. T., et al. "Neural population dynamics during reaching," Nature, vol. 487, pp. 51-56, 2012.
  4. Cunningham, J. P. and Yu, B. M. "Dimensionality reduction for large-scale neural recordings," Nature Neuroscience, vol. 17, pp. 1500-1509, 2014.
  5. Axler, S. Linear Algebra Done Right, 4th ed. Springer, 2024.
  6. Strang, G. "The fundamental theorem of linear algebra," The American Mathematical Monthly, vol. 100, no. 9, pp. 848-855, 1993.
  7. Safaie, M., Chang, J. C., Park, J., et al. "Preserved neural dynamics across animals performing similar behaviour," Nature, vol. 623, pp. 765-771, 2023.
  8. Trefethen, L. N. and Bau, D. Numerical Linear Algebra. SIAM, 1997.
  9. Stewart, G. W. "On the early history of the singular value decomposition," SIAM Review, vol. 35, no. 4, pp. 551-566, 1993.
  10. Gavish, M. and Donoho, D. L. "The optimal hard threshold for singular values is 4/√3," IEEE Transactions on Information Theory, vol. 60, no. 8, pp. 5040-5053, 2014.
  11. Eckart, C. and Young, G. "The approximation of one matrix by another of lower rank," Psychometrika, vol. 1, no. 3, pp. 211-218, 1936.
  12. Hotelling, H. "Relations between two sets of variates," Biometrika, vol. 28, no. 3-4, pp. 321-377, 1936.
  13. Golub, G. H. and Van Loan, C. F. Matrix Computations, 4th ed. Johns Hopkins University Press, 2013.
  14. Penrose, R. "A generalized inverse for matrices," Mathematical Proceedings of the Cambridge Philosophical Society, vol. 51, no. 3, pp. 406-413, 1955.
  15. Halko, N., Martinsson, P. G., and Tropp, J. A. "Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions," SIAM Review, vol. 53, no. 2, pp. 217-288, 2011.
← Back to home