Least squares and regularization

Why the best-fitting decoder is usually the worst one to use.

PublishedMarch 2026

Reading time

SeriesLinear Algebra for Neural Data, Part 8

The decoder that memorized noise

In Post 3, a collaborator handed us a decoder: a weight matrix $W$ that maps neural firing rates to hand velocity. We used it without asking where it came from. Now we need to build one ourselves.

The setup is standard. You record from 100 neurons in motor cortex while a monkey makes repeated reaches. Each trial gives you a 100-dimensional firing-rate vector $a_i \in \mathbb{R}^{100}$ and a 2D hand velocity $v_i \in \mathbb{R}^2$ . You want a 2-by-100 weight matrix $W$ such that $Wa_i \approx v_i$ for every trial. Stack the trials: the neural data form a 50-by-100 matrix $A$ (one row per trial), and the velocities form a 50-by-2 matrix $V$ . You want $AW^\top \approx V$ , or equivalently, you want to solve each column of the velocity matrix separately as a least-squares problem.

You have 50 training trials. You apply the least-squares formula (which we will derive in the next section) and get a weight matrix. On the training data, the predictions are nearly perfect: $R^2 = 0.99$ . You hold out 50 additional test trials that the decoder has never seen. On those, $R^2 = 0.03$ . The decoder explains almost none of the variance on new data.

What happened is not subtle. You have 50 observations and 100 unknowns (one weight per neuron). The system is underdetermined: there are infinitely many weight vectors that fit the training data exactly, and least squares picks one of them. That solution threads through every training point perfectly, but it does so by exploiting the particular pattern of trial-to-trial noise in those 50 trials. On new trials, the noise is different, and the decoder falls apart.*This is a general phenomenon, not specific to neural decoding. Whenever the number of predictors exceeds the number of observations, ordinary least squares will overfit. In neuroscience, the ratio is often extreme: hundreds of neurons, tens of trials. The overfitting is not a failure of the math. It is the math doing exactly what you asked.

N = 20 neurons

Adjust the number of neurons. Training R² stays high as N increases, but test R² collapses once N exceeds the number of training trials.

The figure shows the pattern clearly. As you increase the number of neurons used in the decoder, training $R^2$ climbs steadily toward 1. Test $R^2$ follows at first, because the added neurons carry real information about the reach direction. But once the number of neurons exceeds the number of training trials, test $R^2$ drops. With 100 neurons and 50 trials, the gap between training and test performance is enormous.

The least-squares formula gave us an exact answer. The math was correct. The answer was useless. To understand why, we need to look at what the formula actually computes.

The normal equations

We want to solve $Ax = b$ where $A$ is an $m \times n$ matrix, $x \in \mathbb{R}^n$ is the unknown, and $b \in \mathbb{R}^m$ is the target. When $b$ is not in the column space of $A$ (and usually it is not), there is no exact solution. We settle for the $x$ that makes $Ax$ as close to $b$ as possible, minimizing the squared error $\|Ax - b\|^2$ .

The geometric answer comes from Post 4. The closest point to $b$ in the column space of $A$ is the orthogonal projection of $b$ onto that subspace. Call it $\hat{b} = A\hat{x}$ . The residual $b - A\hat{x}$ is perpendicular to the column space.

Perpendicular to the column space means perpendicular to every column of $A$ . Stacking those perpendicularity conditions into one matrix equation:

A^\top (b - A\hat{x}) = 0

(1)

Distribute and rearrange:

A^\top A\, \hat{x} = A^\top b

(2)

These are the normal equations. If $A^\top A$ is invertible (which requires the columns of $A$ to be linearly independent), the solution is unique:

\hat{x} = (A^\top A)^{-1} A^\top b

(3)

The matrix $(A^\top A)^{-1} A^\top$ is the pseudoinverse of $A$ , which we met in Post 6 through the SVD. The two derivations give the same object. The SVD route ( $A^+ = V\Sigma^+ U^\top$ ) works even when the columns are dependent; the normal-equations route requires invertibility of $A^\top A$ but makes the geometry explicit.*The connection is direct. From the SVD $A = U\Sigma V^\top$ , we get $A^\top A = V\Sigma^2 V^\top$ and $(A^\top A)^{-1} = V\Sigma^{-2} V^\top$ . Then $(A^\top A)^{-1} A^\top = V\Sigma^{-1} U^\top = A^+$ . The normal equations and the pseudoinverse are two ways of writing the same computation.

The derivation has a satisfying structure. We start from a geometric fact (the residual is perpendicular to the column space), translate it into algebra (equation 1), and arrive at an explicit formula (equation 3). The formula minimizes the squared error. It is the unique minimizer. So why did it fail on the decoder problem?

The formula works. What goes wrong is the matrix it operates on. When $A^\top A$ is nearly singular, the inverse amplifies noise. A column of $A$ that carries almost no signal still contributes a direction in the solution, and the inverse inflates that direction by a factor proportional to the reciprocal of the smallest eigenvalue of $A^\top A$ . If the smallest eigenvalue is $10^{-6}$ , the noise in that direction gets amplified by $10^{6}$ .*In the SVD picture, this is the same observation. The pseudoinverse replaces each singular value $\sigma_i$ with $1/\sigma_i$ . A tiny singular value becomes a huge reciprocal, and whatever noise happens to align with that direction gets blown up. The condition number $\sigma_1 / \sigma_n$ measures how badly this can happen.

Condition number

The previous section ended with a specific claim: when $A^\top A$ is nearly singular, the pseudoinverse amplifies noise. The condition number makes that claim precise.

Take the SVD of $A$ . Its singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_n$ measure how much $A$ stretches each input direction. The condition number is the ratio of the largest stretch to the smallest:

\kappa(A) = \frac{\sigma_{\max}}{\sigma_{\min}}

(4)

This ratio controls sensitivity. If you perturb the data $b$ by a small relative amount $\epsilon$ , the least-squares solution can change by up to $\kappa(A) \cdot \epsilon$ in relative terms. When $\kappa = 10^6$ , a 0.1% change in the data can produce a 1000x change in the estimated weights. The solution is not wrong in the algebraic sense — it still minimizes the squared error — but it is useless in the practical sense, because the weights are dominated by amplified noise rather than signal.

Why does neural population data tend to have large $\kappa$ ? Two reasons, both common in practice. First, neurons are correlated. Motor cortex neurons that prefer similar reach directions fire together, so the columns of $A$ point in similar directions. That makes some singular values large (the shared activity patterns) and others tiny (the directions along which the population barely varies). $A^\top A$ inherits this structure: its eigenvalues are the squared singular values, so a singular value of $10^{-3}$ becomes an eigenvalue of $10^{-6}$ .*The neural manifold literature quantifies this directly. Gallego et al. (2017) showed that motor cortex population activity during reaching lives on a manifold whose dimensionality is far lower than the neuron count. If you record from 100 neurons but the data only spans 10 dimensions, then 90 singular values are near zero. The condition number of such a matrix is enormous.

Second, when the number of neurons exceeds the number of trials ( $n > m$ ), $A^\top A$ is not just nearly singular — it is exactly singular. It has at most $m$ nonzero eigenvalues, and the remaining $n - m$ eigenvalues are zero. The condition number is infinite. This is the situation from section 1: 100 neurons, 50 trials, and a decoder that memorized noise.

The small singular values are the specific source of trouble. Each singular value $\sigma_i$ corresponds to a direction in the input space. The pseudoinverse divides by $\sigma_i$ along that direction. For the large singular values, this division is harmless — a signal that was stretched by 10 gets compressed back by 1/10. For the tiny singular values, the division is catastrophic. A direction where the data barely varies (say $\sigma_i = 0.001$ ) gets inflated by a factor of 1000. Whatever noise happens to project onto that direction is amplified into a huge weight component. The resulting weight vector has enormous norm and is almost entirely noise.

-3-2-1123-3-2-112345κ = 2.0‖w‖ = 1.28ŷ=1.03 + 0.76xWell-conditioned — drag any point

Drag a data point. In the well-conditioned case, the regression line barely moves. In the ill-conditioned case, it swings wildly. The condition number κ quantifies the instability.

The figure lets you see the instability directly. In the well-conditioned panel, moving a single data point barely changes the fit. In the ill-conditioned panel, the same perturbation sends the regression line swinging through a completely different slope. Both panels solve the same least-squares problem. The only difference is the geometry of the data: one has singular values that are roughly equal, the other has singular values that span orders of magnitude.

Ridge regression

The diagnosis points directly to a fix. The problem is that small eigenvalues of $A^\top A$ produce enormous entries in the inverse. So make the small eigenvalues bigger. Add a positive constant $\lambda$ to every diagonal entry of $A^\top A$ before inverting:

\hat{x}_{\text{ridge}} = (A^\top A + \lambda I)^{-1} A^\top b

(5)

This is ridge regression. The matrix $A^\top A + \lambda I$ has eigenvalues $\sigma_i^2 + \lambda$ instead of $\sigma_i^2$ . An eigenvalue that was $10^{-6}$ becomes $\lambda + 10^{-6} \approx \lambda$ . The condition number drops from $\sigma_{\max}^2 / \sigma_{\min}^2$ to $(\sigma_{\max}^2 + \lambda) / (\sigma_{\min}^2 + \lambda)$ , which is always closer to 1. The inverse no longer blows up along the low-variance directions.*Ridge regression is also called Tikhonov regularization, after the Soviet mathematician who studied it in the context of ill-posed integral equations. In BCI decoding, ridge is the standard method. The review by Glaser et al. (2020) found that ridge regression matched or outperformed more complex decoders across multiple datasets and brain areas. Nearly every practical neural decoder uses some form of regularization — the question is not whether to regularize, but how much.

There is an equivalent way to arrive at the same formula. Instead of modifying the matrix you invert, you can modify the objective you minimize. Add a penalty on the size of the weights:

\min_x \|Ax - b\|^2 + \lambda \|x\|^2

(6)

Take the gradient, set it to zero, and you get equation 5 back. The two views — inflating eigenvalues versus penalizing weight magnitude — describe the same computation. The penalty view makes the tradeoff explicit: the first term wants the predictions to match the data, the second term wants the weights to stay small. The parameter $\lambda$ controls the balance.

Geometrically, ridge regression constrains the solution to lie within a ball $\|x\|^2 \leq t$ for some radius $t$ that depends on $\lambda$ . Ordinary least squares finds the minimum of the squared error anywhere in parameter space. Ridge finds the minimum within the ball. If the unconstrained minimum has enormous norm (because tiny singular values inflated the weights), the ball pulls the solution back toward the origin. The weights shrink. The fit to the training data gets slightly worse. The fit to new data gets much better.

This is the bias-variance tradeoff stated concretely. At $\lambda = 0$ , there is no penalty, and you recover ordinary least squares: zero bias (the expected estimate equals the true parameter), but potentially enormous variance (the estimate swings wildly with each new sample of noise). As $\lambda$ increases, the weights shrink toward zero, introducing bias (the estimate is systematically pulled away from the truth), but reducing variance (the estimate becomes stable across different noise realizations). Somewhere between these extremes is a value of $\lambda$ that minimizes the total error on new data. Finding that value is the subject of the next section.

0.20.40.60.80.41.01.72.32.9xytraintestλ = 0.00Train MSE: 0.02Test MSE: 0.05‖w‖²: 4.12

Adjust λ. At λ = 0 the fit passes through the training points but overshoots on test points. Increasing λ stabilizes the regression line at the cost of a slightly worse training fit.

The figure shows what happens as you turn the dial. With no regularization, the fit hugs the training data closely — too closely, because it is tracking noise. As you increase $\lambda$ , the regression line smooths out. The training error rises slightly, but the test error drops. Push $\lambda$ too far and the line flattens toward zero, ignoring the data entirely. The optimal $\lambda$ sits where test error is minimized: enough regularization to suppress noise, not so much that you suppress signal.

Choosing lambda

The bias-variance tradeoff tells you that an optimal $\lambda$ exists. It does not tell you what it is. The answer depends on the data: on the singular values of $A$ , on the noise level, on how many trials you have. You cannot compute it from first principles. You have to estimate it empirically.

The standard approach is cross-validation. Split your $m$ trials into $K$ groups (folds). For each fold $k$ , hold it out, fit ridge regression on the remaining $K-1$ folds using a candidate $\lambda$ , and measure the prediction error on the held-out fold. Average the error across folds:

\text{CV}(\lambda) = \frac{1}{K} \sum_{k=1}^{K} \| A_k \hat{x}_{-k}(\lambda) - b_k \|^2

(7)

Here $A_k$ and $b_k$ are the data from fold $k$ , and $\hat{x}_{-k}(\lambda)$ is the ridge solution fit on everything except fold $k$ . Repeat this for a grid of $\lambda$ values (typically spaced logarithmically from $10^{-4}$ to $10^4$ ) and pick the $\lambda$ that minimizes $\text{CV}(\lambda)$ .*In practice, most neuroscience toolboxes use generalized cross-validation (GCV) or leave-one-out CV, both of which have closed-form solutions for ridge regression. You do not need to loop over folds. GCV replaces the combinatorial leave-one-out procedure with a single matrix trace computation, making it fast even for large datasets.

If you plot $R^2$ on held-out data against $\log \lambda$ , the curve is U-shaped. On the left ( $\lambda$ too small), the model overfits: the weights are large, the training fit is excellent, and the test fit is poor. On the right ( $\lambda$ too large), the model underfits: the weights are crushed toward zero, and the decoder ignores real structure in the data. The bottom of the U is the CV-optimal $\lambda$ . The RidgeExplorer above already shows this pattern — as you increase $\lambda$ , training error rises monotonically while test error dips and then climbs back up.

Where the optimum falls depends on how ill-conditioned the problem is. Consider decoding hand velocity from motor cortex. With 100 neurons and 200 trials, the data matrix is overdetermined and reasonably well-conditioned. The optimal $\lambda$ is small — perhaps 1 to 10 — because there is enough data to estimate the weights reliably, and only a gentle nudge is needed. Now cut the trial count to 50. The problem becomes underdetermined (more neurons than trials), the condition number goes to infinity, and the optimal $\lambda$ shifts higher, perhaps by an order of magnitude. Less data means more regularization. The cross-validation curve adapts automatically: it finds the $\lambda$ that matches the severity of the problem.

Why shrinkage works

Cross-validation tells you which $\lambda$ to use. It does not tell you what ridge is actually doing to the solution. For that, go back to the SVD. Write $A = U\Sigma V^\top$ . The ordinary least-squares solution expands in the right singular vectors: the weight along direction $v_i$ is proportional to $1/\sigma_i$ . We saw in section 3 that this is the source of trouble — when $\sigma_i$ is small, $1/\sigma_i$ is enormous, and the corresponding weight component is dominated by noise.

Ridge regression replaces the pseudoinverse filter $1/\sigma_i$ with a different function:

f_i = \frac{\sigma_i}{\sigma_i^2 + \lambda}

(8)

This filter has two regimes. When $\sigma_i$ is large relative to $\sqrt{\lambda}$ , the denominator is dominated by $\sigma_i^2$ , and $f_i \approx 1/\sigma_i$ — essentially the same as the pseudoinverse. The well-measured directions pass through unchanged. When $\sigma_i$ is small relative to $\sqrt{\lambda}$ , the denominator is dominated by $\lambda$ , and $f_i \approx \sigma_i/\lambda$ — a value close to zero. The poorly measured directions are heavily damped. There is no sharp cutoff. The transition is smooth: the filter rolls off continuously as the singular values shrink.*This perspective connects ridge to the broader family of spectral filters. Truncated SVD, ridge regression, and Landweber iteration are all linear filters applied to the singular values. They differ only in the shape of the filter function: $f_i = \mathbf{1}[\sigma_i > \text{threshold}] / \sigma_i$ for truncated SVD, $f_i = \sigma_i / (\sigma_i^2 + \lambda)$ for ridge, and an iterative approximation for Landweber. Ridge is the most widely used because its filter is smooth and its solution has a closed form.

λ = 0.50

Singular values (bars) with two filter functions overlaid. The pseudoinverse (red) divides by each singular value — small values produce huge weights. Ridge (teal) damps the small ones. Adjust λ to see the filter change.

The figure makes the mechanism visible. The bars show the singular values of a typical neural data matrix — a few large ones capturing the dominant population activity patterns, then a long tail of small values reflecting noise directions. The pseudoinverse filter (red) climbs steeply as the singular values shrink, amplifying whatever noise projects onto those directions. The ridge filter (teal) follows the pseudoinverse for the large singular values but peels away and stays small where the singular values are tiny. As you increase $\lambda$ , the point where the two curves diverge moves to the right, damping more and more directions.

Compare this to the truncated SVD from Part 6. Truncated SVD applies a hard threshold: keep the top $k$ singular values, discard the rest entirely. The filter function is a step — $1/\sigma_i$ for the first $k$ directions, zero for the remainder. Ridge is a soft threshold: it gradually damps the contribution of each direction rather than making a binary keep-or-discard decision. In practice, ridge is often the better choice. The hard cutoff of truncated SVD forces you to pick $k$ , and directions just below the cutoff may still carry useful signal. Ridge avoids the arbitrary boundary. Each direction contributes in proportion to how well-measured it is, which is usually what you want when decoding from a noisy neural population.

Sparsity and the L1 norm

Ridge regression shrinks all weights toward zero, but it never sets any of them exactly to zero. Every neuron retains some nonzero weight, no matter how small. Sometimes that is not what you want. If you record from 100 neurons in motor cortex but only 15 of them actually carry information about the decoded variable, you want the decoder to find those 15 and ignore the rest. A decoder that spreads its weights across all 100 neurons is harder to interpret, harder to validate, and often less stable than one that commits to a subset.

The tool for this is a different penalty. Replace the squared L2 norm $\|w\|^2 = w_1^2 + w_2^2 + \cdots + w_n^2$ with the L1 norm $\|w\|_1 = |w_1| + |w_2| + \cdots + |w_n|$ . The resulting method is the lasso (least absolute shrinkage and selection operator), introduced by Tibshirani (1996):

\min_w \|Aw - b\|^2 + \lambda \|w\|_1

(9)

The geometric reason the L1 penalty produces zeros connects back to the norm balls from Part 1. The L2 ball is round — a circle in two dimensions, a sphere in three. The L1 ball is a diamond (a rotated square in 2D, an octahedron in 3D), with sharp corners sitting on the coordinate axes. Constrained optimization finds the point where the elliptical cost contours first touch the constraint region. For the L2 circle, that tangent point generically lies on a smooth curve, so all weights are nonzero. For the L1 diamond, the contours almost always hit a corner first. A corner sits on a coordinate axis, which means one or more weights are exactly zero. The sharper the corner, the stronger the pull toward sparsity.*The elastic net, introduced by Zou and Hastie (2005), combines both penalties: $\lambda_1 \|w\|_1 + \lambda_2 \|w\|_2^2$ . This gives sparsity from the L1 term and stability from the L2 term. Unlike pure lasso, elastic net handles correlated predictors gracefully — it tends to include or exclude groups of correlated neurons together rather than arbitrarily picking one from each correlated pair.

constraint radius t = 1.5

Toggle between L₂ (circle) and L₁ (diamond) constraints. The L₁ diamond has corners on the axes — when the cost contours touch a corner, one weight is exactly zero.

In practice, the sparsity is real and useful. Apply lasso to the motor cortex decoding problem with 100 neurons, sweep $\lambda$ upward, and watch what happens. At small $\lambda$ , all 100 weights are nonzero and the decoder overfits, just like unregularized least squares. As $\lambda$ increases, weights start dropping to exactly zero, one by one. At some intermediate value, perhaps 15 to 20 neurons remain, and the test-set performance peaks. Push $\lambda$ further and even informative neurons get zeroed out, and performance degrades. The cross-validation procedure from section 5 works here too — plot held-out error against $\lambda$ , find the minimum, and read off both the optimal penalty and the set of neurons the decoder selected.

There is one cost. Unlike ridge, the lasso objective has no closed-form solution. The absolute value in the L1 norm makes the penalty non-differentiable at zero, so you cannot just take the gradient and set it to zero. Numerical algorithms (coordinate descent, ISTA, ADMM) handle this efficiently, but the computation is iterative rather than a single matrix formula. For most neural decoding problems, this is a minor inconvenience. The real question is whether you need sparsity at all. If the goal is pure prediction accuracy, ridge usually wins. If the goal is identifying which neurons matter, lasso is the right tool.

What comes next

Every method in the rest of this series will face a version of the same problem. Canonical correlation analysis (CCA) requires inverting sample covariance matrices. Record 100 neurons across 50 trials and the sample covariance is 100-by-100 but rank 50 at most — exactly singular. The same ill-conditioning that wrecked ordinary least squares wrecks ordinary CCA. The fix is the same too: add $\lambda I$ to the covariance matrix before inverting. Regularized CCA is to CCA what ridge is to least squares. The next post builds this up from scratch.

The broader point is worth stating plainly. More parameters than data points, correlated features, noise in the target — these are not unusual conditions in neuroscience. They are the default. Regularization is not an optional add-on for when things go wrong. It is what makes any of these methods usable on real neural data. The normal equations give you the math. Regularization gives you something you can trust.

References

Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd ed. Springer, 2009.
Hoerl, A. E. and Kennard, R. W. "Ridge regression: biased estimation for nonorthogonal problems," Technometrics, vol. 12, no. 1, pp. 55-67, 1970.
Tibshirani, R. "Regression shrinkage and selection via the lasso," Journal of the Royal Statistical Society: Series B, vol. 58, no. 1, pp. 267-288, 1996.
Glaser, J. I., Benjamin, A. S., Farhoodi, R., et al. "Machine learning for neural decoding," eNeuro, vol. 7, no. 4, 2020.
Strang, G. Introduction to Linear Algebra, 6th ed. Wellesley-Cambridge Press, 2023.
Gallego, J. A., Perich, M. G., Miller, L. E., and Solla, S. A. "Neural manifolds for the control of movement," Neuron, vol. 94, no. 5, pp. 978-984, 2017.
Zou, H. and Hastie, T. "Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society: Series B, vol. 67, no. 2, pp. 301-320, 2005.

← Back to home